# What is data?

This seems like a very simple question, but what is data in a computer really? To a computer everything that can be represented as a string of bits is considered data whether that is a numbers, text, to audio and video. For data scientists and business analysts what we refer to as data is usually just a bunch of tables and that is what we will be focusing on here.

Almost everyone here will be familiar with Excel, but few of us think much about the way data is organized. This becomes much more important when we want to interact with data programatically.

## Tidy data vs wide data vs messy data

In data science we therefore have specific terminology for different ways of laying out data. The most common of which is **tidy** data:

<img src="https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png" width=50%>

<center> Tidy data: Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells. (<a href="https://r4ds.had.co.nz/tidy-data.html">Source: Hadley Wickham: R for Data Science</a>)</center>

<br>

Tidy data is very convenient for a number of reasons:

1. Easy to visualize by mapping columns to visual properties
2. Easy to analyse by subsetting or grouping data by some column(s)
3. Consistent data formats make it easier to reason about

**Wide** data on the other hand is very common when working in Excel, e.g. how often have you seen something like this:

<img src="https://d33wubrfki0l68.cloudfront.net/3aea19108d39606bbe49981acda07696c0c7fcd8/2de65/images/tidy-9.png" width=70%>

<center> Tidy data: Difference between wide data (table 4) and tidy data. (<a href="https://r4ds.had.co.nz/tidy-data.html">Source: Hadley Wickham: R for Data Science</a>)</center>

Sadly a lot of data is **messy** and doesn't conform to either format, transforming this kind of data sadly takes up a lot of peoples time and there are no good solutions.

## Common data formats

* CSV
* Excel files
* SQL (or other databases)

In [11]:
import pandas as pd

pd.read_csv('./mpg.csv')

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
394,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
395,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
396,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


Similarly to read an Excel file you can use:

In [14]:
pd.read_excel

<function pandas.io.excel._base.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skip_footer=0, skipfooter=0, convert_float=True, mangle_dupe_cols=True, **kwds)>

## Data catalogues

<img src="https://intake.readthedocs.io/en/latest/_static/images/logo.png" width=20%></img>

Intake (developed at Anaconda) makes it easy to abstract away the difference between all these different formats and access data catalogues in a consistent way. The long term goal at Anaconda should be that we make as much data as possible accessible this way (with appropriate access controls in place).