# Dealing with missing values

**Author:** 'Felipe Millacura'
    
**Date:** '21st February 2021'

### Learning objectives

* Understand the challenges of missing and null values
* Know potential methods of dealing with missing values
* Identify why the data was missing - random or pattern
* Be able to identify and justify the approach taken for each missing data item
* Be able to tidy up a dataset and deal with missing values

### Missing values: what are they? 

Missing values - in their simplest form - are data points that are not entered into your spreadsheet/database/table. First, understand that there is not one good way of dealing with missing data. The missing values you will get differ by dataset, and the way you deal with them will also depend on what analysis you are doing. 


### Missing values: why do they go missing?

Lots of different things contribute to missing data. The types of missing values can be split into categories.  

**Missing at Random (MAR):** There is a pattern in the missing data but not on your primary dependent variables. Instead, it might be related to your data. For example, maybe you have cyclical data, and every night there is a period where the data collection device is shut off. 

**Missing Completely at Random (MCAR):** There is no pattern in the missing data. This is good! This might mean someone has accidently entered data wrong, or it's just a forgotten entry, etc.   

**Missing not at Random (MNAR):** There is a pattern in the missing data that affect your primary dependent variables. For example, lower-income participants are less likely to respond and thus affect your conclusions about income and likelihood to recommend. Missing not at random is your worst-case scenario. Proceed with caution.    

In the first two cases, it is pretty ok to remove the missing values (depending on how many occurrences there are, and where those occurrences are). In the last case, removing observations can produce bias in the results. 



### How are missing values represented? 

In Python the missing values are coded by the symbol `None`, `NaN`, `NaT` .

     None : Pythonic missing data
     NaN : (“Not a Number”) means 0/0
     NaT: For datetime64[ns] types

To identify missings in your dataset the function `is.null()` can be used, getting `NaN` in numeric arrays, `None` or `NaN` in object arrays, `NaT` in datetimelike. You can also use `isna()` or `notna()` (Boolean inverse of `pandas.isna`) on DataFrames and Series. Further information [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html)

String variables are harder to code. They can just be left blank, or replaced with a large variety of different symbols - whatever the analyst chooses really! Missing data within `object` variables are MUCH harder to find, but potentially a bit easier to deal with. 


### Standard Missing Values

These are missing values that Pandas can detect. Let's take a look at the `Street Number` column on the next example DataFrame.

![](https://miro.medium.com/max/875/0*veOKXtXllBUoIOr-.jpg)

In the third row there's an empty cell. In the seventh row there’s an "NA" value. Clearly these are both missing values and, hence, can be detected by Pandas with no issue

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/property_data.csv')

print(df['ST_NUM'])


0    104.0
1    197.0
2      NaN
3    201.0
4    203.0
5    207.0
6      NaN
7    213.0
8    215.0
Name: ST_NUM, dtype: float64


In [3]:
print(df['ST_NUM'].isnull())

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
8    False
Name: ST_NUM, dtype: bool


This is a simple example, but highlights an important point. Pandas will recognize both empty cells and “NA” types as missing values

### Non-Standard Missing Values

There are times where missing values can have other formats. Let's now take a look at the `Number of Bedrooms` column.

![](https://miro.medium.com/max/875/0*wlRcCspEXuvanHPP.jpg)

In [3]:
df

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3,1,1000
1,100002000.0,197.0,LEXINGTON,N,3,1.5,--
2,100003000.0,,LEXINGTON,N,,1,850
3,100004000.0,201.0,BERKELEY,12,1,,700
4,,203.0,BERKELEY,Y,3,2,1600
5,100006000.0,207.0,BERKELEY,Y,,1,800
6,100007000.0,,WASHINGTON,,2,HURLEY,950
7,100008000.0,213.0,TREMONT,Y,--,1,
8,100009000.0,215.0,TREMONT,Y,na,2,1800


Just like before, Pandas recognized the “NA” as a missing value. Unfortunately, not all other types were recognized.

This becomes a common issue when multiple users are in charge of entering data manually. Maybe someone likes to use "n/a" but someone else uses "na" or "--" instead.

An easy way to detect all the possible formats is to put them in a list. Then once we import the data, we pass it in the argument `na_values` allowing Pandas to recognize them right away. Here's an example for that:

In [4]:
# Making a list of missing value types
missing_values = ["n/a", "na", "--"]
df = pd.read_csv("data/property_data.csv", na_values = ["n/a", "na", "--"])

In [5]:
df

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3.0,1,1000.0
1,100002000.0,197.0,LEXINGTON,N,3.0,1.5,
2,100003000.0,,LEXINGTON,N,,1,850.0
3,100004000.0,201.0,BERKELEY,12,1.0,,700.0
4,,203.0,BERKELEY,Y,3.0,2,1600.0
5,100006000.0,207.0,BERKELEY,Y,,1,800.0
6,100007000.0,,WASHINGTON,,2.0,HURLEY,950.0
7,100008000.0,213.0,TREMONT,Y,,1,
8,100009000.0,215.0,TREMONT,Y,,2,1800.0


### Missing values: the danger

Both data analysts and software developers may confuse a missing value with a default value or category. For instance, in Excel, if you add a cell that contains the number `2` to a field that is empty, the result will come out as `2`. What could be wrong with this?   

*In this case, the problem is that Excel mistakenly imputes `0` where it should have recognised a missing value. This might not be a problem in a standard addition task, but what if you were calculating a mean? Do you want all missing values to be included? No - then you're dividing the sum total by the wrong number of observations.*

Another commonly encountered mistake is to confuse an `NA` in categorical data. For example, a missing value can either represent an unknown category OR actual missing data. 

For example, let's suppose you are analysing data from a survey about demographics in a particular city council region. Unfortunately, the survey hasn't been designed or set up properly, and so you've recieved lots of missing values. For example, there may be data values where the person has left the entry blank because they wanted to responded "don't know" (for example, parents birth city). And there may be values where the person has chosen not to intentionally respond and left the field blank (e.g. marital status). Now you've got 2 missing data points, but they represent totally different things. 

This is a common problem, and it's worth remembering that if unknown is indeed a category, it should be added as a response choice, so it can be appropriately analyzed.



### Missing values: what should you do? 

Overall, it's up to you to decide how empty values are handled, since a default imputation by your software might give you an unexpected or erroneous result. There are three main strategies usually employed:

1. **Drop the NA values.** Delete all data from any participant with missing data. 

2. **Replace the NA values with something else.** This is known as **imputation** and there are different kinds:

* *Average imputation*: replace the missing values with the mean value. This isn't always recommended though, as you might have outlier data and this can affect the validity of your data. This can also artifically reduce the variability of your data.   

* *Common point imputation*: you can use the median, mode or most commonly chosen value. For example, on a 5 point scale you could substitute a missing value as a 3 (e.g. neutral). This is better than guessing, but still a risky approach, and again, reduces the variability in your dataset.   

* *Regression substitution*: Regression analysis estimates and predicts values. Regression substition predicts the missing value from the other values. This typically gives you more statistically stable imputations. 

* *Multiple imputations*: This is a newer technique, but by far the most popular. In summary, it involves using software to create plausible values based on correlations between responses and via a number of simulations. Each of the simulated datasets are analysed, and then the results are combined into the missing data. In essence, this technique uses models based on simulated data, to create values which you can impute into your missing values. 

    *Note: it is worth noting that imputation does not necessarily give better results! It's just one way of preserving some data.*

3. **Just leave them alone.** This can sometimes work if you just want a true representation of the data. However, if you want to do any summary stats, you will need to deal with missing values. 



<blockquote class='task'>
<b>Task - 5 mins</b>

Have a think about these three categories of options. What are the benefits and pitfalls of each? 

<br>
<details>
<summary><b>Answer</b></summary>

<b>Drop the NAs :</b> 
The benefit of this approach is that you get rid of all the missing data, and you can't accidently convert or summarise it.   
The pitfall of this approach is that you might end up losing a lot of data, which has cost time and money to collect. Similarly, you might end up deleting all participants in a category by accident.   

<b>Impute the values :</b>
The benefit of this approach is that you minimise data loss. For example, you might have a set of temperature recordings over a month, with a few days missing because the equipment was down. In this case, you might feel comfortable replacing the missing values with the average temperature for the month. Or the median. As the temperature within a month doesn't tend to vary that much, either of these is probably a good estimate of the temperature on the missing days. 

<b>Just leave the values: </b>
The benefit of this approach is that you're keeping the data as close to raw as possible. Sometimes, this can be an appropriate answer. For example, if you were just creating datasets and wanted to retain as much of the raw data as possible for traceability, and you knew you weren't going to do any summary stats, then it might be best just to leave them. However, you MUST remember that your summary stats will be affected, should you choose to do any.  
The pitfalls of this approach however are many. For example, if you leave the missing values in, they might accidently get summarised if someone who isn't familiar with the data starts working on it. Similarly, you might have non standard missing values in the data, and these can cause problems.

</details>
</blockquote>

Now we know what missing values are, why they are bad, and what we should do with them, let's look at working with them in Python. 

[This post](https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4) sums the different practices up pretty well:

![](https://miro.medium.com/max/875/1*_RA3mCS30Pr0vUxbp25Yxw.png)


## Identifying missing values in Pandas: an exercise


Pandas's core functionality is all set up to be consistent with the idea that the analyst should choose how to deal with missing values. Mostly giving you the options to remove or impute on NAs. 

For this exercise we're going to work with some customer attrition data - that is, data which represents the loss of clients or customers. Places such as banks, internet companies, tv companies, insurance firms all use this kind of data and analysis as part of their metrics. We've all heard the adverts talking about "Our clients stay with us for X years because they're so happy". Well, that's customer attrition data. 

Here we have some data from a phone supply company (communication company). Let's load in the data and look at the missing values we have. 




In [6]:
comms_data = pd.read_csv('data/telecom_data.csv')

comms_data.head()

Unnamed: 0,customerID,monthly_charges,total_charges,payment_method,churn
0,7590,29.85,118.9,electronic check,yes
1,5575,56.78,na,mailed check,yes
2,3668,,108.15,--,yes
3,7795,42.3,1840.75,bank transfer,no
4,9237,75.7,,electronic check,no


Once you analyse the data you are going to realise that almost every column in the dataset contains some kind of missing value, and a wide array of them! Let's first start by looking at some standard missing values. 


## Standard missing values 

The column `monthly_charges` contains what we refer to as "standard missing values". That is, it has NA's in it. Let's take a look, using the `is.na()` function. 


In [7]:
comms_data[comms_data['monthly_charges'].isna()]

Unnamed: 0,customerID,monthly_charges,total_charges,payment_method,churn
2,3668,,108.15,--,yes
7,6713,,,,yes
10,6790,,119.9,electronic check,yes
11,6672,,na,mailed check,yes
13,6435,,5305.95,bank transfer,no
17,5962,,1745.475,,yes
19,5725,,1815.365,electronic check,no
25,5015,,3573.35,--,yes
27,4778,,,,yes
30,4423,,297.0924419,electronic check,yes


From this you can see that we have plenty missing values.  Now we can count how many missing values we have to see the scope of the problem we're dealing with.   



In [9]:
comms_data.isna().sum()

customerID          0
monthly_charges    12
total_charges       4
payment_method      3
churn               0
dtype: int64


So, what should we do? As mentioned in the introduction, you usually have three broad options. 


### 1. Remove them 

You can remove them, as we learnt to do in a previous lesson last week: 


In [10]:
comms_data.dropna()

Unnamed: 0,customerID,monthly_charges,total_charges,payment_method,churn
0,7590,29.85,118.9,electronic check,yes
1,5575,56.78,na,mailed check,yes
3,7795,42.3,1840.75,bank transfer,no
5,9305,Nan,830.5,--,yes
6,1452,89.1,1949.5,credit card,no
8,7892,104.8,3046.05,electronic check,no
9,8451,56.2,456.95,electronic check,no
12,6553,Nan,3573.35,--,yes
15,6198,23.5,1675.585,--,yes
16,6080,68.9,1710.53,credit card,no


### 2. Impute them 

You can impute them (that is, replace them with something else). For example, you could replace them with the median value of `monthly_charges`. 


In [12]:
# replace NaN observations in the monthly_charges column, with the median of that column

median = comms_data['monthly_charges'].astype('float').median()

comms_imputed_median = comms_data

comms_imputed_median['monthly_charges'] = comms_data['monthly_charges'].fillna(value=median)

comms_imputed_median

Unnamed: 0,customerID,monthly_charges,total_charges,payment_method,churn
0,7590,29.85,118.9,electronic check,yes
1,5575,56.78,na,mailed check,yes
2,3668,52.427895,108.15,--,yes
3,7795,42.3,1840.75,bank transfer,no
4,9237,75.7,,electronic check,no
5,9305,Nan,830.5,--,yes
6,1452,89.1,1949.5,credit card,no
7,6713,52.427895,,,yes
8,7892,104.8,3046.05,electronic check,no
9,8451,56.2,456.95,electronic check,no


Here our code is overwriting the `monthly_charges` column with a new one where it replaces the `NaN` values with the median. 

We can also do average imputation, and replace the values with the mean.  


In [13]:
# replace NaN observations in the monthly_charges column, with the mean of that column

mean = comms_data['monthly_charges'].astype('float').mean()

comms_imputed_mean = comms_data

comms_imputed_mean['monthly_charges'] = comms_data['monthly_charges'].fillna(value= mean)

comms_imputed_mean

Unnamed: 0,customerID,monthly_charges,total_charges,payment_method,churn
0,7590,29.85,118.9,electronic check,yes
1,5575,56.78,na,mailed check,yes
2,3668,52.427895,108.15,--,yes
3,7795,42.3,1840.75,bank transfer,no
4,9237,75.7,,electronic check,no
5,9305,Nan,830.5,--,yes
6,1452,89.1,1949.5,credit card,no
7,6713,52.427895,,,yes
8,7892,104.8,3046.05,electronic check,no
9,8451,56.2,456.95,electronic check,no


The final way you can impute is via building models to simulate the missing values. This is generally only used on more complex datasets in prep for model building, and so we won't cover this yet. We will learn about some packages that allow you to do this later in the course, but meanwhile you could have a look into the `method` (‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None)  argument on [`fillna`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) or the [`interpolate`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html) function


### 3. Leave them as is 

As we stated in the intro, the final option you have is to just leave them as is. In this particular case, this option isn't appropriate, because we want to summarise the charges eventually.   


Now let's look at dealing with some of the non-standard missing values. 


Take a look at the `total_charges` and `payment_method` columns and find the different type of missing value identifiers we have.

What is the problem here? Pandas hasn't recognised that there are diverse types of missing values in this columns - we have "na", "NA", "N/A", "Nan" and "--".


So, what to do? 
Well, one option is that you can manually set the missing values to all be `NaN`


In [14]:
missing_values = ['n/a', 'na', '--', 'N/A', 'NA', 'Nan']
df = pd.read_csv('data/telecom_data.csv', na_values = missing_values)

In [15]:
df

Unnamed: 0,customerID,monthly_charges,total_charges,payment_method,churn
0,7590,29.85,118.9,electronic check,yes
1,5575,56.78,,mailed check,yes
2,3668,,108.15,,yes
3,7795,42.3,1840.75,bank transfer,no
4,9237,75.7,,electronic check,no
5,9305,,830.5,,yes
6,1452,89.1,1949.5,credit card,no
7,6713,,,,yes
8,7892,104.8,3046.05,electronic check,no
9,8451,56.2,456.95,electronic check,no


But this method involves you knowing all the different ways missing values are coded in your data. Which is ok if you have a small dataset like we do here, but what if you had millions of values?   

A better method of dealing with this would be to ensure that you're working with numeric data. In our case, most columns are actually set to `object`. 


In [16]:
# Remember that these are still handled as strings aka object type

comms_data.dtypes

customerID          int64
monthly_charges    object
total_charges      object
payment_method     object
churn              object
dtype: object

If we solve this issue while importing:

In [17]:
df.dtypes

customerID           int64
monthly_charges    float64
total_charges      float64
payment_method      object
churn               object
dtype: object

If we change these [`to_numeric`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html) types using the `errors` argument as 'coerce', you'll notice what happens:

In [19]:
pd.to_numeric(comms_data['monthly_charges'], errors= 'coerce')

0      29.850000
1      56.780000
2      52.427895
3      42.300000
4      75.700000
5            NaN
6      89.100000
7      52.427895
8     104.800000
9      56.200000
10     52.427895
11     52.427895
12           NaN
13     52.427895
14     53.800000
15     23.500000
16     68.900000
17     52.427895
18     76.500000
19     52.427895
20     45.600000
21           NaN
22     23.600000
23     15.500000
24    105.600000
25     52.427895
26           NaN
27     52.427895
28     34.700000
29     12.500000
30     52.427895
31     45.600000
32     35.600000
33     52.427895
34     52.427895
Name: monthly_charges, dtype: float64

Now you can see that once again, Pandas has found all the NA's. 

Ok, so we've got those one's covered. But what about if we have an even more non-standard missing value? For example, take the `payment_method` column. 


<blockquote class='task'>
Take a look at the `payment_method` column and find the different type of missing value identifiers we have, and how many we have.

Replace these with NaN, and then check if you have the right amount
    

</blockquote>


And voila! There you have it. You've successfully cleaned the data and fixed all the missing values, dealing with them in different ways. The final thing you can do is look at your newly cleaned data. 


### Implicit missing data 

So far we have dealt with explicit missing data, in that we have entries in our data which have values of NA etc. But what happens if the data is implicitly missing? Let's take a look at an example


In [20]:
import numpy as np

In [21]:
patient_data = pd.DataFrame({'patient': ['A'*10, 'B'*5], 
                             'year': [[*range(2000, 2009, 1)], #here we use the argument-unpacking operator *
                                      [2000, 2004, 2006, 2007, 2008]], 
                             'result': [np.random.randint(100, size=15), 
                                        15]
                            })

patient_data

Unnamed: 0,patient,year,result
0,AAAAAAAAAA,"[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...","[86, 51, 29, 32, 86, 96, 99, 93, 14, 99, 95, 3..."
1,BBBBB,"[2000, 2004, 2006, 2007, 2008]",15


So some results for patient B are implictly missing (2001-2003, 2005, 2009), in that we do not have data for those years. But what happens if we want to turns these into explict missing entries (and then potentially impute them).

In [22]:
patient_data.dtypes

patient    object
year       object
result     object
dtype: object

Here we can use the [`explode`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html) function to unpack each element from a list-like format to a row

In [24]:
patient_data.explode('result')

Unnamed: 0,patient,year,result
0,AAAAAAAAAA,"[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...",86
0,AAAAAAAAAA,"[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...",51
0,AAAAAAAAAA,"[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...",29
0,AAAAAAAAAA,"[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...",32
0,AAAAAAAAAA,"[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...",86
0,AAAAAAAAAA,"[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...",96
0,AAAAAAAAAA,"[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...",99
0,AAAAAAAAAA,"[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...",93
0,AAAAAAAAAA,"[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...",14
0,AAAAAAAAAA,"[2000, 2001, 2002, 2003, 2004, 2005, 2006, 200...",99


Once you unpack the years you would need to input the missing year rows and add the NaN values, give it a try!

### Recap 

**How does Python represent missing values?**

Python represents missing values with `None`, but there are also `NaN` and `NaT` in Pandas an Numpy. NaN stands for Not a number and NaT for Not a time.

**What is the best way to deal with string-related missing values?**

Best way is usually to create a category to represent them all, e.g. a category of “unknown” could be used to represent missing values.

**What function would you use to overwrite missing values?**

You can use the `fillna` function, or you can also use `replace`.


**Now is a perfect time to talk about Data tidiness!**

# What is Tidy Data?


### Learning Objectives

1. Understand the concept of tidy data
2. Be able to describe why datasets are not tidy
3. Know why tidy data is important for analytics




### What is Tidy Data?

<center>
<br>
<div class='emphasis'>
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
</div>
</center>

Up until now, the data you've worked with has more or less been in a reasonable format: rows and columns have been organised in a sensible way, and variables have been separated out. However, more often than not, the data you work with will not be formatted in the most efficient way possible. You'll have columns representing the same thing, oddly named rows and column headers, and sometimes more than one data entry per cell. This type of data is known as messy data. What we need is the opposite - tidy data!


![](images/real_world_data.jpg)

### Tidy data has structure 

There are three fundamental rules defining Tidy Data:

1. Each variable must have its own column. 
2. Each observation must have its own row.
3. Each value must have its own cell (i.e. no grouping two variables together, e.g. a person's name and age should not be in one cell 'Joe Blogs, 28' ).

This figure from [statseducation.com](http://statseducation.com/Introduction-to-R/img/tidy-1.png) shows this a bit more clearly: 

![](http://statseducation.com/Introduction-to-R/img/tidy-1.png)  



Arranging your data in this way makes it easier to work with because you have a consistent way of referring to variables (as column names) and observations (as row indices). When use tidy data and tidy tools, you spend less time worrying about how to feed the output from one function into the input of another, and more time answering your questions about the data. 



### Tidy data contains relationships 

Tidy data also works on a premise that data contains values which have relationships between one another, and it displays these in the dataset as consistently as it does the values. For example, in a tidy dataset you might have relationships such as:  

1. A name associated with a bank account number 
2. A date associated with a time
3. A patient ID associated with a test result


Basically, tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning).  


### Long vs. Wide 

Finally, tidy data can be structured in long format or wide format.

**Long format** is where every row represents an observation belonging to a particular category.


| Product | Attribute | Value     | 
|---------|-----------|-----------| 
| A       | Height    | 10        | 
| A       | Width     | 5         | 
| A       | Weight    | 2         | 
| B       | Height    | 20        | 
| B       | Width     | 10        | 

| country     | year    | avg_temp |
|-------------|---------|----------|
| Sweden      | 1994    |   11     |
| Denmark     | 1994    |    5     |
| Norway      | 1994    |    9     |
| Sweden      | 1995    |    8     |
| Denmark     | 1995    |    7     |
| Norway      | 1995    |    8     |
| Sweden      | 1996    |    9     |
| Denmark     | 1996    |   10     |
| Norway      | 1996    |   11     |

**Wide format** is here each observation is spread across multiple columns.

| Product | Height | Width | Weight   | 
|---------|--------|-------|----------|   
| A       | 10     | 5     | 2        | 
| B       | 20     | 10    | NA       | 

<br>

| country     | avg_temp_1994 | avg_temp_1995 | avg_temp_1996 |
|-------------|---------------|---------------|---------------|
| Sweden      |       11      |       8       |       9       |
| Denmark     |       5       |       7       |      10       |
| Norway      |       9       |       8       |      11       |

<br>



# Which is the better format?

Out of the two formats above, which do you prefer? 

Tidy is a mix of both of these approaches, but in general long format works best for data-wrangling for three main reasons: 

1. If you have a lot of variables, wide format can be tricky. Imagine if we had 50 years worth of average temperatures, for each month. In wide format, you'd end up with 600 columns (50*12 months). In the long dataset, you might only have 12 variable columns - one for each month.   

2. Long format structures the data in key-value pairs and therefore aids clarity. For example, it is relatively quick to see that in the long dataset above, the temperature belongs to a country-year pair.  

3. Certain packages (such as matplotlib) and many other built in statistic functions require data to be in long format. Therefore, it's easier to organise your data in this format to start with. 



Take a look at the image below: 

![](https://d33wubrfki0l68.cloudfront.net/f6fca537e77896868fedcd85d9d01031930d76c9/637d9/images/tidy-17.png)  

On the left, you have data that isn't classed as 'tidy'. Why is this?

This data isn't tidy because you have more than one observation per cell. On the right, this dataset has been 'tidied', and the cases and population have been separated, making it easier to analyse. 


Now let's look at a slightly harder example. Pretend you're working at a drug testing unit, and you've got some data from three subjects, and you've recorded their heart rate twice a day. 


In [25]:
# create subject info tibble
subject_hr = pd.DataFrame({'name': ["SUBJ01","SUBJ02","SUBJ03"],
                           'hr_am': [68,72,68],
                           'hr_pm': [58,89,52]
                          })
    
subject_hr

Unnamed: 0,name,hr_am,hr_pm
0,SUBJ01,68,58
1,SUBJ02,72,89
2,SUBJ03,68,52


* There are 2 heart rate columns, when we really, should only have one (rule 1: a variable must have its own column). 
* Because drug type is spread out into two columns, it means that rows don't contain one unique observation (rule 2). If you look, row one contains two different HR options for one particular subject. 

What we really need is for our data to look like this: 


| SUBJID  | TIME    | HEART RATE  | 
|---------|---------|-------------| 
| SUBJ01  | hr_am   | 68          | 
| SUBJ01  | hr_pm   | 58          | 
| SUBJ02  | hr_am   | 72          |  
| SUBJ02  | hr_pm   | 89          | 
| SUBJ03  | hr_am   | 68          | 
| SUBJ03  | hr_pm   | 52          | 

In this, each observation has its own row, and each variable is only repeated once. 


## Why is tidy data important?

<br>
<center>
<div class='emphasis'>
"The development of tidy data has been driven by my experience working with real-world datasets. With few, if any, constraints on their organisation, such datasets are often constructed in bizarre ways. I have spent countless hours struggling to get such datasets organised in a way that makes data analysis possible, let alone easy."
- Hadley Wickham
</div>
</center>
<br>


Right now, you might be sitting thinking that tidy data is such a good (and almost, obvious) idea, that most datasets you come into contact with in the wild will be in this format? Wrong. In practice, raw data you're given is rarely perfectly tidy. Once you receive 

The main reason tidy data is important is because it makes it easy for a data analyst or scientist to extract the necessary data, because it has a standardised way of structuring the data. Messy data requires different strategies and often times unique programming solutions in order to extract data in the same way. This slows analysis, invites errors, and makes the analysis pipeline less robust and reproducible. 

Secondly, good ordering and structure of the data makes it easier to scan and eyeball the raw data. As we've already learned in the first section of today, this is an important and very necessary part to understanding and working with your data. 

Third, it's important to have tidy data in Python because of the type of programming language Python is. Python is a object-oriented programming language, meaning that all data structures, functions and operations are designed to work on or as objects. Data frames in Pandas are just lists or dictionaries, arranged to look like tables. 

Finally, tidy data arranges all values so that the relationships parallel the structure of DataFrames in Python. As a result, tidy datasets can easily use Python's different functions. This means you'll have an easier user experience: you can analyse data using functions that someone has already written for you! 



### Recap

What are the three main rules of tidy data?

**Answer**

1. Each variable must have its own column. 
2. Each observation must have its own row.
3. Each value must have its own cell. 


What are the main benefits of tidy data?

**Answer**

* Consistent structure, makes it easy to analyse  
* No need for complex or unique programming solutions  
* Works with Pandas DataFrames, therefore can use all of Python's inbuilt functions and syntax. 



### Additional information

* an [excerpt](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html)  from the Python Data Science Handbook by Jake VanderPlas
* a great [post](https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4) in how to handle missing values

* a [post](https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b) about data cleaning

* [Journal of Statistical Software - Tidy Data (Hadley Wickham)](http://vita.had.co.nz/papers/tidy-data.html)
