# Tidying data for analysis

## Principles of tidy data
1. Columns represent separate variables
2. Rows represent individual observations
3. Observational units form a table

In [1]:
import pandas as pd

airquality = pd.read_csv('data/airquality.csv')

In [2]:
airquality.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


## Reshaping your data using melt

Use `pd.melt()` to melt the `Ozone`, `Solar.R`, `Wind`, and `Temp` columns of airquality into rows. Do this by using `id_vars` to the column you do not wish to melt: '`Date`'.

In [3]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars='Day')

# Print the head of airquality_melt
airquality_melt.head()

Unnamed: 0,Day,variable,value
0,1,Ozone,41.0
1,2,Ozone,36.0
2,3,Ozone,12.0
3,4,Ozone,18.0
4,5,Ozone,


## Customizing melted data

When melting DataFrames, it would be better to have column names more meaningful than `variable` and `value` (the default names used by `pd.melt()`)

In [4]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(airquality, id_vars='Day', var_name='measurement', value_name='reading')

# Print the head of airquality_melt
airquality_melt.head()

Unnamed: 0,Day,measurement,reading
0,1,Ozone,41.0
1,2,Ozone,36.0
2,3,Ozone,12.0
3,4,Ozone,18.0
4,5,Ozone,


## `pivot()`: un-melting data

* Opposite of melting
    - While melting takes a set of columns and turns it into a single column, pivoting will create a new column for each unique value in a specified column.
* In melting, we turned columns into rows
* Pivoting: turn unique values into separate columns
* Analysis-friendly shape to reporting-friendly shape
* Violates tidy data principle: rows do not contain observations
    - Multiple variables stored in the same column
* But cannot handle duplicate values

## `pivot_table()`

* Has a parameter that specifies how to deal with duplicate values
* Example: Can aggregate the duplicate values by taking their average
* `.pivot_table()` has an index parameter which you can use to specify the columns that you don't want pivoted: 
    - It is similar to the `id_vars` parameter of `pd.melt()`. 
    - Two other parameters that you have to specify are `columns` (the name of the column you want to pivot), 
    - and `values` (the values to be used when the column is pivoted). 

In [5]:
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index='Day', columns='measurement', values='reading')

# Print the head of airquality_pivot
airquality_pivot.head()

measurement,Month,Ozone,Solar.R,Temp,Wind
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,7.0,77.75,199.0,80.2,6.78
2,7.0,43.0,174.8,80.8,9.16
3,7.0,33.25,177.4,79.4,9.62
4,7.0,62.333333,197.25,81.8,8.62
5,7.0,48.666667,163.333333,79.2,8.46


In [6]:
# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()

In [7]:
# Print the head of airquality_pivot_reset
airquality_pivot_reset.head()

measurement,Day,Month,Ozone,Solar.R,Temp,Wind
0,1,7.0,77.75,199.0,80.2,6.78
1,2,7.0,43.0,174.8,80.8,9.16
2,3,7.0,33.25,177.4,79.4,9.62
3,4,7.0,62.333333,197.25,81.8,8.62
4,5,7.0,48.666667,163.333333,79.2,8.46


## Pivoting duplicate values

* by using `.pivot_table()` and the `aggfunc` parameter, you can not only reshape your data, but also remove duplicates. 

In [8]:
import numpy as np

# Pivot table the airquality_dup: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index='Day', columns='measurement', values='reading', aggfunc=np.mean)

# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

airquality_pivot.head()

measurement,Day,Month,Ozone,Solar.R,Temp,Wind
0,1,7.0,77.75,199.0,80.2,6.78
1,2,7.0,43.0,174.8,80.8,9.16
2,3,7.0,33.25,177.4,79.4,9.62
3,4,7.0,62.333333,197.25,81.8,8.62
4,5,7.0,48.666667,163.333333,79.2,8.46


## Exercise - 1

* In this exercise, you're going to tidy the '`m014`' column, which represents males aged 0-14 years of age. 
* In order to parse this value, you need to extract the first letter into a new column for `gender`, and the rest into a column for `age_group`. Here, since you can parse values by position, you can take advantage of pandas' vectorized string slicing by using the `str` attribute of columns of type `object`.

In [9]:
tb = pd.read_csv('data/tb.csv')
tb.head()

Unnamed: 0,country,year,m014,m1524,m2534,m3544,m4554,m5564,m65,mu,f014,f1524,f2534,f3544,f4554,f5564,f65,fu
0,AD,2000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,,,,
1,AE,2000,2.0,4.0,4.0,6.0,5.0,12.0,10.0,,3.0,16.0,1.0,3.0,0.0,0.0,4.0,
2,AF,2000,52.0,228.0,183.0,149.0,129.0,94.0,80.0,,93.0,414.0,565.0,339.0,205.0,99.0,36.0,
3,AG,2000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,0.0,0.0,0.0,
4,AL,2000,2.0,19.0,21.0,14.0,24.0,19.0,16.0,,3.0,11.0,10.0,8.0,8.0,5.0,11.0,


In [10]:
# Melt tb: tb_melt
tb_melt = pd.melt(frame=tb, id_vars=['country', 'year'])
tb_melt.head()

Unnamed: 0,country,year,variable,value
0,AD,2000,m014,0.0
1,AE,2000,m014,2.0
2,AF,2000,m014,52.0
3,AG,2000,m014,0.0
4,AL,2000,m014,2.0


In [11]:
# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

# Print the head of tb_melt
tb_melt.head()

Unnamed: 0,country,year,variable,value,gender,age_group
0,AD,2000,m014,0.0,m,14
1,AE,2000,m014,2.0,m,14
2,AF,2000,m014,52.0,m,14
3,AG,2000,m014,0.0,m,14
4,AL,2000,m014,2.0,m,14


## Exercise - 2

* Splitting a column with `.split()` and `.get()`

In [12]:
ebola = pd.read_csv('data/ebola.csv')
ebola.head()

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
0,1/5/2015,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,
1,1/4/2015,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,
2,1/3/2015,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,
3,1/2/2015,286,,8157.0,,,,,,,,3496.0,,,,,,
4,12/31/2014,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,


In [13]:
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
ebola_melt.head()

Unnamed: 0,Date,Day,type_country,counts
0,1/5/2015,289,Cases_Guinea,2776.0
1,1/4/2015,288,Cases_Guinea,2775.0
2,1/3/2015,287,Cases_Guinea,2769.0
3,1/2/2015,286,Cases_Guinea,
4,12/31/2014,284,Cases_Guinea,2730.0


In [14]:
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')

# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

ebola_melt.head()

Unnamed: 0,Date,Day,type_country,counts,str_split,type,country
0,1/5/2015,289,Cases_Guinea,2776.0,"[Cases, Guinea]",Cases,Guinea
1,1/4/2015,288,Cases_Guinea,2775.0,"[Cases, Guinea]",Cases,Guinea
2,1/3/2015,287,Cases_Guinea,2769.0,"[Cases, Guinea]",Cases,Guinea
3,1/2/2015,286,Cases_Guinea,,"[Cases, Guinea]",Cases,Guinea
4,12/31/2014,284,Cases_Guinea,2730.0,"[Cases, Guinea]",Cases,Guinea
