# Data Cleaning with Pandas

>“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy

>“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

There are three interrelated rules which make a dataset tidy:

1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

![tidy](img/tidy-1.png)

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import untidy
import pandas as pd

We have versions of the same dataset below.
- What makes each of these datasets untidy?
- What would the ideal tidy version of this data look like?
- What code would you use to fix each dataset?

In [4]:
untidy.exampleone()

       country  year       obser      count
0  Afghanistan  1999       cases        745
1  Afghanistan  1999  population   19987071
2  Afghanistan  2000       cases       2666
3  Afghanistan  2000  population   20595360
4       Brazil  1999       cases      37737
5       Brazil  1999  population  172006362


In [11]:
df = pd.DataFrame(untidy.exampleone())
df

Unnamed: 0,country,year,obser,count
0,Afghanistan,1999,cases,745
1,Afghanistan,1999,population,19987071
2,Afghanistan,2000,cases,2666
3,Afghanistan,2000,population,20595360
4,Brazil,1999,cases,37737
5,Brazil,1999,population,172006362


In [23]:
df_tidy = df.pivot_table(
    index  = ['country','year'],
    columns = ['obser'],
    values = ['count'])         

In [30]:
df_tidy

Unnamed: 0_level_0,Unnamed: 1_level_0,count,count
Unnamed: 0_level_1,obser,cases,population
country,year,Unnamed: 2_level_2,Unnamed: 3_level_2
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362


In [29]:
df_tidy.reset_index()

Unnamed: 0_level_0,country,year,count,count
obser,Unnamed: 1_level_1,Unnamed: 2_level_1,cases,population
0,Afghanistan,1999,745,19987071
1,Afghanistan,2000,2666,20595360
2,Brazil,1999,37737,172006362


In [43]:
df_tidy.dtypes

       obser     
count  cases         int64
       population    int64
dtype: object

In [8]:
untidy.exampletwo()

Unnamed: 0,country,year,rate
0,Afghanistan,1999,745/19987071
1,Afghanistan,2000,2666/20595360
2,Brazil,1999,37737/172006362
3,Brazil,2000,80488/174504898
4,China,1999,212258/1272915272
5,China,2000,213766/1280428583


In [68]:
df_two = pd.DataFrame(untidy.exampletwo())
df_two

Unnamed: 0,country,year,rate
0,Afghanistan,1999,745/19987071
1,Afghanistan,2000,2666/20595360
2,Brazil,1999,37737/172006362
3,Brazil,2000,80488/174504898
4,China,1999,212258/1272915272
5,China,2000,213766/1280428583


In [72]:
rates = df_two['rate'].str.split('/',expand=True)
rates

Unnamed: 0,0,1
0,745,19987071
1,2666,20595360
2,37737,172006362
3,80488,174504898
4,212258,1272915272
5,213766,1280428583


In [85]:
df_two['rate1'] = rates[0]
df_two['rate2'] = rates[1]
df_two

Unnamed: 0,country,year,rate1,rate2
0,Afghanistan,1999,745,19987071
1,Afghanistan,2000,2666,20595360
2,Brazil,1999,37737,172006362
3,Brazil,2000,80488,174504898
4,China,1999,212258,1272915272
5,China,2000,213766,1280428583


In [126]:
df_two['rate1'] = pd.to_numeric(df_two['rate1'])
df_two['rate2'] = pd.to_numeric(df_two['rate2'])
df_two.dtypes

country    object
year        int64
rate1       int64
rate2       int64
dtype: object

In [151]:
# df_two['rate'] = (df_two['rate1'])/(df_two['rate2'])
# df_two = df_two.drop(['rate1','rate2'],axis =1)

In [152]:
df_two #final tidy table

Unnamed: 0,country,year,rate
0,Afghanistan,1999,3.7e-05
1,Afghanistan,2000,0.000129
2,Brazil,1999,0.000219
3,Brazil,2000,0.000461
4,China,1999,0.000167
5,China,2000,0.000167


In [124]:
# two_tidy = df_two.pivot_table(
#     index = ['year'],
#     columns = ['country'],
#     values = ['rate1','rate2'])
# two_tidy

Unnamed: 0_level_0,rate1,rate1,rate1,rate2,rate2,rate2
country,Afghanistan,Brazil,China,Afghanistan,Brazil,China
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1999,745,37737,212258,19987071,172006362,1272915272
2000,2666,80488,213766,20595360,174504898,1280428583


In [125]:
two_tidy.columns

MultiIndex(levels=[['rate1', 'rate2'], ['Afghanistan', 'Brazil', 'China']],
           codes=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]],
           names=[None, 'country'])

In [6]:
untidy.examplethree()

table 1
       country    1999    2000
0  Afghanistan     745    2666
1       Brazil   37737   80488
2        China  212258  213766 

table 2
       country        1999        2000
0  Afghanistan    19987071    20595360
1       Brazil   172006362   174504898
2        China  1272915272  1280428583
