# data_cleansing_tutorial
Data cleansing tutorial in Pandas for Chipy scientific SIG

## Data Import

In [10]:
import pandas as pd

In [11]:
df14 = pd.read_csv('../data/Immunization_14-15.csv')
df09 = pd.read_csv('../data/immunization_09-10.csv')
df04 = pd.read_csv('../data/immunization_04-05.csv')

### Initial data verification
Let's first take a look at the data briefly to make sure it's in the format we want.

In [12]:
df04.head()

Unnamed: 0,Immunization School Survey Report 2004-05,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57,Unnamed: 58,Unnamed: 59,Unnamed: 60,Unnamed: 61,Unnamed: 62
0,RCDT,Region,District,School,Total Sch. Enroll.,Total - NC,Unduplic.,Total Excluded,NC-PE Only,Polio Prot.,...,Cpox Prot.,Cpox Rel.,Cpox Med.,Cpox AS,Cpox UP/NC,Cpox Total,,,,
1,Adams County,,,,,,,,,,...,,,,,,,,,,
2,01001001026,01,0010,0001,279,2,2,2,0,277,...,0,0,0,0,0,0,,,,
3,01001002026,01,0020,2002,278,1,1,1,0,276,...,138,0,0,1,1,140,,,,
4,01001002026,01,0020,2002,419,0,0,0,0,419,...,161,0,0,0,0,161,,,,


Here we see that the header information is wrong. The correct header is row 0. Because we are lazy, we will re-import the data and drop the first row instead of formatting this current version.

In [20]:
df04 = pd.read_csv('../data/immunization_04-05.csv', skiprows=1)
df04.head()

Unnamed: 0,RCDT,Region,District,School,Total Sch. Enroll.,Total - NC,Unduplic.,Total Excluded,NC-PE Only,Polio Prot.,...,Cpox Prot.,Cpox Rel.,Cpox Med.,Cpox AS,Cpox UP/NC,Cpox Total,Unnamed: 59,Unnamed: 60,Unnamed: 61,Unnamed: 62
0,Adams County,,,,,,,,,,...,,,,,,,,,,
1,01001001026,1.0,0010,1.0,279.0,2.0,2.0,2.0,0.0,277.0,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,01001002026,1.0,0020,2002.0,278.0,1.0,1.0,1.0,0.0,276.0,...,138.0,0.0,0.0,1.0,1.0,140.0,,,,
3,01001002026,1.0,0020,2002.0,419.0,0.0,0.0,0.0,0.0,419.0,...,161.0,0.0,0.0,0.0,0.0,161.0,,,,
4,01001002X06,1.0,002X,,197.0,0.0,0.0,0.0,0.0,193.0,...,101.0,2.0,0.0,0.0,0.0,103.0,,,,


Cool, the columns are correct but per our goals we only need the columns for Polio. Let's remove the rest.

In [21]:
columns_to_keep = ['RCDT', 'School', 'Total Sch. Enroll.', 'Polio Prot.']
df04 = df04.ix[:, columns_to_keep] # takes all rows, only the columns in the array columns_to_keep
df04.head()

Unnamed: 0,RCDT,School,Total Sch. Enroll.,Polio Prot.
0,Adams County,,,
1,01001001026,1.0,279.0,277.0
2,01001002026,2002.0,278.0,276.0
3,01001002026,2002.0,419.0,419.0
4,01001002X06,,197.0,193.0


Interesting, the first row has NaN. Looking back at the table above, this is because the original table has some extra rows for Counties e.g. "Adams County".

Let's simply drop the NaNs in this table.

In [24]:
df04.dropna(inplace=True) # inplace=True replaces the previous data frame.
                          # This is the same as df04 = df04.dropna(inplace=False)
df04.head()

Unnamed: 0,RCDT,School,Total Sch. Enroll.,Polio Prot.
1,01001001026,1.0,279,277
2,01001002026,2002.0,278,276
3,01001002026,2002.0,419,419
4,01001002X06,,197,193
5,01001003026,1001.0,330,329


Instead of looking just the first 5 rows, let's check out some summary statistics.

#### Summary Statistics

In [25]:
df04.describe()

Unnamed: 0,RCDT,School,Total Sch. Enroll.,Polio Prot.
count,5179,5179.0,5179,5179
unique,2045,701.0,1237,1240
top,15016299025,,231,127
freq,594,1165.0,19,16


In [26]:
df04.dtypes

RCDT                  object
School                object
Total Sch. Enroll.    object
Polio Prot.           object
dtype: object

Uh oh, the two columns are supposed to be numerical, not strings (denoted here as objects).

Let's convert these now.

In [33]:
# inplace isn't available for this function, thus asign manually
cols = ['Total Sch. Enroll.', 'Polio Prot.']
df04[cols] = df04[cols].applymap(lambda x: int(x))
df04.head()

ValueError: ("invalid literal for int() with base 10: '1,628'", u'occurred at index Total Sch. Enroll.')

This error shows that there's are values with commas, so Python can't automatically change it into an int.

We have to remove the comma, then convert to int.

In [None]:
df04[cols] = df04[cols].applymap(lambda x: int(x.replace(',', '')))
df04.head()

In [37]:
df04.dtypes

RCDT                  object
School                object
Total Sch. Enroll.     int64
Polio Prot.            int64
dtype: object

Beautiful. Now let's run those summary stats again.

In [39]:
df04.describe(include='all')

Unnamed: 0,RCDT,School,Total Sch. Enroll.,Polio Prot.
count,5179.0,5179.0,5179.0,5179.0
unique,2045.0,701.0,,
top,15016299025.0,,,
freq,594.0,1165.0,,
mean,,,454.953659,450.193281
std,,,459.038458,455.74756
min,,,1.0,1.0
25%,,,191.0,189.0
50%,,,344.0,341.0
75%,,,561.0,554.5


This looks fine so far.

We can probably do the same things to the next datasets.

In [41]:
df09.head()

Unnamed: 0,2009-2010 School Year Immunization Data,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 47,Unnamed: 48,Unnamed: 49,Unnamed: 50,Unnamed: 51,Unnamed: 52,Unnamed: 53,Unnamed: 54,Unnamed: 55,Unnamed: 56
0,RCDTS,Facility,RCDT,School,Enrollment,TotalNC,UnduplCount,NCPE,Excluded,PolioProt,...,HibMedObj,HibRelObj,HibApprSchd,HibTotal,CpoxProt,CpoxNcNp,CpoxMedObj,CpoxRelObj,CpoxApprSchd,CpoxTotal
1,140161630022008,21st Century Preparatory Center,14016163002,2008,234,3,3,0,3,232,...,0,0,0,54,233,0,1,0,0,234
2,540921180243104,2x2 Christian Educational Center,54092118024,3104,19,19,19,0,0,0,...,0,0,0,19,0,19,0,0,0,19
3,140162180160007,A B Shepard High Sch (Campus),14016218016,0007,1944,3,2,1,3,1942,...,0,0,0,0,0,0,0,0,0,0
4,041012050252059,A C Thompson Elem School,04101205025,2059,307,1,0,1,0,307,...,0,0,0,0,307,0,0,0,0,307


Same header issue as before.

In [42]:
df09 = pd.read_csv('../data/immunization_09-10.csv', skiprows=1)
df09.head()

Unnamed: 0,RCDTS,Facility,RCDT,School,Enrollment,TotalNC,UnduplCount,NCPE,Excluded,PolioProt,...,HibMedObj,HibRelObj,HibApprSchd,HibTotal,CpoxProt,CpoxNcNp,CpoxMedObj,CpoxRelObj,CpoxApprSchd,CpoxTotal
0,140161630022008,21st Century Preparatory Center,14016163002,2008,234,3,3,0,3,232,...,0,0,0,54,233,0,1,0,0,234
1,540921180243104,2x2 Christian Educational Center,54092118024,3104,19,19,19,0,0,0,...,0,0,0,19,0,19,0,0,0,19
2,140162180160007,A B Shepard High Sch (Campus),14016218016,7,1944,3,2,1,3,1942,...,0,0,0,0,0,0,0,0,0,0
3,41012050252059,A C Thompson Elem School,4101205025,2059,307,1,0,1,0,307,...,0,0,0,0,307,0,0,0,0,307
4,140160960022001,A F Ames Elem School,14016096002,2001,334,1,0,1,0,330,...,0,0,0,0,330,0,0,4,0,334


In [43]:
df09.columns

Index([u'RCDTS', u'Facility', u'RCDT', u'School', u'Enrollment', u'TotalNC',
       u'UnduplCount', u'NCPE', u'Excluded', u'PolioProt', u'PolioNcNp',
       u'PolioMedObj', u'PolioRelObj', u'PolioApprSchd', u'PolioTotal',
       u'DtpProt', u'DtpNcNp', u'DtpMedObj', u'DtpRelObj', u'DtpApprSchd',
       u'DtpTotal', u'MeasProt', u'MeasNcNp', u'MeasMedObj', u'MeasRelObj',
       u'MeasApprSchd', u'MeasTotal', u'RubProt', u'RubNcNp', u'RubMedObj',
       u'RubRelObj', u'RubApprSchd', u'RubTotal', u'MumpProt', u'MumpNcNp',
       u'MumpMedObj', u'MumpRelObj', u'MumpApprSchd', u'MumpTotal', u'HepProt',
       u'HepNcNp', u'HepMedObj', u'HepRelObj', u'HepApprSchd', u'HepTotal',
       u'HibProt', u'HibNcNp', u'HibMedObj', u'HibRelObj', u'HibApprSchd',
       u'HibTotal', u'CpoxProt', u'CpoxNcNp', u'CpoxMedObj', u'CpoxRelObj',
       u'CpoxApprSchd', u'CpoxTotal'],
      dtype='object')

In [45]:
columns_to_keep = ['RCDTS', 'Enrollment', 'PolioProt']
df09 = df09[columns_to_keep]
df09.dropna(inplace=True)
df09.head()

Unnamed: 0,RCDTS,Enrollment,PolioProt
0,140161630022008,234,232
1,540921180243104,19,0
2,140162180160007,1944,1942
3,41012050252059,307,307
4,140160960022001,334,330


In [46]:
df09.dtypes

RCDTS          object
Enrollment    float64
PolioProt     float64
dtype: object

How nice, they're floats already.

In [48]:
df09.describe(include='all')

Unnamed: 0,RCDTS,Enrollment,PolioProt
count,5021.0,5021.0,5021.0
unique,5021.0,,
top,280065050160001.0,,
freq,1.0,,
mean,,460.148576,453.327026
std,,469.334927,465.449582
min,,2.0,0.0
25%,,199.0,195.0
50%,,353.0,348.0
75%,,549.0,540.0
