# 2. Prepare Data

From https://machinelearningmastery.com/process-for-working-through-machine-learning-problems/

I preface data preparation with a data analysis phase that involves summarizing the attributes and visualizing them using scatter plots and histograms. I also like to describe in detail each attribute and relationships between attributes. This grunt work forces me to think about the data in the context of the problem before it is lost to the algorithms

The actual data preparation process is three step as follows:

* Step 1: Data Selection: Consider what data is available, what data is missing and what data can be removed.
* Step 2: Data Preprocessing: Organize your selected data by formatting, cleaning and sampling from it.
* Step 3: Data Transformation: Transform preprocessed data ready for machine learning by engineering features using scaling, attribute decomposition and attribute aggregation.


In [27]:
import numpy as np
import pandas as pd

## Note: data is assumed to be placed in the "data" subdirectory as "data/TrainingSet.csv" and "data/SubmissionRows.csv"
The data directory will be ignored by git

In [28]:
training_data = pd.read_csv("data/TrainingSet.csv")

In [29]:
# Get some basic stats about the training data
training_data.describe()

Unnamed: 0.1,Unnamed: 0,1972 [YR1972],1973 [YR1973],1974 [YR1974],1975 [YR1975],1976 [YR1976],1977 [YR1977],1978 [YR1978],1979 [YR1979],1980 [YR1980],...,1998 [YR1998],1999 [YR1999],2000 [YR2000],2001 [YR2001],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007]
count,195402.0,64945.0,64443.0,64966.0,66973.0,67717.0,69735.0,69763.0,69906.0,75250.0,...,125944.0,130880.0,140547.0,136783.0,140315.0,139159.0,142379.0,161544.0,158888.0,161596.0
mean,141942.303426,163063800000.0,183948800000.0,208953400000.0,214882600000.0,232151700000.0,241368200000.0,254058300000.0,274281000000.0,267485800000.0,...,707904500000.0,721459000000.0,739618900000.0,823633100000.0,883434200000.0,969198300000.0,1054572000000.0,1057680000000.0,1203163000000.0,1353147000000.0
std,82594.568035,4261616000000.0,4749746000000.0,5378336000000.0,5647070000000.0,6120314000000.0,6398377000000.0,6710724000000.0,7213662000000.0,7381164000000.0,...,19272250000000.0,19751100000000.0,20552620000000.0,22407170000000.0,24124110000000.0,26120310000000.0,28748330000000.0,30474570000000.0,34695900000000.0,40021080000000.0
min,0.0,-104793900000000.0,-112888900000000.0,-71341610000000.0,-82695880000000.0,-97356520000000.0,-94334220000000.0,-94958980000000.0,-53624790000000.0,-56497900000000.0,...,-101474200000000.0,-96461400000000.0,-92161800000000.0,-66210600000000.0,-56357000000000.0,-185355200000000.0,-151522200000000.0,-135000500000000.0,-142268900000000.0,-169182000000000.0
25%,70571.25,3.176702,3.550009,4.0,3.671917,4.5953,5.0,4.901495,5.0,5.682373,...,4.902281,5.264969,5.304083,5.249579,5.269189,5.5,5.46,5.20667,5.206982,5.0
50%,141554.5,63.94,66.31737,70.48563,71.88613,74.56354,78.94462,78.63808,81.40866,81.82969,...,64.23599,65.7,62.70796,63.916,63.3,64.52287,62.9062,57.37856,55.47417,55.04303
75%,211984.75,5007000.0,7131000.0,9250000.0,11082000.0,12900000.0,15409500.0,19228500.0,24093750.0,24192250.0,...,18704450.0,13000000.0,6544000.0,10388500.0,9278000.0,12000000.0,10660500.0,8484250.0,10250220.0,8599101.0
max,286117.0,268133500000000.0,294346700000000.0,318650600000000.0,338354100000000.0,358615200000000.0,389586900000000.0,425450600000000.0,455626200000000.0,503905000000000.0,...,1348416000000000.0,1324599000000000.0,1389770000000000.0,1646322000000000.0,1821833000000000.0,2013675000000000.0,2295826000000000.0,2774281000000000.0,3339217000000000.0,3950893000000000.0


In [30]:
# Looks at the first few lines to get an idea what the actual data looks like
training_data.head()

Unnamed: 0.1,Unnamed: 0,1972 [YR1972],1973 [YR1973],1974 [YR1974],1975 [YR1975],1976 [YR1976],1977 [YR1977],1978 [YR1978],1979 [YR1979],1980 [YR1980],...,2001 [YR2001],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007],Country Name,Series Code,Series Name
0,0,,,,,,,,,,...,,,,,,,3.769214,Afghanistan,allsi.bi_q1,(%) Benefits held by 1st 20% population - All ...
1,1,,,,,,,,,,...,,,,,,,7.027746,Afghanistan,allsp.bi_q1,(%) Benefits held by 1st 20% population - All ...
2,2,,,,,,,,,,...,,,,,,,8.244887,Afghanistan,allsa.bi_q1,(%) Benefits held by 1st 20% population - All ...
3,4,,,,,,,,,,...,,,,,,,12.933105,Afghanistan,allsi.gen_pop,(%) Generosity of All Social Insurance
4,5,,,,,,,,,,...,,,,,,,18.996814,Afghanistan,allsp.gen_pop,(%) Generosity of All Social Protection


In [31]:
# same for last few lines
training_data.tail()

Unnamed: 0.1,Unnamed: 0,1972 [YR1972],1973 [YR1973],1974 [YR1974],1975 [YR1975],1976 [YR1976],1977 [YR1977],1978 [YR1978],1979 [YR1979],1980 [YR1980],...,2001 [YR2001],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007],Country Name,Series Code,Series Name
195397,286113,,,,,,,,,,...,,,,,,12.2,,Zimbabwe,SG.VAW.BURN.ZS,Women who believe a husband is justified in be...
195398,286114,,,,,,,,,,...,,,,,,33.0,,Zimbabwe,SG.VAW.GOES.ZS,Women who believe a husband is justified in be...
195399,286115,,,,,,,,,,...,,,,,,30.2,,Zimbabwe,SG.VAW.NEGL.ZS,Women who believe a husband is justified in be...
195400,286116,,,,,,,,,,...,,,,,,24.3,,Zimbabwe,SG.VAW.REFU.ZS,Women who believe a husband is justified in be...
195401,286117,,,,,,,,,,...,57.0,57.2,57.5,57.7,57.9,58.1,58.3,Zimbabwe,SH.DYN.AIDS.FE.ZS,Women's share of population ages 15+ living wi...


In [32]:
# What column names do we have
training_data.columns

Index(['Unnamed: 0', '1972 [YR1972]', '1973 [YR1973]', '1974 [YR1974]',
       '1975 [YR1975]', '1976 [YR1976]', '1977 [YR1977]', '1978 [YR1978]',
       '1979 [YR1979]', '1980 [YR1980]', '1981 [YR1981]', '1982 [YR1982]',
       '1983 [YR1983]', '1984 [YR1984]', '1985 [YR1985]', '1986 [YR1986]',
       '1987 [YR1987]', '1988 [YR1988]', '1989 [YR1989]', '1990 [YR1990]',
       '1991 [YR1991]', '1992 [YR1992]', '1993 [YR1993]', '1994 [YR1994]',
       '1995 [YR1995]', '1996 [YR1996]', '1997 [YR1997]', '1998 [YR1998]',
       '1999 [YR1999]', '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]',
       '2003 [YR2003]', '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]',
       '2007 [YR2007]', 'Country Name', 'Series Code', 'Series Name'],
      dtype='object')

In [33]:
# Find unique countries
training_data['Country Name'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra',
       'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba',
       'Australia', 'Austria', 'Azerbaijan', 'Bahamas, The', 'Bahrain',
       'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin',
       'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Brunei Darussalam', 'Bulgaria',
       'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia', 'Cameroon',
       'Canada', 'Cayman Islands', 'Central African Republic', 'Chad',
       'Channel Islands', 'Chile', 'China', 'Colombia', 'Comoros',
       'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica', "Cote d'Ivoire",
       'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic',
       'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador',
       'Egypt, Arab Rep.', 'El Salvador', 'Equatorial Guinea', 'Eritrea',
       'Estonia', 'Ethiopia', 'Faeroe Islands', 'Fiji', 'Finland',
       'France', 

In [35]:
print ("Number of distinct countries {}".format(len(training_data['Country Name'].unique())))

Number of distinct countries 214


In [36]:
# Pandas can also tell us how many unique values are in each column
training_data.nunique()

Unnamed: 0       195402
1972 [YR1972]     47534
1973 [YR1973]     47677
1974 [YR1974]     48521
1975 [YR1975]     50269
1976 [YR1976]     51279
1977 [YR1977]     53185
1978 [YR1978]     53578
1979 [YR1979]     53882
1980 [YR1980]     58032
1981 [YR1981]     59693
1982 [YR1982]     60516
1983 [YR1983]     60615
1984 [YR1984]     61088
1985 [YR1985]     62170
1986 [YR1986]     62726
1987 [YR1987]     63482
1988 [YR1988]     63814
1989 [YR1989]     65248
1990 [YR1990]     78732
1991 [YR1991]     77506
1992 [YR1992]     80642
1993 [YR1993]     81969
1994 [YR1994]     83600
1995 [YR1995]     88951
1996 [YR1996]     89206
1997 [YR1997]     90177
1998 [YR1998]     90256
1999 [YR1999]     94957
2000 [YR2000]     99186
2001 [YR2001]     99397
2002 [YR2002]    102167
2003 [YR2003]    101224
2004 [YR2004]    102863
2005 [YR2005]    111682
2006 [YR2006]    112989
2007 [YR2007]    115182
Country Name        214
Series Code        1305
Series Name        1305
dtype: int64

In [37]:
# read the data containing the rows we need to predict
submission_rows = pd.read_csv('data/SubmissionRows.csv')

In [17]:
submission_rows.describe()

Unnamed: 0.1,Unnamed: 0,2008 [YR2008],2012 [YR2012]
count,737.0,0.0,0.0
mean,142507.36228,,
std,82711.218777,,
min,559.0,,
25%,70277.0,,
50%,142475.0,,
75%,213142.0,,
max,285811.0,,


In [18]:
submission_rows.nunique()

Unnamed: 0       737
2008 [YR2008]      0
2012 [YR2012]      0
dtype: int64

In [41]:
# The 'Unnamed: 0' columns are really row IDs so let's rename them to make things clearer
training_data.rename(columns={'Unnamed: 0':'row_id'}, inplace=True)
submission_rows.rename(columns={'Unnamed: 0':'row_id'}, inplace=True)

In [42]:
# pandas can generate a series containing True at each location where both files have values for a row
rows_present_in_training_and_submission = training_data['row_id'].isin(submission_rows['row_id'])

In [43]:
joined.describe()

count     195402
unique         2
top        False
freq      194665
Name: Unnamed: 0, dtype: object

In [44]:
joined.value_counts()

False    194665
True        737
Name: Unnamed: 0, dtype: int64

In [45]:
# Select only the roes from the training data that have row IDs matching the submission data
joined_data = training_data[rows_present_in_training_and_submission]

In [46]:
joined_data.describe()

Unnamed: 0,row_id,1972 [YR1972],1973 [YR1973],1974 [YR1974],1975 [YR1975],1976 [YR1976],1977 [YR1977],1978 [YR1978],1979 [YR1979],1980 [YR1980],...,1998 [YR1998],1999 [YR1999],2000 [YR2000],2001 [YR2001],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007]
count,737.0,172.0,170.0,170.0,179.0,186.0,192.0,195.0,200.0,201.0,...,675.0,701.0,714.0,713.0,715.0,713.0,713.0,726.0,725.0,737.0
mean,142507.36228,0.219937,0.207817,0.203128,0.189418,0.198788,0.203314,0.195784,0.201251,0.193704,...,0.26333,0.286637,0.302485,0.306344,0.318082,0.32726,0.334718,0.343434,0.352864,0.361211
std,82711.218777,0.263087,0.253884,0.256467,0.247787,0.264639,0.269425,0.260019,0.274549,0.272135,...,0.363003,0.374184,0.378456,0.378707,0.381181,0.383393,0.383628,0.385937,0.387891,0.390645
min,559.0,0.0122,0.0116,0.0109,0.0,0.0,0.0,0.0,0.0,0.0,...,4e-06,2e-06,5.9e-05,3e-06,0.0,0.00017,0.00017,0.0001,0.00012,6e-05
25%,70277.0,0.056925,0.053775,0.050825,0.04325,0.04095,0.039575,0.03715,0.0341,0.0308,...,0.007,0.008454,0.0118,0.0124,0.0151,0.0153,0.0155,0.0168,0.018,0.018
50%,142475.0,0.1327,0.12725,0.11935,0.1074,0.10925,0.1081,0.1069,0.10245,0.0947,...,0.0414,0.0527,0.070743,0.076123,0.085181,0.092325,0.1026,0.1124,0.12,0.1237
75%,213142.0,0.237075,0.2285,0.22125,0.21335,0.213675,0.21785,0.2085,0.210875,0.1991,...,0.533,0.61,0.6635,0.668,0.70335,0.752,0.744,0.770868,0.794,0.802
max,285811.0,0.999836,0.991109,0.992353,0.99747,0.999596,0.999627,0.999718,0.999843,0.999532,...,1.0,0.999986,0.999702,0.999746,0.999751,0.999956,0.999657,0.999986,0.999762,0.999992
