# 2. Prepare Data

From https://machinelearningmastery.com/process-for-working-through-machine-learning-problems/

I preface data preparation with a data analysis phase that involves summarizing the attributes and visualizing them using scatter plots and histograms. I also like to describe in detail each attribute and relationships between attributes. This grunt work forces me to think about the data in the context of the problem before it is lost to the algorithms

The actual data preparation process is three step as follows:

* Step 1: Data Selection: Consider what data is available, what data is missing and what data can be removed.
* Step 2: Data Preprocessing: Organize your selected data by formatting, cleaning and sampling from it.
* Step 3: Data Transformation: Transform preprocessed data ready for machine learning by engineering features using scaling, attribute decomposition and attribute aggregation.


In [1]:
import numpy as np
import pandas as pd

In [50]:
df_train = pd.read_csv('data/TrainingSet.csv')

In [51]:
df_train.head(2)

Unnamed: 0.1,Unnamed: 0,1972 [YR1972],1973 [YR1973],1974 [YR1974],1975 [YR1975],1976 [YR1976],1977 [YR1977],1978 [YR1978],1979 [YR1979],1980 [YR1980],...,2001 [YR2001],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007],Country Name,Series Code,Series Name
0,0,,,,,,,,,,...,,,,,,,3.769214,Afghanistan,allsi.bi_q1,(%) Benefits held by 1st 20% population - All ...
1,1,,,,,,,,,,...,,,,,,,7.027746,Afghanistan,allsp.bi_q1,(%) Benefits held by 1st 20% population - All ...


In [52]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195402 entries, 0 to 195401
Data columns (total 40 columns):
Unnamed: 0       195402 non-null int64
1972 [YR1972]    64945 non-null float64
1973 [YR1973]    64443 non-null float64
1974 [YR1974]    64966 non-null float64
1975 [YR1975]    66973 non-null float64
1976 [YR1976]    67717 non-null float64
1977 [YR1977]    69735 non-null float64
1978 [YR1978]    69763 non-null float64
1979 [YR1979]    69906 non-null float64
1980 [YR1980]    75250 non-null float64
1981 [YR1981]    78034 non-null float64
1982 [YR1982]    79016 non-null float64
1983 [YR1983]    78982 non-null float64
1984 [YR1984]    79532 non-null float64
1985 [YR1985]    81017 non-null float64
1986 [YR1986]    81455 non-null float64
1987 [YR1987]    82752 non-null float64
1988 [YR1988]    83242 non-null float64
1989 [YR1989]    86331 non-null float64
1990 [YR1990]    106955 non-null float64
1991 [YR1991]    106991 non-null float64
1992 [YR1992]    112243 non-null float64
1993 [Y

### Percentage of missing data per year

In [53]:
100.0 - (df_train.count()/ 195402 * 100.0)

Unnamed: 0        0.000000
1972 [YR1972]    66.763390
1973 [YR1973]    67.020297
1974 [YR1974]    66.752643
1975 [YR1975]    65.725530
1976 [YR1976]    65.344776
1977 [YR1977]    64.312034
1978 [YR1978]    64.297704
1979 [YR1979]    64.224522
1980 [YR1980]    61.489647
1981 [YR1981]    60.064892
1982 [YR1982]    59.562338
1983 [YR1983]    59.579738
1984 [YR1984]    59.298267
1985 [YR1985]    58.538295
1986 [YR1986]    58.314142
1987 [YR1987]    57.650382
1988 [YR1988]    57.399617
1989 [YR1989]    55.818774
1990 [YR1990]    45.264122
1991 [YR1991]    45.245699
1992 [YR1992]    42.557906
1993 [YR1993]    41.375728
1994 [YR1994]    40.214020
1995 [YR1995]    36.301573
1996 [YR1996]    36.349679
1997 [YR1997]    35.678243
1998 [YR1998]    35.546207
1999 [YR1999]    33.020133
2000 [YR2000]    28.072896
2001 [YR2001]    29.999181
2002 [YR2002]    28.191625
2003 [YR2003]    28.783226
2004 [YR2004]    27.135342
2005 [YR2005]    17.327356
2006 [YR2006]    18.686605
2007 [YR2007]    17.300744
C

#### Entries Per Country

In [54]:
pd.DataFrame(df_train['Country Name'].value_counts())

Unnamed: 0,Country Name
Bolivia,1255
Bangladesh,1242
India,1240
Sri Lanka,1239
Kyrgyz Republic,1236
Philippines,1236
Ghana,1233
Morocco,1233
Kenya,1232
Mongolia,1230


In [56]:
year_columns = df_train.drop(raw_df[['Unnamed: 0', 'Country Name', 'Series Code', 'Series Name']], axis=1).columns.values

In [57]:
year_columns

array(['1972 [YR1972]', '1973 [YR1973]', '1974 [YR1974]', '1975 [YR1975]',
       '1976 [YR1976]', '1977 [YR1977]', '1978 [YR1978]', '1979 [YR1979]',
       '1980 [YR1980]', '1981 [YR1981]', '1982 [YR1982]', '1983 [YR1983]',
       '1984 [YR1984]', '1985 [YR1985]', '1986 [YR1986]', '1987 [YR1987]',
       '1988 [YR1988]', '1989 [YR1989]', '1990 [YR1990]', '1991 [YR1991]',
       '1992 [YR1992]', '1993 [YR1993]', '1994 [YR1994]', '1995 [YR1995]',
       '1996 [YR1996]', '1997 [YR1997]', '1998 [YR1998]', '1999 [YR1999]',
       '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]', '2003 [YR2003]',
       '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]', '2007 [YR2007]'],
      dtype=object)

In [59]:
df_long = df_train.melt(id_vars=['Unnamed: 0', 'Country Name', 'Series Code', 'Series Name'])

In [60]:
df_long.count()

Unnamed: 0      7034472
Country Name    7034472
Series Code     7034472
Series Name     7034472
variable        7034472
value           3714187
dtype: int64

In [61]:
df_long = df_long[df['value'].notnull()]

In [62]:
df_long.count()

Unnamed: 0      3714187
Country Name    3714187
Series Code     3714187
Series Name     3714187
variable        3714187
value           3714187
dtype: int64

In [64]:
df_long.head(10)

Unnamed: 0.1,Unnamed: 0,Country Name,Series Code,Series Name,variable,value
19,29,Afghanistan,NY.ADJ.DCO2.GN.ZS,Adjusted savings: carbon dioxide damage (% of ...,1972 [YR1972],0.1383995
20,30,Afghanistan,NY.ADJ.DCO2.CD,Adjusted savings: carbon dioxide damage (curre...,1972 [YR1972],2257446.0
23,33,Afghanistan,NY.ADJ.AEDU.GN.ZS,Adjusted savings: education expenditure (% of ...,1972 [YR1972],1.02965
24,34,Afghanistan,NY.ADJ.AEDU.CD,Adjusted savings: education expenditure (curre...,1972 [YR1972],16794710.0
25,35,Afghanistan,NY.ADJ.DNGY.GN.ZS,Adjusted savings: energy depletion (% of GNI),1972 [YR1972],1.369047
26,36,Afghanistan,NY.ADJ.DNGY.CD,Adjusted savings: energy depletion (current US$),1972 [YR1972],22330650.0
27,38,Afghanistan,NY.ADJ.DMIN.GN.ZS,Adjusted savings: mineral depletion (% of GNI),1972 [YR1972],0.0
28,39,Afghanistan,NY.ADJ.DMIN.CD,Adjusted savings: mineral depletion (current US$),1972 [YR1972],0.0
29,40,Afghanistan,NY.ADJ.DRES.GN.ZS,Adjusted savings: natural resources depletion ...,1972 [YR1972],3.596662
30,41,Afghanistan,NY.ADJ.DFOR.GN.ZS,Adjusted savings: net forest depletion (% of GNI),1972 [YR1972],2.227615


In [65]:
df_submissions = pd.read_csv('data/SubmissionRows.csv')

In [72]:
df_submissions.head()

Unnamed: 0.1,Unnamed: 0,2008 [YR2008],2012 [YR2012]
0,559,,
1,618,,
2,753,,
3,1030,,
4,1896,,


In [68]:
df_submissions.count()

Unnamed: 0       737
2008 [YR2008]      0
2012 [YR2012]      0
dtype: int64

In [73]:
df_merged = pd.merge(df_long, df_submissions, on='Unnamed: 0')

In [74]:
df_merged.count()

Unnamed: 0       15563
Country Name     15563
Series Code      15563
Series Name      15563
variable         15563
value            15563
2008 [YR2008]        0
2012 [YR2012]        0
dtype: int64