**SYPA: Fundamental Analysis of Foreign Direct Investment** <br>
*3A_EDA_Python* <br>
Harvard SYPA <br>
User: Jake Schneider <br>
Date Created: February 8, 2020 <br>
Date Updated: March 6, 2020

____

**Note: If using fancyimpute, this Workbook Needs to be run in the tfcs109a Environment**

**Run R and Python in the same notebook** <br>
Docs: https://stackoverflow.com/questions/39008069/r-and-python-in-one-jupyter-notebook

----

In [1]:
## enables the %%R magic, not necessary if you've already done this
#%load_ext rpy2.ipython

----

**Load Packages**

In [2]:
#Import libraries
import sys
import pandas as pd
from datetime import date, datetime, time, timedelta
import json
import requests
import numpy as np
import math

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style='ticks', context='talk')

from matplotlib.offsetbox import AnchoredText
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.impute import KNNImputer
import statsmodels.api as sm
import statsmodels.imputation as st
import statsmodels
#from fancyimpute import IterativeImputer
#import fancyimpute

import warnings
import itertools

import missingno as msno

from matplotlib.backends.backend_pdf import PdfPages
from PIL import Image, ImageDraw, ImageFont

In [3]:
# Create function 'jprint'

def jprint(obj):
    # create a formatted string of the Python JSON object
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)

----

**Load Data**

In [4]:
analysis_df = pd.read_csv('../../2_Inputs/Final/analysis_df.csv')
analysis_df = analysis_df.drop(["Unnamed: 0"], axis = 1)
analysis_df.head()

Unnamed: 0,country,date,code,iso2Code,region,adminregion,incomeLevel,lendingType,capitalCity,longitude,...,Ratio.of.female.to.male.labor.force.participation.rate......modeled.ILO.estimate.,Unemployment..total....of.total.labor.force...modeled.ILO.estimate.,Net.migration,Prevalence.of.undernourishment....of.population.,Life.expectancy.at.birth..total..years.,Fertility.rate..total..births.per.woman.,Population.ages.65.and.above....of.total.population.,Unmet.need.for.contraception....of.married.women.ages.15.49.,Voice.and.Accountability..Estimate.y,year
0,Afghanistan,1960.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,69.1761,...,,,,,32.446,7.45,2.798308,,,1960.0
1,Afghanistan,1961.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,69.1761,...,,,,,32.962,7.45,2.808131,,,1961.0
2,Afghanistan,1962.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,69.1761,...,,,-20000.0,,33.471,7.45,2.804113,,,1962.0
3,Afghanistan,1963.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,69.1761,...,,,,,33.971,7.45,2.786171,,,1963.0
4,Afghanistan,1964.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,69.1761,...,,,,,34.463,7.45,2.754223,,,1964.0


----

**Summary Statistics**

In [5]:
## Give Summary Statistics for all variables
#
#analysis_df.describe()

In [6]:
# Count the number of unique countries

np.count_nonzero(np.unique(analysis_df["country"]))

218

In [7]:
# Count the number of unique countries

np.count_nonzero(np.unique(analysis_df["date"]))

61

In [8]:
# Count the number of variables

np.count_nonzero(analysis_df.columns)

2324

----

**Data Cleaning**

*Remove Variables with All NAs*

In [9]:
# Check Dimensions

analysis_df.shape

(13042, 2324)

In [10]:
# Check for variables with all NAs

analysis_df = analysis_df.dropna(axis=1, how='all')
analysis_df.shape

(13042, 2324)

In [11]:
# Check for variables with all Zeroes

analysis_df = analysis_df.loc[:, (analysis_df != 0).any(axis=0)]
analysis_df.shape

(13042, 2324)

----

**Visualize Missing Values** <br>
Docs R: https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html <br>
Docs Python: https://dev.to/tomoyukiaota/visualizing-the-patterns-of-missing-value-occurrence-with-python-46dj

In [12]:
# Subset the Missingness

missingdata_df = analysis_df.columns[analysis_df.isnull().any()].tolist()
print(len(missingdata_df))

2323


In [13]:
## Missingno: Plot
## Can only comfortably accommodate about 50 variables
#
#missing_plot = msno.matrix(analysis_df) 
#fig = missing_plot.get_figure()
#fig.savefig('../../3_Outputs/Missing Data Visualizations/Missing Plot - All.jpg')

In [14]:
## Missingno: Bar plot
#
#msno.bar(analysis_df) 

In [15]:
# Missingno: Dendogram

#msno.dendrogram(analysis_df)

----

**Data Imputation** <br>
Docs: https://datascienceplus.com/knnimputer-for-missing-value-imputation-in-python-using-scikit-learn/

*Prerequisites*

In [16]:
## Drop Years Before 2000 to avoid massive missingness
#
#analysis_df = analysis_df[analysis_df['date'] >= 2000]
#print("Analsyis Dimensions:", analysis_df.shape)
#print(analysis_df['date'].describe())

In [17]:
# Drop Observations without FDI Data

analysis_df = analysis_df[analysis_df['Foreign direct investment, net inflows (% of GDP)'].notna()]
analysis_df2 = analysis_df.loc[:, 'country':'capitalCity']
print(analysis_df.shape)
print(analysis_df2.shape)
analysis_df.head()

(7383, 2324)
(7383, 9)


Unnamed: 0,country,date,code,iso2Code,region,adminregion,incomeLevel,lendingType,capitalCity,longitude,...,Ratio.of.female.to.male.labor.force.participation.rate......modeled.ILO.estimate.,Unemployment..total....of.total.labor.force...modeled.ILO.estimate.,Net.migration,Prevalence.of.undernourishment....of.population.,Life.expectancy.at.birth..total..years.,Fertility.rate..total..births.per.woman.,Population.ages.65.and.above....of.total.population.,Unmet.need.for.contraception....of.married.women.ages.15.49.,Voice.and.Accountability..Estimate.y,year
10,Afghanistan,1970.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,69.1761,...,,,,,37.409,7.45,2.631613,,,1970.0
11,Afghanistan,1971.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,69.1761,...,,,,,37.93,7.45,2.635235,,,1971.0
12,Afghanistan,1972.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,69.1761,...,,,-20000.0,,38.461,7.45,2.627456,,,1972.0
13,Afghanistan,1973.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,69.1761,...,,,,,39.003,7.45,2.609505,,,1973.0
16,Afghanistan,1976.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,69.1761,...,,,,,40.715,7.45,2.558353,,,1976.0


In [18]:
# Describe FDI Data

analysis_df['Foreign direct investment, net inflows (% of GDP)'].describe()

count    7383.000000
mean        6.037451
std        49.330609
min       -58.322880
25%         0.421949
50%         1.593219
75%         4.183845
max      1846.596366
Name: Foreign direct investment, net inflows (% of GDP), dtype: float64

In [84]:
# Create column_names

column_names1 = analysis_df.loc[:,'longitude':'Electricity production from coal sources (% of total)'].columns
#print(column_names1[0:5])
column_names2 = analysis_df.loc[:,'Electricity production from natural gas sources (% of total)':'Dealing.with.construction.permits..Liability.and.insurance.regimes.index..0.2...DB16.20.methodology.'].columns
column_names3 = analysis_df.loc[:, 'Dealing.with.construction.permits..Building.quality.control.index..0.15...DB16.20.methodology....Score':'Population.density..people.per.sq..km.of.land.area.'].columns
column_names4 = analysis_df.loc[:, 'Terrestrial.and.marine.protected.areas....of.total.territorial.area.':].columns

In [85]:
column_names = list(column_names1) + list(column_names2) + list(column_names3) + list(column_names4)
column_names[2308:2312]

['Population.ages.65.and.above....of.total.population.',
 'Unmet.need.for.contraception....of.married.women.ages.15.49.',
 'Voice.and.Accountability..Estimate.y',
 'year']

In [86]:
print(len(column_names))

2312


*Liner Interpolation*

In [59]:
# Linear Interpolation

analysis_df_interpolation = analysis_df.loc[:,'longitude':].groupby('longitude').apply(lambda group: group.interpolate(method='linear', axis = 0))

In [60]:
# View Interpolated Df

analysis_df_interpolation.head()

Unnamed: 0,longitude,latitude,"2005 PPP conversion factor, GDP (LCU per international $)","2005 PPP conversion factor, private consumption (LCU per international $)",Access to clean fuels and technologies for cooking (% of population),Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,Ratio.of.female.to.male.labor.force.participation.rate......modeled.ILO.estimate.,Unemployment..total....of.total.labor.force...modeled.ILO.estimate.,Net.migration,Prevalence.of.undernourishment....of.population.,Life.expectancy.at.birth..total..years.,Fertility.rate..total..births.per.woman.,Population.ages.65.and.above....of.total.population.,Unmet.need.for.contraception....of.married.women.ages.15.49.,Voice.and.Accountability..Estimate.y,year
10,69.1761,34.5228,,,,,,,,,...,,,,,37.409,7.45,2.631613,,,1970.0
11,69.1761,34.5228,,,,,,,,,...,,,,,37.93,7.45,2.635235,,,1971.0
12,69.1761,34.5228,,,,,,,,,...,,,-20000.0,,38.461,7.45,2.627456,,,1972.0
13,69.1761,34.5228,,,,,,,,,...,,,-397986.333333,,39.003,7.45,2.609505,,,1973.0
16,69.1761,34.5228,,,,,,,,,...,,,-775972.666667,,40.715,7.45,2.558353,,,1976.0


*KNN Imputer*

In [21]:
# Initialize Imputer

imputer = KNNImputer(n_neighbors=5) #round(np.sqrt(len(analysis_df.columns)))-1
print(imputer)

KNNImputer(add_indicator=False, copy=True, metric='nan_euclidean',
           missing_values=nan, n_neighbors=5, weights='uniform')


In [22]:
# Impute the values

data = imputer.fit_transform(analysis_df.loc[:,'longitude':])

In [87]:
# View Data

data[1:5]

array([[ 6.91761000e+01,  3.45228000e+01,  2.34189781e+01, ...,
         2.96800000e+01, -9.56633200e-01,  1.97100000e+03],
       [ 6.91761000e+01,  3.45228000e+01,  2.20349583e+01, ...,
         2.67000000e+01, -1.04392014e+00,  1.97200000e+03],
       [ 6.91761000e+01,  3.45228000e+01,  2.34189781e+01, ...,
         2.70000000e+01, -7.92019940e-01,  1.97300000e+03],
       [ 6.91761000e+01,  3.45228000e+01,  1.86438781e+01, ...,
         2.80600000e+01, -5.55132540e-01,  1.97600000e+03]])

In [88]:
# View Dimensions

data.shape

(7383, 2312)

In [89]:
# View One data Row

data[0, :]

array([ 6.91761000e+01,  3.45228000e+01,  2.34189781e+01, ...,
        2.94800000e+01, -9.79530660e-01,  1.97000000e+03])

In [91]:
# Create New Dataframe with Imputed Data

imputed_df = pd.DataFrame(data = data,
                          columns = column_names)

In [92]:
# See dimensions of new dataframe

imputed_df.shape

(7383, 2312)

In [93]:
# View imputed_df

imputed_df.head()

Unnamed: 0,longitude,latitude,"2005 PPP conversion factor, GDP (LCU per international $)","2005 PPP conversion factor, private consumption (LCU per international $)",Access to clean fuels and technologies for cooking (% of population),Access to electricity (% of population),"Access to electricity, rural (% of rural population)","Access to electricity, urban (% of urban population)",Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)",...,Ratio.of.female.to.male.labor.force.participation.rate......modeled.ILO.estimate.,Unemployment..total....of.total.labor.force...modeled.ILO.estimate.,Net.migration,Prevalence.of.undernourishment....of.population.,Life.expectancy.at.birth..total..years.,Fertility.rate..total..births.per.woman.,Population.ages.65.and.above....of.total.population.,Unmet.need.for.contraception....of.married.women.ages.15.49.,Voice.and.Accountability..Estimate.y,year
0,69.1761,34.5228,23.418978,29.363218,4.84,29.422765,10.407148,59.216434,27.030349,25.446126,...,78.0033,8.3546,-23616.0,43.52,37.409,7.45,2.631613,29.48,-0.979531,1970.0
1,69.1761,34.5228,23.418978,29.363218,15.852,31.141782,12.720031,64.387019,27.698254,25.449469,...,78.201934,8.5824,-22903.2,43.88,37.93,7.45,2.635235,29.68,-0.956633,1971.0
2,69.1761,34.5228,22.034958,27.729061,4.842,37.29855,13.64541,73.960427,27.030349,25.446126,...,72.199611,8.7142,-20000.0,26.32,38.461,7.45,2.627456,26.7,-1.04392,1972.0
3,69.1761,34.5228,23.418978,29.363218,4.84,30.221782,9.017772,62.487019,27.030349,25.446126,...,69.824106,6.7992,-14881.2,35.16,39.003,7.45,2.609505,27.0,-0.79202,1973.0
4,69.1761,34.5228,18.643878,21.166994,2.344,30.962882,20.820039,52.362763,31.45841,29.303403,...,77.27871,4.2436,-261777.8,38.1,40.715,7.45,2.558353,28.06,-0.555133,1976.0


*Use MICE Imputer* <br>
Docs:https://www.statsmodels.org/stable/generated/statsmodels.imputation.mice.MICEData.html <br>
Forum Post: https://stackoverflow.com/questions/45239256/data-imputation-with-fancyimpute-and-pandas <br>
Amazon S3: https://s3.amazonaws.com/assets.datacamp.com/production/course_17404/slides/chapter4.pdf

In [94]:
## Run MICE
#
##analysis_df_matrix = analysis_df.loc[:,'longitude':] #.as_matrix()
##mice_df=statsmodels.imputation.mice.MICE().complete(analysis_df_matrix)
#
#mice_df=pd.DataFrame(data=mice.complete(analysis_df.loc[:,'longitude':]), columns=column_names, index=analysis_df.index)

In [95]:
#analysis_df_MICE = fancyimpute.MICE().complete(analysis_df.loc[:,'longitude':])

In [96]:
## Run MICE
#
#MICE_imputer = IterativeImputer()
#analysis_df_MICE = analysis_df.loc[:,'longitude':].copy(deep=True)
#analysis_df_MICE.iloc[:, :] = MICE_imputer.fit_transform(analysis_df_MICE)

In [97]:
#imp = statsmodels.imputation.mice.MICEData(analysis_df.loc[:,'longitude':])

In [98]:
## View Data
#
#j = 0
#for data in imp:
#    print(data)
#    j +=1

In [99]:
# View analysis_df

#analysis_df2 = analysis_df.loc[:, 'country':'capitalCity']
analysis_df2.head()

Unnamed: 0,country,date,code,iso2Code,region,adminregion,incomeLevel,lendingType,capitalCity
10,Afghanistan,1970.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul
11,Afghanistan,1971.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul
12,Afghanistan,1972.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul
13,Afghanistan,1973.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul
16,Afghanistan,1976.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul


*Concatenate Analysis_df and Imputed_df together*

In [108]:
# Concat dataframes together

final_df = pd.concat([analysis_df2.reset_index(), imputed_df.reset_index()], axis = 1) #.reindex(analysis_df2.index)
#final_df = pd.merge(analysis_df2, imputed_df, on = ['longitude','latitude'])

In [109]:
# View the dimensions of final_df

final_df.shape

(7383, 2323)

In [110]:
# View final_df
final_df.head(10)

Unnamed: 0,index,country,date,code,iso2Code,region,adminregion,incomeLevel,lendingType,capitalCity,...,Ratio.of.female.to.male.labor.force.participation.rate......modeled.ILO.estimate.,Unemployment..total....of.total.labor.force...modeled.ILO.estimate.,Net.migration,Prevalence.of.undernourishment....of.population.,Life.expectancy.at.birth..total..years.,Fertility.rate..total..births.per.woman.,Population.ages.65.and.above....of.total.population.,Unmet.need.for.contraception....of.married.women.ages.15.49.,Voice.and.Accountability..Estimate.y,year
0,10,Afghanistan,1970.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,...,78.0033,8.3546,-23616.0,43.52,37.409,7.45,2.631613,29.48,-0.979531,1970.0
1,11,Afghanistan,1971.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,...,78.201934,8.5824,-22903.2,43.88,37.93,7.45,2.635235,29.68,-0.956633,1971.0
2,12,Afghanistan,1972.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,...,72.199611,8.7142,-20000.0,26.32,38.461,7.45,2.627456,26.7,-1.04392,1972.0
3,13,Afghanistan,1973.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,...,69.824106,6.7992,-14881.2,35.16,39.003,7.45,2.609505,27.0,-0.79202,1973.0
4,16,Afghanistan,1976.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,...,77.27871,4.2436,-261777.8,38.1,40.715,7.45,2.558353,28.06,-0.555133,1976.0
5,17,Afghanistan,1977.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,...,62.988349,3.705,-1153959.0,26.1,41.32,7.449,2.549322,30.7,-0.718322,1977.0
6,19,Afghanistan,1979.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,...,68.932985,3.5344,145307.4,23.2,42.585,7.449,2.485762,30.24,-0.645187,1979.0
7,20,Afghanistan,1980.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,...,70.469737,4.136,145307.4,21.06,43.244,7.449,2.434753,29.36,-0.671656,1980.0
8,21,Afghanistan,1981.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,...,63.260997,7.0952,149475.2,27.02,43.923,7.449,2.452756,34.42,-0.86607,1981.0
9,42,Afghanistan,2002.0,AFG,AF,South Asia,South Asia,Low income,IDA,Kabul,...,51.791618,3.55,744193.0,43.7,56.784,7.272,2.255848,20.34,-1.433421,2002.0


In [111]:
# Check if there are any missing values
# Still are some missing values

final_df.isnull().any()

index                                                           False
country                                                         False
date                                                            False
code                                                            False
iso2Code                                                         True
                                                                ...  
Fertility.rate..total..births.per.woman.                        False
Population.ages.65.and.above....of.total.population.            False
Unmet.need.for.contraception....of.married.women.ages.15.49.    False
Voice.and.Accountability..Estimate.y                            False
year                                                            False
Length: 2323, dtype: bool

In [112]:
#Find nan_rows

nan_rows = []

for var in final_df.loc[:, 'country':'capitalCity'].columns:
    print(var)
    nan_row = final_df[final_df[var].isnull()]
    nan_rows.append(nan_row)

country
date
code
iso2Code
region
adminregion
incomeLevel
lendingType
capitalCity


In [113]:
nan_rows

[Empty DataFrame
 Columns: [index, country, date, code, iso2Code, region, adminregion, incomeLevel, lendingType, capitalCity, index, longitude, latitude, 2005 PPP conversion factor, GDP (LCU per international $), 2005 PPP conversion factor, private consumption (LCU per international $), Access to clean fuels and technologies for cooking (% of population), Access to electricity (% of population), Access to electricity, rural (% of rural population), Access to electricity, urban (% of urban population), Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+), Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+), Account ownership at a financial institution or with a mobile-money-service provider, male (% of population ages 15+), Account ownership at a financial institution or with a mobile-money-service provider, older adults (% of population ages 25+), Account owner

In [114]:
# Drop if Taiwan

final_df.drop(final_df[final_df['country'] == 'Taiwan, China'].index, inplace = True) 

In [115]:
# Check Dimensions

final_df.shape

(7383, 2323)

In [116]:
# Check for Missing Values
# Acceptable Missing Values

final_df.loc[:, 'country':'capitalCity'].isnull().any()

country        False
date           False
code           False
iso2Code        True
region         False
adminregion     True
incomeLevel    False
lendingType    False
capitalCity     True
dtype: bool

----

**Output CSV**

In [117]:
final_df.to_csv('../../2_Inputs/Final/final_df_knn.csv')