# Econometrics with Python 

This tutorial aims to get first-time Python users 'up-and-running' in Python for Econometrics.  Before diving into the application of Python for economics research, we define the objectives for the session. 


### Objectives
    
- 1) Understanding what is coding and how coding can support economic research.  
- 2) Demonstration on how Python can be used for economic analysis
- 3) Introduction to some basic programming skills. 
- 4) Some application of these skills for econometric analysis. 


### What is coding?  

Coding makes it possible for us to create computer software, apps and websites. Your internet browser, your operating system, the apps on your phone, Facebook, and this website – they’re all made with code.
 
Coding gives orders to machines to acheive goals.  In our case, we are providing the computer with data and directions on what statistical operations to perform with this data. 

#### What is Python?

Python is a programming language. Thus, python is a way for us to give orders to computers. 

#### Why use Python for Econometrics?

- Reason 1) It's free. Python is an open-source language which is free to download. In contrast, econometric programmes such as STATA or SAS cost hundreds or thousands of dollars depending on the size of an organisation. 
- Reason 2) It's faster. The researchers who created Numpy, Pandas, Statsmodels, and other python programmes which have applications in econometrics  are geniuses, and the programmes are very efficient. 
- Reason 3) Greater capabilities. Because python is a general programming language, python can access and visualise data in more interesting and effective ways along with performing econometric analysis. 

# Motivation and Demonstration

#### 1) Get Data - World Bank Data 

#### 2) Visualize Data - [ODI Data Portal](https://conda.io/miniconda.html). 

#### 3) Analyse Data - Regression Analysis

#### How do I download Python?

- Here is a link to download MiniConda - a installer which properly downloads python. [Miniconda](https://conda.io/miniconda.html). 
- Along with MiniConda, it is necessary to install econometric -specific packages through your computer's terminal. We will explore this in greater depth later in the course.  

#### Should I get Python 2 to 3? 

- In almost all cases, it is best to use Python 3. We will use Python 3 in this course.  
  

# The Jupyter Notebook 

#### If I'm coding in Python, why does the top of this page say 'Jupyter'? 

- Jupyter is a notebook for python. It allows you to work in python through an attrative display. It also allows for convenient processes of data visualisation and exploration.
- Also, it should be noted that Jupyter does not use the internet. Although it works through the browser (Firefox, Google, etc.), it does not need access to the internet. Thus, Jupyter will not effect your 3G bill. 

#### Markdowns in the Jupyter Notebook

- This text section is a 'Markdown'.  

# Course Outline 

### Motivation 

The inital stages of learning python for econometrics can be tedious and frustrating.  Thus, we will begin the course with a few application of python for econometrics that (I think) are rather cool.   

### Basics of the Language 

We will examine some of the basics of the language such as: 

##### 1) Variables and Types
- Learn how to assign variables. 
- Learn what are the types to which variables can refer (float, string, bool, int). 
    * Different Type = Different Behavior! 

##### 2) Python Lists
- Learn about python lists as a collection of values - []. 
- Some of the fuctionality and behavior of lists. 

##### 3) Functions, Methods and Packages 
- Functions as pieces of reusable code.  
- Packages as a way to get the functions you need. 
    * Downloading Packages - pip install ___ 
    

##### 4) Basics of Numpy 
- Learn about the Numpy package and its uses for 1D and 2D Numpy Arrays

### Intermediate Python Tools 

##### 5) Dictionaries
- Motivation and Basics of Dictionaries 
    * Dictionary  = {keys : values}
- Rules of dictionaries

##### 6) Basics of Pandas 
- Answer: What is a DataFrame?
- How to upload data from CSV, STATA, and Excel files. 
- Locate and select columns and rows. 
- Filtering Pandas DataFrames  

##### 7) Accessing Data with API-Specific Packages
- Access World Bank Data easily and efficiently through the wbdata package. 
- Remember, always look for a package before performing difficult work! 


### Visualisation - Holoviews and Bokeh 

##### 8) Holoviews and Bokeh 
Learn interactive visualisations for data explorations. 
- Histogram 
- Scatter - also with regression line. 
- Box-and-Wisker Plot 
- Area Chart

### Python for Applied Economics - Statsmodels and Pandas

##### 9) Summary Statistics in Statsmodels and Pandas 
- Describe() functions and other useful summary options. 

##### 10) OLS Regression (Single and Multiple) 
- Basic OLS Regression Analysis
- Generate and Export Tables to Latex 
- Heteroskdastic-Robust SE Regressions 
- Use multiple regression analysis to measure total factor productivity of Ivorian firms. 

##### 11) Fixed or Random Effect Estimation 
- OLS with fixed/random effects to examine the effect of changes over time. 
- Explore the relationship between Remittances and Economic Growth in Sub-Saharan African Countries. 

##### 12) Two-Stage Least Squares in Python 
- Learn 2SLS through the famous example of Acemoglu and Robinson's intitutions and future incomes paper.  
 

## Lesson I: The Basics of Python

This lesson introduces students to the basics of python. In particular, we will go over variables and value types, lists, functions, methods, packages and numpy.  

## Variables 

Variables are specific, case-sensitive names for values. 

For example, I can save my height as a variable. 

In [1]:
# generating a variable of 1.87 meters 
# As you notice, everything that follows the '#' sign turns blue. 
# This occurs because python does not read this code.  
# These are used as notes
AJL_height_m = 1.87

# I could also generate a variable with my weight (around 80 KG)
AJL_weight_kg = 80 

Python saves these values within the computer memory and we refer to the variable name to retrieve the value. We can print these values using the print() function. 

In [2]:
print(AJL_height_m)

1.87


In [3]:
print(AJL_weight_kg)

80


There are a few rules with regard to variable names. 
- 1) They cannot start with a number or include symbols('#',?,etc)
- 2) They are case sensitive. 

In [4]:
4_variables = 340 

SyntaxError: invalid token (<ipython-input-4-2b60a566e72e>, line 1)

In [36]:
variable? = 'the end'

SyntaxError: invalid syntax (<ipython-input-36-d191aff0f4d6>, line 1)

In [37]:
Var = 23 
var = 35
print(Var)
print(var)

23
35


## Types 

There are 4 types of python values: float, integer, boolean, and string.  It is important to remember that different types of python values have different operations and rules associated with them.  The examples below highlight some of these differences. 

##### 1) Float (float) 
    * Real numbers - examples: 2.20304, 4.5, 2.0, etc

In [38]:
height = 2.12 
# type() returns the type of value the variable refers to. 
type(height)

float

In [39]:
height * 2

4.24

In [None]:
a = (height*2)-10
print(a)
type(a)

##### 2) Integer (int)
    * Integers - examples: 1,2,3,4,5,500

In [None]:
length = 2 
type(length)

In [None]:
a = length*3
print(a)

##### 3) String 
    * String of letters or numbers in a phrase - examples: 'the', 'i', '23', 'etc'

In [None]:
string = 'Université Félix Houphouët-Boigny'
type(string)

So... What do you expect string times 3 to equal???

In [None]:
string * 3

In [None]:
string + string

##### 4) Boolean (bool)
    * True/False Value - examples: True, False 

In [None]:
cestvrai = True 
type(cestvrai)

Now, before running this line of code, what do you expect? True looks like a string, thus True True True seems intuitive.  But...


In [None]:
cestvrai*3

The answer is 3 because Booleans actually represent binary values of 0 and 1.

True = 1

False = 0 

Thus, when performing mathematical operations on booleans, they act as integers. 

##### Combinations and Practice 

At this point, you are likely a bit lost.  It's ok; everyone is in the beginning.  The best way to understand the types and values is to explore on your own.  Try adding a string variable to a integer variable and see what happens (my guess is it won't work... but try it to find out.) 

At some point, it will become intuitive. Don't worry. 

## Lists 

Python variables refer to one value.  However, in most cases we are interested in analysing many values. For examples, we might want to examine the height of everyone in this classroom.  How would we store this information in a convenient form? 

We could hold this information in a python list. 

### LIST = [values]

In [5]:
# notice that lists are surrounded by square brackets
class_heights = [1.53, 1.89, 1.43, 1.56, 1.52]
type(class_heights)

list

List can comprise of different python types. 

In [6]:
random = ['the day', 5, 5.235, True]
type(random)

list

Just as python value types follow a particular set of rules, python lists also have a set of operations which are unique. For example, adding two lists combines the two in the order of the addition.

In [7]:
random + class_heights

['the day', 5, 5.235, True, 1.53, 1.89, 1.43, 1.56, 1.52]

However, subtraction of lists is not supported. 

In [8]:
random - class_heights

TypeError: unsupported operand type(s) for -: 'list' and 'list'

### Subsetting Lists

There are two important aspects of subsetting lists: 

- 1) The index begins at zero. 


In [15]:
list_a = ['the', '3', 4, 4.56]

# if I call the first observation in the list by places it within [], 
# notice that [0] returns the first observation 
print(list_a[0]) 

# list[1] returns the second observation 
print(list_a[1])

the
3


#### List Slicing 

To subset several observations, you define the beginning and end of of the slicing by a colon. 

In [11]:
list_a[0:3]

['the', '3', 4]

Notice that the list only returns three values of list_a[0], list_a[1], and list_a[2]

In [12]:
# List_a[3] is not included in the list.   
list_a[3]

4.56

### Changing Lists 

You can change the value of a list by subsetting the value and setting it equal to a new value.  Python automoatically assumes you want to replace the value. 

In [16]:
print(list_a)
list_a[0] = 'cat'
print(list_a)

['the', '3', 4, 4.56]
['cat', '3', 4, 4.56]


## Functions

Functions are sections of reproducable code. 

We have already used functions in this course such as type() and print(). These are functionsn built into python. There are several other useful built-in functions such as max(), min(),  round(), etc.   

In [23]:
numbers = [1,2,3,4,43.5967,6000]

# Here we use the max function to find the max number of 
print(max(numbers))

6000


In [22]:
print(round(numbers[4]))

44


However, we can define our own functions when we find repetitiveness in our own code. For example, we may want to generate a function which adds two variables together. 

In [28]:
# We definte a functions by the 'def __name___()
def add_them_up (a,b): 
    c = a + b 
    return c 

In [30]:
the = 5 
end = 6

# notice that 'a' and 'b' do not need to be named a or b. 
# this is because the 'a' and 'b' within the function are 'local' to the function
# they take whatever is the first and second variables placed within the function and assign a and b to them. 

add_them_up(the, end)

11

### Packages and Numpy 

Packages are a set of functions created by someone else.  Packages allow you to access functions written by others to simplify the tasks within your project. 

One important package in Numpy, a package which most econometrics functions are based on. Numpy makes performing several calculations simple and efficient.  

To download a package, use the code - pip install numpy - in the terminal. 

In [33]:
# First we need to import the numpy package 
# To limit our typing, we import numpy 'as np' 
# Now from here, we only need to type np to refer to numpy
import numpy as np

Numpy arrays can perform operations which python lists cannot. Lets create two lists of integers: height and weight

In [71]:
height = [1.5, 1.67, 1.78, 1.86, 1.45]
weight = [60, 84,65,70,89]

If we try to measure BMI from these lists, we get an error.

Equation: 
BMI = kilograms/(meters squared)

In [51]:
BMI = weight/(height**2)

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

However, if we turn these lists into numpy arrays, this operation works exactly how we would want it to perform. 

In [52]:
np_h = np.array(height)
np_w = np.array(weight)

In [53]:
BMI = np_w /(np_h**2)
print(BMI)

[ 26.66666667  30.11940191  20.51508648  20.23355301  42.33055886]


The arrays above represent one-dimensional numpy arrays.  However, numpy arrays can also be two-dimensional. 

In [60]:
array2d = np.array([height, weight])
print(array2d)

[[  1.5    1.67   1.78   1.86   1.45]
 [ 60.    84.    65.    70.    89.  ]]


How would you go about selecting 1.5 from the first row of the first coloumn?

In [66]:
array2d[0]

array([ 1.5 ,  1.67,  1.78,  1.86,  1.45])

To select within the row, a second bracket is necessary.

In [67]:
array2d[0][0]

1.5

What if you would like to select multiple observations? For example, we may want to access the first two columns of the 2D array. 

In [70]:
array2d[0:2, 0:2]

array([[  1.5 ,   1.67],
       [ 60.  ,  84.  ]])

Or the last two. Notice there are negative signs on the values.  This refers to 'two positoins moving backwards from the last'

In [69]:
array2d[-2:, -2:]

array([[  1.86,   1.45],
       [ 70.  ,  89.  ]])

The best way to understand how to access parts of numpy arrays is to mess with it on your own.  Trial and error - the basis of early coding exploration. 

## Dictionaries 

As we explore python, we have been continually adding ways to efficiently store larger and larger amounts of data. In the beginning, one variable name referred to one value. We then learned how to store multiple types of data points in lists. In the last section, we transformed these lists into numpy arrays which can be used for easy and efficient calculations.

But, how do we store many lists or numpy arrays? 

We can store this information in a dictionary. 

In [72]:
#### Dictionary Format
#### important to notice that dictionaries are surrounded by {}

Dictionary = {'key' : ['value'], 
            'string or number' : ['list', 'or', 'numpy array'] , 
             'diff string or number': ['list', 'or', 'numpy array'] }

For example, we can place the height and weight lists we generated before into a dictionary

In [84]:
bmi_data = {'height': height, 
           'weight': np_w}

In [85]:
print(bmi_data)

{'height': [1.5, 1.67, 1.78, 1.86, 1.45], 'weight': array([60, 84, 65, 70, 89])}


We can access a particular list or numpy array by referring to the key. 

For example to access the height list, we type bmi_data['height']

In [86]:
bmi_data['height']

[1.5, 1.67, 1.78, 1.86, 1.45]

Base on our previous work with numpy arrays, how would you access the first observation in the height list (1.5)?

In [87]:
bmi_data['height'][0]

1.5

What about the first three observations?

In [88]:
bmi_data['height'][0:3]

[1.5, 1.67, 1.78]

Dictionaries are great at holding several types of data such as list and numpy arrays.  However, performing calculations with dictionaries can be challenging -especially because the keys can refer to different types of python objects.  For example, bmi_data includes a list and a numpy array.  

In the next section, we explore the Pandas package to change dictionaries into powerful and convenient DataFrames. 

## Pandas DataFrames 

Pandas is the Python Library for Data Analysis. DataFrames have meaningful labels, time series functionality, and handling processes for missing data and relational operations (>,<,= etc). 

We will first create a pandas DataFrame from the bmi_data dictionary before learning how to generate Pandas dictionaries from csv, excel, and stata files. 

In [89]:
# we first need to import the pandas package
# similarly to numpy we import it as pd to save time typing. 
import pandas as pd

In order to turn our dictionary into a Pandas DataFrame, we merely place the dictionary within the pd.DataFrame() function. 

In [106]:
df = pd.DataFrame(bmi_data)

In [107]:
df

Unnamed: 0,height,weight
0,1.5,60
1,1.67,84
2,1.78,65
3,1.86,70
4,1.45,89


GREAT! Now we have a pandas dataframe of height and weight which is easily interpretation 

### Basics of Pandas DataFrames 

Now we will go over some of the basics of Pandas Dataframes such as adding columns, changing the index, and selecting rows and columns. 

It turns out that we know the names of the height and weight observations.  

In [96]:
names = ['Fatima', 'Ahmadou', 'Michael', 'Frank', 'Gretta']

In [108]:
df['names'] = names 

In [109]:
df

Unnamed: 0,height,weight,names
0,1.5,60,Fatima
1,1.67,84,Ahmadou
2,1.78,65,Michael
3,1.86,70,Frank
4,1.45,89,Gretta


Notice that the names we attached to the dataframe in the order they existed within the list.  Also notice that the index (the furthest left column of numbers) begins with zero like the list index. 

In some cases, it is useful to change the index to refer to names of observations.  For example, we may want to place the names as the index. The 'inplace=True' portion of the code insure the action is permanent. 

In [110]:
df.set_index('names', inplace=True)
df

Unnamed: 0_level_0,height,weight
names,Unnamed: 1_level_1,Unnamed: 2_level_1
Fatima,1.5,60
Ahmadou,1.67,84
Michael,1.78,65
Frank,1.86,70
Gretta,1.45,89


Now that we have a clean dataset, let's try to identify specific columns and observations. 

What is the height of Ahmadou? 

Try to remember how we captured values within the dictionary. If we refer to the 'height' within the dataframe (df), we recieve a Pandas Series of names and their respective heights.  

In [105]:
df['height']

names
Fatima     1.50
Ahmadou    1.67
Michael    1.78
Frank      1.86
Gretta     1.45
Name: height, dtype: float64

In [111]:
type(df['height'])

pandas.core.series.Series

But how do we capture the height of only Ahmadou? What about the list Ahmadou through Frank?

In [113]:
df['height']['Ahmadou']

1.6699999999999999

In [114]:
df['height']['Ahmadou':'Frank']

names
Ahmadou    1.67
Michael    1.78
Frank      1.86
Name: height, dtype: float64

### Tidy Data 

Before moving forward, it is important to discuss the conventions of data for analysis. In most cases, data analysis requires data to be in 'tidy' format. 

This means that each observation is a row and each column is a variable or description of that observation. 

## Access World Bank Data

Along with performing data analysis, Python can gather data for you! In this example, we will use the wbdata package to gather data from the World Bank Development Indicators.  

In [1]:
# As usual, our first step is to upload the package
import wbdata

First, let's take a look at the types of data we can download.

In [2]:
wbdata.get_source()

11	Africa Development Indicators
36	Statistical Capacity Indicators
31	Country Policy and Institutional Assessment
41	Country Partnership Strategy for India (FY2013 - 17)
1 	Doing Business
30	Exporter Dynamics Database ��� Indicators at Country-Year Level
12	Education Statistics
13	Enterprise Surveys
28	Global Financial Inclusion
33	G20 Financial Inclusion Indicators
14	Gender Statistics
15	Global Economic Monitor
27	Global Economic Prospects
32	Global Financial Development
21	Global Economic Monitor Commodities
55	Commodity Prices- History and Projections
34	Global Partnership for Education
29	The Atlas of Social Protection: Indicators of Resilience and Equity
16	Health Nutrition and Population Statistics
39	Health Nutrition and Population Statistics by Wealth Quintile
40	Population estimates and projections
18	IDA Results Measurement System
45	Indonesia Database for Policy and Economic Research
6 	International Debt Statistics
54	Joint External Debt Hub
25	Jobs
37	LAC Equity Lab
19	M

Well... That's quite a lot. I think we will examine some of the African Development Indicators.  The code below provides the names and codes for several indicators. 

In [3]:
wbdata.get_indicator(source=2)

NV.IND.TOTL.KD.ZG        	Industry, value added (annual % growth)
NV.IND.TOTL.KD           	Industry, value added (constant 2010 US$)
NV.IND.TOTL.CN           	Industry, value added (current LCU)
NV.IND.TOTL.CD           	Industry, value added (current US$)
NV.IND.MANF.ZS           	Manufacturing, value added (% of GDP)
NV.IND.MANF.KN           	Manufacturing, value added (constant LCU)
NV.IND.MANF.KD.ZG        	Manufacturing, value added (annual % growth)
NV.IND.MANF.KD           	Manufacturing, value added (constant 2010 US$)
NV.IND.MANF.CN           	Manufacturing, value added (current LCU)
NV.IND.MANF.CD           	Manufacturing, value added (current US$)
NV.AGR.TOTL.ZS           	Agriculture, value added (% of GDP)
NV.AGR.TOTL.KN           	Agriculture, value added (constant LCU)
NV.AGR.TOTL.KD.ZG        	Agriculture, value added (annual % growth)
NV.AGR.TOTL.KD           	Agriculture, value added (constant 2010 US$)
NV.AGR.TOTL.CN           	Agriculture, value added (current LCU)

##### Great! What indicators do we want? 

I found a study which examines the relationship between GDP per capita, remittances, and private investment.  I figured we could recreate it using the African Development Indicators.  

First, we create a dictionary of indicator codes and their respective names. 

In [123]:
indicators_dict = {'NY.GDP.PCAP.PP.KD': 'GDP Per Capita (Constant 2011 USD)', 
                  'BX.TRF.PWKR.DT.GD.ZS': 'Personal remittances, received (% of GDP)', 
                  'BX.KLT.DINV.WD.GD.ZS': 'Foreign direct investment, net inflows (% of GDP)', 
                  'GB.XPD.RSDV.GD.ZS' : 'Research and development expenditure (% of GDP)', 
                  'FS.AST.PRVT.GD.ZS' : 'Domestic credit to private sector (% of GDP)'}


##### Choose Lower Middle Income Countries 

Since we are in a Lower Middle Income Country, we want to examine countries who meet this cateogory. 

In [124]:
# Don't worry if this code doesn't make since (I didn't write it either.)
# Got it from the internet! 
LMC_countries = [i['id'] for i in wbdata.get_country(incomelevel="LMC", display=False)]

In [6]:
print(LMC_countries)

['AGO', 'ARM', 'BGD', 'BOL', 'BTN', 'CIV', 'CMR', 'COG', 'CPV', 'DJI', 'EGY', 'FSM', 'GEO', 'GHA', 'GTM', 'HND', 'IDN', 'IND', 'JOR', 'KEN', 'KGZ', 'KHM', 'KIR', 'LAO', 'LKA', 'LSO', 'MAR', 'MDA', 'MMR', 'MNG', 'MRT', 'NGA', 'NIC', 'PAK', 'PHL', 'PNG', 'PSE', 'SDN', 'SLB', 'SLV', 'STP', 'SWZ', 'SYR', 'TJK', 'TLS', 'TUN', 'UKR', 'UZB', 'VNM', 'VUT', 'XKX', 'YEM', 'ZMB']


##### Get the Data 

Now that we have defined what indicators we want and the countries we want to examine, we can grap the data and generate a pandas dataframe. 

In [125]:
# see that we place the indicators dictionary first, then the country= the name of the list. 

df = wbdata.get_dataframe(indicators_dict, country=LMC_countries, convert_date=False)

##### Now wait. 

Now, depending on internet speeds this may take a bit of time.  But, it is quite impressive what is occuring at this moment.  Our computer is communicating with a large computer somewhere in the US, telling this computer the data we would like, and then returning the data to us in a form which pandas can turn into a pandas dataframe.  

Pretty cool.  Yes, I'm a nerd. 

Now let's look at our Dataset

In [126]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Domestic credit to private sector (% of GDP),"Foreign direct investment, net inflows (% of GDP)",GDP Per Capita (Constant 2011 USD),"Personal remittances, received (% of GDP)",Research and development expenditure (% of GDP)
country,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Angola,2017,,,,,
Angola,2016,21.131451,4.305258,5984.640422,0.004183,
Angola,2015,27.224227,9.015118,6231.067992,0.010795,
Angola,2014,22.885054,1.515813,6260.132681,0.024430,
Angola,2013,23.387915,-5.700024,6185.013829,0.029331,
Angola,2012,22.262184,-5.977515,5998.638601,0.034965,
Angola,2011,20.179426,-2.904235,5911.254334,0.000197,
Angola,2010,20.215879,-3.913151,5895.114088,0.021792,
Angola,2009,21.468893,2.921219,5908.051427,0.000215,
Angola,2008,12.682510,1.994548,5978.334873,0.097512,


#### Search for the country codes for country of interest

In some cases, we may just want one variable for one country.  

First, we find the countrycode for Cote d'Ivoire. 

In [133]:
wbdata.search_countries("ivoire")

CIV	Cote d'Ivoire


Then we include the indicator and country of choice in the get_dataframe() function. 

In [161]:
df_CIV = wbdata.get_dataframe({'NY.GDP.PCAP.PP.KD': 'GDP Per Capita (Constant 2011 USD)'}, country='CIV', convert_date=False)

Voila! We have a dataset with GDP per capita of Cote d'Ivoire from 1965 to 2016. (Where the data is available)

## Pandas: Data Exploration, Cleaning and Summary Statistics

We now explore the Lower Middle Income dataframe.  First, we will use the built in pandas functions for summary statistics and analysis.  Then, we will use the Holoviews visualisation tool to explore the data through graphs and tables.   

##### DataFrame.columns

An initial question for a research may be: what variables are in my dataset? Use df.columns to list the names of the columns of df. 

In [9]:
df.columns

Index(['Domestic credit to private sector (% of GDP)',
       'Foreign direct investment, net inflows (% of GDP)',
       'GDP Per Capita (Constant 2011 USD)',
       'Personal remittances, received (% of GDP)',
       'Research and development expenditure (% of GDP)'],
      dtype='object')

##### Head() and Tail()
One of the first tasks is to examine the beginning and end of the dataset to ensure the data appears reasonable.

In [168]:
# Choose the first five observations
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Domestic credit to private sector (% of GDP),"Foreign direct investment, net inflows (% of GDP)",GDP Per Capita (Constant 2011 USD),"Personal remittances, received (% of GDP)",Research and development expenditure (% of GDP)
country,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Angola,2017,,,,,
Angola,2016,21.131451,4.305258,5984.640422,0.004183,
Angola,2015,27.224227,9.015118,6231.067992,0.010795,
Angola,2014,22.885054,1.515813,6260.132681,0.02443,
Angola,2013,23.387915,-5.700024,6185.013829,0.029331,


In [211]:
# Choose the last five observations 
df.tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,Domestic credit to private sector (% of GDP),"Foreign direct investment, net inflows (% of GDP)",GDP Per Capita (Constant 2011 USD),"Personal remittances, received (% of GDP)",Research and development expenditure (% of GDP)
country,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Zambia,1964,,,,,
Zambia,1963,,,,,
Zambia,1962,,,,,
Zambia,1961,,,,,
Zambia,1960,,,,,


#### Extract the Index - Generate a Variable 

In some cases, it is useful to transform the index into regular columns.  For example, you may want to use them in a regression. 

##### dataframe.reset_index()  

In [43]:
df.reset_index(inplace=True)

#### Change Variable Names 

However, 'country' and 'date' are not particularly attractive names for variables.  We may want to change their names to 'Country' and 'Year'. 

In [44]:
df.rename(index=str, columns={"country": "Country", "date": "Year"}, inplace=True)

In [23]:
df.head()

Unnamed: 0,Country,Year,Domestic credit to private sector (% of GDP),"Foreign direct investment, net inflows (% of GDP)",GDP Per Capita (Constant 2011 USD),"Personal remittances, received (% of GDP)",Research and development expenditure (% of GDP)
0,Angola,2017,,,,,
1,Angola,2016,21.131451,4.305258,5984.640422,0.004183,
2,Angola,2015,27.224227,9.015118,6231.067992,0.010795,
3,Angola,2014,22.885054,1.515813,6260.132681,0.02443,
4,Angola,2013,23.387915,-5.700024,6185.013829,0.029331,


While visually examining data can be useful as a first step, when workign with hundreds or thousands of observations, analysing the data in this manner is inadequate.  

#### DataFrame.describe()

DataFrame.describe() offers a quick way to get summary statistics on all columns. This code gives the number of observations (non-missing), mean, standard deviation, minimum, 25th percentile, 50th percentile, 75th percentile, and max values. 

As shown, the variable on research and development expenditure has many missing values; only 302 observations.  We also see that the average income in our sample of countries is roughly 4 thousand dollars (2011). 

In [256]:
df.describe()

Unnamed: 0,Domestic credit to private sector (% of GDP),"Foreign direct investment, net inflows (% of GDP)",GDP Per Capita (Constant 2011 USD),"Personal remittances, received (% of GDP)",Research and development expenditure (% of GDP),Year
count,2062.0,1837.0,1362.0,1573.0,302.0,3074.0
mean,23.418182,2.934455,4079.120767,6.970477,0.317233,1988.5
std,17.086032,4.705017,2149.869653,10.933354,0.262884,16.743393
min,0.915545,-37.165648,728.031675,0.000197,0.00544,1960.0
25%,11.319736,0.47412,2500.92549,0.827754,0.117842,1974.0
50%,19.406387,1.521803,3396.193639,3.502235,0.230825,1988.5
75%,30.963818,3.918115,5352.323735,8.951668,0.433588,2003.0
max,123.814875,43.912112,11639.309652,99.821794,1.1923,2017.0


### Filtering Pandas DataFrames

In many cases, we do not need all of the data within the dataframe.  For example, we may consider countries with over 8,000 dollars per capita income as too wealthy for our study.  In this case, we would want to select all those who record less than this value. 

In [188]:
# Selects the country with less than 8,000 GDP per capita. 
# Notice that in df[df[]>value], we refer to the dataset both within and outside the []
df_under8 = df[df['GDP Per Capita (Constant 2011 USD)']<8000]

If we run df_under8.describe(), we will see that the max value for GDP Per Capita is now less than 8000. 

It should also be noted that df_under8 is a new dataset. If we return to df, it will be unchanged. 

In [189]:
df_under8.describe()

Unnamed: 0,Domestic credit to private sector (% of GDP),"Foreign direct investment, net inflows (% of GDP)",GDP Per Capita (Constant 2011 USD),"Personal remittances, received (% of GDP)",Research and development expenditure (% of GDP)
count,1156.0,1231.0,1279.0,1078.0,258.0
mean,24.922883,3.555015,3734.88484,7.102411,0.289433
std,17.868419,4.912137,1707.881597,8.872907,0.255774
min,0.915545,-32.346992,728.031675,0.000197,0.00544
25%,11.794539,0.731693,2433.434715,1.159889,0.11336
50%,21.21285,2.095116,3273.576579,3.898547,0.213255
75%,34.119434,4.945689,4774.491659,9.752879,0.351618
max,123.814875,42.092808,7989.997256,71.741617,1.1923


## Visualisation with Holoviews 

One of the benefits of working in Python is its abilities beyond basic calculations.  In fact, Python can generate beautiful and convenient interactive visuals with very little effort. 

Holoviews is a python package which generates interactive visuals within the jupyter notebook.  

In [196]:
# Import Holoviews 
import holoviews as hv

# Choose bokeh as the extension 
hv.extension('bokeh')

# if the symbol of holoviews with bokeh shows up, the import was successful

#### Basic Options 

The first section creates the output window and alters some of the formatting of the visuals. Explore by making small changes to the format. 

In [206]:
# Choose output size for the graphs
%output size=200

#Domestic credit to private sector (% of GDP) 	Foreign direct investment, net inflows (% of GDP) 	GDP Per Capita (Constant 2011 USD) 	Personal remittances, received (% of GDP) 	Research and development expenditure (% of GDP)

### Scatter Plot 

One typical way to explore data in the initial stages is through a scatter plot.  This plots the relationship between two variables. 

For example, in the session below we examine the relationship between Domestic credit to private sector (% of GDP) and GDP Per Capita (Constant 2011 USD). 

As economic theory would suggest, there appears to be positive relationship. 

In [258]:
%%opts Scatter [width=250, height=250, tools=['hover']] (size=5)
                 
scatter1 = hv.Scatter(df, 
                      kdims= ['GDP Per Capita (Constant 2011 USD)'],
                      vdims = ['Domestic credit to private sector (% of GDP)', 'Year', 'Country'])
scatter1

### Scater - Examine Changes Over Time

Although there is a positive relationship in the visual below, this may be driven a both rising at the same time.  

Let examine this by seeing how the plot changes as we look at one year at a time. 

In [257]:
%%opts Scatter [show_grid=True, width=250, height=250, tools=['hover']] (size=10)

scatter2 = hv.Scatter(df, 
                      kdims= ['GDP Per Capita (Constant 2011 USD)'],
                      vdims = ['Domestic credit to private sector (% of GDP)', 'Year', 'Country']).groupby(['Year'])


# Take a look
scatter2

### Histogram Plot 

To examine the variance or distribution of a given variable, it can be useful to plot a histogram of the plot. Holoviews makes this task very straight-forward.  

In [269]:
df_hist = df['GDP Per Capita (Constant 2011 USD)']
df_hist = df_hist.dropna().as_matrix()
frequencies, edges = np.histogram(df_hist, 20)
hv.Histogram(frequencies, edges)



### Line Chart 

To examine times year data, line charts are useful to see paterns over time. 

Below I examine the changes in GDP Per Capita overtime for Cote d'Ivoire. Alter the variable names and country selection to change the plot. 

In [291]:
%%opts Curve {+framewise}

points = [(df['Year']["Cote d'Ivoire"][str(i)], df['GDP Per Capita (Constant 2011 USD)']["Cote d'Ivoire"][str(i)]) for i in range(1960,2017)]

hv.Curve(points)

### Bokeh Line Chart 

I find the bokeh charts to be slightly more attractive. However they run use a bit more code.  Here is an example: 

For more interesting interactive plots, use the seperate bokeh tutorial jupyter notebooks. 

In [296]:
from bokeh.plotting import figure, output_file, show

output_file("line.html")

p = figure(plot_width=400, plot_height=400, tools='hover')

# add a line renderer
p.line(df['Year']["Cote d'Ivoire"], df['GDP Per Capita (Constant 2011 USD)']["Cote d'Ivoire"] , line_width=2)

show(p)

# Econometrics with Python 

We have learned some of the basics of python and how python packages can be used to store, access, clean, summarize, and visualize data.  Now, we will examine how to perform econometric analysis with python. 

## Statsmodels

Although there are a few packages for econometrics for python, we will work with statsmodels. 

Statsmodels is a package with functions specifically for econometrics such as corralation, regression, and time series analysis. 

Moreover, statsmodels generates a convenient way to publish regression results through Latex. 

In [127]:
# Import statsmodels as sm
import statsmodels.api as sm 

##### Add a constant term 

In [128]:
# Add constant term to dataset
df['Constant'] = 1

#### List of Independent Variables in regressions 

Each Xvars_'#' represents a set of independent variables within a regression. 

In [129]:
# Create lists of variables to be used in the first regression
Xvars_1 = ['Domestic credit to private sector (% of GDP)', 'Constant']
# Create lists of variables to be used in the second regression
Xvars_2 = ['Domestic credit to private sector (% of GDP)', 'Foreign direct investment, net inflows (% of GDP)', 'Constant']
# Create lists of variables to be used in the third regression
Xvars_3 = ['Domestic credit to private sector (% of GDP)', 'Foreign direct investment, net inflows (% of GDP)', 'Personal remittances, received (% of GDP)', 'Constant']

#### Run Regressiona and save the fited values

In [130]:
# Estimate an OLS regression for each set of variables
# First Regression
# order = (dependent variable, independent variables, missing='drop')
reg1 = sm.OLS(df['GDP Per Capita (Constant 2011 USD)'], df[Xvars_1], missing='drop').fit()
print('Results from Regression 1')
print(reg1.summary())


# Second Regression 
reg2 = sm.OLS(df['GDP Per Capita (Constant 2011 USD)'], df[Xvars_2], missing='drop').fit()
print()
print('Results from Regression 2')
print(reg2.summary())


# Third Regression
reg3 = sm.OLS(df['GDP Per Capita (Constant 2011 USD)'], df[Xvars_3], missing='drop').fit()
print()
print('Results from Regression 3')
print(reg3.summary())

Results from Regression 1
                                    OLS Regression Results                                    
Dep. Variable:     GDP Per Capita (Constant 2011 USD)   R-squared:                       0.287
Model:                                            OLS   Adj. R-squared:                  0.286
Method:                                 Least Squares   F-statistic:                     496.7
Date:                                Wed, 28 Feb 2018   Prob (F-statistic):           9.68e-93
Time:                                        17:55:29   Log-Likelihood:                -11043.
No. Observations:                                1236   AIC:                         2.209e+04
Df Residuals:                                    1234   BIC:                         2.210e+04
Df Model:                                           1                                         
Covariance Type:                            nonrobust                                         
                        

### Export Results Into an Attractive Format

In [131]:
from statsmodels.iolib.summary2 import summary_col

info_dict={'R-squared' : lambda x: "{:.2f}".format(x.rsquared),
           'No. observations' : lambda x: "{0:d}".format(int(x.nobs))}

results_table = summary_col(results=[reg1,reg2,reg3],
                            float_format='%0.2f',
                            stars = True,
                            model_names=['Priv Cred',
                                         'Add FDI',
                                         'Add Remit'],
                            info_dict=info_dict,
                            regressor_order=['Domestic credit to private sector (% of GDP)', 'Foreign direct investment, net inflows (% of GDP)', 'Personal remittances, received (% of GDP)', 'Constant'])

results_table.add_title('Example - OLS Regression in Python')

print(results_table)

                        Example - OLS Regression in Python
                                                  Priv Cred   Add FDI   Add Remit 
----------------------------------------------------------------------------------
Domestic credit to private sector (% of GDP)      60.48***   60.94***   62.18***  
                                                  (2.71)     (2.74)     (2.89)    
Foreign direct investment, net inflows (% of GDP)            -19.33*    -33.46*** 
                                                             (9.98)     (11.74)   
Personal remittances, received (% of GDP)                               -0.07     
                                                                        (6.41)    
Constant                                          2569.81*** 2646.94*** 2684.24***
                                                  (89.43)    (95.81)    (109.36)  
R-squared                                         0.29       0.29       0.30      
No. observations            

### Generate Latex Code for Attrative Presentation

Unfortunately, I had to make some alterations to the Latex Code presented below.  However, the changes were small. 

- Erase the \\ 
- Erase the begin{center} and end{center}
- Get rid fo the first begin/end{tabular} and place headings within second tabular{}

If the user only wants to post the table, place the text between the following code:

    \documentclass[paper=a4, fontsize=11pt]{scrartcl} % A4 paper and 11pt font size

    \begin{document}
    
    ...
    
    \end{document}

In [132]:
a = results_table.as_latex()
print(a)

\begin{table}
\caption{Example - OLS Regression in Python} \\
\begin{center}
\begin{tabular}{lccc}
\hline
                                                   & Priv Cred  &  Add FDI   & Add Remit   \\
\hline
\hline
\end{tabular}
\begin{tabular}{llll}
Domestic credit to private sector (\% of GDP)      & 60.48***   & 60.94***   & 62.18***    \\
                                                   & (2.71)     & (2.74)     & (2.89)      \\
Foreign direct investment, net inflows (\% of GDP) &            & -19.33*    & -33.46***   \\
                                                   &            & (9.98)     & (11.74)     \\
Personal remittances, received (\% of GDP)         &            &            & -0.07       \\
                                                   &            &            & (6.41)      \\
Constant                                           & 2569.81*** & 2646.94*** & 2684.24***  \\
                                                   & (89.43)    & (95.81)    & (109.36)    \

### Outcome of Latex Table 

![Screenshot%20from%202018-02-28%2000-38-53.png](attachment:Screenshot%20from%202018-02-28%2000-38-53.png)

### Introducting Lagged Variables

See the results from Regression 1 below. In contrast to what traditional economic theory would suggest, the effects of FDI on GDP are not statistically significant.  

However, several constraints to this regression may explain these results.  For example, foreign direct investment likely does not provide returns until a later period.  Thus, FDI from last year or the year before may be a better predictor of GDP Per Capita.

###### dataframe['column'].shift()

Pandas offers a simple method to generate lagged variables: .shift()

In this example, we want to lag FDI by one year (or time observation). Thus, 

In [54]:
# Generates a variables of FDI in the year prior. 
df['FDI_lagged'] = df['Foreign direct investment, net inflows (% of GDP)'].shift(-1)

Now, if we examine the table below, we will see that the value of FDI_lagged is the same as the Foreign direct investment, net inflows (% of GDP) a year prior. 

In [53]:
df

Unnamed: 0,Country,Year,Domestic credit to private sector (% of GDP),"Foreign direct investment, net inflows (% of GDP)",GDP Per Capita (Constant 2011 USD),"Personal remittances, received (% of GDP)",Research and development expenditure (% of GDP),Constant,FDI_lagged
0,Angola,2017,,,,,,1,4.305258
1,Angola,2016,21.131451,4.305258,5984.640422,0.004183,,1,9.015118
2,Angola,2015,27.224227,9.015118,6231.067992,0.010795,,1,1.515813
3,Angola,2014,22.885054,1.515813,6260.132681,0.024430,,1,-5.700024
4,Angola,2013,23.387915,-5.700024,6185.013829,0.029331,,1,-5.977515
5,Angola,2012,22.262184,-5.977515,5998.638601,0.034965,,1,-2.904235
6,Angola,2011,20.179426,-2.904235,5911.254334,0.000197,,1,-3.913151
7,Angola,2010,20.215879,-3.913151,5895.114088,0.021792,,1,2.921219
8,Angola,2009,21.468893,2.921219,5908.051427,0.000215,,1,1.994548
9,Angola,2008,12.682510,1.994548,5978.334873,0.097512,,1,-1.477846


## Mixed Effect Methods 

The following code runs a year fixed/mixed effect model . The first regression measures the effect of trendlines across all countries in a given year.  The second introduces only country fixed effects.  This controls for all aspects of a country which are constant over time such as climate.  The third regression includes both year and country mixed effects for a difference-in-difference model.  

In [32]:
# import additional package for fixed effects
import statsmodels.formula.api as smf

Unfortunately, the code for the mixed-effect models are slightly more complicated. First, the regression input is a string with the following format: 

' DependentVar ~ IndependentVar + IndependentVar...' 

where the variable names cannot have spaces within them. 

Thus, our first step is to rename all variables in the dataset to clear out all spaces.

In [60]:
# Rename all columnns 
# Make sure to pay attention to the order of the names!
df.columns = ['Country', 'Year', 'creditprive', 'FDI', 'GDPpc', 'Remitt', 'ResDev', 'Constant', 'FDI_lagged']

The regression also does not appear to handle missing values well.  As a result, we must generate different datasets for each regression and drop the missing values. 

In [61]:
# Regression 1 
dfreg1= df[['Year', 'FDI_lagged', 'GDPpc', 'Constant', 'Country']]
dfreg1 = dfreg1.dropna()

#### Design the model

Now, our data is ready for analysis with Country mixed-effects. 

First we need to choose our model.  In this first model, we are examine the effects of lagged FDI as a percentage of GDP on GDP Per Capita. By including country mixed-effected, we control for all features of countries which are constant over time.  Thus, we are examining the relationship between changes in FDI within countries and the corresponding changes in GDP in the following year. 

In [64]:
model = 'GDPpc ~ FDI_lagged'

##### Insert model into the regression - Include mixed-effect Country Controls

As economic theory would suggest, an increase in FDI (as a percentage of GDP) leads to an increase in GDP in the following year.  In this model, a one percentage change in FDI as a percentage of GDP correspondst to an increase in GDP per capita in the following year of 46.7 US 2011 Dollars. 

However, we may be concerned that all countries have recorded GDP per capita and FDI increase due to some third factor such as new technologies or banking sector reforms.  This hypothesis is supported by the scatter plot we generated earlier which examines changes over time.  In the scatter plot, as we moved through the years, both FDI and GDP per capita increased in values.  We can control for this trend using time-mixed effects by including year dummies (for all but one).  

In [100]:
# insert model and data types into the regression equation 
# insert Year mixed-effects by setting groups equal to dataframe['Country']
mixed_reg1 = smf.mixedlm(model, dfreg1 , groups=dfreg1['Country'], missing='drop').fit()
print('Results from Regression 1')
print(mixed_reg1.summary())

Results from Regression 1
              Mixed Linear Model Regression Results
Model:                MixedLM   Dependent Variable:   GDPpc      
No. Observations:     1294      Method:               REML       
No. Groups:           52        Scale:                525606.1789
Min. group size:      12        Likelihood:           -10322.0265
Max. group size:      27        Converged:            Yes        
Mean group size:      24.9                                       
-----------------------------------------------------------------
              Coef.    Std.Err.    z    P>|z|   [0.025    0.975] 
-----------------------------------------------------------------
Intercept     5579.129  281.136  19.845 0.000  5028.113  6130.146
FDI_lagged      10.119    4.818   2.100 0.036     0.676    19.562
y1990        -2562.711  161.729 -15.846 0.000 -2879.694 -2245.728
y1991        -2575.334  159.062 -16.191 0.000 -2887.090 -2263.579
y1992        -2523.193  156.452 -16.128 0.000 -2829.833 -2216.55

### Generate Year Dummies

First, we need to generate the year binary variables. The following loop generates these efficiently. This code generates year binary variables and a string of the binary variable names to add to the regression. 

In [114]:
import numpy as np

years = ""
ystr = []
# For Each year in the dataset, create a dummy variable which equals 1 if the year == i 
for i in range(min(dfreg1["Year"].astype(int)), max(dfreg1["Year"].astype(int))) :
    dfreg1['y'+str(i)] = np.where(dfreg1['Year']==str(i), 1, 0)
    years = years + '+ y'+str(i)+' '
    ystr = ystr + list(['y'+str(i)])

If we print the dataset for the regression, we should observe varaiables which take the value '1' when the year binary variable matches the year of the observation.  

In [93]:
dfreg1

Unnamed: 0,Year,FDI_lagged,GDPpc,Constant,Country,y1990,y1991,y1992,y1993,y1994,...,y2006,y2007,y2008,y2009,y2010,y2011,y2012,y2013,y2014,y2015
1,2016,9.015118,5984.640422,1,Angola,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2015,1.515813,6231.067992,1,Angola,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,2014,-5.700024,6260.132681,1,Angola,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,2013,-5.977515,6185.013829,1,Angola,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5,2012,-2.904235,5998.638601,1,Angola,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
6,2011,-3.913151,5911.254334,1,Angola,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
7,2010,2.921219,5895.114088,1,Angola,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
8,2009,1.994548,5908.051427,1,Angola,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
9,2008,-1.477846,5978.334873,1,Angola,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
10,2007,-0.090250,5443.126215,1,Angola,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


### Define Model with Year Binary Variables

Now we redefine our model including the year binary variables. 

In [95]:
model = 'GDPpc ~ FDI_lagged' + years
print(model)

GDPpc ~ FDI_lagged+ y1990 + y1991 + y1992 + y1993 + y1994 + y1995 + y1996 + y1997 + y1998 + y1999 + y2000 + y2001 + y2002 + y2003 + y2004 + y2005 + y2006 + y2007 + y2008 + y2009 + y2010 + y2011 + y2012 + y2013 + y2014 + y2015 


#### Run the new model 

By incuding the year dummy variables, the lagged FDI coefficient is picking up on variation in time-variant, country specific changes in foreign direct investment. When including the year controls, the affect of the lagged FDI on GDP growth is reduced.  However, the results are remain statistically significant at the 99 percent level.  

Interpretation: A one stardard deviation increase in foreign direct investment (% of GDP) within a country corresponds to an increase in GDP per capita of over 50 dollars (Constant 2011 USD) within that country, on average. 

In [99]:
# insert model and data types into the regression equation 
# insert Year mixed-effects by setting groups equal to dataframe['Country']
mixed_reg2 = smf.mixedlm(model, dfreg1 , groups=dfreg1['Country'], missing='drop').fit()
print('Results from Regression 1')
print(mixed_reg2.summary())

Results from Regression 1
              Mixed Linear Model Regression Results
Model:                MixedLM   Dependent Variable:   GDPpc      
No. Observations:     1294      Method:               REML       
No. Groups:           52        Scale:                525606.1789
Min. group size:      12        Likelihood:           -10322.0265
Max. group size:      27        Converged:            Yes        
Mean group size:      24.9                                       
-----------------------------------------------------------------
              Coef.    Std.Err.    z    P>|z|   [0.025    0.975] 
-----------------------------------------------------------------
Intercept     5579.129  281.136  19.845 0.000  5028.113  6130.146
FDI_lagged      10.119    4.818   2.100 0.036     0.676    19.562
y1990        -2562.711  161.729 -15.846 0.000 -2879.694 -2245.728
y1991        -2575.334  159.062 -16.191 0.000 -2887.090 -2263.579
y1992        -2523.193  156.452 -16.128 0.000 -2829.833 -2216.55

In [98]:
dfreg1['FDI_lagged'].describe()

count    1294.000000
mean        3.614419
std         5.068844
min       -32.346992
25%         0.728763
50%         2.118969
75%         4.975645
max        43.912112
Name: FDI_lagged, dtype: float64