## Final Project Pt. 2 - Sierra Leone

## Aim and Hypothesis

* By comparing aid project funding for Sierra Leone between 1992 and 2014, and that country's rankings on HDI indicators for the same time period, I want to see if I can discover any relationships between these two sets off data from a time-series and casuation point of view. 

* I think this could be an interesting analysis because: (in order of what I consider most likely to least likely), it could indicate that AID funding has an immediate positive effect, or it could indicate AID funding has a delayed positive effiect, or it could indicate AID funding has no clear effect, or it could indicate the AID funding has a negative 
effect, on a country's performance on HDI indicators over time. If this analysis was on a larger scale and more rigorous, this could provide insight to AID organisations about what sectors to focus on, and when to expect to see results from their AID funding based on historical data

- Null hypothesis: There is no relationship between amount spent on projects by year, per sector, and the countries performance on related HDI indicators
- Alternative hypothesis: There is some relationship between amount spent on projects by year, and sector, and the relevant countries performance on related HDI indicators

## Risks

* The data may be too unstructured to analyse in the way I propose 
* My assumptions about the data may be incorrect or incomplete
* The visualizations and conclusions I draw may be misleading if there are problems in the data or the assumptions
* There may be no descirnable relationship I can discover between the two data sets
* The lack of project funding in the early 90s may skew the regression
* The AID dataset may only represent a small part of foreign aid to Sierra Leone and thus will not demonstrate any causal effects on the higher level indicators
* There may be a significant time lag between the project funding and the HDI effect that I will not detect.
* Though I am enthusatic about AID projects in Africa, having seen some of their beneficial effect in person, I do lack AID domain knowledge, and this may hinder my ability to correctly determine the right analysis to perform 
* That there are better HDI indicators than the ones I've chosen to obtain relationships with project funding
* That there are not enough data points/years for time series to be a valid method of analysis

## Assumptions

**About the AID project data**
* The original data only states the year when the aid transactions for a project began and ended, and the total dibursements overall for the project. I have assumed that the total disbursements for a project occured equally on each year (inclusive) that the project was active. 
* I have assumed that total disbursements is a better measure of project funding than total commitments.
* I have assumed that because there are too many types of 'Sector Names' in the original data set, I can accurately group these projects together by searching the 'Sector Name' for keywords (i.e. any project containing the word 'Health' in the ad_sector_name has been assigned to Health funding, likewise for Agriculture, Government and Education)

Data sourced from https://www.aiddata.org/data/sierra-leone-aims-geocoded-research-release-level-1-v1-0
"Summary
This is all geocoded projects from the Development Assistance Database (DAD) Aid Information Management System (AIMS) managed for Sierra Leone. This dataset tracks 856 projects across 2314 locations, 3,340,138,954.00 USD in geocoded commitments, and 2,785,175,978.00 USD in geocoded disbursements between 1992 and 2014."

In [5]:
import pandas as pd
df = pd.read_csv("projects_original.csv")
df.head(1)

Unnamed: 0,project_id,is_geocoded,project_title,start_actual_isodate,start_actual_type,end_actual_isodate,end_actual_type,donors,donors_iso3,recipients,recipients_iso3,ad_sector_codes,ad_sector_names,ad_purpose_codes,ad_purpose_names,status,transactions_start_year,transactions_end_year,total_commitments,total_disbursements
0,SL/000021,1,Strengthening District Health Services Project,2005-10-17,start-actual,2013-12-31,end-actual,African Development Bank (AfDB),DAC,Sierra Leone,SLE,120,Health,12005,"Health, purpose unspecified",Completion,2005,2013,36149108.0,36149108.0


The modified projects dataset shows the total amount of disbursements on projects in four given sectors, in a given year between 1992 and 2014 inclusive.


In [7]:
df = pd.read_csv("projects.csv")
df.head(1)

Unnamed: 0,Year,Agriculture,Government,Education,Health
0,1992,0.0,0.0,51750.07143,0.0


**About the HDI data**
* I have assumed the HDI 'Education Index' is an appropriate measure of Sierra Leone's performance in education, and that this may have some relationship with Education funding over the same time period
* I have assumed the HDI 'Agri Employment' (The percentage of the Sierra Leone workforce employed in agriculture) will have some relationship with Agricultural funding over the same time period. 
* Agri Employment data prior to 2010 was only available for the specific years 1991, 1995, 2000 and 2005, so in order to analyse this data over the same time period as the other indicators, between those years I have inserted values equal to the previous years value plus or minus the average change per year between those two time periods. E.g. between 2005 and 2010 the % of Agri Employment changed from 65.9 to 63.7, so I have assumed that in the 4 years inbetween, Agri Employment decreased by (63.7-65.9 = -2.2 / 4 = - 0.55 per year), so 2006 = 65.9-0.55 = 65.35 and so on
* I have assumed the HDI 'Income Index' is an appropriate measure of Sierra Leone's goverment's performance will have some relationship with goverment/bureaucratic aid funding over the same time period. 
* I have assumed the HDI 'Life Expectancy Index' is an appropriate measure of Sierra Leone's health outcomes and performance, and that it will have some relationship with health aid funding over the same time period. 

Data sourced from http://hdr.undp.org/en/data , then pre-filtered for Sierra Leone, then downloaded to seperate CSVs. Then combined into a single table, using 1 row per year

In [8]:
hdi_df_o = pd.read_csv('hdi.csv')
hdi_df_o.head(1)

Unnamed: 0,Year,Education Index,Agri Employment,Income Index,Life Expectancy Index
0,1992,0.194,65.3,0.382,0.25


**About the machine learning**
 * That time series regression is the appropriate ML/statisical model to analyse the relationships proposed.  
 * That all numbers involved are real, continuous floating point numbers

## Goals

* Get project and HDI data into final format for analysis
* Perform EDA on seperate datasets
* Combine the datasets
* Perform time series regressions between the the predictor variables (Aid increases/decreases across time) and the target variable (HDI performance increases/decreases across time)
* Present results and conclusions

## Success Metrics

* If the EDA and visualisations are clear, the markers of the project will understand how the data has been changing over time in a concise and accurate way.
* If the models are fit, and have a high r squared value, and the coefficients have a low p value, this will indicate some kind of statistical significant relationship between the predictor and the target

* If I am able to answer the questions proposed in my lightning talk:
    1. What proportion of disbursed AID dollars were given to various sectors of the country overall?
    2. Which years contain the most amount of disbursed AID dollars?
    3. What are the trends over time, by sector? 
    4. How has Sierra Leone been tracking in various human development indicators throughout those years, what are the trends over time, by area?

* If I am able to address the confounders mentioned in my lightning talk:

    1. Confounder 1: Time Lag. There may be a negative relationship due to the time lag involved, i.e. more project money may go into an area as it’s getting ‘worse’ in order to try an improve it, making it look like there is a negative relationship between the money and the indicator. Alternatively if there’s not as much time lag, the money may have a positive relationship with indicators, showing that it is helping immediately.
    2. Confounder 2: Correlation vs Causation. If, for example, Sierre Leone's HDI indicators improved over the time period, and the project funding also increased over the time period, it may not mean that the project funding caused the HDI improvement. Both aspects could have been influenced by a third factor, such as improvements in the global economy and technology.

* If my conclusions logically follow from my analysis, and accurately describe whether I was able to find any meaningful relationships and if there is any further analysis that may be required

* If I'm unable to determine any significant relationships, the project can still be considered a success as I applied data science skills to a problem I was interested in, and gained practice with data manipulation and Python programming.