# Decision Tree Regression on the World Population
© Explore Data Science Academy

In this test we'll train a simple decision tree model using the world population data from the Analyse Practical Exam. 

<img src="https://github.com/Explore-AI/Pictures/blob/master/population.png?raw=true">

## Honour Code

I **YOUR NAME, YOUR SURNAME**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code (https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.


### Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

In [2]:
population_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/world_population.csv', index_col='Country Code')

In [3]:
population_df.head()

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABW,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,58055.0,58386.0,58726.0,...,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0
AFG,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,10152331.0,10372630.0,10604346.0,10854428.0,...,27294031.0,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0
AGO,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,6309770.0,6414995.0,6523791.0,6642632.0,...,21759420.0,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0
ALB,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,1965598.0,2022272.0,2081695.0,...,2947314.0,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0
AND,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,19647.0,20758.0,21890.0,23058.0,...,83861.0,84462.0,84449.0,83751.0,82431.0,80788.0,79223.0,78014.0,77281.0,76965.0


### Question 1: Population Growth

The world population data spans from 1960 to 2017. We'd like to build a predictive model that can give us the best guess at what the population growth rate in a given year might be. We will calculate the population growth rate as follows:-

$$
Growth\_rate = \frac{current\_year\_population - previous\_year\_population}{previous\_year\_population}
$$

As such, we can only calculate the growth rate for the year 1961 onwards.

Write a function that takes the `population_df` and a `country_code` as input and computes the population growth rate for a given country starting from the year 1961. This function must return a return a 2-d numpy array that contains the year and corresponding growth rate for the country.

_**Function Specifications:**_
* Should take a `population_df` and `country_code` string as input and return a numpy `array` as output.
* The array should only have two columns containing the year and the population growth rate, in other words, it should have a shape `(?, 2)` where `?` is the length of the data.


In [4]:
### START FUNCTION
def get_population_growth_rate_by_country_year(df,country_code):
    
    df1 = df.copy()
    df1 = df1.T
    df2 = df1[country_code].pct_change()
    df2 = df2.reset_index()
    df2 = df2.to_numpy()
    for tiny_array in df2:
        tiny_array[0] = int(tiny_array[0])
        tiny_array[1] = round(float(tiny_array[1]),5)
    df2 = np.delete(df2, 0, axis=0)
    return df2
    

### END FUNCTION

In [12]:
get_population_growth_rate_by_country_year(population_df,'ABW')

array([[ 1.96100000e+03,  2.26337828e-02],
       [ 1.96200000e+03,  1.41960388e-02],
       [ 1.96300000e+03,  8.35927079e-03],
       [ 1.96400000e+03,  5.94408678e-03],
       [ 1.96500000e+03,  5.75115725e-03],
       [ 1.96600000e+03,  6.18898187e-03],
       [ 1.96700000e+03,  5.89101620e-03],
       [ 1.96800000e+03,  5.70148997e-03],
       [ 1.96900000e+03,  5.82331381e-03],
       [ 1.97000000e+03,  5.73851446e-03],
       [ 1.97100000e+03,  6.38301475e-03],
       [ 1.97200000e+03,  6.72947510e-03],
       [ 1.97300000e+03,  6.73462567e-03],
       [ 1.97400000e+03,  4.73084010e-03],
       [ 1.97500000e+03,  2.13124504e-03],
       [ 1.97600000e+03, -1.17051618e-03],
       [ 1.97700000e+03, -3.63120193e-03],
       [ 1.97800000e+03, -4.35675711e-03],
       [ 1.97900000e+03, -2.04648686e-03],
       [ 1.98000000e+03,  1.93397799e-03],
       [ 1.98100000e+03,  7.83746006e-03],
       [ 1.98200000e+03,  1.28452788e-02],
       [ 1.98300000e+03,  1.39538675e-02],
       [ 1.

_**Expected Outputs:**_
```python
get_population_growth_rate_by_country_year(population_df,'ABW')
```
> ```
 array([[ 1.96100000e+03,  2.26337828e-02],
        [ 1.96200000e+03,  1.41960388e-02],
        [ 1.96300000e+03,  8.35927079e-03],
        [ 1.96400000e+03,  5.94408678e-03],
         ...
        [ 2.01400000e+03,  5.89221510e-03],
        [ 2.01500000e+03,  5.26036900e-03],
        [ 2.01600000e+03,  4.60988490e-03],
        [ 2.01700000e+03,  4.21667207e-03]])
```




### Question 2: Even-Odd Train-Test Split

Now that we have have our data, we need to split this into a set of variables we will be training on, and the set of variables that we will make our predictions on. In this case, we're splitting the values such that the training set consists of growth rates for even years and the test consists of growth rates for odd years. We also need to split our data into the predictive features (denoted `X`) and the response (denoted `y`). 

Write a function that will take as input a 2-d numpy array and return four variables in the form of `(X_train, y_train), (X_test, y_test)`, where `(X_train, y_train)` are the features / response of the training set, and `(X-test, y_test)` are the feautes / response of the testing set where the training and testing data consists of even and odd years respectively: 

_**Function Specifications:**_
* Should take a 2-d numpy `array` as input.
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.
* `(X_train, y_train)` should consist of data from even years and `(X_test, y_test)` should consist of data from odd years.

In [5]:
### START FUNCTION
def feature_response_split(arr):
    # your code here
    
    X_train = []
    y_train = []
    X_test = []
    y_test = []
    
    for tiny_array in arr:
        if tiny_array[0]%2==0:
            X_train.append(tiny_array[0])
            y_train.append(tiny_array[1])
        else:
            X_test.append(tiny_array[0])
            y_test.append(tiny_array[1])
    
    X_train = np.array(X_train)
    y_train = np.array(y_train)
    X_test = np.array(X_test)
    y_test = np.array(y_test)
    
    
    return (X_train, y_train), (X_test, y_test)

### END FUNCTION

In [14]:
data = get_population_growth_rate_by_country_year(population_df,'ABW');
(X_train, y_train), (X_test, y_test) = feature_response_split(data)

_**Expected Outputs:**_
```python
data = get_total_population_by_country_year()
feature_response_split(data)
```
> ```
X_train == array([1962., 1964., 1966., 1968., 1970., 1972., 1974., 1976., 1978.,
       1980., 1982., 1984., 1986., 1988., 1990., 1992., 1994., 1996.,
       1998., 2000., 2002., 2004., 2006., 2008., 2010., 2012., 2014.,
       2016.])
```

> ```
y_train ==  array([ 0.01419604,  0.00594409,  0.00618898,  0.00570149,  0.00573851,
        0.00672948,  0.00473084, -0.00117052, -0.00435676,  0.00193398,
        0.01284528,  0.01020884, -0.00606099, -0.01219414,  0.01830187,
        0.05590975,  0.05787267,  0.03580499,  0.02136897,  0.02076288,
        0.02254085,  0.01772885,  0.00800752,  0.00131397,  0.00212906,
        0.00513459,  0.00589222,  0.00460988])
```

> ```
X_test == array([1961., 1963., 1965., 1967., 1969., 1971., 1973., 1975., 1977.,
       1979., 1981., 1983., 1985., 1987., 1989., 1991., 1993., 1995.,
       1997., 1999., 2001., 2003., 2005., 2007., 2009., 2011., 2013.,
       2015., 2017.])
```

> ```
y_test == array([ 0.02263378,  0.00835927,  0.00575116,  0.00589102,  0.00582331,
        0.00638301,  0.00673463,  0.00213125, -0.0036312 , -0.00204649,
        0.00783746,  0.01395387,  0.00302374, -0.01294617, -0.0007695 ,
        0.03979147,  0.0625632 ,  0.04724902,  0.02705529,  0.01979903,
        0.02250889,  0.02131758,  0.01310552,  0.00384798,  0.00098665,
        0.00377696,  0.00594675,  0.00526037,  0.00421667])      
 ```

### Question 3

Now that we have formatted our data, we can fit a model using sklearn's `DecisionTreeRegressor` class. We'll write a function that will take as input the features and response variables that we created in the last question, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)` as well as a `MaxDepth` int corresponding to the max_depth hyperparameter in decision trees.
* Should return an sklearn `DecisionTreeRegressor` model.
* The returned model should be fitted to the data.

_**Hint:**_
You may need to reshape the data within the function. You can use `.reshape(-1, 1)` to do this.


In [6]:
### START FUNCTION
def train_model(X_train, y_train, MaxDepth):
    
    X_train = X_train.reshape(-1,1)
    y_train = y_train.reshape(-1,1)
    regr_tree = DecisionTreeRegressor(max_depth=MaxDepth,random_state=42)
    
    return regr_tree.fit(X_train,y_train)

### END FUNCTION

In [81]:
data = get_population_growth_rate_by_country_year(population_df,'ABW')
(X_train, y_train), _ = feature_response_split(data)

train_model(X_train, y_train,3).predict([[2017]])

array([0.00451454])

_**Expected Outputs:**_
```python
train_model(X_train, y_train,3).predict([[2017]]) == array([0.00451454])
```

### Question 4

We would now like to test on our testing data that we produced from Question 2. This test will give the Root Mean Squared Logarithmic Error (RMSLE), which is given by:

$$
RMSLE = \sqrt{\frac{1}{N}\sum_{i=1}^N [log(1+p_i) - log(1+y_i)]^2}
$$

where $p_i$ refers to the $i^{\rm th}$ prediction made from `X_test`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

_**Function Specifications:**_
* Should take a trained model and two `arrays` as input. This will be the `X_test` and `y_test` variables from Question 2. 
* Should return the residual sum of squares over the input from the predicted values of `X_test` as compared to values of `y_test`.
* The output should be a `float` rounded to 3 decimal places.


In [7]:
### START FUNCTION
def test_model(model, y_test, X_test):
    
    from sklearn.metrics import mean_squared_error
    
    y_pred = model.predict(X_test.reshape(-1,1))
    MSE = mean_squared_error(y_pred,y_test)
    
    return round(np.sqrt(MSE),3)

### END FUNCTION

In [11]:
data = get_population_growth_rate_by_country_year(population_df,'ABW')
(X_train, y_train), (X_test, y_test) = feature_response_split(data)

lm = train_model(X_train, y_train,3)

test_model(lm, y_test, X_test) == 0.008

True

In [79]:
test_model(lm, y_test, X_test)

0.008

_**Expected Outputs:**_
```python
test_model(lm, X_test, y_test) == 0.008
```