# Regression 5: Random Forest Regression on the World Population

For the final test of the week, we'll learn how decision trees can be expanded upon as simple classifiers in order to create an [ensemble](https://en.wikipedia.org/wiki/Ensemble_learning) model know as a Random Forest. Like our previous coding challenges, we train this new model using the world population data from the Analyse Supplementary Exam. 

### Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor

In [2]:
population_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/world_population.csv', index_col='Country Code')
meta_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/metadata.csv', index_col='Country Code')

In [3]:
population_df.head()

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABW,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,58055.0,58386.0,58726.0,...,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0
AFG,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,10152331.0,10372630.0,10604346.0,10854428.0,...,27294031.0,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0
AGO,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,6309770.0,6414995.0,6523791.0,6642632.0,...,21759420.0,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0
ALB,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,1965598.0,2022272.0,2081695.0,...,2947314.0,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0
AND,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,19647.0,20758.0,21890.0,23058.0,...,83861.0,84462.0,84449.0,83751.0,82431.0,80788.0,79223.0,78014.0,77281.0,76965.0


In [4]:
meta_df.head()

Unnamed: 0_level_0,Region,Income Group,Special Notes
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ABW,Latin America & Caribbean,High income,Mining is included in agriculture\r\r\r\nElect...
AFG,South Asia,Low income,Fiscal year end: March 20; reporting period fo...
AGO,Sub-Saharan Africa,Lower middle income,
ALB,Europe & Central Asia,Upper middle income,
AND,Europe & Central Asia,High income,WB-3 code changed from ADO to AND to align wit...


### Question 1

As we've seen previously, the world population data spans from 1960 to 2017. We'd like to build a predictive model that can give us the best guess at what the world population in a given year was. However, as a slight twist this time, we want to compute this estimate for only _countries within a given income group_. To do this, similar to our previous coding challenges, we need to partition our data such that we have testing data which is reserved for our model's evaluation.  

First, however, we need to formulate our data such that the sklearn's `RandomForestRegressor` class can train on our data. To do this, we will write a function that takes as input an income group and return a 2-d numpy array that contains the year and the measured population.

_**Function Specifications:**_
* Should take a `str` argument, called `income_group` as input and return a numpy `array` type as output.
* Set the default argument of `income_group` to equal `'Low income'`.
* If the specified value of `income_group` does not exist, the function must raise a `ValueError`.
* The array should only have two columns containing the year and the population, in other words, it should have a shape `(?, 2)` where `?` is the length of the data.
* The values within the array should be of type `np.int64`. 

_**Further Reading:**_

Data types are associated with memory allocation. As such, your choice of data type affects the precision of computations in your program. For example, the `np.int` data type in numpy can only store values between -2147483648 to 2147483647 and assigning values outside this range for variables of this data type may cause run-time errors. To avoid this, we can use data types with larger memory capacity e.g. `np.int64`.

https://docs.scipy.org/doc/numpy/user/basics.types.html

In [29]:
def get_year_pop_by_income(income_group):
    # Write your code here
    df = population_df.join(meta_df, on='Country Code', how='inner')
    df = df[df['Income Group'] == income_group]
    df = df.drop(['Region', 'Income Group', 'Special Notes'], axis=1)
    df = df.loc[df.index].melt().groupby('variable').sum().reset_index().get_values().astype(np.int64)
    return df

In [30]:
get_year_pop_by_income('High income')

array([[      1960,  769889923],
       [      1961,  781225329],
       [      1962,  791207437],
       [      1963,  801108277],
       [      1964,  810900987],
       [      1965,  820309686],
       [      1966,  829088382],
       [      1967,  837479954],
       [      1968,  844905494],
       [      1969,  854059674],
       [      1970,  862276721],
       [      1971,  871169187],
       [      1972,  880246152],
       [      1973,  888486025],
       [      1974,  897803169],
       [      1975,  906573084],
       [      1976,  913843314],
       [      1977,  921330504],
       [      1978,  928906293],
       [      1979,  936836246],
       [      1980,  944587066],
       [      1981,  952368316],
       [      1982,  959759971],
       [      1983,  966754949],
       [      1984,  973423742],
       [      1985,  980143630],
       [      1986,  987194728],
       [      1987,  994242786],
       [      1988, 1001421456],
       [      1989, 1009036892],
       [  

_**Expected Outputs:**_
```python
get_year_pop_by_income('High income')
```
> ```
array([[      1960,  769889923],
       [      1961,  781225329],
       [      1962,  791207437],
       [      1963,  801108277],
       ...
       [      2015, 1211252041],
       [      2016, 1218629612],
       [      2017, 1225514228]])
```




### Question 2

Now that we have have our data, we need to split this into a set of variables we will be training on, and the set of variables that we will make our predictions on.

Unlike the previous coding challenges, a friend of our has indicated that sklearn has its own built-in functionality for creating training and testing sets. Here, using the `train_test_split` [method](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), we can easily shuffle and randomly choose a subset of the data as the test set.   

Using this knowledge, write a function which uses sklearn's `train_test_split` [method](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) internally, and that will take as input a 2-d numpy array and return four variables in the form of `(X_train, y_train), (X_test, y_test)`, where `(X_train, y_train)` are the features / response of the training set, and `(X-test, y_test)` are the feautes / response of the testing set. 

**Important Note:** Due to the random initialisation process used within sklearn's `train_test_split` method, you will need to fix the value of the `random_state` argument in order to get repeatable and predictable results. 


_**Function Specifications:**_
* Should take a 2-d numpy `array` as input.
* Should use sklearn's `train_test_split` [method](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
* Set `random_state` to equal `42` for this internal method.  
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.
* `(X_test, y_test)` should contain 1% of the input array. They should also be the form of an `array`, and not as a single value.


In [31]:
from sklearn.model_selection import train_test_split

def sklearn_feature_response_split(arr):
    # Write your code here
    X = arr[:,0]
    y = arr[:,1]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=42)
    return (X_train, y_train), (X_test, y_test) 

In [32]:
data = get_year_pop_by_income('High income');
sklearn_feature_response_split(data)

((array([1965, 1994, 1973, 2004, 2012, 1997, 1985, 2006, 1972, 2008, 1963,
         1996, 1991, 1968, 1977, 1966, 1964, 2001, 1979, 1990, 2009, 2010,
         2014, 1975, 1969, 1987, 1986, 1976, 1984, 1993, 2015, 2000, 1971,
         1992, 2016, 2003, 1989, 2013, 1961, 1981, 1962, 2005, 1999, 1995,
         1983, 2007, 1970, 1982, 1978, 2017, 1980, 1967, 2002, 1974, 1988,
         2011, 1998]),
  array([ 820309686, 1048121445,  888486025, 1123325037, 1188796100,
         1071969568,  980143630, 1140084827,  880246152, 1158965286,
          801108277, 1064630661, 1025345408,  844905494,  921330504,
          829088382,  810900987, 1100293969,  936836246, 1017092667,
         1167712409, 1175649232, 1203819897,  906573084,  854059674,
          994242786,  987194728,  913843314,  973423742, 1040349480,
         1211252041, 1092825678,  871169187, 1031949811, 1218629612,
         1115390519, 1009036892, 1196212921,  781225329,  952368316,
          791207437, 1131426281, 1085992668, 10572

_**Expected Outputs:**_
```python
data = get_year_pop_by_income('High income')
sklearn_feature_response_split(data)
```
> ```
((array([1965, 1994, 1973, 2004, 2012, 1997, 1985, 2006, 1972, 2008, 1963,
         1996, 1991, 1968, 1977, 1966, 1964, 2001, 1979, 1990, 2009, 2010,
         2014, 1975, 1969, 1987, 1986, 1976, 1984, 1993, 2015, 2000, 1971,
         1992, 2016, 2003, 1989, 2013, 1961, 1981, 1962, 2005, 1999, 1995,
         1983, 2007, 1970, 1982, 1978, 2017, 1980, 1967, 2002, 1974, 1988,
         2011, 1998]),
  array([ 820309686, 1048121445,  888486025, 1123325037, 1188796100,
         1071969568,  980143630, 1140084827,  880246152, 1158965286,
          801108277, 1064630661, 1025345408,  844905494,  921330504,
          829088382,  810900987, 1100293969,  936836246, 1017092667,
         1167712409, 1175649232, 1203819897,  906573084,  854059674,
          994242786,  987194728,  913843314,  973423742, 1040349480,
         1211252041, 1092825678,  871169187, 1031949811, 1218629612,
         1115390519, 1009036892, 1196212921,  781225329,  952368316,
          791207437, 1131426281, 1085992668, 1057290586,  966754949,
         1149238990,  862276721,  959759971,  928906293, 1225514228,
          944587066,  837479954, 1107836355,  897803169, 1001421456,
         1181451343, 1078927765])),
 (array([1960]), array([769889923])))
 ```

### Question 3

Now that we have formatted our data, we can fit a model using sklearn's `DecisionTreeRegressor` class. We'll write a function that will take as input the features and response variables that we created in the last question, and return a trained model.

**Important Note:** Due to the random initialisation process used within sklearn's `DecisionTreeRegressor` class, you will need to fix the value of the `random_state` argument in order to get repeatable and predictable results.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* Should return an sklearn `RandomForestRegressor` model.
* Set the `random_state` argument of the model to equal `42`
* The returned model should be fitted to the data.

_**Hint:**_
You may need to reshape the data within the function. You can use `.reshape(-1, 1)` to do this.


In [39]:
def train_model(X_train, y_train):
    # Write your code here
    model = RandomForestRegressor(random_state=42).fit(X_train.reshape(-1, 1), y_train)
    return model

In [40]:
data = get_year_pop_by_income('High income')
(X_train, y_train), _ = sklearn_feature_response_split(data)

train_model(X_train, y_train).predict([[1960]])



array([7.86208256e+08])

_**Expected Outputs:**_
```python
train_model(X_train, y_train).predict([[1960]]) == array([7.86208256e+08])
```

### Question 4

We would now like to test on our testing data that we produced from Question 2. This test will give the Mean Absolute Error (MAE), which is given by:

$$
MAE = \frac{1}{N} \sum_{n=i}^N |p_i - y_i|
$$

where $p_i$ refers to the $i^{\rm th}$ prediction made from `X_test`, $y_i$ refers to the $i^{\rm th}$ value in `y_test`, and $N$ is the length of `y_test`.

_**Function Specifications:**_
* Should take a trained model and two `arrays` as input. This will be the `X_test` and `y_test` variables from Question 2. 
* Should return the residual sum of squares over the input from the predicted values of `X_test` as compared to values of `y_test`.
* The output should be a `float` rounded to 2 decimal places.

In [47]:
def test_model(model, X_test, y_test):
    # Write your code here
    y_pred = model.predict(X_test.reshape(-1, 1))
    MAE = 1 / len(y_test) * np.sum(np.absolute(y_pred - y_test))
    return np.round(MAE, 2)

In [48]:
data = get_year_pop_by_income('High income')
(X_train, y_train), (X_test, y_test) = sklearn_feature_response_split(data)
lm = train_model(X_train, y_train)



In [49]:
test_model(lm, X_test, y_test)

16318333.2

_**Expected Outputs:**_
```python
test_model(lm, X_test, y_test) == 16318333.2
```