# Decision trees

This notebook expounds on the fundamental concepts of decision trees by implementing a decision tree regression model and analysing its RMSLE.

We begin by importing the necessary packages for the challenges.

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

## The dataset

The dataset (by ExploreAI) contains population data for various countries over the years from 1960 to 2017. Each row corresponds to a specific country, identified by a country code, and each column represents a year. The values within the dataset represent the population count for each country in the corresponding year.

In [2]:
population_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/world_population.csv', index_col = 'Country Code')
population_df.head()

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABW,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,58055.0,58386.0,58726.0,...,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0
AFG,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,10152331.0,10372630.0,10604346.0,10854428.0,...,27294031.0,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0
AGO,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,6309770.0,6414995.0,6523791.0,6642632.0,...,21759420.0,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0
ALB,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,1965598.0,2022272.0,2081695.0,...,2947314.0,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0
AND,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,19647.0,20758.0,21890.0,23058.0,...,83861.0,84462.0,84449.0,83751.0,82431.0,80788.0,79223.0,78014.0,77281.0,76965.0


## Analysis

### Population growth

The world population data spans from 1960 to 2017. We'd like to build a predictive model that can give us the best guess at what the population growth rate in a given year might be. We will calculate the population growth rate as follows:-

$$
Growth\_rate = \frac{current\_year\_population - previous\_year\_population}{previous\_year\_population}
$$

As such, we can only calculate the growth rate for the year 1961 onwards.

Write a function that takes the `population_df` and a `country_code` as input and computes the population growth rate for a given country starting from the year 1961. This function must return a return a 2-d numpy array that contains the year and corresponding growth rate for the country.

In [44]:
def get_population_growth_rate_by_country_year(df,country_code):
    '''
    This function calls the country code representing a country from the index column, and then calculates
    its population growth rate.

    Args:
        Country Code (str)
        Year (int)

    Returns:
        growth_rates (arr) - An array containing years and the respective growth rate for the selected country.

    ***ALTERNATIVE CODE***
    country_data = df.loc[country_code, '1960':]
    pop_values = country_data.values
    pop_growth_rate = np.round(np.diff(pop_values)/pop_values[:-1], 5)
    years = np.array(country_data.index[1:], dtype = np.int64)
    result_array = np.column_stack((years, pop_growth_rate))
    return result_array
    '''
    
    growth_rates = ([]) # Empty array to store the values after looping through the years
    for year in df.columns.drop('1960'):
        current = (population_df.loc[str(country_code), str(year)]) # Stores current year growth rate
        previous = (population_df.loc[str(country_code), str(int(year)-1)]) # Stores previous year's growth rate
        growth_rate = round(((current - previous)/previous), 5)
        growth_rates.append([year, growth_rate])
    return np.array(growth_rates, dtype = np.float64)

Input:

In [45]:
get_population_growth_rate_by_country_year(population_df,'ABW')

array([[ 1.961e+03,  2.263e-02],
       [ 1.962e+03,  1.420e-02],
       [ 1.963e+03,  8.360e-03],
       [ 1.964e+03,  5.940e-03],
       [ 1.965e+03,  5.750e-03],
       [ 1.966e+03,  6.190e-03],
       [ 1.967e+03,  5.890e-03],
       [ 1.968e+03,  5.700e-03],
       [ 1.969e+03,  5.820e-03],
       [ 1.970e+03,  5.740e-03],
       [ 1.971e+03,  6.380e-03],
       [ 1.972e+03,  6.730e-03],
       [ 1.973e+03,  6.730e-03],
       [ 1.974e+03,  4.730e-03],
       [ 1.975e+03,  2.130e-03],
       [ 1.976e+03, -1.170e-03],
       [ 1.977e+03, -3.630e-03],
       [ 1.978e+03, -4.360e-03],
       [ 1.979e+03, -2.050e-03],
       [ 1.980e+03,  1.930e-03],
       [ 1.981e+03,  7.840e-03],
       [ 1.982e+03,  1.285e-02],
       [ 1.983e+03,  1.395e-02],
       [ 1.984e+03,  1.021e-02],
       [ 1.985e+03,  3.020e-03],
       [ 1.986e+03, -6.060e-03],
       [ 1.987e+03, -1.295e-02],
       [ 1.988e+03, -1.219e-02],
       [ 1.989e+03, -7.700e-04],
       [ 1.990e+03,  1.830e-02],
       [ 1

### Even-odd train-test split

Now that we have our data, we need to divide it into two sets: the variables we will train on and the variables we will predict on. In this scenario, we're separating the variables so that the **training set contains growth rates for even years and the test set contains growth rates for odd years**. We also need to divide our data into the predictive features (`X`) and the response features (`y`). 

Write a function that will take a 2-D numpy array as input and return four variables in the form of `(X_train, y_train), (X_test, y_test)`, where `(X_train, y_train)` are the features and response variables of the training set, and `(X_test, y_test)` are the features and response variables of the testing set. The training and testing data consist of even and odd years, respectively. The function should return two tuples of the form `(X_train, y_train), (X_test, y_test)`.

In [46]:
def feature_response_split(arr):
    '''
    This function takes the array from the previous function and splits the data into odd and even years,
    then two further sets of training(even years) and testing data(odd years).

    Args:
        arr (arr): Array containing data for years and population growth rate

    Returns:
        (X_train, y_train) (tup): Tuple containining data for training
        (X_test, y_test) (tup): Tuple containining data for testing

    ***ALTERNATIVE CODE***
    even_set_years = [arr[i][0] for i in range(len(arr)) if int(arr[i][0]) % 2 == 0]
    odd_set_years = [arr[i][0] for i in range(len(arr)) if int(arr[i][0]) % 2 != 0]
    even_set_rate = [arr[i][1] for i in range(len(arr)) if int(arr[i][0]) % 2 == 0]
    odd_set_rate = [arr[i][1] for i in range(len(arr)) if int(arr[i][0]) % 2 != 0]
    
    (X_train, y_train) = (np.array(even_set_years), np.array(even_set_rate))
    (X_test, y_test) = (np.array(odd_set_years), np.array(odd_set_rate))
    return (X_train, y_train), (X_test, y_test)
    '''
    
    even_set_years = []
    odd_set_years = []
    even_set_rate = []
    odd_set_rate = []
    for i in range(len(arr)):
        if int(arr[i][0])%2 == 0:
            even_set_years.append(arr[i][0])
            even_set_rate.append(arr[i][1])
        else:
            odd_set_years.append(arr[i][0])
            odd_set_rate.append(arr[i][1])
    (X_train, y_train) = (np.array(even_set_years), np.array(even_set_rate))
    (X_test, y_test) = (np.array(odd_set_years), np.array(odd_set_rate))
    return (X_train, y_train), (X_test, y_test)

Input:

In [49]:
data = get_population_growth_rate_by_country_year(population_df,'ABW');
(X_train, y_train), (X_test, y_test) = feature_response_split(data)
print(y_train)

[ 0.0142   0.00594  0.00619  0.0057   0.00574  0.00673  0.00473 -0.00117
 -0.00436  0.00193  0.01285  0.01021 -0.00606 -0.01219  0.0183   0.05591
  0.05787  0.0358   0.02137  0.02076  0.02254  0.01773  0.00801  0.00131
  0.00213  0.00513  0.00589  0.00461]


### Model training

Now that we have formatted our data, we can fit a model using sklearn's `DecisionTreeRegressor` class. We'll write a function that will take as input the features and response variables that we created in the last question, and return a trained model.

In [50]:
def train_model(X_train, y_train, MaxDepth):
    '''
    This function fits and trains a Decision Tree model from the sklearn module on two sets of arrays for
    the predictor and response variables after reshaping them.

    Args:
        X_train (arr): An array containing training data for the X-axis
        y_train (arr): An array containing training data for the Y-axis
        MaxDepth (int): The maximum depth for the decision tree

    Returns:
        A model fitted to the data provided
    '''
        
    X_train_reshaped = X_train.reshape(-1, 1)
    y_train_reshaped = y_train.reshape(-1, 1)
    tree_m = DecisionTreeRegressor(max_depth = MaxDepth)
    return tree_m.fit(X_train_reshaped, y_train_reshaped)

Input:

In [51]:
data = get_population_growth_rate_by_country_year(population_df,'ABW')
(X_train, y_train), _ = feature_response_split(data)

train_model(X_train, y_train,3).predict([[2017]])

array([0.00451333])

Expected output:

```
array([0.00451333])
```

### Model testing

Now we would like to test our model on the testing data that we produced in Exercise 2. This test will give the Root Mean Squared Logarithmic Error (RMSLE), which is determined by:

$$
RMSLE = \sqrt{\frac{1}{N}\sum_{i=1}^N [log(1+p_i) - log(1+y_i)]^2}
$$

* *$p_i$ refers to the $i^{\rm th}$ prediction made from `X_test` 
* $y_i$ refers to the $i^{\rm th}$ value in `y_test`
* $N$ is the length of `y_test`

In [100]:
def test_model(model, y_test, X_test):
    '''
    This module fetches predictions from the testing data then calculates the Root Mean Squared Logarithmic
    Error after reshaping the testing data arrays.

    Args:
        model (class): A model fitted from the previous function
        y_test (arr): An array containing the response variable for the testing data
        X_test (arr): An array containing the predictor variable for the testing data

    Returns:
        RMSLE (float): The root mean squared logarithmic error
    '''
    
    X_test_flat = X_test.flatten()
    y_test_flat = y_test.flatten()
    y_pred = model.predict(X_test_flat.reshape(-1, 1))
    RMSLE = np.sqrt((1/len(y_test_flat))*sum((((np.log(1+y_pred)-np.log(1+y_test_flat)))**2)))
    return round(RMSLE, 3)

Input:

In [99]:
data = get_population_growth_rate_by_country_year(population_df,'ABW')
(X_train, y_train), (X_test, y_test) = feature_response_split(data)
lm = train_model(X_train, y_train,3)
test_model(lm, y_test, X_test)

0.008