#  Decision trees


In this  challenge,I  Will  implement a decision tree regression model and analyse its RMSLE.



# importing the necessary packages for the challenges.

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

## The dataset

The dataset contains population data for various countries over the years from 1960 to 2017. Each row corresponds to a specific country, identified by a country code, and each column represents a year. The values within the dataset represent the population count for each country in the corresponding year.

In [2]:
population_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/AnalyseProject/world_population.csv', index_col='Country Code')
population_df.head()

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
Country Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABW,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,57715.0,58055.0,58386.0,58726.0,...,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0
AFG,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,10152331.0,10372630.0,10604346.0,10854428.0,...,27294031.0,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0
AGO,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,6309770.0,6414995.0,6523791.0,6642632.0,...,21759420.0,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0
ALB,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,1914573.0,1965598.0,2022272.0,2081695.0,...,2947314.0,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0
AND,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,19647.0,20758.0,21890.0,23058.0,...,83861.0,84462.0,84449.0,83751.0,82431.0,80788.0,79223.0,78014.0,77281.0,76965.0


## Analysis

###  Population growth Analysis

The world population data spans from 1960 to 2017. I'd like to build a predictive model that can give us the best guess at what the population growth rate in a given year might be. I will calculate the population growth rate as follows:-

$$
Growth\_rate = \frac{current\_year\_population - previous\_year\_population}{previous\_year\_population}
$$

As such, we can only calculate the growth rate for the year 1961 onwards.



In [3]:
def get_population_growth_rate_by_country_year(df, country_code):
    """
    Calculates the population growth rate for a given country starting from the year 1961.

    Args:
    - df (DataFrame): A pandas DataFrame containing population data for various countries over the years from 1960 to 2017.

    - country_code (str): The code representing the country for which the population growth rate is to be calculated.

    Returns:
    - numpy.ndarray: A 2-dimensional numpy array containing the year and corresponding population growth rate for the
                      specified country.
    """

    # Filter the DataFrame to get data for the specified country
    country_data = df.loc[country_code]

    # Initialize an empty list to store the results
    growth_rates = []

    # Iterate over the columns (years) starting from the second column (index 1)
    for i in range(1, len(country_data)):
        # Get the population for the current year and the previous year
        current_population = country_data.iloc[i]
        previous_population = country_data.iloc[i - 1]

        # Calculate the population growth rate and round it to 5 decimal places
        growth_rate = round((current_population - previous_population) / previous_population, 5)

        # Append the year and growth rate to the results list
        growth_rates.append([int(country_data.index[i]), float(growth_rate)])

    # Convert the list of results to a numpy array
    growth_rates_array = np.array(growth_rates)

    return growth_rates_array



In [4]:
get_population_growth_rate_by_country_year(population_df,'ABW')# ABW Is a country code  we are testing the results

array([[ 1.961e+03,  2.263e-02],
       [ 1.962e+03,  1.420e-02],
       [ 1.963e+03,  8.360e-03],
       [ 1.964e+03,  5.940e-03],
       [ 1.965e+03,  5.750e-03],
       [ 1.966e+03,  6.190e-03],
       [ 1.967e+03,  5.890e-03],
       [ 1.968e+03,  5.700e-03],
       [ 1.969e+03,  5.820e-03],
       [ 1.970e+03,  5.740e-03],
       [ 1.971e+03,  6.380e-03],
       [ 1.972e+03,  6.730e-03],
       [ 1.973e+03,  6.730e-03],
       [ 1.974e+03,  4.730e-03],
       [ 1.975e+03,  2.130e-03],
       [ 1.976e+03, -1.170e-03],
       [ 1.977e+03, -3.630e-03],
       [ 1.978e+03, -4.360e-03],
       [ 1.979e+03, -2.050e-03],
       [ 1.980e+03,  1.930e-03],
       [ 1.981e+03,  7.840e-03],
       [ 1.982e+03,  1.285e-02],
       [ 1.983e+03,  1.395e-02],
       [ 1.984e+03,  1.021e-02],
       [ 1.985e+03,  3.020e-03],
       [ 1.986e+03, -6.060e-03],
       [ 1.987e+03, -1.295e-02],
       [ 1.988e+03, -1.219e-02],
       [ 1.989e+03, -7.700e-04],
       [ 1.990e+03,  1.830e-02],
       [ 1

###  Even-odd train-test split

Now that  our data is ready ,  to divide it into two sets: the variables that train on and the variables we will predict on. In this scenario,  separating the variables so that the **training set contains growth rates for even years and the test set contains growth rates for odd years**. and onse features (`y`).



In [7]:
def feature_response_split(arr):
    """
    Split the input 2-D numpy array into training and testing sets based on even and odd years.

    Args:
    - arr (numpy.ndarray): A 2-dimensional numpy array containing the year and corresponding growth rate.

    Returns:
    - tuple: Two tuples of the form (X_train, y_train), (X_test, y_test).
             (X_train, y_train) contains features and response variables for the training set.
             (X_test, y_test) contains features and response variables for the testing set.
    """

    # Splitting data into features (X) and response (y)
    years = arr[:, 0].astype(int)
    growth_rates = arr[:, 1].astype(float)

    # Splitting into even and odd years
    even_years = years[years % 2 == 0]
    odd_years = years[years % 2 != 0]

    # Splitting years accordingly and converting to string with a period appended
    X_train = [str(year) + '.' for year in years[years % 2 == 0]]
    X_test = [str(year) + '.' for year in years[years % 2 != 0]]

    # Convert the list of strings to a NumPy array of numerical values
    X_test = np.array(X_test, dtype=float)
    X_train = np.array(X_train, dtype=float)

    # Generating response values for even and odd years
    y_train = growth_rates[years % 2 == 0]
    y_test = growth_rates[years % 2 != 0]

    # Printing the tuples as arrays
    #print("y_train == ", repr(y_train))
    #print("X_test == ", repr(X_test))
    #print("y_test == ", repr(y_test))

    return (X_train, y_train), (X_test, y_test)


In [8]:
data = get_population_growth_rate_by_country_year(population_df,'ABW');
(X_train, y_train), (X_test, y_test) = feature_response_split(data)

```
y_train ==  array([ 0.01419604,  0.00594409,  0.00618898,  0.00570149,  0.00573851,
        0.00672948,  0.00473084, -0.00117052, -0.00435676,  0.00193398,
        0.01284528,  0.01020884, -0.00606099, -0.01219414,  0.01830187,
        0.05590975,  0.05787267,  0.03580499,  0.02136897,  0.02076288,
        0.02254085,  0.01772885,  0.00800752,  0.00131397,  0.00212906,
        0.00513459,  0.00589222,  0.00460988])
```

```
X_test == array([1961., 1963., 1965., 1967., 1969., 1971., 1973., 1975., 1977.,
       1979., 1981., 1983., 1985., 1987., 1989., 1991., 1993., 1995.,
       1997., 1999., 2001., 2003., 2005., 2007., 2009., 2011., 2013.,
       2015., 2017.])
```

```
y_test == array([ 0.02263378,  0.00835927,  0.00575116,  0.00589102,  0.00582331,
        0.00638301,  0.00673463,  0.00213125, -0.0036312 , -0.00204649,
        0.00783746,  0.01395387,  0.00302374, -0.01294617, -0.0007695 ,
        0.03979147,  0.0625632 ,  0.04724902,  0.02705529,  0.01979903,
        0.02250889,  0.02131758,  0.01310552,  0.00384798,  0.00098665,
        0.00377696,  0.00594675,  0.00526037,  0.00421667])      
 ```



Now that  data is formatted, I can fit a model using sklearn's `DecisionTreeRegressor` class. I'll write a function that will take as input the features and response variables that we created in the previous part, and return a trained model.


In [9]:
def train_model(X_train, y_train, MaxDepth):
    """
    Train a Decision Tree Regressor model using the input features and response variables.

    Args:
    - X_train (numpy.ndarray): Features for training the model.
    - y_train (numpy.ndarray): Response variables for training the model.
    - MaxDepth (int): Maximum depth hyperparameter for the Decision Tree Regressor.

    Returns:
    - DecisionTreeRegressor: Trained Decision Tree Regressor model.
    """

    # Initialize Decision Tree Regressor with specified max_depth
    model = DecisionTreeRegressor(max_depth=MaxDepth)

    # Reshape the features to match the expected format and fit the model to the training data
    model.fit(X_train.reshape(-1, 1), y_train)

    return model




I try to predict the model using the maximum depth of 3 for a country code ABW

In [10]:
data = get_population_growth_rate_by_country_year(population_df,'ABW')
(X_train, y_train), _ = feature_response_split(data)

train_model(X_train, y_train,3).predict([[2017]])

array([0.00451333])

### Calculating the accuracy of the model

test my model on the testing data that we produced part 2 if this notebook. This test will give the Root Mean Squared Logarithmic Error (RMSLE),



In [12]:

def test_model(model, y_test, X_test):
    """
    Test a trained model on testing data and calculate Root Mean Squared Logarithmic Error (RMSLE).

    Args:
    - model: Trained model object.
    - X_test (numpy.ndarray): Features for testing the model.
    - y_test (numpy.ndarray): Actual response variables for testing the model.

    Returns:
    - float: Root Mean Squared Logarithmic Error (RMSLE) rounded to 3 decimal places.
    """

    # Predict the target variable for the test data
    y_pred = model.predict(X_test.reshape(-1, 1))

    # Calculate squared differences between predicted and actual values
    squared_diff = (np.log1p(y_pred) - np.log1p(y_test))**2

    # Calculate mean squared error
    mean_squared_error = np.mean(squared_diff)

    # Calculate Root Mean Squared Logarithmic Error (RMSLE)
    rmsle = np.sqrt(mean_squared_error)

    # Round RMSLE to 3 decimal places and return
    return round(rmsle, 3)



In [14]:
data = get_population_growth_rate_by_country_year(population_df,'ABW')
(X_train, y_train), (X_test, y_test) = feature_response_split(data)
model = train_model(X_train, y_train, 3)
test_model(model, y_test, X_test)

0.008

Root Mean Square Error (RMSE): It's a measure of how spread out the errors between predicted values and actual values are in a regression problem. It's calculated by taking the square root of the average of the squared differences between the predicted and actual values.
the model's predictions deviate from the actual values by approximately 0.08 units.