# Regression with World Development Indicators

In this workbook we will load up our cleaned data from the World Development Indicators data set and take it through the process of building a regression model.  Will we try to predict the actual life expectancy in years of the countries.

First load the libraries we need.  For this course we will use this library, which combines the components we need from Numpy, Pandas, Matplotlib and Scikit Learn an wraps then in a simplified class called a DasiFrame.  DasiFrame is essentially a Pandas DataFrame extended with machine learning capabilities.

In [None]:
from dasi_library import *

## Load the data from the CSV file

In [None]:
dataset = readCsv('World Indicators 2000.csv')

## Inspect the data

First we will poke around the data to see what we can find.  The aim is to understand the data a bit more whilst wearing our machine learning hat.  We want to understand the features and identify which features might be useful for us when training our model.

### Identify the number of features (columns) and samples (rows)
Understand the size of the data

In [None]:
dataset.shape

### Have a quick look at the data
Take a quick look at the data to understand what you are dealing with.

In [None]:
dataset.head(5)

### Calculate descriptive stats
These give an idea of the range and spread of values for each feature.

In [None]:
dataset.describe()

## Analytical visualisation
We can gain a better understanding of the data using some visualisations.  

### Box plots
Box plots give an idea of spread:

In [None]:
boxPlotAll(dataset)

### Histograms
Histograms give an idea of distribution:

In [None]:
histPlotAll(dataset)

### Correlation matrix

A correlation matrix allows you to quickly see the extent to which there are correlations (positive or negative) between pairs of attributes.  Dark blues and bright yellows are a good sign.

In [None]:
correlationMatrix(dataset)

## Prepare the data

### Remove identifiers (i.e. anything that is not a feature)

We will remove the country name as it is not used for creating the model and will get in the way

In [None]:
dataset = removeCol(dataset, 'CountryName')

### Select our target feature

For a regression task, we will choose a numeric feature.  Here we will choose action life expectancy (if you remember, for the classification task we split the life expectancy into L, M and H bands).

## Split out the target feature

By convention, Y is the set of target values for the samples.  These are the values we hope our model will be able to predict.X is the set of input samples, which we will use to make our prediction.

In [None]:
X,Y = splitXY(dataset, 'LifeExp')

## Pre-process and select the best features

We will rescale all features to have values between 0 and 1.  This helps some algorithms.

In [None]:
X = rescale(X)

This time, rather than doing this manually, we will use statistics to find the 4 features that best contribute to the target values.

In [None]:
X = selectFeaturesKBestRegression(4, X, Y)
X

## Scatter Plot to check our features
Let's just have a quick look at a scatter plot to see how the SelectKBest algorithm did.  Scatter plot matrices show how pairs of features are related.  It is useful for seeing correlations between pairs of features.  Because we got the machine learning tools to select the features, we'd hope there are correlations to our target feature.

In [None]:
cols = listColumns(X)+['LifeExp']
scatterMatrix(selectCols(dataset, cols))

## Split into training and test sets

Now split the data set into a training set (67%) and a test set (33%):

In [None]:
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = trainTestSplit(X, Y, test_size=test_size, random_state=seed)

## Train the models

The evaluateAlgorithmsRegression function creates multiple train / test splits (called **folds**), creates models using all of the algorithms against all of the folds, and returns the results.  The process of using folds in this way is called **k-fold cross-validation**. 

In [None]:
algorithms = []
algorithms.append(LinearRegression)
algorithms.append(KNeighborsRegressor)
algorithms.append(DecisionTreeRegressor)
evaluateAlgorithmsRegression(X_train, Y_train, algorithms, seed)

The number in the evaluation above is the mean absolute error (MAE) if the results.  It's the average error in life expectancy (in years) of our model.  A value of 0 means a perfect predictor.  So the best models will have the smallest MAE.

<hr/>

**Question: >>**

**Why can't we evaluate our models using the accuracy score and confusion matrix that we used for our classification models?**

<hr/>

We can now take our best algorithm and create a model using all of the training data:

In [None]:
model = modelFit(X_train, Y_train, LinearRegression)

Test our model using the training data:

In [None]:
predictions = predict(model, X_train)
print(mean_absolute_error(Y_train, predictions))

## Test the model

Now we do a final test of the model against the test data:

In [None]:
predictions = predict(model, X_test)
print(mean_absolute_error(Y_test, predictions))

Let's also join the predictions to the data set and correct values:

In [None]:
comparePredictionsWithOriginals(X_test, predictions, Y_test)

## Apply the model

Now let's apply the model to the World Indicators 2010 data, to see if our model based on 2000 data holds for 2010 figures.

Load the world indicators 2010 data

In [None]:
unseen_original_dataset = readCsv('World Indicators 2010.csv')
unseen_original_dataset

Select just the columns we used in our model:

In [None]:
selectedFeatures = listColumns(X_test)
targetFeature = ['LifeExp']
selectedFeatures + targetFeature

Plug the columns from above into the algorithm, together with our target feature:

In [None]:
unseen_dataset = selectCols(unseen_original_dataset, selectedFeatures + targetFeature)


Split into target feature and input features.

In [None]:
X,Y = splitXY(unseen_dataset, targetFeature[0])

In [None]:
X = rescale(X)

Use our model to make predictions.

In [None]:
predictions = predict(model, X)

In [None]:
comparePredictionsWithOriginals(X, predictions, Y)

Let's get a measure of how well we did:

In [None]:
mean_absolute_error(Y, predictions)

The above is the average error in life expectancy (in years).

## Inspecting the model
For our classification tasks, we were able in visualise the decision tree created by the algorithm.

Different algorithms have different ways of modelling the relationships in the data, so the approach for inspecting and visualising the model will vary from algorithm to algorithm.  Some algorithms are mode "explainable" than others.

Let's look at the linear regression model.  We can visualise the **coefficients**, which are the numbers assigned to each input feature we used to build our model.

In [None]:
model = modelFit(X_train, Y_train, LinearRegression)
linearRegressionSummary(model, X.columns)

<hr/>

**Question: >>**

**What do you think these number represent?  What can we say about the relative size of the numbers?  Why are some numbers positive, and some negative?**

<hr/>

These are all unfair questions for you without me explaining how linear regression actually works!  I will explain on the whiteboard!!