# Classification with World Development Indicators - WITH SOLUTION

In this workbook we will load up our cleaned data from the World Development Indicators data set and take it through the process of building a classification model.

First load the libraries we need.  For this course we will use this library, which combines the components we need from Numpy, Pandas, Matplotlib and Scikit Learn an wraps then in a simplified class called a DasiFrame.  DasiFrame is essentially a Pandas DataFrame extended with machine learning capabilities.

In [None]:
from dasi_library import *

## Load the data from the CSV file

In [None]:
dataset = readCsv('../datasets/World Development Indicators/World Indicators 2000.csv')

## Inspect the data

First we will poke around the data to see what we can find.  The aim is to understand the data a bit more whilst wearing our machine learning hat.  We want to understand the features and identify which features might be useful for us when training our model.

### Identify the number of features (columns) and samples (rows)
Understand the size of the data

In [None]:
dataset.shape

### Have a quick look at the data
Take a quick look at the data to understand what you are dealing with.

In [None]:
dataset.head(5)

### Calculate descriptive stats
These give an idea of the range and spread of values for each feature.

In [None]:
dataset.describe()

## Analytical visualisation
We can gain a better understanding of the data using some visualisations.  

### Box plots
Box plots give an idea of spread:

In [None]:
boxPlotAll(dataset)

### Histograms
Histograms give an idea of distribution:

In [None]:
histPlotAll(dataset)

### Correlation matrix

A correlation matrix allows you to quickly see the extent to which there are correlations (positive or negative) between pairs of attributes.  Dark blues and bright yellows are a good sign.

In [None]:
correlationMatrix(dataset)

## Prepare the data

### Remove identifiers (i.e. anything that is not a feature)

We will remove the country name as it is not used for creating the model and will get in the way

In [None]:
dataset = removeCol(dataset, 'CountryName')

###  Add additional derived features we may need

We may want to derive new features from existing features.  For example, here we will band the life expectancy into L, M and H.

In [None]:
dataset = appendClass(dataset, 'LifeExpBand','LifeExp',[0,50,60,100],['L','M','H'])

Let's just quickly check this worked:

In [None]:
selectCols(dataset, ['LifeExpBand','LifeExp'])

### Select our target feature

For a classification task, the target feature is a feature with 2 or more unique values.  Here we will select the LifeExpBand that we just created.  Our aim with the model is to predict the LifeExpBand based on other features.  In other words, we want to build a model that uses a few key features to predict the life expectancy (L, M, H) in a country.

Let's check how many countries are in each band:

In [None]:
classDistribution(dataset, 'LifeExpBand')

The L band has only 4 countries, so training may not be great.  Let's make some adjustments:

In [None]:
dataset = appendClass(dataset, 'LifeExpBand','LifeExp',[0,65,73,100],['L','M','H'])
classDistribution(dataset, 'LifeExpBand')

That's a bit better.

## Inspect some more

Let's see how each feature compares for each life expectancy band.  This is really useful - we can already visually see which features might be selected.

In [None]:
selectCols(dataset, ['LifeExpBand','LifeExp'])

In [None]:
classComparePlot(dataset, 'LifeExpBand', 'density')

## Split out the target feature

By convention, Y is the set of target values for the samples.  These are the values we hope our model will be able to predict.X is the set of input samples, which we will use to make our prediction.

In [None]:
X,Y = splitXY(dataset, 'LifeExpBand')

## Pre-process and select the best features

We will rescale all features to have values between 0 and 1.  This helps some algorithms.

In [None]:
X = rescale(X)

This time, rather than doing this manually, we will use statistics to find the 4 features that best contribute to the target values.

In [None]:
X = selectFeaturesKBestClassification(4, X, Y)
X

## Scatter Plot to check our features
Let's just have a quick look at a scatter plot to see how the SelectKBest algorithm did.  Scatter plot matrices show how pairs of features are related.  It is useful for seeing correlations between pairs of features.  Because we got the machine learning tools to select the features, we'd hope there are lots of correlations.

In [None]:
cols = listColumns(X)+['LifeExp']
scatterMatrix(selectCols(dataset, cols))

## Split into training and test sets

Now split the data set into a training set (67%) and a test set (33%):

In [None]:
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = trainTestSplit(X, Y, test_size=test_size, random_state=seed)

## Train the model

This time we will take a different approach.  Rather than just choosing a particular algorithm, we will evaluate a bunch of algorithms to see which one performs the best with this data set.

In [None]:
algorithms = []
algorithms.append(LogisticRegression)
algorithms.append(LinearDiscriminantAnalysis)
algorithms.append(KNeighborsClassifier)
algorithms.append(DecisionTreeClassifier)
algorithms.append(GaussianNB)
algorithms.append(SVC)
evaluateAlgorithmsClassification(X_train, Y_train, algorithms, seed)

Take our best algorithm and train it to create a model:

In [None]:
model = modelFit(X_train, Y_train, LogisticRegression)

Now use the model to make predictions on data that was not used for training:

## Test the model

In [None]:
predictions = predict(model, X_test)

Check how well we did:

In [None]:
print(accuracy_score(Y_test, predictions))
print(confusion_matrix(Y_test, predictions))
print(classification_report(Y_test, predictions))

Let's also join the predictions to the data set and correct values:

In [None]:
comparePredictionsWithOriginals(X_test, predictions, Y_test)

## Apply the model

Now see if you can apply the model to the World Indicators 2010 data, to see if our model based on 2000 data holds for 2010 figures.


Load the world indicators 2010 data

In [None]:
unseen_dataset = readCsv('../datasets/World Development Indicators/World Indicators 2010.csv')
unseen_dataset

Add the LifeExpBand class.

In [None]:
unseen_dataset = appendClass(unseen_dataset,
    class_name='LifeExpBand',
    feature='LifeExp',
    bins=[0,65,73,100],
    labels=['L','M','H'])

See what the distribution is:

In [None]:
classDistribution(unseen_dataset, class_name='LifeExpBand')

Select just the columns we used in our model:

In [None]:
listColumns(X_test)

In [None]:
unseen_dataset = selectCols(unseen_dataset, ['LifeExpBand', 'BirthRate', 'FertilityRate', 'Sanitation', 'Internet'])


Split into target feature and input features.

In [None]:
X,Y = splitXY(unseen_dataset, 'LifeExpBand')

In [None]:
X = rescale(X)

Use our model to make predictions.

In [None]:
predictions = predict(model, X)

In [None]:
comparePredictionsWithOriginals(X, predictions, Y)

In [None]:
print(accuracy_score(Y, predictions))
print(confusion_matrix(Y, predictions))