# European Development Indicators - Analysis

In this workbook we will apply the approach we used for the England or Brazil exercise and apply it to a more meaty dataset of a collection of development indicators for European countries in the year 2000.

First load the libraries we need.  For this course we will use this library, which combines the components we need from Numpy, Pandas, Matplotlib and Scikit Learn an wraps then in a simplified class called a DasiFrame.  DasiFrame is essentially a Pandas DataFrame extended with machine learning capabilities.

In [None]:
from dasi_library import *

## Load the data from the CSV file

Load our data set containing european indicators

In [None]:
dataset = readCsv('european indicators 2000.csv')

## Inspect the data

First we will poke around the data to see what we can find.  
The aim is to understand the data a bit more whilst wearing our machine learning hat.  We want to understand the features and identify which features might be useful for us when training our model.

### Identify the number of features (columns) and samples (rows)
Understand the size of the data

In [None]:
dataset.shape

### Have a quick look at the data
Take a quick look at the data to understand what you are dealing with.

In [None]:
dataset.head(5)

### Calculate descriptive stats
These give an idea of the range and spread of values for each feature.

In [None]:
dataset.describe()

## Visualise the data
We can gain a better understanding of the data using some visualisations.  

### Box plots
Box plots give an idea of spread:

In [None]:
boxPlotAll(dataset)

### Histograms
Histograms give an idea of distribution:

In [None]:
histPlotAll(dataset)

## Split the data into target feature and input features
Our aim is to use the input features to predict the target feature.


### Select our target feature

For a classification task, the target feature is a feature with 2 or more unique values that we are trying to predict.  In the England or Brazil exercise we already had a feature, which was the country.  In this exercise we will start with one of the numeric features.  These numeric features are **continuous variables** not **categorical**.  So we need to turn the numbers into a categorical value.  A simple way is to split it into bands, H, M and L.  Our classification exercise is to try to find features that predict these bands.  I'm going to choose PopGrowth, but why don't you choose something different.

First take a look at the range of values in our chosen attribute.

In [None]:
selectCol(dataset, 'PopGrowth').describe()

Now choose the boundaries of our bands to match the range between the min and max.  The line below will add a new column to the dataset based on these boundaries:

In [None]:
dataset = appendClass(dataset, 'PopGrowthBand','PopGrowth',[-3,0,0.5,2],['L','M','H'])

Let's do a quick check to see what this did:

In [None]:
selectCols(dataset, ['PopGrowthBand','PopGrowth'])

And let's also quickly check that we have useful numbers in each band.  If the split is too uneven we won't be able to train our model properly.

In [None]:
classDistribution(dataset, 'PopGrowthBand')

Ok, that all looks good, so we can continue.

Now we need to remove all the features that are not going to be used for inputs and are not the target.  

<hr/>

**You choose >>**

**Choose up to 4 features you think might be interesting to use as predictors of your chosen target feature.  Be careful not to choose obvious features that are obviously related.  For example, I wouldn't choose any other population growth feature such as UrbanPopGrowth or Fertility Rate.  Also add your target feature.**

<hr/>

I'm going with this:


In [None]:
dataset = selectCols(dataset, ['InternetUsers', 'MobileCellular', 'PopGrowthBand'])

### Split out the target feature

By convention, Y is the set of target values for the samples.  These are the values we hope our model will be able to predict. By convention, X is the set of input samples.  These are the values we will use to create a model.

In [None]:
X,Y = splitXY(dataset, 'PopGrowthBand')

In [None]:
X

In [None]:
Y

## Split the data set into training and test sets
We will train the model on the training set and test it using the test set.

We will use 67% of the data for training and 33% for testing.

In [None]:
test_size = 0.33
seed = 1
X_train, X_test, Y_train, Y_test = trainTestSplit(X, Y, test_size=test_size, random_state=seed)

In [None]:
X_train

In [None]:
X_test

## Train the model
Use the training data to devise a model that can perform our predictions.

We will use the decisions tree algorithm to train our model.

In [None]:
model = modelFit(X_train, Y_train, DecisionTreeClassifier)

Let's see how the model performs on the *training data*:

In [None]:
predictions = predict(model, X_train)
print(accuracy_score(Y_train, predictions))
print(confusion_matrix(Y_train, predictions))

## Test the model

Now use the model to make predictions on data that was not used for training (i.e. the *test data*).

In [None]:
predictions = predict(model, X_test)

## Check how well we did

Compare the results of the model with the known answers.

In [None]:
print(accuracy_score(Y_test, predictions))
print(confusion_matrix(Y_test, predictions))

In [None]:
comparePredictionsWithOriginals(X_test, predictions, Y_test)

## Apply the model
Use data that we have not yet seen to try to make real predictions.

Load the unseen data.  This contains European indicators for 2010.

In [None]:
unseen_dataset = readCsv('european indicators 2010.csv')
unseen_dataset

In [None]:
unseen_dataset = appendClass(unseen_dataset, 'PopGrowthBand','PopGrowth',[-3,0,0.5,2],['L','M','H'])
unseen_dataset = selectCols(unseen_dataset, ['InternetUsers', 'MobileCellular', 'PopGrowthBand'])
classDistribution(unseen_dataset, 'PopGrowthBand')

Split into target feature and input features.

In [None]:
X,Y = splitXY(unseen_dataset, 'PopGrowthBand')

Use our model to make predictions.

In [None]:
predictions = predict(model, X)

Compare our results with what we know actually happened in 2010.

In [None]:
comparePredictionsWithOriginals(X, predictions, Y)

Calculate how accurate our predictions were.

In [None]:
print(accuracy_score(Y, predictions))
print(confusion_matrix(Y, predictions))

## Summary
As you can see, my model sucks!  Hopefully you managed to create a better model by selecting a better set of input and target features.

<hr/>

**Question: >>**

**Why do you think my accuracy score degrades so much from when it was run on the training set, to when it was run on the test set, to when it was run on the unseen data?**

<hr/>