# England or Brazil: Really Simple Classification Example

In this workbook we will take our unplugged exercise and plug it in!  In the England or Brazil exercise we used our human brains to try to work out how to classify data samples from World Cups as either England or Brazil.  The samples contained the total goals scored in the tournament and the average attendance at games.  The data set was extremely small, so whilst it was handy to use as an example for manually classisfying samples, it's not the best for machine learning, where larger data sets would be better.  Nevertheless, it is useful to see how this same data set that we worked on manually could be processed by Python Machine Learning.

First load the libraries we need.  For this course we will use this library, which combines the components we need from Numpy, Pandas, Matplotlib and Scikit Learn an wraps then in a simplified class called a DasiFrame.  DasiFrame is essentially a Pandas DataFrame extended with machine learning capabilities.

In [None]:
from dasi_library import *

## Load the data from the CSV file

Load our data set containing data from World Cups from 1950 to 1970.

In [None]:
dataset = readCsv('England or Brazil.csv')

## Inspect the data

First we will poke around the data to see what we can find.  
The aim is to understand the data a bit more whilst wearing our machine learning hat.  We want to understand the features and identify which features might be useful for us when training our model.

### Identify the number of features (columns) and samples (rows)
Understand the size of the data

In [None]:
dataset.shape

### Have a quick look at the data
Take a quick look at the data to understand what you are dealing with.

In [None]:
dataset.head(10)

### Calculate descriptive stats
These give an idea of the range and spread of values for each feature.

In [None]:
dataset.describe()

## Visualise the data
We can gain a better understanding of the data using some visualisations.  

### Box plots
Box plots give an idea of spread:

In [None]:
boxPlotAll(dataset)

### Histograms
Histograms give an idea of distribution:

In [None]:
histPlotAll(dataset)

## Now let's plot some comparative histograms
This will help us see how features behave for the two classes.

In [None]:
classComparePlot(dataset,'Country', plotType='hist')

## Split the data into target feature and input features
Our aim is to use the input features to predict the target feature.


### Select our target feature

For a classification task, the target feature is a feature with 2 or more unique values that we are trying to predict.  Here it is the country, England or Brazil.

### Split out the target feature

By convention, Y is the set of target values for the samples.  These are the values we hope our model will be able to predict. By convention, X is the set of input samples.  These are the values we will use to create a model.

In [None]:
X,Y = splitXY(dataset, 'Country')

In [None]:
X

In [None]:
Y

## Split the data set into training and test sets
We will train the model on the training set and test it using the test set.

We will use 67% of the data for training and 33% for testing.


The seed parameter starts off the random number generator at a given point.  Using the same seed will generate the same random split.  This is useful if we want to repeat the exact same experiment, say to compare different algorithms.  


<hr/>

**You choose >>**

**Go ahead and change to seed number to another integer.**

<hr/>

In [None]:
test_size = 0.33
seed = 1
X_train, X_test, Y_train, Y_test = trainTestSplit(X, Y, test_size=test_size, random_state=seed)

In [None]:
X_train

In [None]:
X_test

In [None]:
Y_train

In [None]:
Y_test

## Train the model
Use the training data to devise a model that can perform our predictions.

We will use the decisions tree algorithm to train our model.

In [None]:
model = modelFit(X_train, Y_train, DecisionTreeClassifier)

Let's see how the model performs on the *training data*:

In [None]:
# Make predictions based on the training data
predictions = predict(model, X_test)

# Compare the results of the model with the known answers.
print(accuracy_score(Y_test, predictions))
print(confusion_matrix(Y_test, predictions))

## Test the model

Now use the model to make predictions on data that was not used for training (i.e. the *test data*).

In [None]:
predictions = predict(model, X_test)

## Check how well we did

Compare the results of the model with the known answers.

In [None]:
print(accuracy_score(Y_test, predictions))
print(confusion_matrix(Y_test, predictions))

In [None]:
comparePredictionsWithOriginals(X_test, predictions, Y_test)

In [None]:
predictions

## Apply the model
Use data that we have not yet seen to try to make real predictions.

Load the unseen data.  This contains data from World Cups from 1974 onwards.

In [None]:
unseen_data = readCsv('England or Brazil Unseen.csv')
unseen_data

Split into target feature and input features.

In [None]:
X,Y = splitXY(unseen_data, 'Country')

Use our model to make predictions.

In [None]:
predictions = predict(model, X)

Compare our results with what we know actually happened from 1974 onwards.

In [None]:
comparePredictionsWithOriginals(X, predictions, Y)

Calculate how accurate our predictions were.

In [None]:
print(accuracy_score(Y, predictions))
print(confusion_matrix(Y, predictions))

## Summary
So our model works.  It's not brilliant, but that's mainly because the data set is extremely small, so the statisical algorithms don't have much to work on.  But hopefully you now have a better sense of how the machine learning process works.

## Visualising the Decision Tree

Just like with our paper-based exercises, we can interrogate the machine learning algorithm to draw a visualisation of the decision tree.

To run this code you need to first install graphviz:

<code>pip install graphviz</code>

On Linux also install the following:

<code>sudo apt-get install graphviz</code>


In [None]:
viewDecisionTree(model, X.columns)

The numbers are as follows:
- Gini is a measure of the impurity of the labelling.
- value tells us how many observations fell into each predicted category (England or Brazil)

<hr/>

**Question: >>**

**How does this decision tree compare with the ones you came up with?  Has the computer done a good job in modelling the underlying relations in the data?**

<hr/>