## Classification

The process of organising a classifier:

1. Organise and clean the dataset
2. Divide the datset into training and testing subsets
3. Use the classifier to associate feature attributes within the training dataset to known classifications
4. Test the strength of the model fit by predicting the classifications within the test dataset


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

pd.set_option('display.max_rows', 300) # specifies number of rows to show
pd.options.display.float_format = '{:40,.4f}'.format # specifies default number format to 4 decimal places

First of all, read in the data required for the classifications

In [None]:
data = pd.read_csv("")

Organising and cleaning the data includes:

1. split the `data` into two datasets corresponding to predictors `X` and the response variable `y`
2. splt the datasets into training set and testing set

The data to be used here is what cluster does the MSOA belong to as given by our clustering algorithm outputs.

This requires splitting the dataset into one dataset containing the attribute data known as `attributes`, and one containing the classifications known as `y`.

In [None]:
attributes = data.drop("labels", axis = 1)
y = data["labels"]

Since we don't have any categorical variables we can convert the attributes and classifications into numpy arrays which is required to feed into the classifier algorithms

In [None]:
#this could be an issue but we shall see 
attributes.to_numpy()
y.to_numpy()


The final stage in data preprocessing involves splitting the prepared dataset into training and testing subsets. The training data will be used to create the classifier, the esting data will then be used to test the accuracy of the classification

This is done using the `train_test_split` method from `scikit`, which splits the attribute and label data into training and testing subsets.

This splots according to a 75:25 split, roughly in line with convention.

In [None]:
from sklearn.model_Selection import train_test_split

In [None]:
train_a, test_a, train_lab, test_lab = train_test_split(attributes, y)

## Decision tree

The first classification method we will use is that of a decision tree.

This takes two arrays as inputs: an array X of size `[n_smaples, n_featurs]` holding tha trainig samples, and an array Y of integer values, size `[n_samples]`, holding the class lables for the training samples.

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf_decision_tree = DecisionTreeClassifier(random_state=RSEED)  # creates the kNN classifier, setting it to check the 60 neighbouring points
clf_decision_tree.fit(train_d, train_lab)

To test the accuracy of the model on the training data.

In [None]:
print(f'Model Accuracy: {clf_decision_tree.score(train_d, train_lab)})

To test the accuracy of the the model on the test data

In [None]:
test_pred_decision_tree = clf_decision_tree.predict(test_d)

In [None]:

print (metrics.classification_report(test_lab, test_pred_decision_tree))

Once trained we can then plot the tree

In [None]:
from sklearn import tree

decision_tree_depth_5 = DecisionTreeClassifier(max_depth=5)

tree.plot_tree(test_pred_decision_tree(train_a, train_lab))

plt.figure(figsize = (20,20))

plt.show()

Here all nodes excpet the leaf nodes (terminal nodes which are coloured) all have 5 parts (leaf nodes don't have a question because they are where the final prediction is made:

1. Question asked about the data based on a value of a feature. Each is either true or false that splits the node
2. `gini`: the gini Impurity of the node. The average weighted Gini impurity decreases as we move down the tree
3. `samples`: the number of observations in the node
4. `value`: the nu of samples in each class
5. `class` the majority classification for points in the node. In the case of leaf nodes, this is the prediction for all samples in the node 

We can also view the rules in text format, which is more intuitive to read.

In [None]:
# the code is based on this link: https://stackoverflow.com/a/57335067/4667568
decision_tree_depth_5 = DecisionTreeClassifier(max_depth=5)
from sklearn.tree import export_text
tree_rules = export_text(decision_tree_depth_5, feature_names=vec.feature_names_)
print(tree_rules)

The gini impurity of a node is the probability that a randomly chosen sample in a node would be incorrectly labelled if it was labeled by the distribution of samples in the node.

At each node the decision tree searches through the features for the value to split on that results in the greatest reduction in Gini impurity (an alternative for this is the infromation gain).

Eventually the weighted total Gini impurity of the last layer goes to 0, so that each node is completely pure. This means that the model may be overfitting because the nodes are constructed only using training data.

To limit overfitting we can set a 

## Random forest classifier

This uses many trees.

This requires you to specify `n_estimators` which specifies how many trees shsould be created in the construction of the whole forest. The more trees chosen, the longer this will take.

The benefit of this is that it injects some randomness into the fitting of trees to reduce overfitting seen as part of decision trees, and hence produces an overall better model


In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = randomForestClassifier(n_estimators=100)
clf.fit(train_d, train_lab)
test_pred_random_forest = clf.predict(test_d)

In [None]:
print (metrics.classification_report(test_lab, test_pred_random_forest))

In [None]:
print(rf.feature_importances_)

As with a single tree, you can inspect the structure and rules of each tree inside a random forst. if we created and trained a `RandomForestClassifier` called `rf`, we can extract the tree using `rf.estimators_[0]`