# Random Forest - Tutorial

In this project we will train a Random Forest model to classify data from the breast cancer dataset available in the sklearn.datasets library. The first part of this notebook is a tutorial section which shows how to access the data, how to examine the data and how to create and train a Random Forest model. 

In [None]:
# Necessary libraries for data exploration
import numpy as np # matrix operations
import pandas as pd # Dataframes

# Datasets are available through this
import sklearn.datasets as datasets

import matplotlib.pyplot as plt
%matplotlib inline

## Load the data

The iris-dataset provided by sklearn is accessed by the following code (there are methods defined for all available datasets). The dataset is in a dictionary format. The information inside the dictionary is accessible via indexing with the "keys" of the dictionary.

In [None]:
tutorial_data = datasets.load_iris()
print("Availbale information in data:")
print(tutorial_data.keys())

For example, the target values (also known as the labels in labeled data) are accessed by using the "target" key and the actual data is accessed via the "data" key.

In [None]:
# Accessing different parts of the data
print("First datapoint in the dataset:", tutorial_data["data"][0])
print("Target value of the first instance: ", tutorial_data["target"][0],  "= (" + tutorial_data["target_names"][tutorial_data["target"][0]] + ")")
print("The names of the targets:", tutorial_data["target_names"])
print("The features available for every datapoint:", tutorial_data["feature_names"])

We now split the data into three separate sets: The training set, validation set and test set. The training set includes the data and the respective targets which the model will be trained on. The validation set is used to evaluate the model after training. Once the model performs well on the validation set, the model is then evaluated on the test set. This 2-step evaluation is done to avoid bias towards the training set and validation set, since it may work well on both sets but might still fail on other unseen data.

In [None]:
from sklearn.utils import shuffle
# Shuffle the dataset
tutorial_data["data"], tutorial_data["target"] = shuffle(tutorial_data["data"], tutorial_data["target"])

from sklearn.model_selection import train_test_split

# Separation into training and test sets
train_data, test_data, train_labels, test_labels = train_test_split(tutorial_data["data"], tutorial_data["target"], test_size=0.3)

# Separation of test set into test and validation
test_data, validation_data, test_labels, validation_labels = train_test_split(test_data, test_labels, test_size = 0.5)

# Show the amount of target categories in each dataset
print(np.bincount(train_labels), np.bincount(validation_labels), np.bincount(test_labels))


## Data exploration

For further analysis we can display the data in a number of ways. Here we simply plot the data instances as lines, the points between the lines show the value of the respective feature. Take the example below.

In [None]:
# Change size of plot
plt.figure(figsize=(13, 7))

# Plot all setosa instances as blue lines
plt.plot(train_data[train_labels == 0].T, color = "blue")

# Plot all versicolor instances as red lines
plt.plot(train_data[train_labels == 1].T, color = "red")

# Plot all virginica instances as green lines
plt.plot(train_data[train_labels == 2].T, color = "green")

# Set the labels on the x-axis equal to the feature names
plt.xticks(np.arange(len(tutorial_data["feature_names"])),tutorial_data["feature_names"])
plt.show()

We can see that the instances in the training set follow select patterns depending on which target class the instances belong to. In the figure above, the blue lines (instances belonging to the setosa class) show that the petal length and petal width vary significantly from the other classes. The red (versicolor) and green (virginica) instances also vary at those points (red instances are mainly below green instances for the features petal length and petal width). Based on this, we can see that petal length and width are good features for dividing the data into the different target categories.

Note that these kinds of patterns aren't always so trivial in other scenarios where the data may be noisy for example.


## Random Forest

Following the exploration stage we can move on to creating an instance of a random forest model. A random forest model can take a number of parameters to optimize the training of the model. In this project we mainly consider:
- The numer of estimators in the model (how many decision trees the forest consists of)
- The function for determining the quality of a split (sklearn supports "gini" and "entropy")
- The depth of the estimators (how big each decision tree is allowed to get)


For more information about the available functionality of random forest models in sklearn, visit: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
# Get the algorithm from sklearn
from sklearn.ensemble import RandomForestClassifier

# Create a model instance
rf = RandomForestClassifier(n_estimators = 100, criterion="gini", max_depth = None, random_state= 0)

# Train the model on training data
rf.fit(train_data, train_labels)

# Get model prediciotns on training data
predictions = rf.predict(train_data)

# Calculate accuracy on training
train_acc = sum(predictions ==  train_labels)/len(train_labels)
print("Final training accuracy:", train_acc)

# Get accuracy on validation set
validation_accuracy = sum(rf.predict(validation_data) ==  validation_labels)/len(validation_labels)
print("Validation accuracy:", validation_accuracy)

Provided the validation accuracy is good, we can then test the model on the test data

In [None]:
# Accuracy of test set once training and validation are "good enough"
test_accuracy = sum(rf.predict(test_data) ==  test_labels)/len(test_labels)
print("test accuracy:", test_accuracy)

Finally, one of the exciting things about random forest models is the ability to calculate the most "predictive" features of the dataset i.e. which features are most valuable when predicting the target value of a given data instance. This is done automatically during training and the feature importance is stored within the model itself. You can access this information via the following attribute 

In [None]:
rf.feature_importances_

The array gives percentages of how important the respective features are in making predictions. The index of the array corresponds with the order of the features in the data. Based on the scores, we can see that the features petal length and width have higher importance in categorization than sepal length and sepal width.

#Random Forest - Project

your task is to perform a similar analysis of a dataset consisting of breast cancer tumours. The tasks are listed as follows:
- Find out what the features and the target values are
- Sort the data into three separate datasets
- Analyse the data by visualization, can you find any patterns in the data which can be used to determine target value?
- Create a random forest model for categorizing the data instances into the target categories and calculating feature importance.

Feel free to copy and test the code given in the tutorial above.

In [None]:
import random
random.seed(2021)

## Load the data

In [None]:
Cancer_data = datasets.load_breast_cancer()

### Data

From the documentation: https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset


"This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets. https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, “Decision Tree Construction Via Linear Programming.” Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.

The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34]."

## Examine data structure

Try to find out what the dataset consists of by printing the feature names and the target names.

Q1. How many instances are available in the complete dataset?

## Sort data into different sets

Create a training, validation and test set from the dataset, you can choose how big you want each set to be but make sure no instance appears in more than one set! (Remember to shuffle the data)

## Analyze the data

Use matplotlib (or any other means) to visualise the data. 

## Random forest model

Create your model instane here. Calculate the accuracy of the model on the validation and the test set.

Train the model in different setups using the gini criterion and the entropy criterion. Use 20 decision trees in yuor model, of maximum depth 3. 

Q2. Which feature is the most important one?

Q3. What class does the model predict for an input consisting only of zeros?