# Classification Project

In this project, you will work with the breast cancer dataset again and use different classification models. The goal is to predict whether we are in a case malignant or benign. You will use things you have learned from cross-validation and from hyperparameter search.

## Part 0 - Importing the Dataset

The cell below imports the relevant libraries you need and imports the breast cancer dataset. Run the cell below without modifying it, and then you can proceed.

In [None]:
# Non-sklearn packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sklearn packages
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

# Getting the data and targets
X = pd.DataFrame(load_breast_cancer()['data'], columns=load_breast_cancer()['feature_names'])
y = pd.DataFrame(load_breast_cancer()['target'], columns=["target"])

# Printing out description of the dataset
print(load_breast_cancer()['DESCR'])

## Part 1 - Exploring the Dataset

When exploring the data, it is good to have the features and targets in a single dataframe. Combine the variables `X` and `y` into a single dataframe called `combined`:

In [None]:
# Combine into a single dataframe


In [None]:
# Show the first 5 rows of the dataframe


In [None]:
# Get summary statistics for the columns


In [None]:
# Check if there are any missing values


In [None]:
# How many of the patients are in the class "malignant" or "benign"


In [None]:
# Visualize with a bar chart how many of the patients are in the class "malignant" or "benign"


In [None]:
# Check the correlation of the combined dataframe


In [None]:
# Plot the correlation of combined with a heatmap (HINT: sns.heatmap)


Which features looks to be highly correlated with the target?

## Part 2 - Preprocessing

It is now time to start preprocessing the dataset. There are no missing values, so we only need to normalize the data. Before doing this, we should create a test set (since we are planning to evaluate our final chosen model on this). Doing this BEFORE any data processing allows us to avoid data leakage. 

We do not need the variable `combined` anymore and we can stick to `X` and `y` variables. Let us first separate 10% of the dataset by using `train_test_split`:

In [None]:
# Separate 10% of the dataset


Now we can use StandardScaler individually on the main part and the test data set. We DO NOT need to scale the variables target variables.

In [None]:
# Use StandardScaler to scale both the main part, and the test set


The shape of the targets indicate that they are two-dimensional NumPy arrays. But for the `GridSearchCV` class that we will use later, it is required that they be one-dimensional. Use `np.ravel` to flatten the targets to be one-dimensional:

In [None]:
# Flatten the targets


## Part 3 - Random Forests and Hyperparameters

In this section we will create a random forest with different amount of decision trees in them. The goal is to get experience with hyperparameter search with a single hyperparameter.

In [None]:
# Create a random forest classifier


The `RandomForestClassifier` class has a `n_estimators` hyperparameter that decides the number of decision trees in the model. Create a Python dictionary called `param_grid_forest` that specifies the values `1` `2`, `5`, `10`, `20`, `50`, `100`, `200`, `500`, `1000` for the `n_estimators` hyperparameter.

In [None]:
# Specify the Python dictionary with the n_estimators key

# Use GridSearchCV on the random forest model with the param_grid set to the above Python dictionary


In [None]:
# Fit the GridSearchCV model to the data


In [None]:
# Check out the results


In [None]:
# Get the best score


In [None]:
# Get the best forest model


In [None]:
# Predict on the scaled test data


In [None]:
# Get the accuracy for the best forest

# Get the precision for the best forest

# Get the recall for the best forest


## Part 4 - SVMs and Hyperparameters

In this section we will create a support vector machine with values for the `C` hyperparameter and the `degree` hyperparameter. The goal is to get experience with hyperparameter search with two hyperparameters.

In [None]:
# Create a support vector machine with kernel equal to "poly"


The `SVC` class has a `C` hyperparameter and a `degree` hyperparameter (whenever we specify that the `kernel` should be `poly`). Create a Python dictionary called `param_grid_svc` that specifies the values `1` `2`, `5`, `10` for the `C` hyperparameter, and the values `1`, `2`, `3`, `4` for the `degree` hyperparameter.

In [None]:
# Specify the Python dictionary with the C and degree keys

# Use GridSearchCV on the SVC model with the param_grid set to the above Python dictionary


In [None]:
# Fit the GridSearchCV model to the data


In [None]:
# Check out the results


In [None]:
# Get the best score


In [None]:
# Get the best forest model


In [None]:
# Predict on the scaled test data


In [None]:
# Get the accuracy for the best SVC

# Get the precision for the best SVC

# Get the recall for the best SVC


What conclusions can you draw form the results? Should you continue with the support vector machine, or the random forest?