# Lesson 05 Assignement
# Houda Aynaou

## Workplace Scenario

Rooney's client is a tech-manufacturing startup working on a number of automated detection devices for the medical and construction industries. Among the auto-detection devices is a reader that looks at possible carcinoma tissue samples to classify the sample as either benign or malignant. Rooney asks you for help in developing a better algorithm than the current classifier, perhaps a decision tree can help.

For this assignment, you will be designing an experiment using decision tree classifiers for the detection of breast cancer and comparing the accuracy using [Breast Cancer Wisconsin Data Set](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(original)).

| Column                    |Description      |
|---------------------------|-----------------|
|Sample code number         | id number       |
|Clump Thickness            | 1 - 10 |     
|Uniformity of Cell Size    | 1 - 10 |
|Uniformity of Cell Shape   | 1 - 10 |
|Marginal Adhesion          | 1 - 10 |
|Single Epithelial Cell Size| 1 - 10 |
|Bare Nuclei                | 1 - 10 |
|Normal Nucleoli            | 1 - 10 |
|Mitosis                    | 1 - 10 |
|Class                      | 4 for malignant, 2 for benign |



## To do

1. Test both entropy and the gini coefficient. Which performs better and why?
2. What are the best hyperparameter settings for both?
3. Visualize both models and see which feature is selected for each criterion. Are they same for both? Why or why not? 
4. Determine the AUC for the best model you can achieve. What are the precision and recal values and which might be the one you want to maximize?
5. What are the implications of using this type of machine learning algorithm for breast cancer analysis?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings('ignore')


# 1. Data

In [None]:
LINK = 'https://raw.githubusercontent.com/houdaaynaou/DS-Certificate-UW/master/Course%203%20Machine%20Learning%20Techniques/Data/breast-cancer-wisconsin.csv'
col_names = ['ID', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli', 'Mitosis' , 'Class']

data = pd.read_csv(LINK, names= col_names)
data.head()


## 1.1 Inspecting Data

In [None]:
data.shape

In [None]:
data.info()

In [None]:
# Inspecting 'Bare Nuclei' column:

print("'Bare Nuclei' unique observations:\n", data['Bare Nuclei'].unique())

print('\nNumber of rows with missing value:', data[data['Bare Nuclei'] == '?'].shape[0])

print('\nObservations with "Bare Nuclei" missing values:')

data[data['Bare Nuclei'] == '?']


All of the data features are numerical type except for `Bare Nuclei` column that is object because it contains missing values marked as `?`. Missing values will be handled before building the Decision tree. 

In [None]:
data.describe()

The summary of the data shows that most features are right skewed.

In [None]:
# Histogram of column features
sns.set()
for col in list(data.drop(['ID', 'Class'], axis=1).columns):
    data[col].hist()
    plt.title('Histogram of '+ col)
    plt.show();

In [None]:
# Target variable:
sns.set()
data['Class'].hist()
plt.title('Histogram of Tumor Class')
plt.show()

print(data['Class'].value_counts())

The dataset contain **458** observations from **class 2** *benign tumor* which is twice the observations from **class 4** *malignant tumor*. This is an indicator of class imbalance.

## 1.2 Handeling missing value in column Bare Nuclei

In [None]:
# Median 
bare_median = np.median(pd.to_numeric(data[data['Bare Nuclei'] != '?']['Bare Nuclei']))

# Imputing missing value
missing = data['Bare Nuclei'] == '?'
data.loc[missing, 'Bare Nuclei'] = bare_median

# Coerce Column to numeric 
data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'], errors='coerce')


# 2. Decision Tree Classifier Model
## 2.1. Splitting Data

In [None]:
from sklearn.model_selection import train_test_split

X = data.drop(['ID', 'Class'], axis = 1)
Y = data['Class']

X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state = 42, test_size = 0.3)
y_test.value_counts()

## 2.2. Geni Coefficient Model


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

# Geni coefficient Model
gini_tree = DecisionTreeClassifier(criterion= 'gini', random_state= 42)
gini_tree.fit(X_train, y_train)

# Predictions
pred = gini_tree.predict(X_test)
 
# Model performance
confusion_matrix(y_test, pred)


Decision tree with Gini coefficient as a splitting criterion performed fairly well, classifying 60 out of 67 as having melignant breast cancer and 139 out of 143 as having benign bread cancer. 

## 2.2. Entropy Model

In [None]:
entropy_tree = DecisionTreeClassifier(criterion= 'entropy', random_state= 42)
entropy_tree.fit(X_train, y_train)

# Predictions
entropy_pred = entropy_tree.predict(X_test)
 
# Model performance
confusion_matrix(y_test, entropy_pred)

Decision tree with Gini coefficient as a splitting criterion performed fairly well, classifying 59 out of 67 as having melignant breast cancer and 137 out of 143 as having benign bread cancer. 

Decision tree with gini coefficient performed better than decision tree with entropy as a criterion.

# 3. GridSearch for best hyperparameter

## 3.1 GridSearch for Gini_tree

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

gini_tree = DecisionTreeClassifier(criterion= 'gini', random_state=42)
depths = np.arange(1, 10)
num_leafs = [1, 5, 10, 20]

param_grid = [{'max_depth':depths,
              'min_samples_leaf':num_leafs}]

# Gini Gridsearch
gini_gs = GridSearchCV(estimator = gini_tree, param_grid=param_grid, cv=5)

In [None]:
# Best parameters
gini_gs.fit(X_train, y_train)
gini_gs.best_params_

In [None]:
# predictions:
pred_gini_cv = gini_gs.predict(X_test)

# performance:
print('Confusion matrix with the best found parameters:\n',confusion_matrix(y_test, pred_gini_cv))

Decision tree with Gini coefficient as a splitting criterion improved with best parameters detecting 61 with malignant breast cancer out of 67.

## 3.2 GridSearch for entropy tree Model

In [None]:
entropy_pipe_tree = DecisionTreeClassifier(random_state=42)
depths = np.arange(1, 10)
num_leafs = [1, 5, 10, 20]
ent_param_grid = [{'max_depth':depths,
              'min_samples_leaf':num_leafs}]

# entropy Gridsearch
entropy_gs = GridSearchCV(estimator= entropy_pipe_tree, param_grid=ent_param_grid, cv=10)

entropy_cv = entropy_gs.fit(X_train, y_train)

# Best parameters:
entropy_cv.best_params_


In [None]:
# predictions:
pred_entropy_cv = entropy_cv.predict(X_test)

# performance:
print('Confusion matrix with the best found parameters:\n',confusion_matrix(y_test, pred_entropy_cv))


Decision tree with entropy as a splitting criterion improved also with best parameters detecting 61 with malignant breast cancer out of 67. 

# 4. Visualizing Trees 

## 3.1 Gini Coefficient Model Tree with best parameters


In [None]:
from sklearn.tree import export_graphviz

# Training Decision tree model with gini coefficient and best parameters
gini_best_param = DecisionTreeClassifier(criterion= 'gini', max_depth= 4, min_samples_leaf= 1, random_state=42)
gini_best_param.fit(X_train, y_train)
export_graphviz(gini_best_param,'gini_tree.dot', feature_names = list(X.columns))

# predictions
gini_best_param_pred = gini_best_param.predict(X_test)

In [None]:
from graphviz import render
render('dot', 'png', 'gini_tree.dot')

In [None]:
from IPython.display import Image
Image('gini_tree.dot.png')

## 3.2 Entropy Model Tree

In [None]:
from sklearn.tree import export_graphviz

# Training Decision tree model with entropy and best parameters
entropy_best_param = DecisionTreeClassifier(criterion= 'entropy', max_depth= 4, min_samples_leaf= 1, random_state=42)
entropy_best_param.fit(X_train, y_train)
export_graphviz(entropy_best_param,'entropy_tree.dot', feature_names = list(X.columns))

In [None]:
render('dot', 'png', 'entropy_tree.dot')

In [None]:
Image('entropy_tree.dot.png')

The split feature at the root node of the tree selected by both model is `Uniformity of Cell size`, going further down the tree both models differe on the features to split by. 
Gini coefficient minizine is the probability of a random sample being classified incorrectly if we randomly pick a label according to the distribution in a branch.

# 5. AUC, Precision and Recall of the best model: 

## 5.1 AUC for the Geni coefficient model with best parameters:

In [None]:
from sklearn import  metrics 

fpr, tpr, thresholds = metrics.roc_curve(y_test, gini_best_param_pred, pos_label=2)

# Auc
print('Auc:', metrics.auc(fpr, tpr))

Accuracy is measured by the area under the ROC curve which depends on how well the test separates the group being tested into those with and without the disease in question. An area of 1 represents a perfect test; an area of .5 represents a worthless test.

## 5.2 Precision and recal

In [None]:
print('Classification report for Geni coefficient model with best parameters:\n\n',metrics.classification_report(y_test, gini_best_param_pred, target_names=('2', '4')))


**Which metric to maximize? Precision or Recall?** 

The metric to maximize is Recall: the proportion of actual positives that are correctly identified as such, meaning the percentage of women with breast canser who are correctly identified as having the condition and minimize false negatives which are errors in which the test result improperly indicates no presence of a cancer breast (the result is negative), when in reality it is present.


# 6. Implications of using Decision Tree for breat canser analysis

Decision Trees have been used in different areas of medical decision making, they are a reliable and effective decision making technique that provide high classification accuracy with a simple representation of gathered knowledge. This technique has ability of detecting very similarities/differences that a human analyst may be not notice and therefore create and introduction more accurate/useful categories.