# **Classification Tree Exercise (Practice)**

This task uses a breast cancer data set that can be found [here](https://drive.google.com/file/d/1Zms1RfgkWrTp7S6_BFpeELYyvX1s2FSN/view?usp=sharing). The target vector is the diagnosis as either malignant (M) or benign (B).

Your task is to use a

1. decision tree classifier,

2. bagging classifier, and a

3. random forest classifier

to obtain the highest accuracy possible on the test set.

You may want to refer to the previous chapter to review bagging and random forest as they were explained with regression. They work very similarly in a classification problem, but the final prediction is a majority vote of which class instead of an average of continuous values.

Note, that the following code can be used to import the classification tree models required:

```
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
```


[Assignment Solution](https://github.com/coding-dojo-data-science/machine-learning-practice-solutions/blob/main/Classification_Tree_Exercise_(Practice)_Solution.ipynb)





# Preliminary Steps

In [None]:
# import libraries

# foundation
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# preprocessing
from sklearn.model_selection import train_test_split

# models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier

# regression metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error


In [None]:
# mount drive

from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [None]:
# load data

path = '/content/drive/MyDrive/Coding Dojo/07 Week 7: Classification Models/cancer.csv'
df = pd.read_csv(path)

In [None]:
# inspect data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [None]:
df.sample(10)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
357,901028,B,13.87,16.21,88.52,593.7,0.08743,0.05492,0.01502,0.02088,...,15.11,25.58,96.74,694.4,0.1153,0.1008,0.05285,0.05556,0.2362,0.07113
344,89864002,B,11.71,15.45,75.03,420.3,0.115,0.07281,0.04006,0.0325,...,13.06,18.16,84.16,516.4,0.146,0.1115,0.1087,0.07864,0.2765,0.07806
172,87164,M,15.46,11.89,102.5,736.9,0.1257,0.1555,0.2032,0.1097,...,18.79,17.04,125.0,1102.0,0.1531,0.3583,0.583,0.1827,0.3216,0.101
268,8910506,B,12.87,16.21,82.38,512.2,0.09425,0.06219,0.039,0.01615,...,13.9,23.64,89.27,597.5,0.1256,0.1808,0.1992,0.0578,0.3604,0.07062
204,87930,B,12.47,18.6,81.09,481.9,0.09965,0.1058,0.08005,0.03821,...,14.97,24.64,96.05,677.9,0.1426,0.2378,0.2671,0.1015,0.3014,0.0875
216,8811523,B,11.89,18.35,77.32,432.2,0.09363,0.1154,0.06636,0.03142,...,13.25,27.1,86.2,531.2,0.1405,0.3046,0.2806,0.1138,0.3397,0.08365
504,915186,B,9.268,12.87,61.49,248.7,0.1634,0.2239,0.0973,0.05252,...,10.28,16.38,69.05,300.2,0.1902,0.3441,0.2099,0.1025,0.3038,0.1252
140,868999,B,9.738,11.97,61.24,288.5,0.0925,0.04102,0.0,0.0,...,10.62,14.1,66.53,342.9,0.1234,0.07204,0.0,0.0,0.3105,0.08151
281,8912055,B,11.74,14.02,74.24,427.3,0.07813,0.0434,0.02245,0.02763,...,13.31,18.26,84.7,533.7,0.1036,0.085,0.06735,0.0829,0.3101,0.06688
448,911150,B,14.53,19.34,94.25,659.7,0.08388,0.078,0.08817,0.02925,...,16.3,28.39,108.1,830.5,0.1089,0.2649,0.3779,0.09594,0.2471,0.07463


In [None]:
df.describe(include = 'number')

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


In [None]:
df.describe(include = 'object')

Unnamed: 0,diagnosis
count,569
unique,2
top,B
freq,357


In [None]:
# check to see if data is balanced
df['diagnosis'].value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

In [None]:
# check with percentages (alternative)
df['diagnosis'].value_counts(normalize = True)

B    0.627417
M    0.372583
Name: diagnosis, dtype: float64

In [None]:
# check for duplicates
df.duplicated().sum()

0

In [None]:
'''
# set function for evaluating models with error metrics
# (copied from Josh Johnson's Code-Along Notebook from 4/3/2023 CD class)

## Create a function to take the true and predicted values
## and print MAE, MSE, RMSE, and R2 metrics for a model
def eval_model(y_true, y_pred, name='model'):
  """Takes true targets and predictions from a model and prints
  MAE, MSE, RMSE, AND R2 scores
  Set 'name' to name of model and 'train' or 'test' as appropriate"""
  mae = mean_absolute_error(y_true, y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = np.sqrt(mse)
  r2 = r2_score(y_true, y_pred)

  print(f'{name} Scores')
  print(f'MAE: {mae:,.4f} \nMSE: {mse:,.4f} \nRMSE: {rmse:,.4f} \nR2: {r2:.4f}\n')
'''

'\n# set function for evaluating models with error metrics\n# (copied from Josh Johnson\'s Code-Along Notebook from 4/3/2023 CD class)\n\n## Create a function to take the true and predicted values\n## and print MAE, MSE, RMSE, and R2 metrics for a model\ndef eval_model(y_true, y_pred, name=\'model\'):\n  """Takes true targets and predictions from a model and prints\n  MAE, MSE, RMSE, AND R2 scores\n  Set \'name\' to name of model and \'train\' or \'test\' as appropriate"""\n  mae = mean_absolute_error(y_true, y_pred)\n  mse = mean_squared_error(y_true, y_pred)\n  rmse = np.sqrt(mse)\n  r2 = r2_score(y_true, y_pred)\n\n  print(f\'{name} Scores\')\n  print(f\'MAE: {mae:,.4f} \nMSE: {mse:,.4f} \nRMSE: {rmse:,.4f} \nR2: {r2:.4f}\n\')\n'

In [None]:
'''
# make function that will calculate all metrics for a model after it has been 
# fitted, and return a df with the metrics

def eval_model(model, X_train, X_test, y_train, y_test):

  # create list of metrics and df to store calculations
  metrics = ['MAE', 'MSE', 'RMSE', 'R2']
  metrics = pd.DataFrame(index = metrics, 
                         columns = [str(model) + '\n' + 'Train Score', 
                                    str(model) + '\n' + 'Test Score'])

  # create training and testing predictions
  train_pred = model.predict(X_train)
  test_pred = model.predict(X_test)
  
  # calculate mae for training and testing data
  train_mae = mean_absolute_error(y_train, train_pred)
  test_mae = mean_absolute_error(y_test, test_pred)

  # calculate mse
  train_mse = mean_squared_error(y_train, train_pred)
  test_mse = mean_squared_error(y_test, test_pred)

  # calculate r2
  train_r2 = r2_score(y_train, train_pred)
  test_r2 = r2_score(y_test, test_pred)

  # store values in df
  metrics.loc['MAE', str(model) + '\n' + 'Train Score'] = train_mae
  metrics.loc['MAE', str(model) + '\n' + 'Test Score'] = test_mae
  metrics.loc['MSE', str(model) + '\n' + 'Train Score'] = train_mse
  metrics.loc['MSE', str(model) + '\n' + 'Test Score'] = test_mse
  metrics.loc['RMSE', str(model) + '\n' + 'Train Score'] = np.sqrt(train_mse)
  metrics.loc['RMSE', str(model) + '\n' + 'Test Score'] = np.sqrt(test_mse)
  metrics.loc['R2', str(model) + '\n' + 'Train Score'] = train_r2
  metrics.loc['R2', str(model) + '\n' + 'Test Score'] = test_r2

  return metrics

'''

"\n# make function that will calculate all metrics for a model after it has been \n# fitted, and return a df with the metrics\n\ndef eval_model(model, X_train, X_test, y_train, y_test):\n\n  # create list of metrics and df to store calculations\n  metrics = ['MAE', 'MSE', 'RMSE', 'R2']\n  metrics = pd.DataFrame(index = metrics, \n                         columns = [str(model) + '\n' + 'Train Score', \n                                    str(model) + '\n' + 'Test Score'])\n\n  # create training and testing predictions\n  train_pred = model.predict(X_train)\n  test_pred = model.predict(X_test)\n  \n  # calculate mae for training and testing data\n  train_mae = mean_absolute_error(y_train, train_pred)\n  test_mae = mean_absolute_error(y_test, test_pred)\n\n  # calculate mse\n  train_mse = mean_squared_error(y_train, train_pred)\n  test_mse = mean_squared_error(y_test, test_pred)\n\n  # calculate r2\n  train_r2 = r2_score(y_train, train_pred)\n  test_r2 = r2_score(y_test, test_pred)\n\n  #

# Preprocessing

In [None]:
# convert string target variable values to numbers
df['diagnosis'].replace({'B': 0,
                         'M': 1},
                        inplace = True)

In [None]:
# split into features matrix and target vector
target = 'diagnosis'
y = df[target]
X = df.drop(columns = target)

In [None]:
# train test split (model validation)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

# Decision Tree Classifier Model

In [None]:
# instantiate model with default parameter settings
dec_tree_class = DecisionTreeClassifier(random_state = 42)

In [None]:
# fit model on training data
dec_tree_class.fit(X_train, y_train)

In [None]:
# get accuracy for model
print(f"Model's accuracy score on training data: {dec_tree_class.score(X_train, y_train)}")
print(f"Model's accuracy score on testing data: {dec_tree_class.score(X_test, y_test)}")

Model's accuracy score on training data: 1.0
Model's accuracy score on testing data: 0.951048951048951


In [None]:
# consider tuning hyperparameters (solution notebook did not tune hyperparameters, so model
# accuracy scores will differ slightly)
# example from LP: max_depth
dec_tree_class.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': 42,
 'splitter': 'best'}

In [None]:
# tune hyperparameter max_depth to improve model performance on test

# get depth from default model
dec_tree_class.get_depth()

7

In [None]:
# loop through depths 2-7, calculate error metrics, and choose best version of model
depths = list(range(2, 7))

scores = pd.DataFrame(index = depths,
                      columns = ['Test Score',
                                 'Train Score'])

for depth in depths:
  dec_tree = DecisionTreeClassifier(max_depth = depth,
                                    random_state = 42)
  dec_tree.fit(X_train, y_train)
  train_pred = dec_tree.predict(X_train)
  test_pred = dec_tree.predict(X_test)
  train_acc = dec_tree.score(X_train, y_train)
  test_acc = dec_tree.score(X_test, y_test)
  scores.loc[depth, 'Train Score'] = train_acc
  scores.loc[depth, 'Test Score'] = test_acc

sorted_scores = scores.sort_values(by = 'Test Score',
                                   ascending = False)

sorted_scores.head()

Unnamed: 0,Test Score,Train Score
3,0.958042,0.971831
5,0.958042,0.995305
4,0.944056,0.995305
6,0.944056,0.997653
2,0.916084,0.946009


In [None]:
# instantiate best version of model
dec_tree_class_3 = DecisionTreeClassifier(max_depth = 3,
                                          random_state = 42)

In [None]:
# fit model on training data only
dec_tree_class_3.fit(X_train, y_train)

In [None]:
# get accuracy for model
print(f"Model's accuracy score on training data: {dec_tree_class_3.score(X_train, y_train)}")
print(f"Model's accuracy score on testing data: {dec_tree_class_3.score(X_test, y_test)}")

Model's accuracy score on training data: 0.971830985915493
Model's accuracy score on testing data: 0.958041958041958


# Bagging Classifier Model

In [None]:
# instantiate model with default parameters
bag_class = BaggingClassifier(random_state = 42)

In [None]:
# fit model on training data
bag_class.fit(X_train, y_train)

In [None]:
# get accuracy for model
print(f"Model's accuracy score on training data: {bag_class.score(X_train, y_train)}")
print(f"Model's accuracy score on testing data: {bag_class.score(X_test, y_test)}")

Model's accuracy score on training data: 0.9929577464788732
Model's accuracy score on testing data: 0.951048951048951


In [None]:
# consider tuning hyperparameters
# example from LP: n_estimators
bag_class.get_params()

{'base_estimator': 'deprecated',
 'bootstrap': True,
 'bootstrap_features': False,
 'estimator': None,
 'max_features': 1.0,
 'max_samples': 1.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [None]:
# tune hyperparameter n_estimators to improve model performance on test data

# start by guessing some number of n_estimators (refine later if necessary)

num_estimators = [50, 100, 150, 200]

scores = pd.DataFrame(index = num_estimators, columns = ['Test Score',
                                                         'Train Score'])

for num in num_estimators:
  bag_class = BaggingClassifier(n_estimators = num,
                                random_state = 42)
  bag_class.fit(X_train, y_train)
  train_score = bag_class.score(X_train, y_train)
  test_score = bag_class.score(X_test, y_test)
  scores.loc[num, 'Train Score'] = train_score
  scores.loc[num, 'Test Score'] = test_score

sorted_scores = scores.sort_values(by = 'Test Score',
                                     ascending = False)
  
sorted_scores.head()

Unnamed: 0,Test Score,Train Score
50,0.958042,1.0
100,0.958042,1.0
150,0.958042,1.0
200,0.958042,1.0


In [None]:
# should put plot here to see how n_estimators affects performance

In [None]:
# best version of model is the default (same accuracy score as models with different num_estimators)

# Random Forest Classifier Model

In [None]:
# instantiate model with default parameter settings
rf_class = RandomForestClassifier(random_state = 42)

In [None]:
# fit model on training data
rf_class.fit(X_train, y_train)

In [None]:
# get accuracy for model
print(f"Model's accuracy score on training data: {rf_class.score(X_train, y_train)}")
print(f"Model's accuracy score on testing data: {rf_class.score(X_test, y_test)}")

Model's accuracy score on training data: 1.0
Model's accuracy score on testing data: 0.972027972027972


In [None]:
# these accuracy scores differ from the rf classifier in the solutions notebook??

In [None]:
# consider tuning hyperparameters
# example from LP: max_depth
rf_class.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [None]:
# did not tune any hyperparameters for this model

The best model according to test accuracy is the random forest model, which was 97.2% accurate on the test data.