<a href="https://colab.research.google.com/github/jaytrey777/Classification-Trees/blob/main/Classification_Trees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tasks

The target vector is the diagnosis as either malignant (M) or benign (B).



Use the following classifiers to obtain the highest accuracy possible on the test set:

1. Decision Tree Classifier

2. Bagging Classifier

3. Random Forest Classifier

Link to original dataset:
https://drive.google.com/file/d/1Zms1RfgkWrTp7S6_BFpeELYyvX1s2FSN/view

# Imports and Drive Mounting

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
import pandas as pd
import numpy as np
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
import matplotlib as mpl

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier

from sklearn import set_config
set_config(display = 'diagram')

import warnings
warnings.filterwarnings('ignore')

# Data Loading & Cleaing

In [3]:
filename = '/content/drive/MyDrive/Colab Notebooks/02. Weeks 5-8: Machine Learning/07 Week 7: Classification/4. Practice Assignments/cancer.csv'
df = pd.read_csv(filename)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

In [5]:
df.nunique()

id                         569
diagnosis                    2
radius_mean                456
texture_mean               479
perimeter_mean             522
area_mean                  539
smoothness_mean            474
compactness_mean           537
concavity_mean             537
concave points_mean        542
symmetry_mean              432
fractal_dimension_mean     499
radius_se                  540
texture_se                 519
perimeter_se               533
area_se                    528
smoothness_se              547
compactness_se             541
concavity_se               533
concave points_se          507
symmetry_se                498
fractal_dimension_se       545
radius_worst               457
texture_worst              511
perimeter_worst            514
area_worst                 544
smoothness_worst           411
compactness_worst          529
concavity_worst            539
concave points_worst       492
symmetry_worst             500
fractal_dimension_worst    535
dtype: i

Here is what I need to do to manipulate the data:
- delete the id column because there is a unique id for every person and that column will not help the model in any way. 
- benign (noncancerous) or malignant (cancerous) are the possible categories for the target variable. I am going to change the outcomes to the following:
 - 0 = benign 
 - 1 = malignant.

# Preprocessing

In [6]:
df['diagnosis'].value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

In [7]:
df['diagnosis'].replace({'B': 0, 'M': 1}, inplace = True)
df['diagnosis'].value_counts()

0    357
1    212
Name: diagnosis, dtype: int64

In [8]:
X = df.drop(columns = ['id', 'diagnosis']).copy() # sets features eliminating the 'id' and the target columns
y = df['diagnosis'].copy() # sets target variable

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42) #train test split
X_train.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
287,12.89,13.12,81.89,515.9,0.06955,0.03729,0.0226,0.01171,0.1337,0.05581,...,13.62,15.54,87.4,577.0,0.09616,0.1147,0.1186,0.05366,0.2309,0.06915
512,13.4,20.52,88.64,556.7,0.1106,0.1469,0.1445,0.08172,0.2116,0.07325,...,16.41,29.66,113.3,844.4,0.1574,0.3856,0.5106,0.2051,0.3585,0.1109
402,12.96,18.29,84.18,525.2,0.07351,0.07899,0.04057,0.01883,0.1874,0.05899,...,14.13,24.61,96.31,621.9,0.09329,0.2318,0.1604,0.06608,0.3207,0.07247
446,17.75,28.03,117.3,981.6,0.09997,0.1314,0.1698,0.08293,0.1713,0.05916,...,21.53,38.54,145.4,1437.0,0.1401,0.3762,0.6399,0.197,0.2972,0.09075
210,20.58,22.14,134.7,1290.0,0.0909,0.1348,0.164,0.09561,0.1765,0.05024,...,23.24,27.84,158.3,1656.0,0.1178,0.292,0.3861,0.192,0.2909,0.05865


Since all of the columns are numerical, I only need to worry about processing them when creating my pipeline

In [10]:
#setting up a tuple to process the numerical columns but inputing the mean for any missing values that may come later. 
median_imputer = SimpleImputer(strategy = 'median')
scaler = StandardScaler()

numerical_pipeline = make_pipeline(median_imputer, scaler)

numerical_selector = make_column_selector(dtype_include = 'number')

numerical_tuple = (numerical_pipeline, numerical_selector)

In [11]:
#finishing up the preprocessor and fitting the training data.
preprocessor = make_column_transformer(numerical_tuple, remainder='drop', verbose_feature_names_out=False)
preprocessor.fit(X_train)

# Decision Tree Classifier

In [12]:
dec_tree = DecisionTreeClassifier()
dec_tree_pipeline = make_pipeline(preprocessor, dec_tree)
dec_tree_pipeline.fit(X_train, y_train)

In [13]:
#Create predictions using data sets
decision_train_predictions = dec_tree_pipeline.predict(X_train)
decision_test_predictions = dec_tree_pipeline.predict(X_test)

## Metrics 

In [14]:
#Model Accuracy
dec_tree_train_accuracy = round(dec_tree_pipeline.score(X_train, y_train), 4)
dec_tree_test_accuracy = round(dec_tree_pipeline.score(X_test, y_test), 4)

# Training Data
dec_train_mae = round(mean_absolute_error(y_train, decision_train_predictions), 4)
dec_train_mse = round(mean_squared_error(y_train, decision_train_predictions),4)
dec_train_rmse = round(np.sqrt(mean_squared_error(y_train, decision_train_predictions)),4)
dec_train_r2 = round(r2_score(y_train, decision_train_predictions), 4)

# Testing Data
dec_test_mae = round(mean_absolute_error(y_test, decision_test_predictions),4)
dec_test_mse = round((mean_squared_error(y_test, decision_test_predictions)),4)
dec_test_rmse = round(np.sqrt(mean_squared_error(y_test, decision_test_predictions)),4)
dec_test_r2 = round(r2_score(y_test, decision_test_predictions), 4)

In [15]:
#Model Accuracy
print(f'Training Model Accuracy: {dec_tree_train_accuracy}')
print(f'Testing Model Accuracy: {dec_tree_test_accuracy}')
print('')

# Training Data
print('Decision Classifier Training Metrics')
print(f'MAE : {dec_train_mae}')
print(f'MSE : {dec_train_mse}')
print(f'RMSE : {dec_train_rmse}')
print(f'R-SQUARED : {dec_train_r2}')
print('')

# Testing Data
print('Decision Classifier Testing Metrics')
print(f'MAE : {dec_test_mae}')
print(f'MSE : {dec_test_mse}')
print(f'RMSE : {dec_test_rmse}')
print(f'R-SQUARED : {dec_test_r2}')

Training Model Accuracy: 1.0
Testing Model Accuracy: 0.951

Decision Classifier Training Metrics
MAE : 0.0
MSE : 0.0
RMSE : 0.0
R-SQUARED : 1.0

Decision Classifier Testing Metrics
MAE : 0.049
MSE : 0.049
RMSE : 0.2212
R-SQUARED : 0.7917


#Bagging Classifier

In [16]:
bag_tree = BaggingClassifier()
bag_tree_pipeline = make_pipeline(preprocessor, bag_tree)
bag_tree_pipeline.fit(X_train, y_train)

In [17]:
bag_train_predictions = bag_tree_pipeline.predict(X_train)
bag_test_predictions = bag_tree_pipeline.predict(X_test)

## Metrics

In [18]:
#Model Accuracy
bag_tree_train_accuracy = round(bag_tree_pipeline.score(X_train, y_train), 4)
bag_tree_test_accuracy = round(bag_tree_pipeline.score(X_test, y_test), 4)

# Training Data
bag_train_mae = round(mean_absolute_error(y_train, bag_train_predictions), 4)
bag_train_mse = round(mean_squared_error(y_train, bag_train_predictions),4)
bag_train_rmse = round(np.sqrt(mean_squared_error(y_train, bag_train_predictions)),4)
bag_train_r2 = round(r2_score(y_train, bag_train_predictions), 4)

# Testing Data
bag_test_mae = round(mean_absolute_error(y_test, bag_test_predictions),4)
bag_test_mse = round((mean_squared_error(y_test, bag_test_predictions)),4)
bag_test_rmse = round(np.sqrt(mean_squared_error(y_test, bag_test_predictions)),4)
bag_test_r2 = round(r2_score(y_test, bag_test_predictions), 4)

In [19]:
#Model Accuracy
print(f'Training Model Accuracy: {bag_tree_train_accuracy}')
print(f'Testing Model Accuracy: {bag_tree_test_accuracy}')
print('')

# Training Data
print('Baging Classifier Training Metrics')
print(f'MAE : {bag_train_mae}')
print(f'MSE : {bag_train_mse}')
print(f'RMSE : {bag_train_rmse}')
print(f'R-SQUARED : {bag_train_r2}')
print('')

# Testing Data
print('Decision Classifier Testing Metrics')
print(f'MAE : {bag_test_mae}')
print(f'MSE : {bag_test_mse}')
print(f'RMSE : {bag_test_rmse}')
print(f'R-SQUARED : {bag_test_r2}')

Training Model Accuracy: 0.9977
Testing Model Accuracy: 0.958

Baging Classifier Training Metrics
MAE : 0.0023
MSE : 0.0023
RMSE : 0.0485
R-SQUARED : 0.9899

Decision Classifier Testing Metrics
MAE : 0.042
MSE : 0.042
RMSE : 0.2048
R-SQUARED : 0.8215


# Random Forest Classifier

In [20]:
random_forest = RandomForestClassifier()
random_forest_pipeline = make_pipeline(preprocessor, random_forest)
random_forest_pipeline.fit(X_train, y_train)

In [21]:
random_forest_train_predictions = random_forest_pipeline.predict(X_train)
random_forest_test_predictions = random_forest_pipeline.predict(X_test)

## Metrics

In [22]:
#Model Accuracy
random_forest_train_accuracy = round(random_forest_pipeline.score(X_train, y_train), 4)
random_forest_test_accuracy = round(random_forest_pipeline.score(X_test, y_test), 4)

# Training Data
random_forest_train_mae = round(mean_absolute_error(y_train, random_forest_train_predictions), 4)
random_forest_train_mse = round(mean_squared_error(y_train, random_forest_train_predictions),4)
random_forest_train_rmse = round(np.sqrt(mean_squared_error(y_train, random_forest_train_predictions)),4)
random_forest_train_r2 = round(r2_score(y_train, random_forest_train_predictions), 4)

# Testing Data
random_forest_test_mae = round(mean_absolute_error(y_test, random_forest_test_predictions),4)
random_forest_test_mse = round((mean_squared_error(y_test, random_forest_test_predictions)),4)
random_forest_test_rmse = round(np.sqrt(mean_squared_error(y_test, random_forest_test_predictions)),4)
random_forest_test_r2 = round(r2_score(y_test, random_forest_test_predictions), 4)

In [23]:
#Model Accuracy
print(f'Training Model Accuracy: {random_forest_train_accuracy}')
print(f'Testing Model Accuracy: {random_forest_test_accuracy}')
print('')

# Training Data
print('Baging Classifier Training Metrics')
print(f'MAE : {random_forest_train_mae}')
print(f'MSE : {random_forest_train_mse}')
print(f'RMSE : {random_forest_train_rmse}')
print(f'R-SQUARED : {random_forest_train_r2}')
print('')

# Testing Data
print('Decision Classifier Testing Metrics')
print(f'MAE : {random_forest_test_mae}')
print(f'MSE : {random_forest_test_mse}')
print(f'RMSE : {random_forest_test_rmse}')
print(f'R-SQUARED : {random_forest_test_r2}')

Training Model Accuracy: 1.0
Testing Model Accuracy: 0.965

Baging Classifier Training Metrics
MAE : 0.0
MSE : 0.0
RMSE : 0.0
R-SQUARED : 1.0

Decision Classifier Testing Metrics
MAE : 0.035
MSE : 0.035
RMSE : 0.187
R-SQUARED : 0.8512
