# ***UCI Breast Cancer Pipeline Project***
###
### Some noteworthy information from UCI:
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

**Ten real-valued features are computed for each cell nucleus:**

1) radius (mean of distances from center to points on the perimeter)
2) texture (standard deviation of gray-scale values)
3) perimeter
4) area
5) smoothness (local variation in radius lengths)
6) compactness (perimeter^2 / area - 1.0)
7) concavity (severity of concave portions on the contour)
8) concave points (number of concave portions of the contour)
9) symmetry
10) fractal dimension ("coastline approximation" - 1)
###


## 0. Import Modules:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression,LinearRegression, Lasso, Ridge
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.decomposition import PCA

## 1. Import UCI Dataset &#8594; Write Dataset to Local CSV

In [2]:
## Import UCI Dataset and write to local csv
## NOTE: Commented out code for the sake of minimizing runtime, it is functional/reusable.
# from ucimlrepo import fetch_ucirepo
# breast_ca = fetch_ucirepo(id=17)

# breast_ca_df = breast_ca.data.original
# breast_ca_df.to_csv('UCI_BreastCancer.csv', index=False)
# print('Successfully wrote dataset to csv file!')

# Read csv and store as df
df = pd.read_csv('UCI_BreastCancer.csv')

## 2. Search for missing values and verify shape &#8594; Identify Feature Types

In [3]:
# Search Dataset for missing / null values:
try:
    if df.isnull().sum().any()>0:
        print('NaN values found: ', df.isnull().sum())
    else:
        print('No NaN or null values found')
except Exception as e:
    print(e)

# Consider the number of unique values for each feature:
# All features are numeric.
# print(df.nunique())

# Verify features and shape:
print(df.columns)
print(df.shape)

No NaN or null values found
Index(['ID', 'radius1', 'texture1', 'perimeter1', 'area1', 'smoothness1',
       'compactness1', 'concavity1', 'concave_points1', 'symmetry1',
       'fractal_dimension1', 'radius2', 'texture2', 'perimeter2', 'area2',
       'smoothness2', 'compactness2', 'concavity2', 'concave_points2',
       'symmetry2', 'fractal_dimension2', 'radius3', 'texture3', 'perimeter3',
       'area3', 'smoothness3', 'compactness3', 'concavity3', 'concave_points3',
       'symmetry3', 'fractal_dimension3', 'Diagnosis'],
      dtype='object')
(569, 32)


## 3. Define Target (y) and Features (X) &#8594; Convert Target to Binary &#8594; Train_Test_Split()

In [4]:
# Define features and target
y = df.Diagnosis
X = df.drop(columns=['Diagnosis'])
print(X.shape)
print(y.shape)

# Convert target data to binary and verify value_counts.
print('\nPrior to binary conversion: \n',y.value_counts())
try:
    y = pd.DataFrame(np.where(y == 'M',1,0), columns=['Diagnosis'])
    y = y.Diagnosis
    print('\nPost binary conversion: \n',y.value_counts(),'\n')

except Exception as e:
    print(e)

print(X.shape)
print(y.shape)

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

(569, 31)
(569,)

Prior to binary conversion: 
 Diagnosis
B    357
M    212
Name: count, dtype: int64

Post binary conversion: 
 Diagnosis
0    357
1    212
Name: count, dtype: int64 

(569, 31)
(569,)


## 4. PCA and Exploratory Data Analysis:

In [5]:
# In progress ...

## 4.1. Prepare Classifier Switching Class:

In [6]:
from sklearn.base import BaseEstimator
from sklearn.linear_model import SGDClassifier

class ClfSwitch(BaseEstimator):
    def __init__(self, estimator=SGDClassifier()):
        self.estimator = estimator
    def fit(self,xx,yy=None,**kwargs):
        self.estimator.fit(xx,yy)
        return self
    def predict(self,xx,yy=None):
        return self.estimator.predict(xx)
    def predict_proba(self,xx,yy=None):
        return self.estimator.predict_proba(xx)
    def score(self,xx,yy):
        return self.estimator.score(xx,yy)

## 5. Define Pipeline / Preprocessing / Search_space

In [7]:
scaler = StandardScaler()

preprocessor = ColumnTransformer([
    ('scaler', scaler, X_train.columns)
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('clf', ClfSwitch())
])

search_space = [
    {'clf__estimator': [RandomForestClassifier(random_state=13)],
     'clf__estimator__max_depth':[10,15,25],
     'clf__estimator__n_estimators':[150,200,250],
     },
    {'clf__estimator': [GradientBoostingClassifier(random_state=13)],
     'clf__estimator__learning_rate':[0.001,0.01,0.1],
     'clf__estimator__n_estimators':[150,200,250],
     },
    {'clf__estimator': [SGDClassifier(random_state=13)],
     'clf__estimator__loss': ['hinge','log_loss'],
     'clf__estimator__alpha': [0.001,0.005,0.01],
     'clf__estimator__penalty': ['l2']
     }
]

## 6. GridSearchCV Implementation &#8594; Analyze Results and Prepare to Tune Hyperparameters
*About 30 sec runtime with current search_space (9,9,6)*

In [8]:
from sklearn.metrics import accuracy_score

# USE sklearn version 1.5.2 to suppress FutureWarning during fit:
gs = GridSearchCV(estimator=pipeline, param_grid=search_space, cv=5, error_score='raise')
gs.fit(X_train, y_train)

# Load the best estimator:
gs_best = gs.best_estimator_

# Load the best classifier:
gs_best_clf = gs_best.named_steps['clf']

# Print the best classifier's estimator:
print(gs_best_clf.get_params()['estimator'])

## Parameters can be accessed using:
#print(gs_best_clf.get_params()['estimator__max_depth'])
#print(gs_best_clf.get_params()['estimator__n_estimators'])

# Compare the accuracy score on test data to the best_score_ (training data):
print('Test Data score: \t\t',gs_best.score(X_test, y_test))
y_pred = gs_best.predict(X_test)
print('Test Data score: \t\t',accuracy_score(y_test, y_pred))
print('Training Data score: \t',gs.best_score_)

SGDClassifier(alpha=0.005, loss='log_loss', random_state=13)
Test Data score: 		 0.9912280701754386
Test Data score: 		 0.9912280701754386
Training Data score: 	 0.9758241758241759


## 6.1. Feature Analysis using cv_results_

In [9]:
# Used .cv_results_ to create a dataframe of the results:
cv_df = pd.DataFrame(gs.cv_results_)

columns_interest = [
    'param_clf__estimator',
    'param_clf__estimator__max_depth',
    'param_clf__estimator__n_estimators',
    'param_clf__estimator__loss',
    'param_clf__estimator__penalty',
    'param_clf__estimator__alpha',
    'mean_test_score',
    'std_test_score',
    'rank_test_score']

cv_df_results = cv_df[columns_interest].round(3)

cv_df_results

Unnamed: 0,param_clf__estimator,param_clf__estimator__max_depth,param_clf__estimator__n_estimators,param_clf__estimator__loss,param_clf__estimator__penalty,param_clf__estimator__alpha,mean_test_score,std_test_score,rank_test_score
0,RandomForestClassifier(random_state=13),10.0,150.0,,,,0.96,0.02,7
1,RandomForestClassifier(random_state=13),10.0,200.0,,,,0.96,0.02,7
2,RandomForestClassifier(random_state=13),10.0,250.0,,,,0.96,0.02,7
3,RandomForestClassifier(random_state=13),15.0,150.0,,,,0.96,0.02,7
4,RandomForestClassifier(random_state=13),15.0,200.0,,,,0.96,0.02,7
5,RandomForestClassifier(random_state=13),15.0,250.0,,,,0.96,0.02,7
6,RandomForestClassifier(random_state=13),25.0,150.0,,,,0.96,0.02,7
7,RandomForestClassifier(random_state=13),25.0,200.0,,,,0.96,0.02,7
8,RandomForestClassifier(random_state=13),25.0,250.0,,,,0.96,0.02,7
9,GradientBoostingClassifier(random_state=13),,150.0,,,,0.613,0.036,24


## 7. Hyperparameter Tuning and Retesting models

In [10]:
# In progress ...

## 7.1. Diving Deeper into RandomForestClassifier Validation
*Using RandomForestClassifier(max_depth=10, n_estimators=150)*

In [11]:
# Modifying best rfc() to validate scores across dataset.
score_list = list()

scaler = StandardScaler()
scaler.fit(X)
X_train_scale = scaler.transform(X_train)
X_test_scale = scaler.transform(X_test)

# Loop to analyze for any meaningful impact of Random_state on bootstrapping in the RFC().
for i in range(1,10):
    clf = RandomForestClassifier(random_state=i+1,max_depth=10,n_estimators=150)
    clf.fit(X_train_scale,y_train)
    current_score = clf.score(X_test_scale,y_test)
    score_list.append([i+1,current_score])

c_names = ['random_state:','RFC score:']
score_df = pd.DataFrame(score_list,columns=c_names)
print('RFC Score Average: \t\t\t',str(score_df['RFC score:'].mean()))
print('RFC Score Std Deviation: \t',str(score_df['RFC score:'].std()))
# No indication that Random_state had a meaningful impact on our accuracy scores.

from time import perf_counter as tpc

start = tpc()
clf = RandomForestClassifier(random_state=11,max_depth=10,n_estimators=150)
clf.fit(X_train_scale,y_train)
current_score = clf.score(X_test_scale,y_test)
end = tpc()
print('RFC time: \t\t\t\t\t', str((end - start) * 1000), 'ms')

score_df

RFC Score Average: 			 0.9220272904483431
RFC Score Std Deviation: 	 0.00814000637840649
RFC time: 					 127.50887500000019 ms


Unnamed: 0,random_state:,RFC score:
0,2,0.921053
1,3,0.929825
2,4,0.912281
3,5,0.912281
4,6,0.929825
5,7,0.929825
6,8,0.929825
7,9,0.921053
8,10,0.912281


## 7.2. Diving Deeper into SGDClassifier Validation:
*hinge vs log_loss, alpha = 0.01, penalty = l2*

*NOTE: Must scale data in this block because we are no longer using Pipeline and thus no preprocessor/scaler*


In [12]:
# Modifying best SGDClassifier() to validate scores across dataset and compare loss='hinge' vs. loss='log_loss'
score_list = list()

# Cannot stress enough the importance of calling the scaler before fitting to SGDClassifer!
scaler = StandardScaler()
scaler.fit(X)
X_train_scale = scaler.transform(X_train)
X_test_scale = scaler.transform(X_test)

# Loop to analyze for any meaningful impact of Random_state on SGDClassifier().
for i in range(1,10):
    clf = SGDClassifier(random_state=i+1,loss='hinge',alpha=0.01,penalty='l2')
    clf.fit(X_train_scale,y_train)
    current_score = clf.score(X_test_scale,y_test)
    clf_log = SGDClassifier(random_state=i+1, loss='log_loss',alpha=0.01,penalty='l2')
    clf_log.fit(X_train_scale,y_train)
    current_score_log = clf.score(X_test_scale,y_test)
    score_list.append([i+1,current_score,current_score_log])

c_names = ['random_state:','SGDClassifier hinge score:','SGDClassifier log_loss score:']
score_df = pd.DataFrame(score_list,columns=c_names)
print('SGDC Hinge Score Average: \t\t\t',str(score_df['SGDClassifier hinge score:'].mean()))
print('SGDC Hinge Score Std Deviation: \t',str(score_df['SGDClassifier hinge score:'].std()))
print('SGDC Log_loss Score Average: \t\t',str(score_df['SGDClassifier log_loss score:'].mean()))
print('SGDC Log_loss Score Std Deviation: \t',str(score_df['SGDClassifier log_loss score:'].std()))
# No difference!
print()
from time import perf_counter as tpc

start = tpc()
clf = SGDClassifier(random_state=11,loss='hinge',alpha=0.01,penalty='l2')
clf.fit(X_train_scale,y_train)
current_score = clf.score(X_test_scale,y_test)
end = tpc()
print('SGDC Hinge time: \t\t\t\t\t', str((end - start) * 1000), 'ms')


start = tpc()
clf_log = SGDClassifier(random_state=11, loss='log_loss',alpha=0.01,penalty='l2')
clf_log.fit(X_train_scale,y_train)
current_score_log = clf.score(X_test_scale,y_test)
end = tpc()
print('SGDC Log_loss time:\t\t\t\t\t', str((end - start) * 1000), 'ms')
# Hinge takes the advantage for runtime!

score_df

SGDC Hinge Score Average: 			 0.976608187134503
SGDC Hinge Score Std Deviation: 	 0.01074337606483848
SGDC Log_loss Score Average: 		 0.976608187134503
SGDC Log_loss Score Std Deviation: 	 0.01074337606483848

SGDC Hinge time: 					 1.1974589999965701 ms
SGDC Log_loss time:					 1.236583000000735 ms


Unnamed: 0,random_state:,SGDClassifier hinge score:,SGDClassifier log_loss score:
0,2,0.982456,0.982456
1,3,0.964912,0.964912
2,4,0.991228,0.991228
3,5,0.964912,0.964912
4,6,0.991228,0.991228
5,7,0.973684,0.973684
6,8,0.982456,0.982456
7,9,0.973684,0.973684
8,10,0.964912,0.964912


## 7.3. RandomizedSearchCV Implementation
*Note: Runtime benefits from defining n_iter parameter (default n_iter=10)*

*When n_iter=5 the Runtime is 5 to 8 seconds, when n_iter=10 the Runtime is 8 to 18 seconds*

In [13]:
# RandomizedSearchCV method:
from sklearn.model_selection import RandomizedSearchCV

rs = RandomizedSearchCV(estimator=pipeline, param_distributions=search_space, cv=5, n_iter=10, error_score='raise')
rs.fit(X_train, y_train)

# Load the best estimator:
rs_best = rs.best_estimator_
# Load the best classifier:
rs_best_clf = rs_best.named_steps['clf']
# Print the best classifier's estimator:
print(rs_best_clf.get_params()['estimator'])

# Compare the test data score to the training data (best_score_):
print(rs_best.score(X_test, y_test))
print(rs.best_score_)

SGDClassifier(alpha=0.005, loss='log_loss', random_state=13)
0.9912280701754386
0.9758241758241759
