# Developing an ML Model End-to-End: Exercises

Now that you've had some exposure to using Python for training, tuning, evaluating, and selecting an ML model, it's time to try some exercises on your own. Below, you'll find four exercises that utilize a new training data set, the Diagnostic Breast Cancer Wisconsin data set. You can read more about this data set [here](https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)). When applicable, an exercise will begin with some starter code to help guide your development towards a solution to the problem. Remember, there isn't just one solution to the problem and your code may produce the similar results even if it's written differently from what is supplied in the solution notebook.

These exercises are intended to be performed on Anaconda Notebooks to avoid the need to download any additional Python libraries besides `ydata_profiler`. If you decide to do these exercises on your own machine, you'll need to download the data set locally as well as any of the Python libraries used. The first code block is provided for you to download the data set.

In [1]:
# Import Python libraries needed
import pandas as pd
import os
import urllib
import tarfile

# !pip install ydata-profiling
 
# Create a function for pulling U.S. Census data for California housing
DOWNLOAD_ROOT = 'https://raw.githubusercontent.com/jsukup/Developing-an-ML-Model-End-to-End/master'
DATA_PATH = os.path.join('datasets', 'cancer')
DATA_URL = DOWNLOAD_ROOT + 'datasets/cancer/cancer.tgz'

def fetch_cancer_data(data_url=DATA_URL, data_path=DATA_PATH):
    if not os.path.isdir(data_path):
        os.makedirs(data_path)
    tgz_path = os.path.join(data_path, 'cancer.tgz')
    urllib.request.urlretrieve(data_url, tgz_path)
    data_tgz = tarfile.open(tgz_path)
    data_tgz.extractall(path=data_path)
    data_tgz.close()

fetch_cancer_data() # Pull data set from GitHub

# Create a function for loading data into a Python object
def load_cancer_data(data_path=DATA_PATH):
    csv_path = os.path.join(data_path, 'cancer.csv')
    return pd.read_csv(csv_path)

data = load_cancer_data()
data.head() # Inspect first five rows of data set

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


## Exercise 1: Describe and Verify Data Quality

To begin, have a look at the data set in its raw form before performing any transformations. The easiest approach is the use the `ydata_profiling` library.  Answer the following questions:
1. Describe the data set. How many variables/features does it include? What do you believe is the target we would like to predict?
2. Are there any missing values?
3. Are there any highly correlated variables/features with a correlation coefficient >.7? If so, what are they? Are there any variables/features highly correlated with the target variable that might be good predictors?
4. Are there any variables/features we can remove?
5. What is the proportion of each label for the target variable/feature?

In [None]:
# Run the data profiler and save the output to an HTML file
from ydata_profiling import ProfileReport 

profile = ProfileReport(data)
profile.to_file('profile.html')

## Exercise 2: Data Preparation

1. Write a function to delete any variables/features that are not useful to training the model (e.g., empty variables/features)
2. Write a function to convert the categorical target variable/feature to a binary, numeric one.
3. Write a function to apply `StandardScaler` to the training variables/features (but *not* the target variable/feature)

In [2]:
# Function to delete any unneeded variables/features
def delete_unnamed_32(data):
    if 'Unnamed: 32' in data.columns:
        return data.drop('Unnamed: 32', axis=1, inplace=True)
    else:
        return print("Variable 'Unnamed: 32' does not exist in the DataFrame.")

delete_unnamed_32(data)

# Function to convert the categorical target variable/feature
def convert_target_variable(data, target_column):
    data[target_column] = data[target_column].map({'B': 0, 'M': 1})
    return data

convert_target_variable(data, 'diagnosis')

# Function to apply `StandardScaler`
from sklearn.preprocessing import StandardScaler

def apply_standard_scaler(data):
    scaler = StandardScaler()
    data.iloc[:,2:] = scaler.fit_transform(data.iloc[:,2:])
    return data

apply_standard_scaler(data)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,1,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,...,1.886690,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,842517,1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.243890,0.281190
2,84300903,1,1.579888,0.456187,1.566503,1.558884,0.942210,1.052926,1.363478,2.037231,...,1.511870,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955000,1.152255,0.201391
3,84348301,1,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.935010
4,84358402,1,1.750297,-1.151816,1.776573,1.826229,0.280372,0.539340,1.371011,1.428493,...,1.298575,-1.466770,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.397100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,1,2.110995,0.721473,2.060786,2.343856,1.041842,0.219060,1.947285,2.320965,...,1.901185,0.117700,1.752563,2.015301,0.378365,-0.273318,0.664512,1.629151,-1.360158,-0.709091
565,926682,1,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,...,1.536720,2.047399,1.421940,1.494959,-0.691230,-0.394820,0.236573,0.733827,-0.531855,-0.973978
566,926954,1,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.038680,0.046588,0.105777,...,0.561361,1.374854,0.579001,0.427906,-0.809587,0.350735,0.326767,0.414069,-1.104549,-0.318409
567,927241,1,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,...,1.961239,2.237926,2.303601,1.653171,1.430427,3.904848,3.197605,2.289985,1.919083,2.219635


## Exercise 3: Model Development

1. Split the data set into training data and testing data using the `scikit-learn` function `train_test_split`. You should create 4 new Python objects: `x_train`,`x_test`,`y_train`,`y_test`. This way we'll hold the target variable/feature in it's own object. Use an 80/20 train/test split with the `random_state=734`. 

In [3]:
# Create a train/test split
from sklearn.model_selection import train_test_split 

y = data['diagnosis']
x = data.drop(['id','diagnosis'], axis=1)
x_train, x_test, y_train, y_test = train_test_split(x,
                                                    y,
                                                    test_size=0.2,
                                                    random_state=734) 

2. Train up to 3 different models using `scikit-learn` to build a model that predicts the target variable/feature using the training variables/features. Since the target is a discrete value, this will be a classification model. More information on models and `scikit-learn` can be found [here](https://scikit-learn.org/stable/). After training, save each model's prediction on the testing data in a new Python object. You don't have to adjust the default hyperparameters.

In [4]:
# Logistic Regression model
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression()
lr_clf.fit(x_train, y_train)

# Random Forest model
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()
rf_clf.fit(x_train, y_train)

# K-Nearest Neighbors model
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier()
sgd_clf.fit(x_train, y_train)

3. Evaluate each model using a simple measurement of Accuracy (use the `scikit-learn` method `.score()`) comparing the predicted values with the actual values in the `y_test` Python object created earlier. 

In [13]:
# Model accuracy
lr_score = lr_clf.score(x_test, y_test)
rf_score = rf_clf.score(x_test, y_test)
sgd_score = sgd_clf.score(x_test, y_test)

print(f"Model accuracy:\n Logistic regression: {lr_score},\n Random Forest: {rf_score},\n SGD: {sgd_score}")

Model accuracy:
 Logistic regression: 0.9912280701754386,
 Random Forest: 0.9824561403508771,
 SGD: 0.9824561403508771


## Exercise 4: Hyperparameter Optimization and Model Selection

Now that we've trained a few candidate models, we can try and improve their performance using hyperparameter optimization.

1. Chose a single model or all the candidate models to perform hyperparameter optimization on. You may use the built-in tools in `scikit-learn` that we used in the course exercises (e.g., `GridSearchCV` and `RandomizedSearchCV`). Remember to choose a `scoring` parameter that's suitable for a classification model like `accuracy`! You should be able to reuse a lot of the same code from the course lab, too.

In [15]:
# Train a Random Forest model using grid search with cross-validation
from sklearn.model_selection import GridSearchCV

# Create a list of hyperparameter values set as key:value pairs
param_grid = [
    # Try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 5, 10, 30], 'max_features': [2, 4, 6, 8, 10]},
    # Try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

# Train Random Forest model
forest_clf = RandomForestClassifier(random_state=734)

# Five fold cross-validation: (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_clf, 
                           param_grid, 
                           cv=5,
                           scoring='accuracy',
                           return_train_score=True)

# Fit model to training data
grid_search.fit(x_train, y_train)

# Print best parameters and estimator
print(f"Optimal estimated hyperparameters: {grid_search.best_estimator_}")

Optimal estimated hyperparameters: RandomForestClassifier(max_features=8, n_estimators=10, random_state=734)


In [8]:
# Train a Random Forest model using grid search with cross-validation
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Create a range of hyperparameter values set as key:value pairs
param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
}

# Train Random Forest model
forest_clf = RandomForestClassifier(random_state=734)

# Five fold cross-validation: (12+6)*5=90 rounds of training 
rand_search = RandomizedSearchCV(forest_clf, 
                                 param_distribs,
                                 cv=5,
                                 scoring='accuracy',
                                 return_train_score=True)

# Fit model to training data
rand_search.fit(x_train, y_train)

# Print best parameters and estimator
print(f"Optimal estimated hyperparameters: {rand_search.best_estimator_}")

'Optimal estimated hyperparameters: RandomForestClassifier(max_features=5, n_estimators=168, random_state=734)'

2. Compare the results of the first model(s) to the second model(s) with optimal hyperparameters. You'll probably have different results depending on which technique was used. Which model performs the best?

In [14]:
# Train a Random Forest model with GridSearch optimal hyperparameters
forest_gs = RandomForestClassifier(max_features=8, n_estimators=10, random_state=734)
forest_gs.fit(x_train, y_train)
forest_gs_score = forest_gs.score(x_test, y_test)

# Train a Random Forest model with the RandomSearch optimal hyperparameters
forest_rs = RandomForestClassifier(max_features=5, n_estimators=168, random_state=734)
forest_rs.fit(x_train, y_train)
forest_rs_score = forest_rs.score(x_test, y_test)

# Print the results
print(f"Model accuracy:\n Random Forest (GS): {forest_gs_score},\n Random Forest (RS): {forest_rs_score},\n Random Forest (default): {rf_score}")


Model accuracy:
 Random Forest (GS): 0.9824561403508771,
 Random Forest (RS): 0.9736842105263158,
 Random Forest (default): 0.9824561403508771


# Congrats! That's the end of the exercises!