# LabManual_5 - Deployment of a Model

## Overview

This lab is a continuation of the guided labs of ML Implementation Pipeline  which we are discussing.   

In this lab, you will deploy a trained model and perform a prediction against the model. You will then perform a batch transform on the test dataset.


## Introduction to the business scenario

You work for a healthcare provider, and want to improve the detection of abnormalities in orthopedic patients. 

You are tasked with solving this problem by using machine learning (ML). You have access to a dataset that contains six biomechanical features and a target of *normal* or *abnormal*. You can use this dataset to train an ML model to predict if a patient will have an abnormality.


## About this dataset

This biomedical dataset was built by Dr. Henrique da Mota during a medical residence period in the Group of Applied Research in Orthopaedics (GARO) of the Centre Médico-Chirurgical de Réadaptation des Massues, Lyon, France. The data has been organized in two different, but related, classification tasks. 

The first task consists in classifying patients as belonging to one of three categories: 

- *Normal* (100 patients)
- *Disk Hernia* (60 patients)
- *Spondylolisthesis* (150 patients)

For the second task, the categories *Disk Hernia* and *Spondylolisthesis* were merged into a single category that is labeled as *abnormal*. Thus, the second task consists in classifying patients as belonging to one of two categories: *Normal* (100 patients) or *Abnormal* (210 patients).


## Attribute information

Each patient is represented in the dataset by six biomechanical attributes that are derived from the shape and orientation of the pelvis and lumbar spine (in this order): 

- Pelvic incidence
- Pelvic tilt
- Lumbar lordosis angle
- Sacral slope
- Pelvic radius
- Grade of spondylolisthesis

The following convention is used for the class labels: 
- DH (Disk Hernia)
- Spondylolisthesis (SL)
- Normal (NO) 
- Abnormal (AB)

For more information about this dataset, see the [Vertebral Column dataset webpage](http://archive.ics.uci.edu/ml/datasets/Vertebral+Column).


## Dataset attributions

This dataset was obtained from:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.


# Lab setup

Because this solution is split across several labs in the module, you run the following cells so that you can load the data and train the model to be deployed.

**Note:** The setup can take up to 5 minutes to complete.

## Importing the data, splitting data sets, and training the model (repeat steps)

By running the following cells, the data will be imported and ready for use. 

**Note:** The following cells represent the key steps in the previous labs.


In [56]:
# Download and extract the dataset
f_zip = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00212/vertebral_column_data.zip'
r = requests.get(f_zip, stream=True)
Vertebral_zip = zipfile.ZipFile(io.BytesIO(r.content))
Vertebral_zip.extractall()

# Load and prepare the data
data = arff.loadarff('column_2C_weka.arff')
df = pd.DataFrame(data[0])

# Map class values to binary
class_mapper = {b'Abnormal': 1, b'Normal': 0}
df['class'] = df['class'].replace(class_mapper)

# Save the class column separately before reordering
class_column = df['class'].copy()

# Reorder columns to place 'class' at the first position
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]

# Split the data into train, test, and validation sets
train, test_and_validate = train_test_split(df, test_size=0.2, random_state=42, stratify=df['class'])
test, validate = train_test_split(test_and_validate, test_size=0.5, random_state=42, stratify=test_and_validate['class'])

# Drop the 'class' column from features and extract the target variable
X_train = train.drop(['class'], axis=1)
y_train = train['class']

# Initialize and train the model
model = XGBClassifier(objective='binary:logistic', eval_metric='auc', num_round=42)
model.fit(X_train, y_train)
print("Training Completed")

Training Completed


# Step 1: Performing predictions

Now that you have a deployed model, you will run some predictions.

First, review the test data and re-familiarize yourself with it.

In [14]:
test.shape

(31, 7)

You have 31 instances, with seven attributes. The first five instances are:

In [8]:
test.head(5)

Unnamed: 0,class,pelvic_incidence,pelvic_tilt,lumbar_lordosis_angle,sacral_slope,pelvic_radius,degree_spondylolisthesis
136,1,88.024499,39.844669,81.774473,48.17983,116.601538,56.766083
230,0,65.611802,23.137919,62.582179,42.473883,124.128001,-4.083298
134,1,52.204693,17.212673,78.094969,34.99202,136.972517,54.939134
130,1,50.066786,9.12034,32.168463,40.946446,99.712453,26.766697
47,1,41.352504,16.577364,30.706191,24.775141,113.266675,-4.497958


You don't need to include the target value (class). This predictor can take data in the comma-separated values (CSV) format. You can thus get the first row *without the class column* by using the following code:

`test.iloc[:1,1:]` 

The **iloc** function takes parameters of [*rows*,*cols*]

To only get the first row, use `0:1`. If you want to get row 2, you could use `1:2`.

To get all columns *except* the first column (*col 0*), use `1:`



In [19]:
row = test.iloc[0:1,1:]
row.head()

Unnamed: 0,pelvic_incidence,pelvic_tilt,lumbar_lordosis_angle,sacral_slope,pelvic_radius,degree_spondylolisthesis
136,88.024499,39.844669,81.774473,48.17983,116.601538,56.766083


Now, you can use the data to perform a prediction.

In [21]:
model.predict_proba(row)

array([[0.00177544, 0.99822456]], dtype=float32)

The result you get isn't a *0* or a *1*. Instead, you get a *probability score*. You can apply some conditional logic to the probability score to determine if the answer should be presented as a 0 or a 1. You will work with this process when you do batch predictions.

For now, compare the result with the test data.

In [11]:
test.head(5)

Unnamed: 0,class,pelvic_incidence,pelvic_tilt,lumbar_lordosis_angle,sacral_slope,pelvic_radius,degree_spondylolisthesis
136,1,88.024499,39.844669,81.774473,48.17983,116.601538,56.766083
230,0,65.611802,23.137919,62.582179,42.473883,124.128001,-4.083298
134,1,52.204693,17.212673,78.094969,34.99202,136.972517,54.939134
130,1,50.066786,9.12034,32.168463,40.946446,99.712453,26.766697
47,1,41.352504,16.577364,30.706191,24.775141,113.266675,-4.497958


**Question:** Is the prediction accurate?

**Challenge task:** Update the previous code to send the second row of the dataset. Are those predictions correct? Try this task with a few other rows.

It can be tedious to send these rows one at a time. You could write a function to submit these values in a batch, you will examine that feature next. However, before you do, you will terminate the model.

In [62]:
row2 = test.iloc[1:2, 1:]
row2_actual = test.iloc[1:2, 0].values[0]

row2_prediction = model.predict_proba(row2)
row2_predicted_class = (row2_prediction[0][1] > 0.5).astype(int)

print(f"Row 2 prediction: {row2_prediction}")
print(f"Actual class: {row2_actual}, Predicted class: {row2_predicted_class}")

Row 2 prediction: [[0.33137828 0.6686217 ]]
Actual class: 0, Predicted class: 1


In [66]:
# Sample 5 random rows from the test set
random_sample = test.sample(n=5, random_state=42)

# Function to convert probabilities to class labels
def predict_class(probabilities, threshold=0.5):
    return (probabilities[:, 1] > threshold).astype(int)

# Make predictions on the sampled rows and print results
for i, row in random_sample.iterrows():
    row_data = row[1:].values.reshape(1, -1)  # Reshape to 2D array for prediction
    actual_class = row['class']
    prediction = model.predict_proba(row_data)
    predicted_class = predict_class(prediction)
    print(f"Row {i}: Actual class: {actual_class}, Predicted class: {predicted_class[0]}, Probabilities: {prediction}")

Row 0: Actual class: 1.0, Predicted class: 0, Probabilities: [[0.9680065  0.03199349]]
Row 194: Actual class: 1.0, Predicted class: 1, Probabilities: [[0.01658463 0.98341537]]
Row 95: Actual class: 1.0, Predicted class: 1, Probabilities: [[6.198883e-04 9.993801e-01]]
Row 174: Actual class: 1.0, Predicted class: 1, Probabilities: [[0.0016551 0.9983449]]
Row 297: Actual class: 0.0, Predicted class: 1, Probabilities: [[0.00233924 0.99766076]]


# Step 3: Performing a batch transform

When you are in the training-testing-feature engineering cycle, you want to test your holdout or test sets against the model. You can then use those results to calculate metrics. However, there is a more efficient way.


In [68]:
batch_X = test.iloc[:,1:];
batch_X.head()

Unnamed: 0,pelvic_incidence,pelvic_tilt,lumbar_lordosis_angle,sacral_slope,pelvic_radius,degree_spondylolisthesis
136,88.024499,39.844669,81.774473,48.17983,116.601538,56.766083
230,65.611802,23.137919,62.582179,42.473883,124.128001,-4.083298
134,52.204693,17.212673,78.094969,34.99202,136.972517,54.939134
130,50.066786,9.12034,32.168463,40.946446,99.712453,26.766697
47,41.352504,16.577364,30.706191,24.775141,113.266675,-4.497958


In [70]:
predicted_probabilities = model.predict_proba(batch_X)

In [72]:
target_predicted = pd.DataFrame(predicted_probabilities[:, 1], columns=['class'])
target_predicted.head(5)

Unnamed: 0,class
0,0.998225
1,0.668622
2,0.995486
3,0.998336
4,0.961274


In [74]:
def binary_convert(x):
    threshold = 0.65
    if x > threshold:
        return 1
    else:
        return 0

target_predicted['binary'] = target_predicted['class'].apply(binary_convert)

print(target_predicted.head(10))
test.head(10)

      class  binary
0  0.998225       1
1  0.668622       1
2  0.995486       1
3  0.998336       1
4  0.961274       1
5  0.999004       1
6  0.997197       1
7  0.991417       1
8  0.997661       1
9  0.659416       1


Unnamed: 0,class,pelvic_incidence,pelvic_tilt,lumbar_lordosis_angle,sacral_slope,pelvic_radius,degree_spondylolisthesis
136,1,88.024499,39.844669,81.774473,48.17983,116.601538,56.766083
230,0,65.611802,23.137919,62.582179,42.473883,124.128001,-4.083298
134,1,52.204693,17.212673,78.094969,34.99202,136.972517,54.939134
130,1,50.066786,9.12034,32.168463,40.946446,99.712453,26.766697
47,1,41.352504,16.577364,30.706191,24.775141,113.266675,-4.497958
135,1,77.121344,30.349874,77.481083,46.77147,110.611148,82.093607
100,1,84.585607,30.361685,65.479486,54.223922,108.010218,25.118478
89,1,71.186811,23.896201,43.696665,47.29061,119.864938,27.283985
297,0,45.575482,18.759135,33.774143,26.816347,116.797007,3.13191
4,1,49.712859,9.652075,28.317406,40.060784,108.168725,7.918501


**Note:** The *threshold* in the **binary_convert** function is set to *.65*.

**Challenge task:** Experiment with changing the value of the threshold. Does it impact the results?

**Note:** The initial model might not be good. You will generate some metrics in the next lab, before you tune the model in the final lab.

In [77]:
# Define a function to convert probabilities to binary class labels based on a threshold
def binary_convert(x, threshold=0.65):
    if x > threshold:
        return 1
    else:
        return 0

# Apply the binary_convert function with the default threshold of 0.65
target_predicted['binary'] = target_predicted['class'].apply(lambda x: binary_convert(x, threshold=0.65))

# Experiment with different thresholds by applying the binary_convert function with each threshold
thresholds = [0.4, 0.5, 0.6, 0.7, 0.8]
for threshold in thresholds:
    target_predicted[f'binary_{threshold}'] = target_predicted['class'].apply(lambda x: binary_convert(x, threshold))

# Print the first 10 rows of the DataFrame with results for different thresholds
print("First 10 rows with different thresholds:")
print(target_predicted.head(10))

# Print the first 10 rows of the test DataFrame for comparison
print("\nFirst 10 rows of the test DataFrame:")
print(test.head(10))



First 10 rows with different thresholds:
      class  binary  binary_0.4  binary_0.5  binary_0.6  binary_0.7  \
0  0.998225       1           1           1           1           1   
1  0.668622       1           1           1           1           0   
2  0.995486       1           1           1           1           1   
3  0.998336       1           1           1           1           1   
4  0.961274       1           1           1           1           1   
5  0.999004       1           1           1           1           1   
6  0.997197       1           1           1           1           1   
7  0.991417       1           1           1           1           1   
8  0.997661       1           1           1           1           1   
9  0.659416       1           1           1           1           0   

   binary_0.8  
0           1  
1           0  
2           1  
3           1  
4           1  
5           1  
6           1  
7           1  
8           1  
9           0  



# Congratulations!

You have completed this lab, and you can now end the lab by following the lab guide instructions.