# LabManual_4 - Train the Model Using XGBoost algorithm


## Overview

This lab is a continuation of the guided labs of ML Implementation Pipeline  which we are discussing.  

In this lab, you will split the data into three separate datasets:

- *Training Set* - This will be used to train the model.
- *Validation Set* - This will be used during training to validate the model.
- *Test Set* - This will be held back and used to produce metrics after the model is trained. You will use this dataset in an upcoming lab.

With the split data, you will train a XGBoost model .


## Introduction to the business scenario

You work for a healthcare provider, and want to improve detection of abnormalities in orthopedic patients. 

You are tasked with solving this problem by using machine learning (ML). You have access to a dataset that contains six biomechanical features and a target of *normal* or *abnormal*. You can use this dataset to train an ML model to predict if a patient will have an abnormality.


## About this dataset

This biomedical dataset was built by Dr. Henrique da Mota during a medical residence period in the Group of Applied Research in Orthopaedics (GARO) of the Centre Médico-Chirurgical de Réadaptation des Massues, Lyon, France. The data has been organized in two different, but related, classification tasks. 

The first task consists in classifying patients as belonging to one of three categories: 

- *Normal* (100 patients)
- *Disk Hernia* (60 patients)
- *Spondylolisthesis* (150 patients)

For the second task, the categories *Disk Hernia* and *Spondylolisthesis* were merged into a single category that is labeled as *abnormal*. Thus, the second task consists in classifying patients as belonging to one of two categories: *Normal* (100 patients) or *Abnormal* (210 patients).


## Attribute information:

Each patient is represented in the dataset by six biomechanical attributes that are derived from the shape and orientation of the pelvis and lumbar spine (in this order): 

- Pelvic incidence
- Pelvic tilt
- Lumbar lordosis angle
- Sacral slope
- Pelvic radius
- Grade of spondylolisthesis

The following convention is used for the class labels: 
- DH (Disk Hernia)
- Spondylolisthesis (SL)
- Normal (NO) 
- Abnormal (AB)


For more information about this dataset, see the [Vertebral Column dataset webpage](http://archive.ics.uci.edu/ml/datasets/Vertebral+Column).


## Dataset attributions

This dataset was obtained from:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.


# Lab setup
Because this solution is split across several labs in the module, you must run the following cells so that you can load the data.

## Importing the data and Exploring the Data (Repeat steps)

By running the following cells, the data will be imported and ready for use. 

**Note:** The following cells represent the key steps in the previous labs.

In [1]:
import pandas as pd
import requests
import zipfile
import io
from scipy.io import arff
from sklearn.model_selection import train_test_split
import warnings
import os
warnings.filterwarnings("ignore")
from sklearn.metrics import accuracy_score

In [2]:
f_zip = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00212/vertebral_column_data.zip'
r = requests.get(f_zip, stream=True)
Vertebral_zip = zipfile.ZipFile(io.BytesIO(r.content))
Vertebral_zip.extractall()

In [3]:
data = arff.loadarff('column_2C_weka.arff')
df = pd.DataFrame(data[0])

You will start with a quick reminder of the data in the dataset.  To get the most out of this lab, carefully read the instructions and code before you run the cells. Take time to experiment!

First, use shape to examine the number of rows and columns.

In [4]:
df.shape

(310, 7)

In [5]:
df.columns

Index(['pelvic_incidence', 'pelvic_tilt', 'lumbar_lordosis_angle',
       'sacral_slope', 'pelvic_radius', 'degree_spondylolisthesis', 'class'],
      dtype='object')

You can see the six biomechanical features, and that the target column is named *class*.


Next, get a class column with numberical valuses

In [13]:
class_mapper = {b'Abnormal':1,b'Normal':0}
df['class']=df['class'].replace(class_mapper)

## Step 1: Preparing the data for Training

For this lab, you must split the data into three datasets.

There are many different ways to split datasets. Many code samples that you might find will split the dataset into the *target* and the *features*. Then, they will split each of those two datasets into three subsets, which results in a total of six datasets to track.

## Moving the target column position

Get the target column and move it to the first position.

In [14]:
cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]

You should see that the **class** is now the first column.

In [15]:
df.columns
df.head()

Unnamed: 0,class,pelvic_incidence,pelvic_tilt,lumbar_lordosis_angle,sacral_slope,pelvic_radius,degree_spondylolisthesis
0,1,63.027817,22.552586,39.609117,40.475232,98.672917,-0.2544
1,1,39.056951,10.060991,25.015378,28.99596,114.405425,4.564259
2,1,68.832021,22.218482,50.092194,46.613539,105.985135,-3.530317
3,1,69.297008,24.652878,44.311238,44.64413,101.868495,11.211523
4,1,49.712859,9.652075,28.317406,40.060784,108.168725,7.918501


## Splitting the data

You will start by splitting the dataset into two datasets. You will use one dataset for training, and you will split the other dataset again for use with validation and testing.

You will use the *train_test_split function* from the *scikit-learn library*, which is a free machine learning library for Python. It has many algorithms and useful functions, such as the one you will use. 

- For more information about the function, see the [Train_test_split documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
 - For more information about scikit-learn, see the [scikit-learn guide](https://scikit-learn.org/stable/)

Because you don't have a lot of data, you want to make sure that the split datasets contain a representative amount of each class. Thus, you will use the *stratify* switch. Finally, you will use a random number so that you can repeat the splits.

In [16]:
from sklearn.model_selection import train_test_split
train, test_and_validate = train_test_split(df, test_size=0.2, random_state=42, stratify=df['class'])

In [17]:
train

Unnamed: 0,class,pelvic_incidence,pelvic_tilt,lumbar_lordosis_angle,sacral_slope,pelvic_radius,degree_spondylolisthesis
202,1,76.314028,41.933683,93.284863,34.380345,132.267286,101.218783
178,1,80.654320,26.344379,60.898118,54.309940,120.103493,52.467552
68,1,72.076278,18.946176,51.000000,53.130102,114.213013,1.010041
118,1,65.536003,24.157487,45.775170,41.378515,136.440302,16.378086
182,1,75.437748,31.539454,89.600000,43.898294,106.829590,54.965789
...,...,...,...,...,...,...,...
282,0,53.683380,13.447022,41.584297,40.236358,113.913703,2.737035
265,0,48.170746,9.594217,39.710920,38.576530,135.623310,5.360051
180,1,37.903910,4.479099,24.710274,33.424811,157.848799,33.607027
28,1,44.551012,21.931147,26.785916,22.619865,111.072920,2.652321


Next, split the *test_and_validate* dataset into two equal parts.

In [18]:
test, validate = train_test_split(test_and_validate, test_size=0.5, random_state=42, stratify=test_and_validate['class'])

Examine the three datasets.

In [19]:
print(train.shape)
print(test.shape)
print(validate.shape)

(248, 7)
(31, 7)
(31, 7)


Now, check the distribution of the classes.

In [20]:
print(train['class'].value_counts())
print(test['class'].value_counts())
print(validate['class'].value_counts())

class
1    168
0     80
Name: count, dtype: int64
class
1    21
0    10
Name: count, dtype: int64
class
1    21
0    10
Name: count, dtype: int64


## Step 2: Training the model


The first step is to import the XGBClassifier from xgboost.

Running **fit** will train the model.

In [15]:
!conda install xgboost -y

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.9.0
  latest version: 24.3.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /root/miniconda3

  added / updated specs:
    - xgboost


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _py-xgboost-mutex-2.0      |            cpu_0           9 KB
    ca-certificates-2024.3.11  |       h06a4308_0         127 KB
    libxgboost-1.7.3           |       h6a678d5_0         3.9 MB
    py-xgboost-1.7.3           |   py39h06a4308_0         213 KB
    xgboost-1.7.3              |   py39h06a4308_0          12 KB
    ------------------------------------------------------------
                                           Total:         4.2 MB

The following NEW packages will be INSTALLED:

  _py-xgboost-mutex  pkgs/main/l

In [21]:
from xgboost import XGBClassifier
model = XGBClassifier(objective='binary:logistic', eval_metric='auc', num_round=42)
print(model.fit(train.drop(['class'], axis = 1).values, train['class'].values))
print("Training Completed")

Parameters: { "num_round" } are not used.

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='auc', feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              num_round=42, predictor=None, ...)
Training Completed


After the training is complete, you are ready to test and evaluate the model. However, you will do  testing and validation in later labs.

### Congratulations!

You have completed this lab, and you can now end the lab by following the lab guide instructions.