# Task Overview 
Task: In this synthetic dataset based off of a real dataset funded by the Mayo Clinic, each example represents both general and survival information about a patient that has liver cirrhosis, a condition involving prolonged liver damage. The goal is to train a machine learning model that can predict the patient's current survival status based on the data features. 

Approach: Our approach will be to train a Logistic Regression model using sklearn to be used as a baseilne model and to then train a performance-focused model using XGBoost as a learning exercise.  

Conclusions: 

In [1]:
import numpy as np
import pandas as pd

# Read in the training data as a pandas df 
train_file_path = "/kaggle/input/playground-series-s3e26/train.csv"
train_df = pd.read_csv(train_file_path) 

train_df.head(5)

Unnamed: 0,id,N_Days,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage,Status
0,0,999,D-penicillamine,21532,M,N,N,N,N,2.3,316.0,3.35,172.0,1601.0,179.8,63.0,394.0,9.7,3.0,D
1,1,2574,Placebo,19237,F,N,N,N,N,0.9,364.0,3.54,63.0,1440.0,134.85,88.0,361.0,11.0,3.0,C
2,2,3428,Placebo,13727,F,N,Y,Y,Y,3.3,299.0,3.55,131.0,1029.0,119.35,50.0,199.0,11.7,4.0,D
3,3,2576,Placebo,18460,F,N,N,N,N,0.6,256.0,3.5,58.0,1653.0,71.3,96.0,269.0,10.7,3.0,C
4,4,788,Placebo,16658,F,N,Y,N,N,1.1,346.0,3.65,63.0,1181.0,125.55,96.0,298.0,10.6,4.0,C


In [2]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7905 entries, 0 to 7904
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             7905 non-null   int64  
 1   N_Days         7905 non-null   int64  
 2   Drug           7905 non-null   object 
 3   Age            7905 non-null   int64  
 4   Sex            7905 non-null   object 
 5   Ascites        7905 non-null   object 
 6   Hepatomegaly   7905 non-null   object 
 7   Spiders        7905 non-null   object 
 8   Edema          7905 non-null   object 
 9   Bilirubin      7905 non-null   float64
 10  Cholesterol    7905 non-null   float64
 11  Albumin        7905 non-null   float64
 12  Copper         7905 non-null   float64
 13  Alk_Phos       7905 non-null   float64
 14  SGOT           7905 non-null   float64
 15  Tryglicerides  7905 non-null   float64
 16  Platelets      7905 non-null   float64
 17  Prothrombin    7905 non-null   float64
 18  Stage   

In [3]:
train_df.describe()

Unnamed: 0,id,N_Days,Age,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
count,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0
mean,3952.0,2030.173308,18373.14649,2.594485,350.561923,3.548323,83.902846,1816.74525,114.604602,115.340164,265.228969,10.629462,3.032511
std,2282.121272,1094.233744,3679.958739,3.81296,195.379344,0.346171,75.899266,1903.750657,48.790945,52.530402,87.465579,0.781735,0.866511
min,0.0,41.0,9598.0,0.3,120.0,1.96,4.0,289.0,26.35,33.0,62.0,9.0,1.0
25%,1976.0,1230.0,15574.0,0.7,248.0,3.35,39.0,834.0,75.95,84.0,211.0,10.0,2.0
50%,3952.0,1831.0,18713.0,1.1,298.0,3.58,63.0,1181.0,108.5,104.0,265.0,10.6,3.0
75%,5928.0,2689.0,20684.0,3.0,390.0,3.77,102.0,1857.0,137.95,139.0,316.0,11.0,4.0
max,7904.0,4795.0,28650.0,28.0,1775.0,4.64,588.0,13862.4,457.25,598.0,563.0,18.0,4.0


## Dataset Pre-Processing
Task Overview
1. Missing Feature Value Handling (Unnecessary for XGBoost)
2. Categorical Attribute Handling
3. Feature Scaling (Unnecessary for XGBoost)

## Missing Feature Value Handling

Problem: We need to check if there are any missing feature values and handle the problem accordingly.

Solution: Since we are using XGBoost, we do not need to do dataset pre-processing for missing feature values since XGBoost's implementation automatically handles them. We still check for missing feature values for practice by using the isna DataFrame method which returns a boolean DataFrame where each cell is True if the value is missing and False otherwise. We then apply the sum method to find the number of missing values in each column. 

In [4]:
# Check for missing feature values
print("Number of missing feature values by column: ")
print(train_df.isna().sum())

Number of missing feature values by column: 
id               0
N_Days           0
Drug             0
Age              0
Sex              0
Ascites          0
Hepatomegaly     0
Spiders          0
Edema            0
Bilirubin        0
Cholesterol      0
Albumin          0
Copper           0
Alk_Phos         0
SGOT             0
Tryglicerides    0
Platelets        0
Prothrombin      0
Stage            0
Status           0
dtype: int64


## Categorical Attributes/Columns 
Problem: We use the DataFrame info method to find that there are six categorical columns: drug, sex, ascites, hepatomegaly, spiders, edema which we need to convert from text into numerical values.

Solution: The main approaches for categorical attribute handling are 
1. Ordinal Encoding - Useful when the categories correspond to an ascending or descending order. 
2. One-Hot Encoding - For each categorical column, convert it into multiple columns, one for each possible category. This is used when the categories do not have an obvious logical order. 
3. Numerical Feature Replacement (Advanced) - In cases where the number of categories is cery large (hundreds or thousands) one should consider replacing the categorical columns with a numerical column that converts each category into some number. For example, one could convert a country code into the country's population. 
4. Embedding Replacement (Advanced) - Alternatively, one can replace categories with embeddings, which are low dimensional vectors that represent the category. 

In this case, we use a one-hot encoding since none of the categories seem to have a logical order and the number of categories is low (under 10 for all categorical columns)

In [5]:
# Confirm that the number of categories in the categorical columns is manageable (< 100)
unique_values_per_column = train_df.nunique()

print(unique_values_per_column)

id               7905
N_Days            461
Drug                2
Age               391
Sex                 2
Ascites             2
Hepatomegaly        2
Spiders             2
Edema               3
Bilirubin         111
Cholesterol       226
Albumin           160
Copper            171
Alk_Phos          364
SGOT              206
Tryglicerides     154
Platelets         227
Prothrombin        49
Stage               4
Status              3
dtype: int64


In [6]:
# Convert the categorical columns into one-hot encodings
train_df = pd.get_dummies(train_df)

# Confirm the transformation was successful
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7905 entries, 0 to 7904
Data columns (total 29 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    7905 non-null   int64  
 1   N_Days                7905 non-null   int64  
 2   Age                   7905 non-null   int64  
 3   Bilirubin             7905 non-null   float64
 4   Cholesterol           7905 non-null   float64
 5   Albumin               7905 non-null   float64
 6   Copper                7905 non-null   float64
 7   Alk_Phos              7905 non-null   float64
 8   SGOT                  7905 non-null   float64
 9   Tryglicerides         7905 non-null   float64
 10  Platelets             7905 non-null   float64
 11  Prothrombin           7905 non-null   float64
 12  Stage                 7905 non-null   float64
 13  Drug_D-penicillamine  7905 non-null   bool   
 14  Drug_Placebo          7905 non-null   bool   
 15  Sex_F                

## Feature Scaling

Problem: Typically since feature column values are combined to create the final classification, the ML model will perform better if the features are on the same scale. 

Solution: Since XGBoost's decision tree clasificaiton uses splitting which occurs within a column, different column values do not interact with each other and therefore scaling the features is not necessary 

# Baseline Model Training Tasks
1. Divide the training set into target and predictor columns 
2. Train the Logistic Regression modle using sklearn
3. Evaluate the results 