# Task Overview 
Task: In this synthetic dataset based off of a real dataset funded by the Mayo Clinic, each example represents both general and survival information about a patient that has liver cirrhosis, a condition involving prolonged liver damage. The goal is to train a machine learning model that can predict the patient's current survival status based on the data features. 

Approach: Our approach will be to train a Logistic Regression model using sklearn to be used as a baseilne model and to then train a performance-focused model using XGBoost as a learning exercise.  

Conclusions: 

In [1]:
import numpy as np
import pandas as pd

# Read in the training and test data as Pandas DataFrames 
train_file_path = "/kaggle/input/playground-series-s3e26/train.csv"
train_df = pd.read_csv(train_file_path) 
test_file_path = "/kaggle/input/playground-series-s3e26/test.csv"
test_df = pd.read_csv(test_file_path)

# Remove the id column since it is not useful for prediction
test_id_df = test_df['id'].astype(int)
train_df = train_df.drop('id', axis=1)
test_df = test_df.drop('id', axis=1)


train_df.head(5)

Unnamed: 0,N_Days,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage,Status
0,999,D-penicillamine,21532,M,N,N,N,N,2.3,316.0,3.35,172.0,1601.0,179.8,63.0,394.0,9.7,3.0,D
1,2574,Placebo,19237,F,N,N,N,N,0.9,364.0,3.54,63.0,1440.0,134.85,88.0,361.0,11.0,3.0,C
2,3428,Placebo,13727,F,N,Y,Y,Y,3.3,299.0,3.55,131.0,1029.0,119.35,50.0,199.0,11.7,4.0,D
3,2576,Placebo,18460,F,N,N,N,N,0.6,256.0,3.5,58.0,1653.0,71.3,96.0,269.0,10.7,3.0,C
4,788,Placebo,16658,F,N,Y,N,N,1.1,346.0,3.65,63.0,1181.0,125.55,96.0,298.0,10.6,4.0,C


In [2]:
train_df.info()

# Make a list of categorical columns for future use (in the feature scaling section)
categorical_cols = ['Drug', 'Sex', 'Ascites', 'Hepatomegaly', 'Spiders', 'Edema', 'Status']

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7905 entries, 0 to 7904
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   N_Days         7905 non-null   int64  
 1   Drug           7905 non-null   object 
 2   Age            7905 non-null   int64  
 3   Sex            7905 non-null   object 
 4   Ascites        7905 non-null   object 
 5   Hepatomegaly   7905 non-null   object 
 6   Spiders        7905 non-null   object 
 7   Edema          7905 non-null   object 
 8   Bilirubin      7905 non-null   float64
 9   Cholesterol    7905 non-null   float64
 10  Albumin        7905 non-null   float64
 11  Copper         7905 non-null   float64
 12  Alk_Phos       7905 non-null   float64
 13  SGOT           7905 non-null   float64
 14  Tryglicerides  7905 non-null   float64
 15  Platelets      7905 non-null   float64
 16  Prothrombin    7905 non-null   float64
 17  Stage          7905 non-null   float64
 18  Status  

In [3]:
train_df.describe()

Unnamed: 0,N_Days,Age,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
count,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0
mean,2030.173308,18373.14649,2.594485,350.561923,3.548323,83.902846,1816.74525,114.604602,115.340164,265.228969,10.629462,3.032511
std,1094.233744,3679.958739,3.81296,195.379344,0.346171,75.899266,1903.750657,48.790945,52.530402,87.465579,0.781735,0.866511
min,41.0,9598.0,0.3,120.0,1.96,4.0,289.0,26.35,33.0,62.0,9.0,1.0
25%,1230.0,15574.0,0.7,248.0,3.35,39.0,834.0,75.95,84.0,211.0,10.0,2.0
50%,1831.0,18713.0,1.1,298.0,3.58,63.0,1181.0,108.5,104.0,265.0,10.6,3.0
75%,2689.0,20684.0,3.0,390.0,3.77,102.0,1857.0,137.95,139.0,316.0,11.0,4.0
max,4795.0,28650.0,28.0,1775.0,4.64,588.0,13862.4,457.25,598.0,563.0,18.0,4.0


## Dataset Pre-Processing
Task Overview
1. Target Variable Label Encoding 
2. Missing Feature Value Handling (Unnecessary for XGBoost)
3. Feature Scaling (Unnecessary for XGBoost)
4. Categorical Attribute Handling (Should be done at the end since it may change the number of columns)
5. Train/Test Split

# Target Variable Label Encoding

Problem: Since the target column (Status) is a text column, we need to convert the values into integers.

Approaches:
1. Label Encoding - Convert each text category into an integer label. 
2. Ordinal Encoding - Convert each text category into an integer label but with a particular order. Used when the categories have some quantitative order that can be taken advantage of.   
3. One-Hot Encoding - Convert each text category into a separate column. For example, this is done in the softmax layer of a neural network. 

Solution: There is no obvious ordering in the 'Status' column and ML libraries typically expect a single target column so we will use label encoding.

In [4]:
train_df['Status'].value_counts()

Status
C     4965
D     2665
CL     275
Name: count, dtype: int64

In [5]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder() 
train_df['Status'] = label_encoder.fit_transform(train_df['Status'])

# Check that the label encoder transformation was applied correctly
train_df['Status'].value_counts()

Status
0    4965
2    2665
1     275
Name: count, dtype: int64

## Missing Feature Value Handling

Problem: We need to check if there are any missing feature values and handle the problem accordingly.

Solution: Since we are using XGBoost, we do not need to do dataset pre-processing for missing feature values since XGBoost's implementation automatically handles them. We still check for missing feature values for practice by using the isna DataFrame method which returns a boolean DataFrame where each cell is True if the value is missing and False otherwise. We then apply the sum method to find the number of missing values in each column. 

In [6]:
# Check for missing feature values
print("Number of missing feature values by column: ")
print(train_df.isna().sum())

Number of missing feature values by column: 
N_Days           0
Drug             0
Age              0
Sex              0
Ascites          0
Hepatomegaly     0
Spiders          0
Edema            0
Bilirubin        0
Cholesterol      0
Albumin          0
Copper           0
Alk_Phos         0
SGOT             0
Tryglicerides    0
Platelets        0
Prothrombin      0
Stage            0
Status           0
dtype: int64


In [7]:
print("Number of missing feature values by column in test data:", test_df.isna().sum().sum())

Number of missing feature values by column in test data: 0


## Feature Scaling

Problem: Typically since feature column values are combined to create the final classification, the ML model will perform better if the features are on the same scale. 

Approaches: 
1. Min-Max Scaling - Scales the data to a fixed range between two values (typically 0 and 1). Most useful for neural networks.
2. Standardization (Z-score Normalization) - Scales the data so that the mean is 0 and the standard deviation is 1. Most useful for algorithms that assume a normal distribution of data, such as SVMs and logistic regression.
3. Robust Scaling - Scaling based on median and IQR. Most useful for handling significant outliers
4. MaxAbsScaler - Scales each feature based on its maximum absolute value. Useful for sparse data. 

Solution: Since XGBoost's decision tree classification uses splitting which occurs within a column, different column values do not interact with each other and therefore scaling the features is not necessary. However, since we are using Logistic Regression as our baseline model and features do interact in training we will need to do feature scaling. Since we doing feature scaling specifically for our logistic regression model we will use the Standardization approach. 

In [8]:
from sklearn.preprocessing import StandardScaler 

numerical_cols = train_df.columns.difference(categorical_cols)

std_scaler = StandardScaler() 
std_scaler.fit(train_df[numerical_cols])

train_df[numerical_cols] = std_scaler.transform(train_df[numerical_cols])
test_df[numerical_cols] = std_scaler.transform(test_df[numerical_cols])
                      
# Confirm the transformation was successful by seeing if the mean = 0 and std = 1 for numerical columns
test_df.describe()

Unnamed: 0,N_Days,Age,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
count,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0
mean,0.00779,0.033864,0.001549,0.009851,-0.029617,0.010526,-0.002895,-0.020847,-0.001029,-0.013781,0.004353,0.005175
std,0.993309,0.973958,1.010406,1.025961,1.02524,1.021709,1.016664,1.003627,1.001441,1.001418,1.014105,0.987967
min,-1.817984,-2.384728,-0.601797,-1.180148,-4.588553,-1.052815,-0.802543,-1.808946,-1.567576,-2.323678,-2.08455,-2.345776
25%,-0.727654,-0.7183,-0.496885,-0.524971,-0.57294,-0.591648,-0.522026,-0.811772,-0.596648,-0.64291,-0.805263,-1.191649
50%,-0.135421,0.117632,-0.391973,-0.263923,0.062625,-0.249068,-0.354452,-0.156896,-0.215892,-0.071221,-0.037691,-0.037522
75%,0.604869,0.627996,0.106359,0.201867,0.640411,0.238452,0.011428,0.478508,0.431393,0.591939,0.474024,1.116605
max,2.526884,2.792831,6.663359,7.291089,3.15378,6.642081,6.327728,7.023169,9.188781,3.404652,5.84703,1.116605


## Categorical Attributes/Columns 
Problem: We use the DataFrame info method to find that there are six categorical columns: drug, sex, ascites, hepatomegaly, spiders, edema which we need to convert from text into numerical values.

Solution: The main approaches for categorical attribute handling are 
1. Ordinal Encoding - Useful when the categories correspond to an ascending or descending order. 
2. One-Hot Encoding - For each categorical column, convert it into multiple columns, one for each possible category. This is used when the categories do not have an obvious logical order. 
3. Numerical Feature Replacement (Advanced) - In cases where the number of categories is cery large (hundreds or thousands) one should consider replacing the categorical columns with a numerical column that converts each category into some number. For example, one could convert a country code into the country's population. 
4. Embedding Replacement (Advanced) - Alternatively, one can replace categories with embeddings, which are low dimensional vectors that represent the category. 

In this case, we use a one-hot encoding since none of the categories seem to have a logical order and the number of categories is low (under 10 for all categorical columns)

In [9]:
# Confirm that the number of categories in the categorical columns is manageable (< 100)
unique_values_per_column = train_df.nunique()

print(unique_values_per_column)

N_Days           461
Drug               2
Age              391
Sex                2
Ascites            2
Hepatomegaly       2
Spiders            2
Edema              3
Bilirubin        111
Cholesterol      226
Albumin          160
Copper           171
Alk_Phos         364
SGOT             206
Tryglicerides    154
Platelets        227
Prothrombin       49
Stage              4
Status             3
dtype: int64


In [10]:
# Convert the categorical columns into one-hot encodings
status = train_df['Status']
train_df_dummies = pd.get_dummies(train_df.drop('Status', axis=1))
train_df = pd.concat([train_df_dummies, status], axis=1)
test_df = pd.get_dummies(test_df)

# Confirm the transformation was successful
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5271 entries, 0 to 5270
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   N_Days                5271 non-null   float64
 1   Age                   5271 non-null   float64
 2   Bilirubin             5271 non-null   float64
 3   Cholesterol           5271 non-null   float64
 4   Albumin               5271 non-null   float64
 5   Copper                5271 non-null   float64
 6   Alk_Phos              5271 non-null   float64
 7   SGOT                  5271 non-null   float64
 8   Tryglicerides         5271 non-null   float64
 9   Platelets             5271 non-null   float64
 10  Prothrombin           5271 non-null   float64
 11  Stage                 5271 non-null   float64
 12  Drug_D-penicillamine  5271 non-null   bool   
 13  Drug_Placebo          5271 non-null   bool   
 14  Sex_F                 5271 non-null   bool   
 15  Sex_M                

# Train/Test Split

Problem: We need to split the dataset into a training dataset and a testing/tuning dataset.

Solution: We wait until all dataset pre-processing is done (to avoid needing to pre-process the train/test datasets separately) and then split the dataset using sklearn's train_test_split function. 

In [11]:
from sklearn.model_selection import train_test_split

# Split the training set into a training and test set 
X = train_df.drop("Status", axis=1)
y = train_df["Status"]

X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size = 0.2)

# Logistic Regression Baseline Model Training Tasks
1. Train the Logistic Regression modle using sklearn
2. Evaluate the results 

In [12]:
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Train the model 
model = LogisticRegression(max_iter = 1000)
model.fit(X_train, y_train)
y_pred = model.predict_proba(test_df)

print(y_pred)

# Evaluate the results
#print("Accuracy:", accuracy_score(y_test, y_pred))
#print("Classification Report:\n", classification_report(y_test, y_pred))
#print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


[[0.8355211  0.01168076 0.15279814]
 [0.8726466  0.03659344 0.09075996]
 [0.05280175 0.02772749 0.91947076]
 ...
 [0.84435891 0.01785279 0.1377883 ]
 [0.94407197 0.03246214 0.02346589]
 [0.40346918 0.00844295 0.58808786]]


In [13]:
# Modify the probability predictions into the submission format 
submission_df = pd.DataFrame(y_pred, columns=['Status_C', 'Status_CL', 'Status_D'])
final_submission_df = pd.concat([test_id_df, submission_df], axis=1)
final_submission_df.head(10)

# Create a submission.csv file that Kaggle will automatically evaluate for submission
final_submission_df.to_csv('submission.csv', index = False)