# Introduction 

Cirrhosis results from prolonged liver damage, leading to extensive scarring, often due to conditions like hepatitis or chronic alcohol consumption. The [data](https://archive.ics.uci.edu/dataset/878/cirrhosis+patient+survival+prediction+dataset-1) provided is sourced from a Mayo Clinic study on primary biliary cirrhosis (PBC) of the liver carried out from 1974 to 1984.

During 1974 to 1984, 424 PBC (_Primary Biliary Cirrhosis_) patients referred to the Mayo Clinic qualified for the randomized placebo-controlled trial testing the drug _D-penicillamine_. Of these, the initial 312 patients took part in the trial and have mostly comprehensive data. The remaining 112 patients didn't join the clinical trial but agreed to record basic metrics and undergo survival tracking. Six of these patients were soon untraceable after their diagnosis, leaving data for 106 of these individuals in addition to the 312 who were part of the randomized trial.

# Problem Statement
- What are the factors that influence the mortality rate for patients with cirrhosis?
- Which model has the highest accuracy of predicting the mortality rate for patients with cirrhosis?

# Data Collection
- [Source of dataset](https://archive.ics.uci.edu/dataset/878/cirrhosis+patient+survival+prediction+dataset-1)
- [Research Publication](https://pubmed.ncbi.nlm.nih.gov/2737595/)

**Variable Information**
- `ID`: unique identifier
- `N_Days`: number of days between registration and the earlier of death, transplantation, or study analysis time in July 1986
- `Status`: status of the patient C (censored), CL (censored due to liver tx), or D (death)
- `Drug:` type of drug D-penicillamine or placebo
- `Age`: age in [days]
- `Sex`: M (male) or F (female)
- `Ascites`: presence of ascites N (No) or Y (Yes)
- `Hepatomegaly`: presence of hepatomegaly N (No) or Y (Yes)
- `Spiders`: presence of spiders N (No) or Y (Yes)
- `Edema`: 
  - `N`: no edema and no diuretic therapy for edema
  - `S`: edema present without diuretics or edema resolved by diuretics
  - `Y`: edema despite diuretic therapy
- `Bilirubin`: serum bilirubin in [mg/dl]
- `Cholesterol`: serum cholesterol in [mg/dl]
- `Albumin`: albumin in [gm/dl]
- `Copper`: urine copper in [ug/day]
- `Alk_Phos`: alkaline phosphatase in [U/liter]
- `SGOT`: Serum Glutamic Oxaloacetic Transaminase, which is also known as aspartate aminotransferase (AST) in [U/ml]
- `Triglycerides`: triglicerides in [mg/dl]
- `Platelets`: platelets per cubic [ml/1000]
- `Prothrombin`: prothrombin time in seconds [s]
- `Stage`: histologic stage of disease (1, 2, 3, or 4)

**Import Modules**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import chi2_contingency
from statsmodels.graphics.mosaicplot import mosaic
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, roc_auc_score, auc
from sklearn.model_selection import GridSearchCV

**Dataset Extraction**

In [None]:
df = pd.read_csv('/kaggle/input/cirrhosis-prediction-dataset/cirrhosis.csv')
df.head()

# Data Cleaning and Preprocessing

In [None]:
df.columns

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
missing_columns = ['Drug', 'Ascites', 'Hepatomegaly', 
                   'Spiders', 'Cholesterol', 'Copper', 
                   'Alk_Phos', 'SGOT', 'Tryglicerides', 
                   'Platelets', 'Prothrombin', 'Stage']

# Drop missing values from affected columns
df = df.dropna(subset=missing_columns)

df.info()

In [None]:
columns = ['Status', 'Drug', 'Sex', 'Ascites', 'Hepatomegaly', 'Spiders', 'Edema', 'Stage']

for column in columns:
    print(f'The unique values of {column}: {df[column].unique()}')

In [None]:
# Change 'Age' from days to years
df['Age'] = df['Age'] / 365.25
df['Age'] = df['Age'].astype(int)

df.head()

In [None]:
# Convert float into integer for `Stage`
df['Stage'] = df['Stage'].astype(int)
df['Stage'] = df['Stage'].astype(str)
print(df['Stage'].unique())

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# Rename the columns
df.rename(columns={'Tryglicerides': 'Triglycerides', 'Alk_Phos': 'ALP', 'SGOT': 'AST'}, inplace=True) 
df.head()

In [None]:
# Check for duplicated rows
duplicated_rows = df[df.duplicated()]

# Group by the duplicated rows and calculate their sum
sum_of_duplicated_rows = duplicated_rows.groupby(duplicated_rows.columns.tolist()).size().reset_index(name='count')

# Display the sum of duplicated rows
print(sum_of_duplicated_rows)

In [None]:
df.describe()

# Features Engineering

## Feature Transformation
Transformation of continuous variables into categorical variables:
- `Age` : { < 35 : Young Adult, 35 < Middle-Aged Adult < 65, > 65 : Elderly }
- `Bilirubin` : { < 1.2 : normal, >= 1.2 : high }
- `Cholesterol` : { < 200: normal, 200 < borderline high <240, >= 240:high }
- `Albumin`: { < 3.4 : low, 3.4 < normal < 5.4, >= 5.4 : high }
- `Copper`: < 70 : low, 70 <normal <140, >= 140 : high }
  - `ALP`: <20:low, 20<normal<140, >=140:high
  - `AST`: <10:low, 10<normal<40, >=40:high
  - `Triglycerides`: <150:normal, 150<borderline high<200, 200<high<500, >=500:very high
  - `Platelets`: <150:low, 150<normal<450, >=450:high
  - `Prothrombin`: <11:shortened,11<normal<13,>=13:prolonged

In [None]:
# Age
bins = [0, 35, 65, float('inf')]  
labels = ['Young Adult', 'Middle-Aged Adult', 'Elderly']
df['Age'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Bilirubin
bins = [0, 1.2, float('inf')]  
labels = ['normal', 'high']
df['Bilirubin'] = pd.cut(df['Bilirubin'], bins=bins, labels=labels, right=False)

# Cholesterol
bins = [0, 200, 240, float('inf')]
labels = ['normal', 'borderline high', 'high' ]
df['Cholesterol'] = pd.cut(df['Cholesterol'], bins=bins, labels=labels, right=False)

# Albumin
bins = [0, 3.4, 5.4, float('inf')]
labels = ['low', 'normal', 'high' ]
df['Albumin'] = pd.cut(df['Albumin'], bins=bins, labels=labels, right=False)

# Copper
bins = [0, 70, 140, float('inf')]
labels = ['low', 'normal', 'high' ]
df['Copper'] = pd.cut(df['Copper'], bins=bins, labels=labels, right=False)

# ALP
bins = [0, 20, 140, float('inf')]
labels = ['low', 'normal', 'high' ]
df['ALP'] = pd.cut(df['ALP'], bins=bins, labels=labels, right=False)

# AST
bins = [0, 10, 40, float('inf')]
labels = ['low', 'normal', 'high' ]
df['AST'] = pd.cut(df['AST'], bins=bins, labels=labels, right=False)

# Triglycerides
bins = [0, 150, 200, 500, float('inf')]
labels = ['normal', 'borderline high', 'high', 'very high']
df['Triglycerides'] = pd.cut(df['Triglycerides'], bins=bins, labels=labels, right=False)

# Platelets
bins = [0, 150, 450, float('inf')]
labels = ['low', 'normal', 'high' ]
df['Platelets'] = pd.cut(df['Platelets'], bins=bins, labels=labels, right=False)

# Prothrombin
bins = [0, 11, 13, float('inf')]
labels = ['shortened', 'normal', 'prolonged' ]
df['Prothrombin'] = pd.cut(df['Prothrombin'], bins=bins, labels=labels, right=False)

# Display the DataFrame with the new categorical column
df.head()

# Exploratory Data Analysis 

### Number of Days

In [None]:
# Create a figure with two subplots (1 row, 2 columns)
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

# Plot histogram distribution of N_Days
axs[0].hist(df['N_Days'], density=True, alpha=0.5, color='blue', label='Histogram')
sns.kdeplot(df['N_Days'], color='red', ax=axs[0], label='KDE')

axs[0].set_title('Distribution of Number of Days')
axs[0].set_xlabel('Number of Days')
axs[0].set_ylabel('Frequency')
axs[0].legend()

# Define custom colors for the boxplot
box_color = 'lightblue'  # Color of the box
whisker_color = 'darkblue'  # Color of the whiskers
median_color = 'red'  # Color of the median line
flier_color = 'green'  # Color of the outliers

# Create a boxplot with custom colors
axs[1].boxplot(
    df['N_Days'],
    boxprops={'color': box_color},
    capprops={'color': whisker_color},
    medianprops={'color': median_color},
    flierprops={'markerfacecolor': flier_color, 'markeredgecolor': flier_color},
    patch_artist=True,  # Fill the box with color
)

axs[1].set_title('Boxplot of Number of Days')
axs[1].set_xlabel('Number of Days')

# Customize the x-axis and y-axis tick colors if needed
axs[1].tick_params(axis='x', colors='gray')
axs[1].tick_params(axis='y', colors='gray')

# Adjust the layout to prevent overlap
plt.tight_layout()

plt.show()

mean_days = np.mean(df['N_Days'])
median_days = np.median(df['N_Days'])
min_days = np.min(df['N_Days'])
max_days = np.max(df['N_Days'])

print("Mean number of days:", mean_days)
print("Median number of days:", median_days)
print("Minimum number of days:", min_days)
print("Maximum number of days:", max_days)

- Normal distribution of number of days
- No outliers or any skewness
- Average number of days is around 1800 days (_5 years_)

### Status

In [None]:
# Create a count plot
sns.countplot(data=df, x='Status', palette='Set2')
plt.title('Distribution of Status')
plt.xlabel('Status')
plt.ylabel('Count')
plt.show()

# Calculate death rate of patients
death = len(df[df.Status == 'D'])
death_rate = round((death / len(df))*100, 2)
print("Death Rate: "+ str(death_rate) + "%.")

- Most of the patients data are censored.
- Death rate of patients during the clinical trial is 40%

### Drug

In [None]:
# Calculate the value counts for the 'Drug'
drug_counts = df['Drug'].value_counts()

# Create a pie chart
plt.pie(drug_counts, labels=drug_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Drugs')
plt.show()

Both drugs are used in the equal proportion for all patients.

### Age

In [None]:
# Calculate the value counts for the 'Drug'
age_counts = df['Age'].value_counts()

# Create a pie chart
plt.pie(age_counts, labels=age_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Age')
plt.show()

Most of the patients are middle-aged adult.

### Sex

In [None]:
# Calculate the value counts for the 'Age'
sex_counts = df['Sex'].value_counts()

# Create a pie chart
plt.pie(sex_counts, labels=sex_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Sex')
plt.show()

- Most of the patients are females.
- The distribution of genders is unbalanced.

## Hypothesis Testing

### Number of Days vs Status

In [None]:
# Extract data for different 'Status' categories
c_days = df['N_Days'][df['Status'] == 'C']
cl_days = df['N_Days'][df['Status'] == 'CL']
d_days = df['N_Days'][df['Status'] == 'D']

# Create a list of data to be plotted
data_to_plot = [c_days, cl_days, d_days]

# Create a figure and axis
fig, ax = plt.subplots()

# Create a box plot
boxplot = ax.boxplot(data_to_plot, labels=['C', 'CL', 'D'], patch_artist=True)

# Customize boxplot colors
colors = ['lightblue', 'lightgreen', 'lightcoral']
for patch, color in zip(boxplot['boxes'], colors):
    patch.set_facecolor(color)

# Customize the color of the median line
for median in boxplot['medians']:
    median.set_color('black')

# Add labels and title
ax.set_xlabel('Status')
ax.set_ylabel('Number of Days')
ax.set_title('Box Plot of Number of Days by Status')

# Show the plot
plt.show()

# Perform ANOVA
f_statistic, p_value = stats.f_oneway(c_days, cl_days, d_days)
print(f'ANOVA F-statistic: {f_statistic:.2f}')
print(f'ANOVA p-value: {p_value:.4f}')

# Perform Tukey's test for pairwise comparisons
tukey_results = pairwise_tukeyhsd(df['N_Days'], df['Status'])
print(tukey_results)

The number of days for censored patients is significantly higher that censored patients with liver transplant and dead patients

### Number of Days vs Drug

In [None]:
# Extract data for different drug categories
placebo_days = df['N_Days'][df['Drug'] == 'Placebo']
penicilline_days = df['N_Days'][df['Drug'] == 'D-penicillamine']

# Create a list of data to be plotted
data_to_plot = [placebo_days, penicilline_days]

# Create a figure and axis
fig, ax = plt.subplots()

# Create a box plot
boxplot = ax.boxplot(data_to_plot, labels=['Placebo', 'D-penicillamine'], patch_artist=True)

# Customize boxplot colors
colors = ['lightblue', 'lightgreen', 'lightcoral']
for patch, color in zip(boxplot['boxes'], colors):
    patch.set_facecolor(color)

# Customize the color of the median line
for median in boxplot['medians']:
    median.set_color('black')

# Add labels and title
ax.set_xlabel('Status')
ax.set_ylabel('Number of Days')
ax.set_title('Box Plot of Number of Days by Drug')

# Show the plot
plt.show()

# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(placebo_days, penicilline_days)

# Print the results
print(f'P-value: {p_value:.4f}')

# Determine the significance level (alpha)
alpha = 0.05

# Check if the p-value is less than the significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference between the number of days and drugs.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference between the number of days and drugs.")

### Number of Days vs Stages of Disease

In [None]:
# Extract data for different drug categories
stage1_days = df['N_Days'][df['Stage'] == '1']
stage2_days = df['N_Days'][df['Stage'] == '2']
stage3_days = df['N_Days'][df['Stage'] == '3']
stage4_days = df['N_Days'][df['Stage'] == '4']

# Create a list of data to be plotted
data_to_plot = [stage1_days, stage2_days, stage3_days, stage4_days]

# Create a figure and axis
fig, ax = plt.subplots()

# Create a box plot
boxplot = ax.boxplot(data_to_plot, labels=['Stage 1', 'Stage 2', 'Stage 3', 'Stage 4'], patch_artist=True)

# Customize boxplot colors
colors = ['lightblue', 'lightgreen', 'lightcoral']
for patch, color in zip(boxplot['boxes'], colors):
    patch.set_facecolor(color)

# Customize the color of the median line
for median in boxplot['medians']:
    median.set_color('black')

# Add labels and title
ax.set_xlabel('Stage')
ax.set_ylabel('Number of Days')
ax.set_title('Box Plot of Number of Days by Stage')

# Show the plot
plt.show()

# Perform ANOVA
f_statistic, p_value = stats.f_oneway(stage1_days, stage2_days, stage3_days, stage4_days)
print(f'ANOVA F-statistic: {f_statistic:.2f}')
print(f'ANOVA p-value: {p_value:.4f}')

# Perform Tukey's test for pairwise comparisons
tukey_results = pairwise_tukeyhsd(df['N_Days'], df['Stage'])
print(tukey_results)

Stage 4 patients' have shortest number of days during the clinical trial.

### Stage of Disease vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Stage'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Stage")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

Patients in Stage 4 of the disease have significantly higher risk of death compared to other stages.

### Drug vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Drug'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Drug")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

There is no significant difference in types of drugs used in different statuses.

### Age vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Age'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Age")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

Elderly patients with liver diseases have higher risk of death compared to other status.

### Ascites vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Ascites'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Ascites")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

Patients with ascites have higher risk of death.

### Hepatomegaly vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Hepatomegaly'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Hepatomegaly")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

Patients with hepatomegaly have higher risk of death.

### Spider vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Spiders'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Spiders")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

Patients with spiders have higher risk of death.

### Edema vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Edema'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Edema")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

Patients with edema have higher risk of death.

## Bilirubin vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Bilirubin'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Bilirubin")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

Patients with high amount of bilirubins have higher risk of death.

## Cholesterol vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Cholesterol'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Cholesterol")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

There is no significant difference of cholesterol levels in all status.

## Albumin vs Stage of Disease

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Albumin'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Albumin")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

Patients with low albumin level have higher risk of death.

## Copper vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Copper'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Copper")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

Patients with high amount of copper have higher risk of death.

## Alkaline phosphatase vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['ALP'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("ALP")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

All patients with liver disease have high amount of ALP.

## AST vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['AST'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("AST")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

There is no significant difference of AST in all status of patients.

## Triglycerides vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Triglycerides'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Triglycerides")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

Patients with high amount cholesterol have higher risk of death.

## Platelets vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Platelets'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Platelets")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

There is no significant difference of number of platelets in all status of patients.

## Prothrombin vs Status

In [None]:
# Create a cross-tabulation (contingency table)
Xtab = pd.crosstab(df['Prothrombin'], df['Status'])

# Plot a clustered bar chart
Xtab.plot(kind='bar')
plt.title("Clustered Bar Chart of Cross-Tabulation")
plt.xlabel("Prothrombin")
plt.ylabel("Count")
plt.legend(title="Status")
plt.show()

# Perform the chi-square test
chi2, pval, dof, expected = chi2_contingency(Xtab)

# Output the results
print("Cross-Tabulation:")
print(Xtab)
print("P-value:", pval)

Patients with normal or prolonged prothrombin time have higher risk of death.

# Features Engineering

## Features Selection

- The features that are signicantly associated with risk of death due to liver cirrhosis are:
  - Stage 4 liver disease
  - Elderly age
  - Ascites
  - Hepatomegaly
  - Spiders
  - Edema
  - High bilirubin
  - Low albumin
  - High copper
  - High cholesterol
  - Normal or prolonged prothrombin time

## Features Transformation
- Hot-encoding of all categorical variables will represent feature data, X
- `Status` variable will represent target data, y

In [None]:
# Extract the specified columns into a new DataFrame
columns = df[['Stage', 'Age', 'Ascites', 'Hepatomegaly', 
              'Spiders', 'Edema', 'Bilirubin', 'Albumin', 
              'Copper', 'Cholesterol', 'Prothrombin', 'Status']]

# Perform one-hot encoding for the other columns
df = pd.get_dummies(columns, drop_first=True)

df.head()

In [None]:
df.info()

In [None]:
# Calculate the correlation matrix
correlation_matrix = df.corr()

# Create a heatmap
plt.figure(figsize=(10, 8))  # Adjust the figure size as needed
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")

# Display the heatmap
plt.title("Correlation Heatmap")
plt.show()

Features that contribute to the high risk of death due to liver cirrhosis:
- Stage 4 liver disease
- Elderly age
- Presence of ascites
- Presence of hepatomegaly
- Presence of spiders
- Presence of edema
- High level bilirubin
- High level of copper
- Normal prothrombin time

# Predictive Model

## Data splitting

In [None]:
# Define X (features) and y (target)
X = df[['Stage_4', 'Age_Elderly', 'Ascites_Y', 
        'Hepatomegaly_Y', 'Spiders_Y', 'Edema_Y', 'Bilirubin_high', 'Copper_high', 'Prothrombin_normal']]
y = df['Status_D']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Logistic Regression

In [None]:
# Create a Logistic Regression classifier
lr = LogisticRegression()

# Train the classifier on the training data
lr.fit(X_train, y_train)

# Make predictions on the test data
y_pred = lr.predict(X_test)

# Evaluate the classifier's performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Print the evaluation results
print(f"Accuracy: {accuracy:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

## Hyperparameter Tuning

In [None]:
# Define a grid of hyperparameters to search
param_grid = {
    'penalty': ['l1', 'l2'],  # Regularization penalty
    'C': [0.001, 0.01, 0.1, 1, 10],  # Inverse of regularization strength
    'solver': ['liblinear'],  # Solver for L1 penalty
}

# Create a grid search cross-validation object
grid_search = GridSearchCV(estimator=lr, param_grid=param_grid, cv=5)

# Fit the grid search to your data
grid_search.fit(X, y)

# Print the best hyperparameters found
print("Best Hyperparameters: ", grid_search.best_params_)

# Print the best cross-validation score
print("Best Cross-Validation Score: {:.2f}".format(grid_search.best_score_))

# You can also access the best trained model using grid_search.best_estimator_
best_logistic_regression = grid_search.best_estimator_

# Fit the model to your training data
best_logistic_regression.fit(X_train, y_train)

# Make predictions on your test data
y_pred = best_logistic_regression.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Confusion Matrix:\n", confusion)
print("Classification Report:\n", classification_rep)

## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier on the training data
rf_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_classifier.predict(X_test)

# Evaluate the classifier's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}".format(accuracy))

## Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

# Create a Gradient Boosting classifier
gb_classifier = GradientBoostingClassifier(n_estimators=100, random_state=42)

# Train the classifier on the training data
gb_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = gb_classifier.predict(X_test)

# Evaluate the classifier's performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}".format(accuracy))

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)
rf_predictions = rf_classifier.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_predictions)
print("Random Forest Accuracy: {:.2f}".format(rf_accuracy))

# AdaBoost Classifier
adaboost_classifier = AdaBoostClassifier(n_estimators=50, random_state=42)
adaboost_classifier.fit(X_train, y_train)
adaboost_predictions = adaboost_classifier.predict(X_test)
adaboost_accuracy = accuracy_score(y_test, adaboost_predictions)
print("AdaBoost Accuracy: {:.2f}".format(adaboost_accuracy))

## Stacking Method

In [None]:
# Define base models
base_model_1 = RandomForestClassifier(n_estimators=100, random_state=42)
base_model_2 = GradientBoostingClassifier(n_estimators=100, random_state=42)

# Train base models
base_model_1.fit(X_train, y_train)
base_model_2.fit(X_train, y_train)

# Make predictions using base models
predictions_1 = base_model_1.predict(X_test)
predictions_2 = base_model_2.predict(X_test)

# Create a new feature matrix using predictions from base models
stacked_X = np.column_stack((predictions_1, predictions_2))

# Train a meta-model (e.g., Logistic Regression) on top of the base models' predictions
meta_model = LogisticRegression()
meta_model.fit(stacked_X, y_test)

# Make predictions using the stacked model
stacked_predictions = meta_model.predict(stacked_X)

# Evaluate the performance of the stacked model
stacked_accuracy = accuracy_score(y_test, stacked_predictions)
print("Stacked Model Accuracy: {:.2f}".format(stacked_accuracy))

# Conclusion

## Univariate analysis
- Duration of the clinical trial is around 5 years
- Death rate is 40%
- Both placebo and D-penillicilamine are used in equal proportion
- 85% of the patients are middle-aged adults (_35 - 65 years old_)
- 88% of the patients are females
- Neither placebo nor D-penillicilamine improved the status of the patients

## Hypothesis testing
The features that are signicantly associated with rate of mortality due to liver cirrhosis are:
- Stage 4 liver disease
- Elderly age
- Ascites
- Hepatomegaly
- Spiders
- Edema
- High bilirubin
- Low albumin
- High copper
- High cholesterol
- Normal or prolonged prothrombin time

## Predictive Modelling :
Logistic regression with hyperparameter tuning has the highest accuracy score of 80% compared to other models

# Future Work:
- According to this study, almost 90% of the participants were females hence the sample does not represent the actual population of patients with liver disease. Therefore, it is recommended to repeat the trial with equal number of male and female participants.
- During the analysis, it is uncertain about the status of patient's censored information. Further collection of data in patients with censored information will be useful to explore other possible factors that may contribute to this research.
- Additional data such as medical imaging (_eg: MRI, CT scan, ultrasound_), concurrent illnesses (_eg: cholethiasis, gallstone, portal hypertension etc_) and laboratory investigations maybe useful to improve the accuracy of prediction