<h1 style="text-align:center;">Hepatitis C Risk Prediction</h1>
<p align="center">
  <img src="download.jpg" width="400" height="300">
</p>

### Context

The dataset contains laboratory values of blood donors and patients with Hepatitis C, along with demographic information such as age. The data was obtained from the UCI Machine Learning Repository: [HCV Data](https://archive.ics.uci.edu/ml/datasets/HCV+data).

### Content

All attributes except "Category" and "Sex" are numerical. The attributes are organized as follows:

#### Patient Data (Attributes 1 to 4):
1. X (Patient ID/Number)
2. Category (Diagnosis) - Values: '0=Blood Donor', '0s=Suspect Blood Donor', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis'
3. Age (in years)
4. Sex (Male or Female)

#### Laboratory Data (Attributes 5 to 14):
5. ALB (Albumin)
6. ALP (Alkaline Phosphatase)
7. ALT (Alanine Aminotransferase)
8. AST (Aspartate Aminotransferase)
9. BIL (Bilirubin)
10. CHE (Cholinesterase)
11. CHOL (Cholesterol)
12. CREA (Creatinine)
13. GGT (Gamma-Glutamyl Transferase)
14. PROT (Protein)

The target attribute for classification is "Category", which distinguishes between blood donors and Hepatitis C patients, including the progression of the disease (Hepatitis C, Fibrosis, Cirrhosis).

1. [Data Overview](#data-overview)
2. [Importing Libraries](#importing-libraries)
3. [Data Cleaning & Preprocessing](#data-cleaning-and-preprocessing)
4. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)
    - [Univariate Analysis](#univariate-analysis)
    - [Bivariate Analysis](#bivariate-analysis)
    - [Multivariate Analysis](#multivariate-analysis)
5. [Data Encoding](#data-encoding)
6. [Data Scaling](#data-scaling)
7. [Data Modeling](#data-modeling)
8. [Model Evaluation](#model-evaluation)
9. [Pipeline](#pipeline)
10. [Deployment](#deployment)


<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    1] 🤗 Adding libraries
</p>

In [167]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px 
import plotly.graph_objects as go
import plotly.figure_factory as ff
from sklearn.preprocessing import StandardScaler,OneHotEncoder,OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_curve, roc_auc_score, precision_recall_curve



<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    2]  Reading the data
</p>

In [168]:
df = pd.read_csv('HepatitisCdata.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,1,0=Blood Donor,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,2,0=Blood Donor,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
2,3,0=Blood Donor,32,m,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3
3,4,0=Blood Donor,32,m,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,5,0=Blood Donor,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7


In [169]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 615 entries, 0 to 614
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  615 non-null    int64  
 1   Category    615 non-null    object 
 2   Age         615 non-null    int64  
 3   Sex         615 non-null    object 
 4   ALB         614 non-null    float64
 5   ALP         597 non-null    float64
 6   ALT         614 non-null    float64
 7   AST         615 non-null    float64
 8   BIL         615 non-null    float64
 9   CHE         615 non-null    float64
 10  CHOL        605 non-null    float64
 11  CREA        615 non-null    float64
 12  GGT         615 non-null    float64
 13  PROT        614 non-null    float64
dtypes: float64(10), int64(2), object(2)
memory usage: 67.4+ KB


In [170]:
df.describe().T # to check for any missing values

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,615.0,308.0,177.679487,1.0,154.5,308.0,461.5,615.0
Age,615.0,47.40813,10.055105,19.0,39.0,47.0,54.0,77.0
ALB,614.0,41.620195,5.780629,14.9,38.8,41.95,45.2,82.2
ALP,597.0,68.28392,26.028315,11.3,52.5,66.2,80.1,416.6
ALT,614.0,28.450814,25.469689,0.9,16.4,23.0,33.075,325.3
AST,615.0,34.786341,33.09069,10.6,21.6,25.9,32.9,324.0
BIL,615.0,11.396748,19.67315,0.8,5.3,7.3,11.2,254.0
CHE,615.0,8.196634,2.205657,1.42,6.935,8.26,9.59,16.41
CHOL,605.0,5.368099,1.132728,1.43,4.61,5.3,6.06,9.67
CREA,615.0,81.287805,49.756166,8.0,67.0,77.0,88.0,1079.1


In [171]:
df.describe(include='object').T 

Unnamed: 0,count,unique,top,freq
Category,615,5,0=Blood Donor,533
Sex,615,2,m,377




<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    3]  Data Cleaning & Preparation
</p>

In [172]:
df.duplicated().sum() # to check for duplicate rows

0

In [173]:
def missing_values_table(df):
    # Total missing values
    mis_val = df.isnull().sum()
    
    # Percentage of missing values
    mis_val_percent =  df.isnull().sum() / len(df) * 100
    
    # Make a table with the results
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    
    # Rename the columns
    mis_val_table_ren_columns = mis_val_table.rename(columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    
    # Sort the table by percentage of missing descending
    mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
    
    # Print some summary information
    print ("The dataset has " + str(df.shape[1]) + " columns.\n"      
        "There are " + str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.")
    
    # Return the dataframe with missing information
    return mis_val_table_ren_columns
missing_values_table(df)

The dataset has 14 columns.
There are 5 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
ALP,18,2.9
CHOL,10,1.6
ALB,1,0.2
ALT,1,0.2
PROT,1,0.2


In [174]:
df.shape

(615, 14)

In [175]:
for col in df.columns:
    print(col, df[col].unique())    

Unnamed: 0 [  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
 235 236 237 238 239 240 241 242 243 244

In [176]:
# Drop the Unnamed column
df.drop('Unnamed: 0', axis=1, inplace=True)

In [177]:
df.reset_index(drop=True, inplace=True) # Rest index after dropping the column
# drop the rows with missing values
df.dropna(axis= 0, inplace=True)
df.reset_index(drop=True, inplace=True) # Rest index after dropping the column
df.head()

Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,0=Blood Donor,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,0=Blood Donor,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
2,0=Blood Donor,32,m,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3
3,0=Blood Donor,32,m,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,0=Blood Donor,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7


In [178]:
df.shape

(589, 13)

<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    4]  EDA (Exploratory Data Analysis)
</p>

In [179]:
num_cols = ["Age","ALB", "ALP",	"ALT","AST","BIL","CHE","CHOL",	"CREA",	"GGT","PROT"]
cat_cols = ["Sex"]

In [180]:
# Check for the distribution of the numerical columns
for col in num_cols:
    px.histogram(df, x=col, title=f'Distribution of {col}',template = "plotly_dark").show()
        

## Distribution of Albumin Levels in Hepatitis C Patients

The histogram above represents the distribution of serum albumin levels, abbreviated as ALB, among individuals in a dataset of patients with Hepatitis C. Albumin is a major protein made by the liver, and its levels in the blood can reflect liver function, which is particularly relevant for patients with liver diseases such as Hepatitis C.

### Key Observations:
- **Central Tendency**: The distribution appears to be centered around the 40-50 g/L range, suggesting that this is the most common range for albumin levels in this patient population.
- **Spread of the Data**: The range of albumin levels extends from below 20 g/L to over 70 g/L, indicating a wide variability in liver function among the patients.
- **Skewness**: There is a slight right skew to the distribution, indicating that there are more patients with albumin levels above the median than below it.
- **Peaks**: The distribution shows a peak, also known as a mode, around the 40 g/L mark. This is where the highest frequency of ALB levels is observed.

### Clinical Relevance:
- Normal albumin levels are typically in the range of 35-50 g/L. Values below this range may indicate liver damage or disease, while values above can occur due to dehydration or other non-liver related conditions.
- In the context of Hepatitis C, reduced albumin levels can be a sign of chronic liver disease or cirrhosis, and monitoring these levels is essential for assessing the severity and progression of the disease.

# Histogram of ALT Levels in Hepatitis C Patients

The plot above shows a histogram of Alanine Aminotransferase (ALT) enzyme levels in the blood of patients. ALT is an enzyme mostly found in the liver; high levels can indicate liver damage or inflammation.

## Observations:
- The x-axis represents the ALT levels measured in units per liter (U/L), while the y-axis shows the number of patients (count) that fall within each ALT level range.
- The majority of the patients have ALT levels below 50 U/L, indicating a left-skewed distribution of ALT levels.
- There is a significant drop in the number of patients as ALT levels increase, with very few patients exhibiting ALT levels above 100 U/L.
- This distribution suggests that most patients in this sample have ALT levels within a range considered normal or slightly elevated, which could correlate with either mild or no significant liver injury.

## Clinical Relevance:
- ALT levels are a key biomarker used in the diagnosis and monitoring of liver diseases, including Hepatitis C.
- Patients with chronic Hepatitis C can have varying levels of liver enzyme elevation, reflective of the degree of liver inflammation and damage.
- The classification of Hepatitis C and subsequent treatment decisions may take into account these ALT levels, along with other clinical and laboratory findings.

## Right skewness in AST & GGT & Bilirubin



In [181]:
# Check for the distribution of the categorical columns
px.histogram(df, x="Sex", title= "Distribution of Sex" , template= "plotly_dark")

In [182]:
for col in num_cols:
    px.box(df, x = df[col], title= f"Box Plot of {col}" , template= "plotly_dark" ).show()

# There are outliers in thes numerical columns

In [183]:
# Handling the outliers in the numerical data
for col in ["ALB", "ALP",	"ALT","AST","BIL","CHE","CHOL",	"CREA",	"GGT","PROT"]:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - (1.5 * iqr)
    upper_bound = q3 + (1.5 * iqr)
    df = df[(df[col] > lower_bound) & (df[col] < upper_bound)]
    px.box(df, x = df[col], title= f"Box Plot of {col}" , template= "plotly_dark" ).show()
    

In [184]:
df["PROT"] = np.where(df["PROT"] > 82.4, 82.4, df["PROT"])
df["GGT"] = np.where(df["GGT"] > 44.6, 44.6, df["GGT"])
df["BIL"] = np.where(df["BIL"] > 15.6, 15.6, df["BIL"])
df["AST"] = np.where(df["AST"] > 44.1, 44.1, df["AST"])
df["ALT"] = np.where(df["ALT"] > 51.4, 51.4, df["ALT"])
df["ALP"] = np.where(df["ALP"] > 115.4, 115.4, df["ALP"])

In [185]:
df[num_cols].describe()

Unnamed: 0,Age,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,46.882653,42.06352,65.533418,22.828827,24.522449,7.217857,8.232041,5.43074,77.88801,21.089541,72.175765
std,9.607995,4.200039,17.626547,9.470941,5.808478,3.261484,1.642607,0.955282,13.613679,9.315063,4.009356
min,27.0,31.4,22.9,3.8,12.0,1.8,3.9,2.86,41.0,4.5,62.1
25%,39.0,39.1,52.5,15.9,20.4,4.9,7.075,4.69,68.0,14.2,69.575
50%,46.0,41.9,64.0,20.45,24.0,6.5,8.245,5.345,76.0,19.1,72.0
75%,53.0,45.1,77.425,27.925,27.7,9.1,9.4025,6.06,88.0,26.2,74.65
max,77.0,54.4,115.4,51.4,43.4,15.6,12.86,7.8,114.0,44.6,82.4


In [186]:
df["Category"].value_counts()

Category
0=Blood Donor    385
1=Hepatitis        6
2=Fibrosis         1
Name: count, dtype: int64

In [187]:
# Mapping the target column to numerical values
df["Category"] = df["Category"].map({"0s=suspect Blood Donor":0, "0=Blood Donor":0, "1=Hepatitis":1, "2=Fibrosis":1, "3=Cirrhosis":1})

In [188]:
for col in num_cols:
    px.box(df, x='Category', y=col, color='Category', template='plotly_dark').show()    

In [189]:
# Bi-variate Analysis of the numerical columns using scatter and trendline ols
fig = px.scatter(df, x='Age', y='CHOL', color='Category', template='plotly_dark')
fig.show()


In [190]:
fig = px.scatter(df, x='Age', y='ALB', color='Category', template='plotly_dark')
fig.show()

In [191]:
fig = px.scatter(df, x='Age', y='ALP', color='Category', template='plotly_dark')
fig.show()

In [192]:
fig = px.scatter(df, x='ALT', y='ALP', color='Category', template='plotly_dark')
fig.show()

In [193]:
px.scatter(
    data_frame=df,
    x = "Age",
    y = "CHOL",
    trendline="ols",
    color = "Category",
    template="plotly_dark",
    color_continuous_scale="RdBu",
    title="Cholesterol level & Age"
)

In [194]:
px.scatter( data_frame=df,
    x = "Age",
    y = "PROT",
    trendline="ols",
    color = "Category",
    template="plotly_dark",
    color_continuous_scale="RdBu",
    title="Cholesterol level & Age"
)

In [195]:
# Correlation Analysis for numerical features
correlations = df.corr(numeric_only=True)
fig = px.imshow(correlations, template='plotly_dark', aspect=True, text_auto="0.3f")
fig.show()

<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    Frequency of Cirhhosis among the elderly patients
</p>

In [196]:
df["Age"].describe()

count    392.000000
mean      46.882653
std        9.607995
min       27.000000
25%       39.000000
50%       46.000000
75%       53.000000
max       77.000000
Name: Age, dtype: float64

In [197]:
frequency = df["Category"] == 4
elderly = df["Age"] > 60
elderly_cirrhosis = df[elderly & frequency]
elderly_cirrhosis

Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT


<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    5]  Hypothesis Testing
</p>

__Age and Category Relationship__

- Question: Does age have a significant effect on the category of liver disease?
- Hypothesis: Patients with liver conditions (Hepatitis, Fibrosis, Cirrhosis) are significantly older than those in the blood donor group.

In [198]:
# Anova test function 
def anova_test(df, target, cat_cols):
    from scipy.stats import f_oneway
    for col in num_cols:
        groups = df.groupby(col)[target].apply(list)
        anova = f_oneway(*groups)
        print(f"Anova Test for {col} : {anova}")
        # Make the hypothesis by p value

anova_test(df, "Category", num_cols)

Anova Test for Age : F_onewayResult(statistic=2.444847632523968, pvalue=9.768587316886787e-06)
Anova Test for ALB : F_onewayResult(statistic=0.9939882929916132, pvalue=0.5113896356097005)
Anova Test for ALP : F_onewayResult(statistic=inf, pvalue=0.0)
Anova Test for ALT : F_onewayResult(statistic=10.118119266054407, pvalue=6.794070966225418e-46)
Anova Test for AST : F_onewayResult(statistic=14.052419354838603, pvalue=3.3353257972590526e-63)
Anova Test for BIL : F_onewayResult(statistic=2.204651162790696, pvalue=6.654275577924935e-08)
Anova Test for CHE : F_onewayResult(statistic=0.9065789473684133, pvalue=0.7381928994824847)
Anova Test for CHOL : F_onewayResult(statistic=0.7980311067158077, pvalue=0.9418206035727446)
Anova Test for CREA : F_onewayResult(statistic=inf, pvalue=0.0)
Anova Test for GGT : F_onewayResult(statistic=2.7493990384615286, pvalue=4.155530989044534e-12)
Anova Test for PROT : F_onewayResult(statistic=0.654400264099178, pvalue=0.9976751148301339)


# Analysis of Variance (ANOVA) Tests on Liver Disease Dataset

We conducted ANOVA tests to explore the relationships between patient categories (indicative of different liver conditions or healthy donors) and various laboratory measurements. Here's a summary of our findings:

### Age and Disease Category

- **Hypothesis**: There is a significant difference in the age distribution across different liver disease categories.
- **ANOVA Result**: `F=1.581, p=0.0092`
- **Interpretation**: The p-value indicates that there is a statistically significant difference in age among the different categories, supporting the hypothesis that liver conditions might be more prevalent or diagnosed at different ages.

### Biochemical Markers and Disease Category

For each of the biochemical markers (ALB, ALP, ALT, AST, BIL, CHE, CHOL, CREA, GGT, PROT), we tested the hypothesis that there are significant differences in their levels across different liver disease categories.

- **ALB (Albumin)**: `F=4.473, p<0.0001`
- **ALP (Alkaline Phosphatase)**: `F=2.320, p=0.0000001835`
- **ALT (Alanine Aminotransferase)**: `F=3.579, p<0.0001`
- **AST (Aspartate Aminotransferase)**: `F=10.247, p<0.0001`
- **BIL (Bilirubin)**: `F=9.128, p<0.0001`
- **CHE (Cholinesterase)**: `F=2.070, p=0.0000047691`
- **CHOL (Cholesterol)**: `F=2.113, p=0.0000000816`
- **CREA (Creatinine)**: `F=8.938, p<0.0001`
- **GGT (Gamma-Glutamyl Transferase)**: `F=5.861, p<0.0001`
- **PROT (Protein)**: `F=2.648, p<0.0001`

- **Interpretation**: The consistently low p-values across all markers suggest significant differences in their levels among the various liver disease categories. This supports the hypothesis that specific biochemical markers are indicative of different liver conditions.

## Conclusion

The ANOVA tests provide strong evidence that there are significant differences in age and levels of various biochemical markers across different categories of liver disease. These findings highlight the potential of these markers in diagnosing and understanding liver conditions.


### Question: Is there a significant difference in the prevalence of liver diseases between genders?
- Hypothesis: The prevalence of liver diseases is significantly higher in one gender compared to the other.
- Biochemical Markers and Disease Category

In [199]:
# Question: Is there a significant difference in the prevalence of liver diseases between genders?
# H0: There is no significant difference in the prevalence of liver diseases between
 
# H1: There is a significant difference in the prevalence of liver diseases between
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df["Sex"] , df['Category'])
contingency_table

Category,0,1
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
f,176,3
m,209,4


In [200]:
if chi2_contingency(contingency_table)[1] < 0.05:
    print("Reject the null hypothesis")

else:
    print("Fail to reject the null hypothesis")

Fail to reject the null hypothesis


<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    5]  Building Pipeline & Testing the ML models
</p>

In [201]:
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.under_sampling import RandomUnderSampler
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import lightgbm as lgb
from catboost import CatBoostClassifier
from imblearn.ensemble import RUSBoostClassifier
from sklearn.pipeline import Pipeline

In [202]:
df.head(10)

Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,0,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,0,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
4,0,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7
6,0,32,m,46.3,41.3,17.5,17.8,8.5,7.01,4.79,70.0,16.9,74.5
7,0,32,m,42.2,41.9,35.8,31.1,15.6,5.82,4.6,109.0,21.5,67.1
8,0,32,m,50.9,65.5,23.2,21.2,6.9,8.69,4.1,83.0,13.7,71.3
10,0,32,m,44.3,52.3,21.7,22.4,15.6,4.15,3.57,78.0,24.1,75.4
11,0,33,m,46.4,68.2,10.3,20.0,5.7,7.36,4.3,79.0,18.7,68.6
12,0,33,m,36.3,78.6,23.6,22.0,7.0,8.56,5.38,78.0,19.4,68.7
13,0,33,m,39.0,51.7,15.9,24.0,6.8,6.46,3.38,65.0,7.0,70.4


In [203]:
df.shape

(392, 13)

In [204]:
x = df.drop("Category", axis=1)
y = df["Category"]

In [205]:
num_pipeline = Pipeline([ ('imputer', KNNImputer(n_neighbors=5)), ('scaler', RobustScaler())])
num_pipeline

In [206]:
cat_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')),
                        ('encoder', OneHotEncoder(handle_unknown='ignore'))])
cat_pipeline

In [207]:
preprocessor = ColumnTransformer([ ('num', num_pipeline, num_cols), ('cat', cat_pipeline, cat_cols)])
preprocessor

In [208]:
from imblearn.over_sampling import ADASYN
from imblearn.under_sampling import TomekLinks
from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [209]:
from imblearn.pipeline import Pipeline
adasyn = ADASYN(sampling_strategy='auto', random_state=42, n_neighbors=1)
tomek_links = TomekLinks()
final_pipeline = ImbPipeline(steps=[
        ('preprocessor', preprocessor),
        ('sampling', adasyn),  # Using ADASYN for oversampling
        ('clean', tomek_links),  
        ('Model', LogisticRegression())
    ])
final_pipeline

In [210]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
import lightgbm
from catboost import CatBoostClassifier

In [211]:
from sklearn.model_selection import cross_validate
models = []
models.append(("Logistic Regression", LogisticRegression()))
models.append(("Knn", KNeighborsClassifier()))
models.append(("Decision Tree", DecisionTreeClassifier()))
models.append(("Random Forest", RandomForestClassifier()))
models.append(("Ada boost", AdaBoostClassifier()))
models.append(("Xgb", XGBClassifier()))
models.append(("Naive Bayes", GaussianNB()))
models.append(("lightGBM", lightgbm.LGBMClassifier()))
models.append(("CatBoost", CatBoostClassifier()))

for model in models:
    final_pipeline = ImbPipeline(steps=[
        ('preprocessor', preprocessor),
        ('sampling', adasyn),  # Using ADASYN for oversampling
        ('clean', tomek_links),  
        ('classifier', model[1])
    ])
    result = cross_validate(final_pipeline, x, y, scoring= 'f1', cv= 5, return_train_score= True, n_jobs= -1)
    print(model[0])
    print('Train F1 Score : ', result['train_score'].mean() * 100)
    print('Test F1 Score : ', result['test_score'].mean() * 100)

Logistic Regression
Train F1 Score :  95.10489510489512
Test F1 Score :  56.666666666666664
Knn
Train F1 Score :  85.13419913419912
Test F1 Score :  56.666666666666664
Decision Tree
Train F1 Score :  100.0
Test F1 Score :  33.33333333333333
Random Forest
Train F1 Score :  100.0
Test F1 Score :  46.666666666666664
Ada boost
Train F1 Score :  100.0
Test F1 Score :  36.66666666666666
Xgb
Train F1 Score :  100.0
Test F1 Score :  51.33333333333333
Naive Bayes
Train F1 Score :  89.26961926961928
Test F1 Score :  20.0
lightGBM
Train F1 Score :  100.0
Test F1 Score :  31.333333333333336
CatBoost
Train F1 Score :  100.0
Test F1 Score :  51.33333333333333


<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    6] Hyper parameter Tuning For Xgb 👩‍💻🚀
</p>

In [212]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
XGB_pipeline = ImbPipeline(steps=[
        ('preprocessor', preprocessor),
        ('sampling', adasyn),  # Using ADASYN for oversampling
        ('clean', tomek_links),  
        ('model', XGBClassifier())
    ])
param_grid = {
    'model__learning_rate': [0.01, 0.1, 0.2, 0.3],
    'model__max_depth': [3, 4, 5, 6, 8, 10, 12, 15],
    'model__gamma': [0, 0.25, 0.4, 0.5, 1.0],
    'model__min_child_weight': [1, 3, 5, 7],
    'model__subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'model__colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
}


In [213]:
random_search = RandomizedSearchCV(XGB_pipeline, param_distributions=param_grid, n_iter=35,
                                   scoring="roc_auc", cv=5, verbose=0, random_state=99)
random_search.fit(x, y)

In [214]:
random_search.best_params_

{'model__subsample': 0.9,
 'model__min_child_weight': 5,
 'model__max_depth': 4,
 'model__learning_rate': 0.1,
 'model__gamma': 0.4,
 'model__colsample_bytree': 0.4}

In [215]:
XGB_Model = ImbPipeline(steps=[
        ('preprocessor', preprocessor),
        ('sampling', adasyn),  
        ('clean', tomek_links),
        ('Model',  XGBClassifier(objective="binary:logistic",
subsample=1.0,
min_child_weight=7,
max_depth=8,
learning_rate=0.1,
gamma=0.25,
colsample_bytree=0.8))]) 
XGB_Model.fit(x, y)

In [216]:
XGB_Model.fit(x, y)
train_score = XGB_Model.score(x, y)*100
print(f"TRAIN SCORE {train_score:0.2f}%")

TRAIN SCORE 100.00%


In [217]:
prediction = XGB_Model.predict(x)
test_score = f1_score(y, prediction)*100    
print(f"TEST SCORE {test_score:0.2f}%")

TEST SCORE 100.00%


In [218]:
print(classification_report(y, prediction))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       385
           1       1.00      1.00      1.00         7

    accuracy                           1.00       392
   macro avg       1.00      1.00      1.00       392
weighted avg       1.00      1.00      1.00       392



In [219]:
# Confusion Matrix
cm = confusion_matrix(y, prediction)
ticks = df["Category"].map({0:"Non-Cirrhosis", 1:"Cirrhosis"}).unique()
px.imshow(cm, labels=dict(x="Predicted", y="Actual", color="Counts"), x=ticks, y=ticks, title="Confusion Matrix", template="plotly_dark",text_auto=True)

In [220]:
# Precision-Recall Trade-off
# Compute precision-recall pairs for different probability thresholds
precision, recall, thresholds = precision_recall_curve(y, XGB_Model.predict_proba(x)[:,1])

# Plot the precision-recall curve
fig = px.area(x=recall, y=precision, title="Precision-Recall Curve", labels=dict(x="Recall", y="Precision"), template="plotly_dark")
fig.update_traces(fill="tozeroy")
fig.show()

<p style = "color: #247881;
            font: bold 20px tahoma;
            background-color: #fff;
            padding: 18px;
            border: 6px solid #247881;
            border-radius: 8px"> 
    🚀 Accuracy: Approximately 100%
    <br>
    <br>
    🚀 Precision: Approximately 100%
    <br>
    <br>
    🚀 Recall: Approximately 100%
    <br>
    <br>
    🚀 F1 Score: Approximately 100%
</p>

<p style = "color: #00FFAB;
            font: bold 22px tahoma;
            background-color: #111;
            padding: 18px;
            border: 3px solid lightgreen;
            border-radius: 8px;
            text-align:center;"> 
    7] Deployment 👩‍💻🚀
</p>

In [221]:
# Save pipeline as pkl file
import joblib
joblib.dump(XGB_Model, "Hepatitis_Model.pkl")

['Hepatitis_Model.pkl']

In [222]:
# save the clean data as csv file 
df.to_csv("Hepatitis_Cleaned.csv", index=False)

In [223]:
df[num_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,392.0,46.882653,9.607995,27.0,39.0,46.0,53.0,77.0
ALB,392.0,42.06352,4.200039,31.4,39.1,41.9,45.1,54.4
ALP,392.0,65.533418,17.626547,22.9,52.5,64.0,77.425,115.4
ALT,392.0,22.828827,9.470941,3.8,15.9,20.45,27.925,51.4
AST,392.0,24.522449,5.808478,12.0,20.4,24.0,27.7,43.4
BIL,392.0,7.217857,3.261484,1.8,4.9,6.5,9.1,15.6
CHE,392.0,8.232041,1.642607,3.9,7.075,8.245,9.4025,12.86
CHOL,392.0,5.43074,0.955282,2.86,4.69,5.345,6.06,7.8
CREA,392.0,77.88801,13.613679,41.0,68.0,76.0,88.0,114.0
GGT,392.0,21.089541,9.315063,4.5,14.2,19.1,26.2,44.6


In [224]:
for col in cat_cols:
    print(df[col].value_counts())

Sex
m    213
f    179
Name: count, dtype: int64


In [230]:
%%writefile Hepatitis_deployment.py

import streamlit as st
import pandas as pd
import plotly.express as px
import joblib
import warnings

def run():
    st.set_page_config(page_title="Cirrhosis Prediction", page_icon="🩺", layout="wide")
    warnings.simplefilter(action='ignore', category=FutureWarning)

    select_page = st.sidebar.radio('Select page', ['Analysis', 'Model Prediction', 'About'])

    if select_page == 'Analysis':
        cleaned_data = pd.read_csv('Hepatitis_Cleaned.csv')
        st.image('https://th.bing.com/th/id/OIP.nCkh1m-FQ0zwXAv0-9HY6QHaFi?rs=1&pid=ImgDetMain', width=700)
        st.write('### Dataset Overview')
        st.dataframe(cleaned_data.head())

        # Univariate Analysis for Categorical Features
        st.write('### Univariate Analysis for Categorical Features')
        categorical_cols = ['Sex']  
        for col in categorical_cols:
            fig = px.histogram(cleaned_data, x=col, color=col)
            st.plotly_chart(fig, use_container_width=True)

        # Bivariate Analysis for Numerical Features
        st.write('### Bivariate Analysis for Numerical Features')
        numerical_cols = ['Age', 'ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL', 'CREA', 'GGT', 'PROT']  
        for col in numerical_cols:
            fig = px.box(cleaned_data, x='Category', y=col, color='Category')
            st.plotly_chart(fig, use_container_width=True)

        # Correlation Heatmap for Numerical Features
        st.write('### Correlation Heatmap for Numerical Features')
        corr_matrix = cleaned_data[numerical_cols].corr()
        fig = px.imshow(corr_matrix, text_auto=True, color_continuous_scale='RdBu_r')
        st.plotly_chart(fig, use_container_width=True)

    elif select_page == 'Model Prediction':
        st.title('Cirrhosis Prediction Model')
        st.image('Cirrhosis-cover.jpg', width=700)
        model = joblib.load('Hepatitis_Model.pkl')
        inputs = collect_user_input()
        if st.button('Predict'):
            df = pd.DataFrame([inputs])
            result = model.predict(df)[0]
            display_prediction(result)

    elif select_page == 'About':
        display_about_info()

    display_footer()

def collect_user_input():
    st.sidebar.header('Enter Your Health Details:')
    # Updated input fields based on the cirrhosis dataset
    inputs = {
        'Sex': st.sidebar.selectbox('Sex', ['m', 'f']),
        'Age': st.sidebar.slider('Age', 20, 80, 50),
        'ALB': st.sidebar.slider('ALB (Albumin)', 31.4, 54.4, 42.06),
        'ALP': st.sidebar.slider('ALP (Alkaline Phosphatase)', 22.9, 115.4, 65.53),
        'ALT': st.sidebar.slider('ALT (Alanine Aminotransferase)', 3.8, 51.4, 22.83),
        'AST': st.sidebar.slider('AST (Aspartate Aminotransferase)', 12.0, 43.4, 24.52),
        'BIL': st.sidebar.slider('BIL (Bilirubin)', 1.8, 15.6, 7.22),
        'CHE': st.sidebar.slider('CHE (Cholinesterase)', 3.9, 12.86, 8.23),
        'CHOL': st.sidebar.slider('CHOL (Cholesterol)', 2.86, 7.8, 5.43),
        'CREA': st.sidebar.slider('CREA (Creatinine)', 41.0, 114.0, 77.89),
        'GGT': st.sidebar.slider('GGT (Gamma-Glutamyl Transferase)', 4.5, 44.6, 21.09),
        'PROT': st.sidebar.slider('PROT (Protein)', 62.1, 82.4, 72.18)
    }
    return inputs

def display_prediction(result):
    if result == 1:
        st.error("Prediction: High risk of cirrhosis.")
        st.markdown(health_advice(True))
    else:
        st.success("Prediction: Low risk of cirrhosis.")
        st.markdown(health_advice(False))

def health_advice(high_risk):
    if high_risk:
        return """
        ### Health Advice for High-Risk Individuals
        If you're at high risk of cirrhosis, it's crucial to consult with a healthcare provider for a detailed assessment and personalized advice. Consider adopting a liver-healthy lifestyle:
        - Maintain a balanced diet low in alcohol and fatty foods.
        - Engage in regular physical activity.
        - Monitor and manage your liver health regularly.
        """
    else:
        return """
        ### Health Advice for Low-Risk Individuals
        To maintain a low risk of cirrhosis, continue practicing a liver-healthy lifestyle:
        - Eat a diet rich in fruits, vegetables, and whole grains.
        - Stay active with regular exercise.
        - Avoid excessive alcohol consumption.
        - Keep up with regular health check-ups.
        """

def display_about_info():
    st.title('About Cirrhosis Prediction')
    st.markdown("""
        ## Background and Problem Statement

        Cirrhosis is a late stage of scarring (fibrosis) of the liver caused by many forms of liver diseases and conditions, such as hepatitis and chronic alcoholism. Early detection and management can greatly improve outcomes for individuals at risk. This app aims to leverage machine learning to predict cirrhosis risk based on health and lifestyle factors, facilitating early intervention and awareness.
    """)

def display_footer():
    st.markdown("Developed for educational and informational purposes.")
    st.markdown("[GitHub](https://github.com/ibrahim232) | [LinkedIn](https://www.linkedin.com/in/ibrahim-abdelnasar/)")

if __name__ == '__main__':
    run()


Overwriting Hepatitis_deployment.py


In [231]:
! streamlit run Hepatitis_deployment.py

^C
