# **PRESENTED BY**: **JAYAKRISHNAN J**

**DATE:09/04/2024**

**Predicting Hepatitis C Patients**

Outline:

 1.Data Preprocessing: Data cleaning, simple EDA, data scalling

2.Model Building: Logistic Regression, Random Forest, Support Vector Machine

3.Model Evaluation: Accuracy, AUC-ROC, AUC-PRC.




**Background**

Hepatitis C is a viral infection specifically targeting the liver, potentially leading to severe and life-threatening liver damage if left untreated. This bloodborne virus, primarily transmitted through exposure to infected blood, can establish a chronic infection that may persist for years. In its chronic form, hepatitis C can quietly progress, often without noticeable symptoms, making early detection crucial for effective management.

Diagnostic tools, such as liver function tests, play a pivotal role in assessing the health of the liver. These blood tests evaluate the levels of various enzymes and proteins, providing insights into the liver's performance in tasks like protein production and bilirubin clearance. Elevated levels of certain enzymes may indicate liver cell damage or disease, helping healthcare professionals monitor and address the progression of hepatitis C and its impact on liver function. Regular monitoring and timely intervention are essential components of managing hepatitis C to prevent long-term complications.



**Some important terminologies for this dataset:**

Alanine transaminase (ALT): ALT is a liver enzyme responsible for converting proteins into energy, and elevated levels in the bloodstream may indicate liver damage or disease.

Aspartate transaminase (AST): AST, also found in the liver, assists in amino acid metabolism; increased levels could signify liver damage, disease, or muscle injury.

Alkaline phosphatase (ALP): ALP, present in the liver and bone, aids in protein breakdown, and elevated levels may suggest liver damage, bile duct obstruction, or certain bone diseases.

Albumin (ALB) and Total Protein: Albumin, produced in the liver, is vital for immune function, and lower-than-normal levels may indicate liver damage or disease.

Bilirubin: Bilirubin, a byproduct of red blood cell breakdown, passes through the liver, and elevated levels could signal liver damage, disease, or certain types of anemia.

Gamma-glutamyltransferase (GGT): GGT is a blood enzyme, and higher-than-normal levels may indicate liver or bile duct damage.

Acetylcholinesterase (AChE) is an enzyme pivotal in nerve signal transmission, breaking down acetylcholine and contributing to neuromuscular function.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

In [None]:
df=pd.read_csv('/content/HepatitisCdata.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/content/HepatitisCdata.csv'

In [None]:
df.head()

In [None]:
df.tail()

**DATA CLEANING**

In [None]:

df.drop('Unnamed: 0', axis=1,inplace=True)
df

In [None]:
df.info()

In [None]:
print(df['Category'].unique())

In [None]:
df.isna().sum()

In [None]:
df['ALB'].fillna(df['ALB'].mean(), inplace=True)
df['ALP'].fillna(df['ALP'].mean(), inplace=True)
df['ALT'].fillna(df['ALT'].mean(), inplace=True)
df['CHOL'].fillna(df['CHOL'].mean(), inplace=True)
df['PROT'].fillna(df['PROT'].mean(), inplace=True)

In [None]:
df.isna().sum()

In [None]:
df['Category'] = df['Category'].map({'0=Blood Donor': 0, '0s=suspect Blood Donor': 0,
                                     "1=Hepatitis" : 1, "2=Fibrosis" : 1, "3=Cirrhosis" : 1})

df['Sex'] = df['Sex'].map({'m': 0, 'f': 1})

**Exploratory Data Analysis**

In [None]:
#check disease distribution

plt.figure()
sns.countplot(x='Category', data=df)
plt.title('Disease Comparison')
plt.show

In [None]:
plt.figure()
sns.countplot(x='Sex', data=df)
plt.title('Gender Comparison')
plt.show

In [None]:
plt.figure()
sns.histplot(data=df, x='Age', hue='Category', kde=True)
plt.title('Age Distribution')
plt.show

In [None]:
fig, axes = plt.subplots(5, 2, figsize=(12, 15))
axes = axes.flatten()

columns = ['ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL', 'CREA', 'GGT', 'PROT']

for i, column in enumerate(columns):
    sns.boxplot(x=df['Category'], y=df[column], ax=axes[i])
    axes[i].set_title(f'Boxplot of {column}')

plt.tight_layout()
plt.show()

In [None]:


fig, ax = plt.subplots(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cbar=True, cmap='coolwarm')

In [None]:
from sklearn.preprocessing import StandardScaler
scalar=StandardScaler()
cols_to_scale = ['ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL', 'CREA', 'GGT', 'PROT']
scaled_data = scalar.fit_transform(df[cols_to_scale])
df_scaled = pd.DataFrame(scaled_data, columns=cols_to_scale)
df[cols_to_scale] = df_scaled

In [None]:
df

In [None]:
from sklearn.model_selection import train_test_split
x = df.drop("Category", axis=1)
y = df["Category"]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

In [None]:
x_train

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier()
knn.fit(x_train,y_train)


In [None]:
from sklearn.metrics import accuracy_score,confusion_matrix
y_pred=knn.predict(x_test)
score=accuracy_score(y_test,y_pred)
score

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV

In [None]:
lr_model = LogisticRegression()
lr_params = {
    "penalty": ["l2"],
    "C": [0.01, 0.1, 1, 10],
    "max_iter": [500]
}

grid_search_lr = GridSearchCV(lr_model, lr_params, scoring='accuracy', cv=5)
grid_search_lr.fit(x_train, y_train)

best_lr_model = grid_search_lr.best_estimator_
y_pred_lr = best_lr_model.predict(x_test)

accuracy_lr = accuracy_score(y_test, y_pred_lr)

print("Logistic Regression")
print(f"Best parameters: {grid_search_lr.best_params_}")
print(f"Accuracy: {accuracy_lr}")

In [None]:
#Training using best hyperparameter

lr_model = LogisticRegression(C=10, penalty='l2', max_iter=500)
lr_model.fit(x_train, y_train)
y_pred_lr = lr_model.predict(x_test)
y_pred1_lr =lr_model.predict(x_train)

accuracy_lr = accuracy_score(y_test, y_pred_lr)
conf_matrix_lr = confusion_matrix(y_test, y_pred_lr)
classification_rep_lr = classification_report(y_test, y_pred_lr)

print("Logistic Regression")
print(f"Best Model Accuracy: {accuracy_lr:.3f}")
print("Best Model Confusion Matrix:")
print(conf_matrix_lr)
print("Best Model Classification Report:")
print(classification_rep_lr)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc_model = RandomForestClassifier()
rfc_params = {
    "n_estimators": [100, 200, 300],
    "max_depth": [5, 10, 20, None]
}

grid_search_rfc = GridSearchCV(rfc_model, rfc_params, scoring='accuracy', cv=5)
grid_search_rfc.fit(x_train, y_train)

best_rfc_model = grid_search_rfc.best_estimator_
y_pred_rfc = best_rfc_model.predict(x_test)

accuracy_rfc = accuracy_score(y_test, y_pred_rfc)

print("Random Forest")
print(f"Best parameters: {grid_search_rfc.best_params_}")
print(f"Accuracy: {accuracy_rfc}")

In [None]:
rfc_model = RandomForestClassifier(max_depth=20, n_estimators=200)
rfc_model.fit(x_train, y_train)
y_pred_rfc = rfc_model.predict(x_test)
y_pred1_rfc = rfc_model.predict(x_train)

accuracy_rfc = accuracy_score(y_test, y_pred_rfc)
conf_matrix_rfc = confusion_matrix(y_test, y_pred_rfc)
classification_rep_rfc = classification_report(y_test, y_pred_rfc)

print("Random Forest")
print(f"Best Model Accuracy: {accuracy_rfc:.3f}")
print("Best Model Confusion Matrix:")
print(conf_matrix_rfc)
print("Best Model Classification Report:")
print(classification_rep_rfc)

In [None]:
from sklearn.svm import SVC

svc_model = SVC()
svc_params = {
    "C": [0.01, 0.1, 1, 10],
    "kernel": ["linear", "rbf", "sigmoid"],
    "gamma": ["scale", "auto"]
}

grid_search_svc = GridSearchCV(svc_model, svc_params, scoring='accuracy', cv=5)
grid_search_svc.fit(x_train, y_train)

best_svc_model = grid_search_svc.best_estimator_
y_pred_svc = best_svc_model.predict(x_test)

accuracy_svc = accuracy_score(y_test, y_pred_svc)

print("Support Vector Machine")
print(f"Best parameters: {grid_search_svc.best_params_}")
print(f"Accuracy: {accuracy_svc}")

In [None]:
svc_model = SVC(C=10, gamma='scale', kernel='linear')
svc_model.fit(x_train, y_train)
y_pred_svc = svc_model.predict(x_test)

accuracy_svc = accuracy_score(y_test, y_pred_svc)
conf_matrix_svc = confusion_matrix(y_test, y_pred_svc)
classification_rep_svc = classification_report(y_test, y_pred_svc)

print("Support Vector Machine")
print(f"Best Model Accuracy: {accuracy_svc:.3f}")
print("Best Model Confusion Matrix:")
print(conf_matrix_svc)
print("Best Model Classification Report:")
print(classification_rep_svc)

In [None]:
x = ['Logistic Regression',
     'Random Forest',
     'Support Vector Machine']

y = [accuracy_lr,
     accuracy_rfc,
     accuracy_svc]

fig, ax = plt.subplots(figsize=(10,7))
sns.barplot(x=x,y=y, palette='coolwarm')
plt.ylabel("Model Accuracy")
plt.xticks(rotation=20)
plt.title("Model Accuracy Comparison")

for i, v in enumerate(y):
    ax.text(i, v + 0.01, f'{v*100:.2f}%', ha='center', va='bottom', fontsize=10)

plt.show()

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, average_precision_score, roc_curve, precision_recall_curve

y_scores_lr = best_lr_model.decision_function(x_test)
y_scores_rfc = best_rfc_model.predict_proba(x_test)[:, 1]
y_scores_svc = best_svc_model.decision_function(x_test)

models = ['Logistic Regression', 'Random Forest', 'Support Vector Machine']
scores = [y_scores_lr, y_scores_rfc, y_scores_svc]

fig, axes = plt.subplots(1, 3, figsize=(18, 6))  # Change the subplot layout here

for i, model in enumerate(models):
    # Calculate AUC for ROC
    auc_roc = roc_auc_score(y_test, scores[i])
    fpr, tpr, _ = roc_curve(y_test, scores[i])

    # Plot ROC curve
    axes[i].plot(fpr, tpr, label=f'AUC-ROC = {auc_roc:.2f}')

    # Calculate AUC for Precision-Recall Curve
    auc_prc = average_precision_score(y_test, scores[i])
    precision, recall, _ = precision_recall_curve(y_test, scores[i])

    # Plot Precision-Recall Curve
    axes[i].plot(recall, precision, label=f'AUC-PRC = {auc_prc:.2f}')

    axes[i].set_title(model)
    axes[i].legend()

plt.tight_layout()  # Adjust layout
plt.show()

In [None]:
#overfitting for logistic regression model
rep = classification_report(y_train, y_pred1_lr)
print ("Report of train data:")
print(rep)

print("-----------------------------------------------------------")

rep = classification_report(y_test, y_pred_lr)
print ("Report of test data:")
print(rep)

In [None]:
rep = classification_report(y_train,y_pred1_rfc)
print ("Report of train data:")
print(rep)

print("-----------------------------------------------------------")

rep = classification_report(y_test, y_pred_rfc)
print ("Report of test data:")
print(rep)