# Problem 2: Liver Disease Prediction (10 points)

## Problem Statement
You are tasked with predicting whether a patient has liver disease using the **Indian Liver Patient Dataset (ILPD)**. The dataset contains demographic information and blood test results for patients. Your goal is to preprocess the data, build a KNN classifier, and use synthetic samples to make predictions.

## Tasks

**Part A (3 points):**  
- Display the **first 10** and **last 10** rows of the dataset.

**Part B (4 points):**  
- Preprocess the data by:
  - Handling missing values if any  
  - Encoding categorical features (e.g., Gender)  
  - Scaling numerical features if required  
- Build a **K-Nearest Neighbors (KNN)** classifier with **K=5**.  
- Train the model and make predictions on the test set.

**Part C (3 points):**  
- Create **synthetic patient samples** with relevant feature values.  
- Use the trained KNN model to predict whether these synthetic patients have liver disease.  

## Expected Outcome
- Correctly display the dataset rows.  
- Build and train a KNN model.  
- Make predictions on new synthetic samples and interpret the results.


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,classification_report
from imblearn.over_sampling import SMOTE

In [5]:
df = pd.read_csv(r'C:\Users\LENOVO\OneDrive - Green University\Desktop\ML Lab Final\indian_liver_patient.csv')

In [6]:
df.head(10)

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1
5,46,Male,1.8,0.7,208,19,14,7.6,4.4,1.3,1
6,26,Female,0.9,0.2,154,16,12,7.0,3.5,1.0,1
7,29,Female,0.9,0.3,202,14,11,6.7,3.6,1.1,1
8,17,Male,0.9,0.3,202,22,19,7.4,4.1,1.2,2
9,55,Male,0.7,0.2,290,53,58,6.8,3.4,1.0,1


In [7]:
df.tail(10)

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Dataset
573,32,Male,3.7,1.6,612,50,88,6.2,1.9,0.4,1
574,32,Male,12.1,6.0,515,48,92,6.6,2.4,0.5,1
575,32,Male,25.0,13.7,560,41,88,7.9,2.5,2.5,1
576,32,Male,15.0,8.2,289,58,80,5.3,2.2,0.7,1
577,32,Male,12.7,8.4,190,28,47,5.4,2.6,0.9,1
578,60,Male,0.5,0.1,500,20,34,5.9,1.6,0.37,2
579,40,Male,0.6,0.1,98,35,31,6.0,3.2,1.1,1
580,52,Male,0.8,0.2,245,48,49,6.4,3.2,1.0,1
581,31,Male,1.3,0.5,184,29,32,6.8,3.4,1.0,1
582,38,Male,1.0,0.3,216,21,24,7.3,4.4,1.5,2


In [8]:
df.isna().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    4
Dataset                       0
dtype: int64

In [9]:
df = df.dropna()

In [10]:
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])

In [11]:
x = df.drop('Dataset',axis=1)
y = df['Dataset']

In [12]:
x_train, x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=42)

In [13]:
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [15]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train,y_train)
y_pred = knn.predict(x_test)

In [16]:
# Evaluate
print("\n--- KNN Model Performance ---")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred, average="weighted"))
print("Recall:", recall_score(y_test, y_pred, average="weighted"))
print("F1 Score:", f1_score(y_test, y_pred, average="weighted"))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


--- KNN Model Performance ---
Accuracy: 0.6609195402298851
Precision: 0.6355573263788274
Recall: 0.6609195402298851
F1 Score: 0.6308276483767685

Classification Report:
               precision    recall  f1-score   support

           1       0.69      0.86      0.77       113
           2       0.53      0.30      0.38        61

    accuracy                           0.66       174
   macro avg       0.61      0.58      0.57       174
weighted avg       0.64      0.66      0.63       174



In [17]:
import numpy as np
synthetic_samples = pd.DataFrame([
    [45, 1, 1.2, 0.3, 200, 45, 35, 6.8, 3.5, 1.0],   # Sample 1
    [55, 0, 0.9, 0.2, 180, 40, 30, 6.5, 3.0, 1.1] 
],columns = x.columns)

In [18]:
synthetic_scaled = scaler.transform(synthetic_samples)

In [19]:
synthetic_pred = knn.predict(synthetic_scaled)
synthetic_samples['Predicted_Disease'] = synthetic_pred
print("\nSynthetic Samples Predictions:\n", synthetic_samples)


Synthetic Samples Predictions:
    Age  Gender  ...  Albumin_and_Globulin_Ratio  Predicted_Disease
0   45       1  ...                         1.0                  1
1   55       0  ...                         1.1                  2

[2 rows x 11 columns]
