
**DPhi Data Science Bootcamp (Project 1)**  
<font color=black> **Name:** Johny Ijaq</font>  

# Disease Prediction

**About the Dataset**  

This dataset contains records of liver patient and non liver patient collected from North East of Andhra Pradesh, India. Dataset was downloaded from the UCI Machine Learning Repository.  

The "Liver_Problem" column is the target variable used to divide groups into liver patient ( Liver_Problem == 1) or not ( Liver_Problem == 2).

**Features:**

- Age of the patient

- Gender of the patient

- Total Bilirubin

- Direct Bilirubin

- Alkaline Phosphotase

- Alamine Aminotransferase

- Aspartate Aminotransferase

- Total Protiens

- Albumin

- Albumin and Globulin Ratio

- Liver_Problem

**Objective:**

To build a machine learning model that predicts whether a patient is healthy (non-liver patient) or ill (liver patient) based on the clinical and demographic features (or input variables) listed above.

## 1. Importing the Libraries

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")

## 2. Importing the Dataset

In [6]:
liver_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/liver_patient_data/indian_liver_patient_dataset.csv')

In [7]:
liver_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         500 non-null    int64  
 1   Gender                      500 non-null    object 
 2   Total_Bilirubin             500 non-null    float64
 3   Direct_Bilirubin            500 non-null    float64
 4   Alkaline_Phosphotase        500 non-null    int64  
 5   Alamine_Aminotransferase    500 non-null    int64  
 6   Aspartate_Aminotransferase  500 non-null    int64  
 7   Total_Protiens              500 non-null    float64
 8   Albumin                     500 non-null    float64
 9   Albumin_and_Globulin_Ratio  496 non-null    float64
 10  Liver_Problem               500 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 43.1+ KB


In [8]:
liver_data.shape

(500, 11)

In [9]:
liver_data.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Liver_Problem
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


In [10]:
liver_data['Liver_Problem'].value_counts()

1    350
2    150
Name: Liver_Problem, dtype: int64

## 3. Data Preprocessing

In [11]:
liver_data.isnull().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    4
Liver_Problem                 0
dtype: int64

There are 4 missing values in `Albumin_and_Globulin_Ratio` column 

### 3.1. Handling Missing Values

In [11]:
liver_data[liver_data['Albumin_and_Globulin_Ratio'].isnull()]

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Liver_Problem
209,45,Female,0.9,0.3,189,23,33,6.6,3.9,,1
241,51,Male,0.8,0.2,230,24,46,6.5,3.1,,1
253,35,Female,0.6,0.2,180,12,15,5.2,2.7,,2
312,27,Male,1.3,0.6,106,25,54,8.5,4.8,,2


Filling with the mean value

In [19]:
liver_data.Albumin_and_Globulin_Ratio.fillna(liver_data['Albumin_and_Globulin_Ratio'].mean(), inplace=True)

In [20]:
liver_data.isnull().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    0
Liver_Problem                 0
dtype: int64

### 3.2. Label Encoding

In [21]:
le = LabelEncoder()
liver_data.Gender = le.fit_transform(liver_data.Gender)
liver_data.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Liver_Problem
0,65,0,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,1,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,1,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,1,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,1,3.9,2.0,195,27,59,7.3,2.4,0.4,1


## 4. Separating input and target variables

y has the label or the target variable "Liver_Problem" - basically this is what we need to predict. 

X has input variables which are used to predict y. Basically X has all the columns of our data excluding target variable "Liver_Problem"

In [22]:
X = liver_data.drop('Liver_Problem', axis = 1) 
y = liver_data['Liver_Problem']

## 5. Splitting Data into Train and Test

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 101)

## 6. Training

In [24]:
model = LogisticRegression()

In [25]:
model.fit(X_train, y_train)

LogisticRegression()

In [26]:
pred = model.predict(X_test)

In [27]:
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()
print("True Positive", tp)
print("True Negative", tn)
print("False Positive", fp)
print("False Negative", fn)

True Positive 5
True Negative 60
False Positive 6
False Negative 29


In [28]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pred)

0.65

## 7. Testing

In [29]:
test_new = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/liver_patient_data/indian_liver_patient_new_testdataset.csv')

In [30]:
test_new.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio
0,36,Male,2.8,1.5,305,28,76,5.9,2.5,0.7
1,42,Male,0.8,0.2,127,29,30,4.9,2.7,1.2
2,53,Male,19.8,10.4,238,39,221,8.1,2.5,0.4
3,32,Male,30.5,17.1,218,39,79,5.5,2.7,0.9
4,32,Male,32.6,14.1,219,95,235,5.8,3.1,1.1


In [31]:
le = LabelEncoder()
test_new.Gender = le.fit_transform(test_new.Gender)
test_new.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio
0,36,1,2.8,1.5,305,28,76,5.9,2.5,0.7
1,42,1,0.8,0.2,127,29,30,4.9,2.7,1.2
2,53,1,19.8,10.4,238,39,221,8.1,2.5,0.4
3,32,1,30.5,17.1,218,39,79,5.5,2.7,0.9
4,32,1,32.6,14.1,219,95,235,5.8,3.1,1.1


In [32]:
Liver_Problem = model.predict(test_new)

## Downloading the prediction file

In [23]:
res = pd.DataFrame(Liver_Problem)
res.index = test_new.index # its important for comparison
res.columns = ["Liver_Problem"]
res.to_csv("prediction_results_Liver_Problem.csv")      # the csv file will be saved locally on the same location where this notebook is located.