# Business Case

Patients with Liver disease have been continuously increasing because of excessive consumption of alcohol, inhale of harmful gases, intake of contaminated food, pickles and drugs. This dataset was used to evaluate prediction algorithms in an effort to reduce burden on doctors by helping to predict the patient has disease or not.

## Importing Basic Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

## Importing Dataset

In [2]:
data = pd.read_csv("Indian Liver Patient Dataset (ILPD).csv", header=None)

In [3]:
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.90,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.00,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.40,1
...,...,...,...,...,...,...,...,...,...,...,...
578,60,Male,0.5,0.1,500,20,34,5.9,1.6,0.37,2
579,40,Male,0.6,0.1,98,35,31,6.0,3.2,1.10,1
580,52,Male,0.8,0.2,245,48,49,6.4,3.2,1.00,1
581,31,Male,1.3,0.5,184,29,32,6.8,3.4,1.00,1


* Since dataset does not have header column we need to use **header = None** so that we can add the header columns through the code. Otherwise we have to alter the original dataset.

## Defining Column Names to the Dataset

In [4]:
column_names = ['Age','Gender','Total Bilirubin','Direct Bilirubin','Alkaline Phosphotase',
                'Alamine Aminotransferase','Aspartate Aminotransferase','Total Protiens','Albumin',
                'Albumin and Globulin Ratio','Target']
data.columns = column_names
data

Unnamed: 0,Age,Gender,Total Bilirubin,Direct Bilirubin,Alkaline Phosphotase,Alamine Aminotransferase,Aspartate Aminotransferase,Total Protiens,Albumin,Albumin and Globulin Ratio,Target
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.90,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.00,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.40,1
...,...,...,...,...,...,...,...,...,...,...,...
578,60,Male,0.5,0.1,500,20,34,5.9,1.6,0.37,2
579,40,Male,0.6,0.1,98,35,31,6.0,3.2,1.10,1
580,52,Male,0.8,0.2,245,48,49,6.4,3.2,1.00,1
581,31,Male,1.3,0.5,184,29,32,6.8,3.4,1.00,1


## Basic Checks

In [5]:
data.head()    #First 5 rows in the dataset

Unnamed: 0,Age,Gender,Total Bilirubin,Direct Bilirubin,Alkaline Phosphotase,Alamine Aminotransferase,Aspartate Aminotransferase,Total Protiens,Albumin,Albumin and Globulin Ratio,Target
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


In [6]:
data.tail()    #Last 5 rows in dataset

Unnamed: 0,Age,Gender,Total Bilirubin,Direct Bilirubin,Alkaline Phosphotase,Alamine Aminotransferase,Aspartate Aminotransferase,Total Protiens,Albumin,Albumin and Globulin Ratio,Target
578,60,Male,0.5,0.1,500,20,34,5.9,1.6,0.37,2
579,40,Male,0.6,0.1,98,35,31,6.0,3.2,1.1,1
580,52,Male,0.8,0.2,245,48,49,6.4,3.2,1.0,1
581,31,Male,1.3,0.5,184,29,32,6.8,3.4,1.0,1
582,38,Male,1.0,0.3,216,21,24,7.3,4.4,1.5,2


In [7]:
data.info()       #Gives the Datatype and non-Null value count for all the columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    object 
 2   Total Bilirubin             583 non-null    float64
 3   Direct Bilirubin            583 non-null    float64
 4   Alkaline Phosphotase        583 non-null    int64  
 5   Alamine Aminotransferase    583 non-null    int64  
 6   Aspartate Aminotransferase  583 non-null    int64  
 7   Total Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin and Globulin Ratio  579 non-null    float64
 10  Target                      583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


In [8]:
data.describe()    #Describes details of numerical data in dataset

Unnamed: 0,Age,Total Bilirubin,Direct Bilirubin,Alkaline Phosphotase,Alamine Aminotransferase,Aspartate Aminotransferase,Total Protiens,Albumin,Albumin and Globulin Ratio,Target
count,583.0,583.0,583.0,583.0,583.0,583.0,583.0,583.0,579.0,583.0
mean,44.746141,3.298799,1.486106,290.576329,80.713551,109.910806,6.48319,3.141852,0.947064,1.286449
std,16.189833,6.209522,2.808498,242.937989,182.620356,288.918529,1.085451,0.795519,0.319592,0.45249
min,4.0,0.4,0.1,63.0,10.0,10.0,2.7,0.9,0.3,1.0
25%,33.0,0.8,0.2,175.5,23.0,25.0,5.8,2.6,0.7,1.0
50%,45.0,1.0,0.3,208.0,35.0,42.0,6.6,3.1,0.93,1.0
75%,58.0,2.6,1.3,298.0,60.5,87.0,7.2,3.8,1.1,2.0
max,90.0,75.0,19.7,2110.0,2000.0,4929.0,9.6,5.5,2.8,2.0


In [9]:
data.describe().T    # Transpose 

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,583.0,44.746141,16.189833,4.0,33.0,45.0,58.0,90.0
Total Bilirubin,583.0,3.298799,6.209522,0.4,0.8,1.0,2.6,75.0
Direct Bilirubin,583.0,1.486106,2.808498,0.1,0.2,0.3,1.3,19.7
Alkaline Phosphotase,583.0,290.576329,242.937989,63.0,175.5,208.0,298.0,2110.0
Alamine Aminotransferase,583.0,80.713551,182.620356,10.0,23.0,35.0,60.5,2000.0
Aspartate Aminotransferase,583.0,109.910806,288.918529,10.0,25.0,42.0,87.0,4929.0
Total Protiens,583.0,6.48319,1.085451,2.7,5.8,6.6,7.2,9.6
Albumin,583.0,3.141852,0.795519,0.9,2.6,3.1,3.8,5.5
Albumin and Globulin Ratio,579.0,0.947064,0.319592,0.3,0.7,0.93,1.1,2.8
Target,583.0,1.286449,0.45249,1.0,1.0,1.0,2.0,2.0


In [10]:
data.describe(include='O')      #Displays details of categorical column

Unnamed: 0,Gender
count,583
unique,2
top,Male
freq,441


In [11]:
data['Gender'].value_counts()   #Displays the count of different values in categorical column

Male      441
Female    142
Name: Gender, dtype: int64

In [12]:
data.isnull().sum()    #Displays the sum of null values in each column

Age                           0
Gender                        0
Total Bilirubin               0
Direct Bilirubin              0
Alkaline Phosphotase          0
Alamine Aminotransferase      0
Aspartate Aminotransferase    0
Total Protiens                0
Albumin                       0
Albumin and Globulin Ratio    4
Target                        0
dtype: int64

In [13]:
data[data.duplicated()]

Unnamed: 0,Age,Gender,Total Bilirubin,Direct Bilirubin,Alkaline Phosphotase,Alamine Aminotransferase,Aspartate Aminotransferase,Total Protiens,Albumin,Albumin and Globulin Ratio,Target
19,40,Female,0.9,0.3,293,232,245,6.8,3.1,0.8,1
26,34,Male,4.1,2.0,289,875,731,5.0,2.7,1.1,1
34,38,Female,2.6,1.2,410,59,57,5.6,3.0,0.8,2
55,42,Male,8.9,4.5,272,31,61,5.8,2.0,0.5,1
62,58,Male,1.0,0.5,158,37,43,7.2,3.6,1.0,1
106,36,Male,5.3,2.3,145,32,92,5.1,2.6,1.0,2
108,36,Male,0.8,0.2,158,29,39,6.0,2.2,0.5,2
138,18,Male,0.8,0.2,282,72,140,5.5,2.5,0.8,1
143,30,Male,1.6,0.4,332,84,139,5.6,2.7,0.9,1
158,72,Male,0.7,0.1,196,20,35,5.8,2.0,0.5,1


* data.loc[(data['Age'] == 40) & (data['Gender'] == 'Female')]      #To See if the duplicates are actually there or not 

In [16]:
data.duplicated().sum()      #Sum of duplicated values (First value that is original entry is not counted as duplicate)

13

In [17]:
data.drop_duplicates(inplace = True)     #Removes duplicate rows available in the dataset

In [18]:
data

Unnamed: 0,Age,Gender,Total Bilirubin,Direct Bilirubin,Alkaline Phosphotase,Alamine Aminotransferase,Aspartate Aminotransferase,Total Protiens,Albumin,Albumin and Globulin Ratio,Target
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.90,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.00,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.40,1
...,...,...,...,...,...,...,...,...,...,...,...
578,60,Male,0.5,0.1,500,20,34,5.9,1.6,0.37,2
579,40,Male,0.6,0.1,98,35,31,6.0,3.2,1.10,1
580,52,Male,0.8,0.2,245,48,49,6.4,3.2,1.00,1
581,31,Male,1.3,0.5,184,29,32,6.8,3.4,1.00,1


## Domain Analysis

We can see the dataset contains blood test report of patients who has done blood test to check whether they have liver disease or not. The report contains various entities such as Age, Gender, Total Bilirubin, Direct Bilirubin, Alkaline Phosphotase, Alamine Aminotransferase, Aspartate Aminotransferase, Total Protiens, Albumin and Albumin and Globulin Ratio. All these entities can be used to predict whether person have defect in liver or not.

**Detailed Description: **
* **Age:** Column indicates the age of person who has done the blood test.
* **Gender:** Column indicates the gender of the person who has done the blood test.
* **Total Bilirubin:** Normal range for Total Bilirubin are between 0.1 to 1.2 mg/dL. If level exceeds 1.2 mg/dL, it is considered danger.
* **Direct Bilirubin:** Normal range for Direct Bilirubin are from 0 to 0.4 mg/dL.
* **Alkaline Phosphotase:** It is denoted as ALP. The normal ALP range is 44 to 147 IU/L or 0.73 to 2.45 µkat/L. High levels of ALP may indicate a liver disease.
* **Alamine Aminotransferase:** It is denoted as ALT. Normal ALT range is 5 to 56 U/L. ALT levels are typically higher in male patients than females.High levels of ALT in your blood can be due to damage or injury to the cells in your liver.
* **Aspartate Aminotransferase:** It is denoted as AST. Normal ranges for AST are: Men: 14 to 20 units/L. Women: 10 to 36 units/L. An increased AST level is often a sign of liver disease.
* **Total Protiens:** The normal range is 6.0 to 8.3 grams per deciliter (g/dL) or 60 to 83 g/L. Low TP levels indicate liver or kidney disease.
* **Albumin:** A normal albumin range is 3.4 to 5.4 g/dL. Lower albumin level indicates that you may have liver disease, kidney disease, or an inflammatory disease. Higher albumin levels may be caused by acute infections, burns or a heart attack.
* **Albumin and Globulin Ratio:** The normal range for albumin/globulin ratio is over 1 , usually around 1 to 2.
* **Target:** Column indicates whether patient has liver disease or not. 1 indicates that the person has liver disease and 2 indicates person does not have liver desease.


## Exploratory Data Analysis

### Univariate Analysis

In [26]:
75 !pip install sweetviz



In [27]:
import sweetviz as sv
my_report = sv.analyze(data)
my_report.show_html('Univariate.html')     #Default name will be SWEETVIZ_REPORT.html

                                             |                                             | [  0%]   00:00 ->…

Report Univariate.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


### Univariate Analysis Insights

* **Age:** No missing values found. Max age of the patient in the dataset is 90 yrs and min age is 4 yrs. Upper quantile value is 58 and lower quantile value is 33.IQR for age column is 25. Standard deviation is 16.2 yrs. This shows that the data is not normally distributed.
* **Gender:** No missing values found. Column indicates gender of the patient. Around 75% of the patients are male and rest are female.
* **Total Bilirubin:** No missing values found. Max value in the Total Bilirubin is 75 and min value is 0.4. Upper quantile value is 2.6 and lower quantile value is 0.8. IQR is 1.8 and Standard deviation is 6.27. This shows that the data is not normally distributed.
* **Direct Bilirubin:** No missing values found. Max value in the Direct Bilirubin is 19.7 and min value is 0.1. Upper quantile value is 1.3 and lower quantile value is 0.2. IQR is 1.1 and Standard deviation is 2.83. This shows that the data is not normally distributed.
* **Alkaline Phosphotase:** No missing values found. Max value in the Alkaline Phosphotase is 2110 and min value is 63. Upper quantile value is 298 and lower quantile value is 176. IQR is 122 and Standard deviation is 245. This shows that the data is not normally distributed.
* **Almine Aminotranseferase:** No missing values found. Max value in the Almine Aminotranseferase is 2000 and min value is 10. Upper quantile value is 60 and lower quantile value is 23. IQR is 37 and Standard deviation is 181. This shows that the data is not normally distributed.
* **Aspartate Aminotransferase:** No missing values found. Max value in the Aspartate Aminotransferase is 4929 and min value is 10. Upper quantile value is 87 and lower quantile value is 25. IQR is 61.8 and Standard deviation is 291. This shows that the data is not normally distributed.
* **Total Proteins:** No missing values found. Max value in the Total Proteins is 9.6 and min value is 2.7. Upper quantile value is 7.2 and lower quantile value is 5.8. IQR is 1.4 and Standard deviation is 1.09. This shows that the data is not normally distributed.
* **Albumin:** No missing values found. Max value in the Albumin is 5.5 and min value is 0.9. Upper quantile value is 3.8 and lower quantile value is 2.6. IQR is 1.2 and Standard deviation is 0.797. This shows that the data is not normally distributed.
* **Albumin and Globulin Ratio:** 4 missing values found. Max value in the Albumin and Globulin Ratio is 2.8 and min value is 0.3. Upper quantile value is 1.1 and lower quantile value is 0.7. IQR is 0.4 and Standard deviation is 0.32. This shows that the data is normally distributed.
* **Target:** This column shows the patient having liver disease and patients not having liver disease. Around 70% patients have liver disease and 30% patients does not have liver disease