# Problem Statement

The given dataset is related to Indian patients who have been tested for a liver disease.
Based on chemical compounds (bilrubin,albumin,protiens,alkaline phosphatase) present in human body and tests like SGOT, SGPT the outcome mentioned is whether person is a patient i.e, whether he needs to be diagnosed further or not.
Perform data cleansing, and required transformations and build a predictive model which will be able to predict most of the cases accurately.

## Following are the feature names for the given data:
Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,
Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,
Albumin_and_Globulin_Ratio,Class.



In [1]:
# Importing the required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:
#loading data 
ind_pat = pd.read_csv('IndianLiverPatientData.txt', delimiter="\t" ,names=('Age','Gender','Total_Bilirubin',
                    'Direct_Bilirubin','Alkaline_Phosphotase','Alamine_Aminotransferase','Aspartate_Aminotransferase',
                     'Total_Protiens','Albumin','Albumin_and_Globulin_Ratio','Class'))

In [3]:
ind_pat.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Class
1,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,No
2,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,No
3,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,No
4,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,No
5,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,No


In [4]:
ind_pat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 583 entries, 1 to 583
Data columns (total 11 columns):
Age                           583 non-null int64
Gender                        563 non-null object
Total_Bilirubin               583 non-null float64
Direct_Bilirubin              583 non-null float64
Alkaline_Phosphotase          583 non-null int64
Alamine_Aminotransferase      583 non-null int64
Aspartate_Aminotransferase    583 non-null int64
Total_Protiens                568 non-null float64
Albumin                       583 non-null float64
Albumin_and_Globulin_Ratio    579 non-null float64
Class                         583 non-null object
dtypes: float64(5), int64(4), object(2)
memory usage: 54.7+ KB


In [5]:
ind_pat.describe()

Unnamed: 0,Age,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio
count,583.0,583.0,583.0,583.0,583.0,583.0,568.0,583.0,579.0
mean,44.746141,3.298799,1.486106,290.576329,80.713551,109.910806,6.483979,3.141852,0.947064
std,16.189833,6.209522,2.808498,242.937989,182.620356,288.918529,1.084039,0.795519,0.319592
min,4.0,0.4,0.1,63.0,10.0,10.0,2.7,0.9,0.3
25%,33.0,0.8,0.2,175.5,23.0,25.0,5.8,2.6,0.7
50%,45.0,1.0,0.3,208.0,35.0,42.0,6.6,3.1,0.93
75%,58.0,2.6,1.3,298.0,60.5,87.0,7.2,3.8,1.1
max,90.0,75.0,19.7,2110.0,2000.0,4929.0,9.6,5.5,2.8


Missing Values in data?

In [161]:
ind_pat.isnull().sum()

Age                            0
Gender                        20
Total_Bilirubin                0
Direct_Bilirubin               0
Alkaline_Phosphotase           0
Alamine_Aminotransferase       0
Aspartate_Aminotransferase     0
Total_Protiens                15
Albumin                        0
Albumin_and_Globulin_Ratio     4
Class                          0
dtype: int64

In [6]:
ind_pat.shape

(583, 11)

Removing duplicate data

In [7]:
ind_pat_dup = ind_pat[ind_pat.duplicated(keep = False)]

In [8]:
ind_pat_dup

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Class
19,40,Female,0.9,0.3,293,232,245,6.8,3.1,0.8,No
20,40,Female,0.9,0.3,293,232,245,6.8,3.1,0.8,No
26,34,Male,4.1,2.0,289,875,731,5.0,2.7,1.1,No
27,34,Male,4.1,2.0,289,875,731,5.0,2.7,1.1,No
34,38,Female,2.6,1.2,410,59,57,5.6,3.0,0.8,Yes
35,38,Female,2.6,1.2,410,59,57,5.6,3.0,0.8,Yes
55,42,Male,8.9,4.5,272,31,61,5.8,2.0,0.5,No
56,42,Male,8.9,4.5,272,31,61,5.8,2.0,0.5,No
62,58,Male,1.0,0.5,158,37,43,7.2,3.6,1.0,No
63,58,Male,1.0,0.5,158,37,43,7.2,3.6,1.0,No


In [9]:
ind_pat = ind_pat[~ind_pat.duplicated(subset = None, keep = 'first')]

Duplicate data is removed

In [10]:
ind_pat.shape

(570, 11)

# performing imputations in missing values

In [12]:
ind_pat['Gender'].value_counts()

Male      410
Female    140
Name: Gender, dtype: int64

In [13]:
ind_pat['Gender']=ind_pat['Gender'].fillna(value='Male')

In [171]:
ind_pat['Total_Protiens'].value_counts()

7.0    32
6.0    28
6.8    26
6.2    24
6.9    24
7.1    21
7.2    20
8.0    18
6.1    18
7.3    18
6.4    17
6.6    16
5.6    16
5.5    16
6.7    15
7.9    14
6.3    14
7.5    14
5.9    14
5.4    13
6.5    13
5.2    12
5.7    11
5.8    11
7.4    11
7.8     9
7.6     9
5.1     9
5.0     8
5.3     8
8.2     8
4.9     6
8.1     6
8.5     5
4.5     4
4.6     4
4.4     4
3.6     3
8.6     3
4.8     3
8.4     3
8.3     3
4.3     3
7.7     2
3.8     2
4.1     2
9.2     2
3.9     2
4.7     2
8.7     1
3.7     1
9.5     1
9.6     1
4.0     1
3.0     1
2.8     1
8.9     1
2.7     1
Name: Total_Protiens, dtype: int64

In [14]:
ind_pat['Total_Protiens'].median()

6.6

In [15]:
ind_pat['Total_Protiens'].fillna(ind_pat['Total_Protiens'].median(),inplace=True)

In [16]:
ind_pat['Albumin_and_Globulin_Ratio'].value_counts()

1.00    104
0.80     62
0.90     57
0.70     53
1.10     45
1.20     35
0.60     31
0.50     26
1.30     25
1.40     17
0.40     14
1.50     10
1.60      4
1.70      4
0.30      4
0.75      4
0.96      3
1.80      3
0.92      2
1.16      2
0.52      2
0.47      2
0.76      2
0.93      2
1.38      2
2.50      2
1.85      2
1.34      2
0.95      2
1.06      2
       ... 
0.48      1
0.62      1
1.12      1
1.03      1
0.53      1
0.55      1
0.46      1
0.69      1
1.25      1
1.66      1
1.55      1
0.61      1
0.39      1
0.64      1
0.78      1
0.37      1
1.27      1
0.68      1
0.67      1
0.58      1
1.11      1
0.71      1
1.72      1
0.35      1
0.45      1
0.88      1
1.02      1
1.09      1
0.89      1
1.36      1
Name: Albumin_and_Globulin_Ratio, Length: 69, dtype: int64

In [17]:
ind_pat['Albumin_and_Globulin_Ratio'].median()

0.95

In [18]:
ind_pat['Albumin_and_Globulin_Ratio'].mean()

0.9480035335689051

In [19]:
ind_pat['Albumin_and_Globulin_Ratio'].fillna(ind_pat['Albumin_and_Globulin_Ratio'].median(),inplace=True)

Now the data is clean and Complete

In [20]:
ind_pat.isnull().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    0
Class                         0
dtype: int64

In [21]:
ind_pat.shape

(570, 11)

In [24]:
ind_pat['Class'].replace({'No': 0},inplace=True)
ind_pat['Class'].replace({'Yes': 1},inplace=True)

# Hypothesis testing to check relationship with the label 

In [25]:
c_table = pd.crosstab(ind_pat['Gender'],ind_pat['Class'])

from scipy.stats import chi2_contingency
stat, p, dof, expected = chi2_contingency(c_table)
if p <= 0.05:
    print("Alternate Hypothesis (H1): Gender and Class have some form of relationship.")
else:
    print("Null Hypothesis(H0): Gender and Class are independent of each other.")
print("Confidence Level : {} %".format(((1- p)*100)))



Null Hypothesis(H0): Gender and Class are independent of each other.
Confidence Level : 92.27222170498041 %


In [26]:

from scipy.stats import pearsonr
correlation,pvalue = pearsonr(ind_pat['Age'],ind_pat['Class'])
print(correlation)
print(pvalue)
print("Confidence Level : {} %".format(((1- pvalue)*100)))
if pvalue <= 0.05:
    print("Alternate Hypothesis (H1) - Age and class has linear relationship")
else:
    print("Null Hypothesis (H0)- Age and class has no linear relationship")

-0.13809318041844038
0.0009479292716647942
Confidence Level : 99.90520707283352 %
Alternate Hypothesis (H1) - Age and class has linear relationship


In [27]:
from scipy.stats import pearsonr
correlation,pvalue = pearsonr(ind_pat['Total_Bilirubin'],ind_pat['Class'])
print(correlation)
print(pvalue)
print("Confidence Level : {} %".format(((1- pvalue)*100)))
if pvalue <= 0.05:
    print("Alternate Hypothesis (H1) - Total_Bilirubin and class has linear relationship")
else:
    print("Null Hypothesis (H0)- Total_Bilirubin and class has no linear relationship")

-0.2244297174135663
6.106350132485954e-08
Confidence Level : 99.99999389364986 %
Alternate Hypothesis (H1) - Total_Bilirubin and class has linear relationship


In [28]:
from scipy.stats import pearsonr
correlation,pvalue = pearsonr(ind_pat['Direct_Bilirubin'],ind_pat['Class'])
print(correlation)
print(pvalue)
print("Confidence Level : {} %".format(((1- pvalue)*100)))
if pvalue <= 0.05:
    print("Alternate Hypothesis (H1) - Direct_Bilirubin and class has linear relationship")
else:
    print("Null Hypothesis (H0)- Direct_Bilirubin and class has no linear relationship")

-0.25066633406488736
1.2903824781009159e-09
Confidence Level : 99.99999987096176 %
Alternate Hypothesis (H1) - Direct_Bilirubin and class has linear relationship


In [29]:
ind_pat.columns

Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',
       'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
       'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
       'Albumin_and_Globulin_Ratio', 'Class'],
      dtype='object')

In [30]:
from scipy.stats import pearsonr
correlation,pvalue = pearsonr(ind_pat['Alkaline_Phosphotase'],ind_pat['Class'])
print(correlation)
print(pvalue)
print("Confidence Level : {} %".format(((1- pvalue)*100)))
if pvalue <= 0.05:
    print("Alternate Hypothesis (H1) - Alkaline_Phosphotase and class has linear relationship")
else:
    print("Null Hypothesis (H0)- Alkaline_Phosphotase and class has no linear relationship")

-0.18755981377621078
6.538991047830893e-06
Confidence Level : 99.99934610089521 %
Alternate Hypothesis (H1) - Alkaline_Phosphotase and class has linear relationship


In [31]:
from scipy.stats import pearsonr
correlation,pvalue = pearsonr(ind_pat['Alamine_Aminotransferase'],ind_pat['Class'])
print(correlation)
print(pvalue)
print("Confidence Level : {} %".format(((1- pvalue)*100)))
if pvalue <= 0.05:
    print("Alternate Hypothesis (H1) - Alamine_Aminotransferase and class has linear relationship")
else:
    print("Null Hypothesis (H0)- Alamine_Aminotransferase and class has no linear relationship")

-0.16191728893425517
0.00010322353521362907
Confidence Level : 99.98967764647864 %
Alternate Hypothesis (H1) - Alamine_Aminotransferase and class has linear relationship


In [32]:
from scipy.stats import pearsonr
correlation,pvalue = pearsonr(ind_pat['Aspartate_Aminotransferase'],ind_pat['Class'])
print(correlation)
print(pvalue)
print("Confidence Level : {} %".format(((1- pvalue)*100)))
if pvalue <= 0.05:
    print("Alternate Hypothesis (H1) - Aspartate_Aminotransferase and class has linear relationship")
else:
    print("Null Hypothesis (H0)- Aspartate_Aminotransferase and class has no linear relationship")

-0.15110067538891003
0.0002941855097133594
Confidence Level : 99.97058144902866 %
Alternate Hypothesis (H1) - Aspartate_Aminotransferase and class has linear relationship


In [33]:
from scipy.stats import pearsonr
correlation,pvalue = pearsonr(ind_pat['Total_Protiens'],ind_pat['Class'])
print(correlation)
print(pvalue)
print("Confidence Level : {} %".format(((1- pvalue)*100)))
if pvalue <= 0.05:
    print("Alternate Hypothesis (H1) - Total_Protiens and class has linear relationship")
else:
    print("Null Hypothesis (H0)- Total_Protiens and class has no linear relationship")

0.037504043309767296
0.3714580009292938
Confidence Level : 62.854199907070615 %
Null Hypothesis (H0)- Total_Protiens and class has no linear relationship


In [34]:
from scipy.stats import pearsonr
correlation,pvalue = pearsonr(ind_pat['Albumin'],ind_pat['Class'])
print(correlation)
print(pvalue)
print("Confidence Level : {} %".format(((1- pvalue)*100)))
if pvalue <= 0.05:
    print("Alternate Hypothesis (H1) - Albumin and class has linear relationship")
else:
    print("Null Hypothesis (H0)- Albumin and class has no linear relationship")

0.16683508037277303
6.268355891508786e-05
Confidence Level : 99.9937316441085 %
Alternate Hypothesis (H1) - Albumin and class has linear relationship


In [35]:
from scipy.stats import pearsonr
correlation,pvalue = pearsonr(ind_pat['Albumin_and_Globulin_Ratio'],ind_pat['Class'])
print(correlation)
print(pvalue)
print("Confidence Level : {} %".format(((1- pvalue)*100)))
if pvalue <= 0.05:
    print("Alternate Hypothesis (H1) - Albumin_and_Globulin_Ratio and class has linear relationship")
else:
    print("Null Hypothesis (H0)- Albumin_and_Globulin_Ratio and class has no linear relationship")

0.17055333734348094
4.2584221495744994e-05
Confidence Level : 99.99574157785042 %
Alternate Hypothesis (H1) - Albumin_and_Globulin_Ratio and class has linear relationship


In [36]:
ind_pat.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Class
1,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,0
2,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,0
3,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,0
4,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,0
5,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,0


Lets drop the features which has no relationship with the label

In [37]:
ind_pat.drop(columns="Gender",inplace=True)

In [38]:
ind_pat.drop(columns="Total_Protiens",inplace=True)

In [39]:
ind_pat.head()

Unnamed: 0,Age,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Albumin,Albumin_and_Globulin_Ratio,Class
1,65,0.7,0.1,187,16,18,3.3,0.9,0
2,62,10.9,5.5,699,64,100,3.2,0.74,0
3,62,7.3,4.1,490,60,68,3.3,0.89,0
4,58,1.0,0.4,182,14,20,3.4,1.0,0
5,72,3.9,2.0,195,27,59,2.4,0.4,0


In [40]:
import warnings
warnings.filterwarnings('ignore')

In [41]:
features = ind_pat.iloc[:,[0,1,2,3,4,5,6,7]].values
label = ind_pat.iloc[:,8].values

In [42]:
##for loop 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

for i in range(1,401):
    X_train,X_test,y_train,y_test=train_test_split(features,
                                                  label,
                                                  test_size=0.2,
                                                  random_state=i)
    
    model = LogisticRegression()
    model.fit(X_train,y_train)
    
    train_score=model.score(X_train,y_train)
    test_score=model.score(X_test,y_test)
    
    if test_score > train_score:
        print("Test Score: {} Train Score: {} Random State: {}".format(test_score,train_score,i))

Test Score: 0.7368421052631579 Train Score: 0.7149122807017544 Random State: 4
Test Score: 0.7543859649122807 Train Score: 0.7192982456140351 Random State: 7
Test Score: 0.7719298245614035 Train Score: 0.7149122807017544 Random State: 9
Test Score: 0.7456140350877193 Train Score: 0.7171052631578947 Random State: 10
Test Score: 0.7719298245614035 Train Score: 0.7039473684210527 Random State: 12
Test Score: 0.7807017543859649 Train Score: 0.7039473684210527 Random State: 14
Test Score: 0.7807017543859649 Train Score: 0.7083333333333334 Random State: 18
Test Score: 0.7543859649122807 Train Score: 0.7083333333333334 Random State: 21
Test Score: 0.7192982456140351 Train Score: 0.7105263157894737 Random State: 26
Test Score: 0.7631578947368421 Train Score: 0.7171052631578947 Random State: 30
Test Score: 0.7543859649122807 Train Score: 0.7192982456140351 Random State: 38
Test Score: 0.7631578947368421 Train Score: 0.7083333333333334 Random State: 39
Test Score: 0.7456140350877193 Train Score:

Test Score: 0.7456140350877193 Train Score: 0.7039473684210527 Random State: 292
Test Score: 0.7456140350877193 Train Score: 0.7171052631578947 Random State: 297
Test Score: 0.7807017543859649 Train Score: 0.7039473684210527 Random State: 298
Test Score: 0.7807017543859649 Train Score: 0.7017543859649122 Random State: 299
Test Score: 0.7456140350877193 Train Score: 0.7127192982456141 Random State: 301
Test Score: 0.7368421052631579 Train Score: 0.7214912280701754 Random State: 303
Test Score: 0.7368421052631579 Train Score: 0.7105263157894737 Random State: 304
Test Score: 0.7192982456140351 Train Score: 0.7127192982456141 Random State: 305
Test Score: 0.7631578947368421 Train Score: 0.7302631578947368 Random State: 306
Test Score: 0.7894736842105263 Train Score: 0.7017543859649122 Random State: 307
Test Score: 0.7543859649122807 Train Score: 0.7214912280701754 Random State: 308
Test Score: 0.7719298245614035 Train Score: 0.7039473684210527 Random State: 311
Test Score: 0.76315789473684

After checking the random_state,test score and train score we have come up to a solution that
the random_state_287 gives best result


In [43]:
#now by giving the random_state 267 we will create a logistic model 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train,X_test,y_train,y_test=train_test_split(features,
                                                  label,
                                                  test_size=0.2,
                                                  random_state=287)
    
model = LogisticRegression()
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))

0.7105263157894737
0.8070175438596491


In [44]:
from sklearn.metrics import classification_report
#print(classification_report())
print(classification_report(label,model.predict(features)))

              precision    recall  f1-score   support

           0       0.75      0.93      0.83       406
           1       0.58      0.23      0.33       164

   micro avg       0.73      0.73      0.73       570
   macro avg       0.66      0.58      0.58       570
weighted avg       0.70      0.73      0.69       570



# accuracy of test model

In [48]:
from sklearn.metrics import accuracy_score
print (accuracy_score(y_test,model.predict(X_test)))

0.8070175438596491


here are the train score test score and classification report

In [49]:
from sklearn.metrics import confusion_matrix
trainscor=(model.score(X_train,y_train))
testscor=(model.score(X_test,y_test))
cm = confusion_matrix(label,model.predict(features))
if testscor > trainscor:
    print('model is balanced')
    print("Train score : ",trainscor)
    print("Test score: ",testscor)
    print(cm)
else:
    print('model is not balanced')

model is balanced
Train score :  0.7105263157894737
Test score:  0.8070175438596491
[[378  28]
 [126  38]]
