# **Business case (Problem Statement)**
Patients with Liver disease have been continuously increasing because of excessive consumption of alcohol, inhale of harmful gases, intake of contaminated food, pickles, drugs and many various other things. This dataset was used to evaluate prediction algorithms in an effort to reduce burden on doctors which will help to predict the disease.

## **Importing Basic Libraries**

In [2]:
import numpy as np
import pandas as pd
import seaborn as sb
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt

## **Importing Dataset**

In [3]:
data=pd.read_csv("Indian Liver Patient Dataset (ILPD).csv" , header = None)

* As the dataset does not have header column hence we need to use **header = None** so that we can define header to the file using the code. Otherwise we have to alter the original excel / csv file.


## **Defining Column Names to the Dataset**

In [4]:
column_names = ['Age' , 'Gender' , 'Total Bilirubin' , 'Direct Bilirubin' , 'Alkaline Phosphotase' , 'Alamine Aminotransferase' ,
                'Aspartate Aminotransferase' , 'Total Protiens' , 'Albumin' , 'Albumin and Globulin Ratio' , 'Target']

data.columns = column_names

* This line of code will define the header to the dataset and assign column name to each column of the data.
This will help us to understand the data and make the necessary calculations over the same.

## **Basic Checks**

In [5]:
data.head()

Unnamed: 0,Age,Gender,Total Bilirubin,Direct Bilirubin,Alkaline Phosphotase,Alamine Aminotransferase,Aspartate Aminotransferase,Total Protiens,Albumin,Albumin and Globulin Ratio,Target
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1


* This line of code will help to display first five rows of the dataset

In [6]:
data.tail()

Unnamed: 0,Age,Gender,Total Bilirubin,Direct Bilirubin,Alkaline Phosphotase,Alamine Aminotransferase,Aspartate Aminotransferase,Total Protiens,Albumin,Albumin and Globulin Ratio,Target
578,60,Male,0.5,0.1,500,20,34,5.9,1.6,0.37,2
579,40,Male,0.6,0.1,98,35,31,6.0,3.2,1.1,1
580,52,Male,0.8,0.2,245,48,49,6.4,3.2,1.0,1
581,31,Male,1.3,0.5,184,29,32,6.8,3.4,1.0,1
582,38,Male,1.0,0.3,216,21,24,7.3,4.4,1.5,2


* This line of code will help to display last five rows of the dataset

In [7]:
data.shape

(583, 11)

* This line of code will help to display total rows and columns of the dataset

In [8]:
data.dtypes


Age                             int64
Gender                         object
Total Bilirubin               float64
Direct Bilirubin              float64
Alkaline Phosphotase            int64
Alamine Aminotransferase        int64
Aspartate Aminotransferase      int64
Total Protiens                float64
Albumin                       float64
Albumin and Globulin Ratio    float64
Target                          int64
dtype: object

* This line of code will help to display the datatype of each column available in this dataset

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    object 
 2   Total Bilirubin             583 non-null    float64
 3   Direct Bilirubin            583 non-null    float64
 4   Alkaline Phosphotase        583 non-null    int64  
 5   Alamine Aminotransferase    583 non-null    int64  
 6   Aspartate Aminotransferase  583 non-null    int64  
 7   Total Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin and Globulin Ratio  579 non-null    float64
 10  Target                      583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB


* This line of code will help to display the datatype along with the non null value available in the dataset. From that we came to know that there are some null values available in the dataset in column Albumin and Globulin Ratio.

In [10]:
data.isnull().sum()

Age                           0
Gender                        0
Total Bilirubin               0
Direct Bilirubin              0
Alkaline Phosphotase          0
Alamine Aminotransferase      0
Aspartate Aminotransferase    0
Total Protiens                0
Albumin                       0
Albumin and Globulin Ratio    4
Target                        0
dtype: int64

* This line of code displays the null values available in Albumin and Globulin Ratio column

In [11]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,583.0,44.746141,16.189833,4.0,33.0,45.0,58.0,90.0
Total Bilirubin,583.0,3.298799,6.209522,0.4,0.8,1.0,2.6,75.0
Direct Bilirubin,583.0,1.486106,2.808498,0.1,0.2,0.3,1.3,19.7
Alkaline Phosphotase,583.0,290.576329,242.937989,63.0,175.5,208.0,298.0,2110.0
Alamine Aminotransferase,583.0,80.713551,182.620356,10.0,23.0,35.0,60.5,2000.0
Aspartate Aminotransferase,583.0,109.910806,288.918529,10.0,25.0,42.0,87.0,4929.0
Total Protiens,583.0,6.48319,1.085451,2.7,5.8,6.6,7.2,9.6
Albumin,583.0,3.141852,0.795519,0.9,2.6,3.1,3.8,5.5
Albumin and Globulin Ratio,579.0,0.947064,0.319592,0.3,0.7,0.93,1.1,2.8
Target,583.0,1.286449,0.45249,1.0,1.0,1.0,2.0,2.0


* This line of code will display the details of the numerical data available in the dataset such as Count, Mean Value, Standard Deviation, Minimum Value, Maximum Value, All three Quantiles and Maximum Value

In [12]:
data.describe(include='O')

Unnamed: 0,Gender
count,583
unique,2
top,Male
freq,441


 * This line of code will display the details of the categorical column available in the dataset such as Total Count, Unique Values available in column

In [13]:
data.isnull().sum()

Age                           0
Gender                        0
Total Bilirubin               0
Direct Bilirubin              0
Alkaline Phosphotase          0
Alamine Aminotransferase      0
Aspartate Aminotransferase    0
Total Protiens                0
Albumin                       0
Albumin and Globulin Ratio    4
Target                        0
dtype: int64

* This line of code will display the null values available in Albumin and Globulin Ratio, We found that there are 4 null values available in Albumin and Globulin Ratio.

In [14]:
data['Gender'].value_counts()

Male      441
Female    142
Name: Gender, dtype: int64

* This line of code will display the different values and counts of the same available in the Gender column

In [15]:
data.duplicated().sum()

13

* This line of code will display the total duplicate rows in the dataset

In [16]:
data.drop_duplicates(inplace = True)

* This line of code remove the duplicate rows available in the dataset

In [17]:
data

Unnamed: 0,Age,Gender,Total Bilirubin,Direct Bilirubin,Alkaline Phosphotase,Alamine Aminotransferase,Aspartate Aminotransferase,Total Protiens,Albumin,Albumin and Globulin Ratio,Target
0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.90,1
1,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.00,1
4,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.40,1
...,...,...,...,...,...,...,...,...,...,...,...
578,60,Male,0.5,0.1,500,20,34,5.9,1.6,0.37,2
579,40,Male,0.6,0.1,98,35,31,6.0,3.2,1.10,1
580,52,Male,0.8,0.2,245,48,49,6.4,3.2,1.00,1
581,31,Male,1.3,0.5,184,29,32,6.8,3.4,1.00,1


## **Domain Analysis**

Basically we can see that this data set is blood test report of persons who has done blood test to check the functionality of Liver. This Blood Test containas Various features such as Age, Gender, Total Bilirubin, Direct Bilirubin, Alkaline Phosphotase, Alamine Aminotransferase, Aspartate Aminotransferase, Total Protiens, Albumin, Albumin and Globulin Ratio.

Which can confirm wheather person have any defect in the Liver or not.

**Detailed Description:**

* **Age:** This column indicated the age of the person who has done the blood test
* **Gender:** This column indicated the Gender of a person who has done the test, We found that only Male and Female are available in the data set.
* **Total_Bilirubin:** Normal range for total bilirubin levels are between 0.3 mg/dL to 1 mg/dL. In adults it can be till 1.2 mg/dL however it is consider to be danger if level exceed 1.2 mg/dL
* **Direct_Bilirubin:** It is a form of bilirubin which has been conjugated with glucoronic acid and is excreted in the bile,normal values of direct bilirubin are from 0 to 0.4 mg/dL. It is consider as danger if it exceeds 0.4 mg/dL
* **alkaline_phosphotase:** Usually it is denoted by ALP The normal range can be considered from 44 to 147 IU/L or 0.73 to 2.45 µkat/L. If it exceeds the normal range it may danage the lifer functionality of a person.
* **Alamine_Aminotransferase:** Usually it is denoted by ALT. It is an enzyme that helps the liver to  convert the food into energy. A normal ALT range is from 7 to 55 U/L. If it exceeds the normal levels it may damage the liver functionality.
* **Aspartate_Aminotransferase:** Usially it is denoted by AST. The normal range for AST varies from laboratory to laboratory. One common reference range for an AST blood test is 8 to 33 U/L. If it exceeds the normal levels it may damage the liver functionality.
* **Total_Protiens:** If your total protein level is low, you may have a liver or kidney problem, or it may be that protein isn't being digested or absorbed properly. The normal protein range is from 6.0 to 8.3 g/dL.
* **Albumin:** A normal albumin range is between 3.4 to 5.4 g/dL. If you have a lower albumin level, you may have malnutrition. It can also mean that person have liver disease or person's liver is not working properly which can cause increase in Albumin levels.
* **Albumin_and_Globulin_Ratio:** Usually the albumin/globulin ratio is between 1.1 and 2.5. Doctor can understang the defect by seeing the this range. Doctor can predict whether it is liver related issue or Kidney related issue or intestine related issue.

In [18]:
data.isnull().sum()

Age                           0
Gender                        0
Total Bilirubin               0
Direct Bilirubin              0
Alkaline Phosphotase          0
Alamine Aminotransferase      0
Aspartate Aminotransferase    0
Total Protiens                0
Albumin                       0
Albumin and Globulin Ratio    4
Target                        0
dtype: int64

* This line of code shows us that there are 4 missing values present in the Albumin and Globulin Ratio column. As this column is normally distributed hence we can replace the same null values with the mean value

In [19]:
print("As data seems to be balanced we will use the mean value to replace the Null Values:" , data['Albumin and Globulin Ratio'].mean())
data.loc[data['Albumin and Globulin Ratio'].isnull()==True,'Albumin and Globulin Ratio']=data['Albumin and Globulin Ratio'].mean()

As data seems to be balanced we will use the mean value to replace the Null Values: 0.9480035335689044


* **Handling Outliers**

* Above in the box plot we can see that there are outliers available in all the columns except Age and Albumin. Hence we need to see the % of the outliers available in the datatset. If that is less than 5 % then we will replace those value but if we found that the number of outliers available in column is more than 5 % then we will not handle those as that might affect the outputs.

* **We can clearly see that Outliers available in the Aspartate Aminotransferase column is more than 12% hence we will not handle these outliers as this will impact stongly on de prediction.**

In [20]:
Q1_TP = data['Total Protiens'].quantile(0.25)
print("lower quartile: ",Q1_TP)
Q3_TP = data['Total Protiens'].quantile(0.75)
print("upper quartile: ",Q3_TP,"\n")

IQR_TP = Q3_TP - Q1_TP
print("IQR for Total Protiens: ",IQR_TP,"\n")

ll_TP = Q1_TP - (1.5 * IQR_TP)
print('lower limit is',ll_TP)
ul_TP = Q3_TP + (1.5 * IQR_TP)
print('upper limit is',ul_TP,'\n')

print("Data above Upper Limite which is consider as outlier: ", len(data.loc[data['Total Protiens'] > ul_TP]) / len(data) * 100,"%")
print("Data below Lower Limite which is consider as outlier: ", len(data.loc[data['Total Protiens'] < ll_TP]) / len(data) * 100,"%")

lower quartile:  5.8
upper quartile:  7.2 

IQR for Total Protiens:  1.4000000000000004 

lower limit is 3.6999999999999993
upper limit is 9.3 

Data above Upper Limite which is consider as outlier:  0.3508771929824561 %
Data below Lower Limite which is consider as outlier:  1.0526315789473684 %


In [21]:
data.loc[data['Total Protiens'] < ll_TP,'Total Protiens'] = data['Total Protiens'].median()
data.loc[data['Total Protiens'] > ul_TP,'Total Protiens'] = data['Total Protiens'].median()

print("Data above Upper Limite which is consider as outlier: ", len(data.loc[data['Total Protiens'] > ul_TP]) / len(data) * 100,"%")
print("Data below Lower Limite which is consider as outlier: ", len(data.loc[data['Total Protiens'] < ll_TP]) / len(data) * 100,"%")

Data above Upper Limite which is consider as outlier:  0.0 %
Data below Lower Limite which is consider as outlier:  0.0 %


* **This code will help us to handle the outliers available in Total Proteins column.**


* **We can clearly see that there are no outliers present in the Albumin column Hence we do not interfere with the column.**

In [22]:
Q1_AGR = data['Albumin and Globulin Ratio'].quantile(0.25)
print("lower quartile: ",Q1_AGR)
Q3_AGR = data['Albumin and Globulin Ratio'].quantile(0.75)
print("upper quartile: ",Q3_AGR,"\n")

IQR_AGR = Q3_AGR - Q1_AGR
print("IQR for Albumin and Globulin Ratio: ",IQR_AGR,"\n")

ll_AGR = Q1_AGR - (1.5 * IQR_AGR)
print('lower limit is',ll_AGR)
ul_AGR = Q3_AGR + (1.5 * IQR_AGR)
print('upper limit is',ul_AGR,'\n')

print("Data above Upper Limite which is consider as outlier: ", len(data.loc[data['Albumin and Globulin Ratio'] > ul_AGR]) / len(data) * 100,"%")
print("Data below Lower Limite which is consider as outlier: ", len(data.loc[data['Albumin and Globulin Ratio'] < ll_AGR]) / len(data) * 100,"%")

lower quartile:  0.7
upper quartile:  1.1 

IQR for Albumin and Globulin Ratio:  0.40000000000000013 

lower limit is 0.09999999999999976
upper limit is 1.7000000000000002 

Data above Upper Limite which is consider as outlier:  1.7543859649122806 %
Data below Lower Limite which is consider as outlier:  0.0 %


* **We can clearly see that Outliers available in the Albumin and Globulin Ratio column is less than 2% hence we will handle these outliers.**

In [23]:
data.loc[data['Albumin and Globulin Ratio'] > ul_AGR,'Albumin and Globulin Ratio'] = data['Albumin and Globulin Ratio'].mean()

print("Data above Upper Limite which is consider as outlier: ", len(data.loc[data['Albumin and Globulin Ratio'] > ul_AGR]) / len(data) * 100,"%")
print("Data below Lower Limite which is consider as outlier: ", len(data.loc[data['Albumin and Globulin Ratio'] < ll_AGR]) / len(data) * 100,"%")

Data above Upper Limite which is consider as outlier:  0.0 %
Data below Lower Limite which is consider as outlier:  0.0 %


* **This code will help us to handle the outliers available in Total Proteins column.**

* **Conversion of Categorical column to Numerical Columns**

In [24]:
data.Gender = data.Gender.replace({'Male':1,'Female':0})

data.head()

Unnamed: 0,Age,Gender,Total Bilirubin,Direct Bilirubin,Alkaline Phosphotase,Alamine Aminotransferase,Aspartate Aminotransferase,Total Protiens,Albumin,Albumin and Globulin Ratio,Target
0,65,0,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,1,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,1,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,1,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,1,3.9,2.0,195,27,59,7.3,2.4,0.4,1


* This line of code will help us to replace categorical column Gender to numerical column.

* **Replacing columns name to make it sort for easy calculations**

In [25]:
data.rename(columns={'Gender':'Sex' , 'Total Bilirubin':'TB' , 'Direct Bilirubin':'DB' , 'Alkaline Phosphotase' : 'ALP' , 'Alamine Aminotransferase':'ALT' ,
                'Aspartate Aminotransferase':'AST' , 'Total Protiens':'TP' , 'Albumin':'ALB' , 'Albumin and Globulin Ratio':'ABR'},inplace='True')


data.head(5)

Unnamed: 0,Age,Sex,TB,DB,ALP,ALT,AST,TP,ALB,ABR,Target
0,65,0,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,1,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,1,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,1,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,1,3.9,2.0,195,27,59,7.3,2.4,0.4,1


* This line of code will help us to shorten the column name for easy calculations

## **Input Variables**

* 1) Age
* 2) Gender (Sex)
* 3) Total Bilirubin (TB)
* 4) Direct Bilirubin (DB)
* 5) Alkaline Phosphotase (ALP)
* 6) Alamine Aminotransferase (ALT)
* 7) Aspartate Aminotransferase (AST)
* 8) Total Protiens (TP)
* 9) Albumin (ALB)
* 10) Albumin and Globulin Ratio (ABR)

## **Output Variables**

* Target

## **Feature Selection**

* By the above graph and chart we can see that there is a high co relation between **Direct Bilirubin and Total Bilirubin** similarly there is a high co relation between **Alamine Aminotransferase & Aspartate Aminotransferase** alco there is high co relation between **Total Protein and Albumin** along with **Albumin and Albumin and Globulin Ratio**


* But only **Direct Bilirubin and Total Bilirubin** has extremely high co relation and hence we can drop anyone column while working and calculations.

## **Scaling of Data for easy Model Building** (Min Max Scalar)

In [26]:
# Min max scaler
from sklearn.preprocessing import MinMaxScaler
scaling = MinMaxScaler()
df = ['Sex','Target']
data1 = scaling.fit_transform(data.drop(df, axis = 1))

In [27]:
data.columns

Index(['Age', 'Sex', 'TB', 'DB', 'ALP', 'ALT', 'AST', 'TP', 'ALB', 'ABR',
       'Target'],
      dtype='object')

In [28]:
data2=pd.DataFrame(data1,columns=['Age', 'TB', 'DB', 'ALP', 'ALT', 'AST', 'TP', 'ALB', 'ABR'])
data2

Unnamed: 0,Age,TB,DB,ALP,ALT,AST,TP,ALB,ABR
0,0.709302,0.004021,0.000000,0.060576,0.003015,0.001626,0.563636,0.521739,0.428571
1,0.674419,0.140751,0.275510,0.310699,0.027136,0.018296,0.690909,0.500000,0.314286
2,0.674419,0.092493,0.204082,0.208598,0.025126,0.011791,0.600000,0.521739,0.421429
3,0.627907,0.008043,0.015306,0.058134,0.002010,0.002033,0.563636,0.543478,0.500000
4,0.790698,0.046917,0.096939,0.064485,0.008543,0.009961,0.654545,0.326087,0.071429
...,...,...,...,...,...,...,...,...,...
565,0.651163,0.001340,0.000000,0.213483,0.005025,0.004879,0.400000,0.152174,0.050000
566,0.418605,0.002681,0.000000,0.017098,0.012563,0.004269,0.418182,0.500000,0.571429
567,0.558140,0.005362,0.005102,0.088911,0.019095,0.007928,0.490909,0.500000,0.500000
568,0.313953,0.012064,0.020408,0.059111,0.009548,0.004472,0.563636,0.543478,0.500000


In [29]:
df1 = data[['Sex','Target']]
df1

Unnamed: 0,Sex,Target
0,0,1
1,1,1
2,1,1
3,1,1
4,1,1
...,...,...
578,1,2
579,1,1
580,1,1
581,1,1


In [30]:
data2 = data2.reset_index(drop = True)
df1 = df1.reset_index(drop = True)

In [31]:
data = pd.concat([data2, df1], axis = 1)
data

Unnamed: 0,Age,TB,DB,ALP,ALT,AST,TP,ALB,ABR,Sex,Target
0,0.709302,0.004021,0.000000,0.060576,0.003015,0.001626,0.563636,0.521739,0.428571,0,1
1,0.674419,0.140751,0.275510,0.310699,0.027136,0.018296,0.690909,0.500000,0.314286,1,1
2,0.674419,0.092493,0.204082,0.208598,0.025126,0.011791,0.600000,0.521739,0.421429,1,1
3,0.627907,0.008043,0.015306,0.058134,0.002010,0.002033,0.563636,0.543478,0.500000,1,1
4,0.790698,0.046917,0.096939,0.064485,0.008543,0.009961,0.654545,0.326087,0.071429,1,1
...,...,...,...,...,...,...,...,...,...,...,...
565,0.651163,0.001340,0.000000,0.213483,0.005025,0.004879,0.400000,0.152174,0.050000,1,2
566,0.418605,0.002681,0.000000,0.017098,0.012563,0.004269,0.418182,0.500000,0.571429,1,1
567,0.558140,0.005362,0.005102,0.088911,0.019095,0.007928,0.490909,0.500000,0.500000,1,1
568,0.313953,0.012064,0.020408,0.059111,0.009548,0.004472,0.563636,0.543478,0.500000,1,1


In [32]:
data

Unnamed: 0,Age,TB,DB,ALP,ALT,AST,TP,ALB,ABR,Sex,Target
0,0.709302,0.004021,0.000000,0.060576,0.003015,0.001626,0.563636,0.521739,0.428571,0,1
1,0.674419,0.140751,0.275510,0.310699,0.027136,0.018296,0.690909,0.500000,0.314286,1,1
2,0.674419,0.092493,0.204082,0.208598,0.025126,0.011791,0.600000,0.521739,0.421429,1,1
3,0.627907,0.008043,0.015306,0.058134,0.002010,0.002033,0.563636,0.543478,0.500000,1,1
4,0.790698,0.046917,0.096939,0.064485,0.008543,0.009961,0.654545,0.326087,0.071429,1,1
...,...,...,...,...,...,...,...,...,...,...,...
565,0.651163,0.001340,0.000000,0.213483,0.005025,0.004879,0.400000,0.152174,0.050000,1,2
566,0.418605,0.002681,0.000000,0.017098,0.012563,0.004269,0.418182,0.500000,0.571429,1,1
567,0.558140,0.005362,0.005102,0.088911,0.019095,0.007928,0.490909,0.500000,0.500000,1,1
568,0.313953,0.012064,0.020408,0.059111,0.009548,0.004472,0.563636,0.543478,0.500000,1,1


## **Splitting Data Into X & Y**

In [33]:
x = data.drop(['Target','DB'], axis = 1)
y = data['Target']

## **Splitting data into Train Test modules for evaluation**

In [34]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report

x_train,x_test,y_train,y_test = train_test_split(x, y, test_size = 0.30, random_state = 15)

In [35]:
print('Shape of input train data is:', x_train.shape,'\n','Shape of input test data is:', x_test.shape,'\n')
print('Shape of output train data is:', y_train.shape,'\n','Shape of output test data is:', y_test.shape,'\n')

Shape of input train data is: (399, 9) 
 Shape of input test data is: (171, 9) 

Shape of output train data is: (399,) 
 Shape of output test data is: (171,) 



In [36]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
LR.fit(x_train,y_train)
y_pred=LR.predict(x_test)

## **Model Building - Bagging**

In [37]:
from sklearn.ensemble import BaggingClassifier
model_bag = BaggingClassifier(base_estimator = LR, n_estimators = 200)
model_bag.fit(x_train,y_train)

## **Model Prediction - Bagging**

In [38]:
Pred_BAG = model_bag.predict(x_test)

## **Model Evaluation - Bagging**

In [39]:
print(classification_report(y_test, Pred_BAG))

print("Accuracy Score for Bagging Model is:", accuracy_score(y_test,Pred_BAG))

              precision    recall  f1-score   support

           1       0.65      1.00      0.79       110
           2       1.00      0.02      0.03        61

    accuracy                           0.65       171
   macro avg       0.82      0.51      0.41       171
weighted avg       0.77      0.65      0.52       171

Accuracy Score for Bagging Model is: 0.6491228070175439


## **Model Building - Gradient Boosting**

In [40]:
# Model building
from sklearn.ensemble import GradientBoostingClassifier
model_GB = GradientBoostingClassifier(n_estimators = 120)
model_GB.fit(x_train,y_train)

## **Model Prediction - Gradient Boosting**

In [41]:
Pred_GB = model_GB.predict(x_test)

## **Model Evaluation - Gradient Boosting**

In [42]:
print(classification_report(y_test, Pred_GB))

print("Accuracy Score for Gadient Boosting Model (Without Hyperparameter Tuning) is:", accuracy_score(y_test,Pred_GB))

              precision    recall  f1-score   support

           1       0.69      0.84      0.75       110
           2       0.51      0.31      0.39        61

    accuracy                           0.65       171
   macro avg       0.60      0.57      0.57       171
weighted avg       0.62      0.65      0.62       171

Accuracy Score for Gadient Boosting Model (Without Hyperparameter Tuning) is: 0.6491228070175439


## **XG Boost Installation**

In [43]:
!pip install xgboost



In [44]:
y_train_mapped = [0 if label == 1 else 1 for label in y_train]
y_test_mapped = [0 if label == 1 else 1 for label in y_test]

## **Model Building - XG Boost**

In [45]:
from xgboost import XGBClassifier
model_XGBP=XGBClassifier()
model_XGBP.fit(x_train,y_train_mapped)

## **Model Prediction - XG Boost**

In [46]:
Pred_XGBP = model_XGBP.predict(x_test)

## **Model Evaluation - XG Boost**

In [47]:
print(classification_report(y_test_mapped, Pred_XGBP))

print("Accuracy Score for Gadient Boosting Model (Without Hyperparameter Tuning) is:", accuracy_score(y_test_mapped,Pred_XGBP))

              precision    recall  f1-score   support

           0       0.71      0.85      0.78       110
           1       0.59      0.38      0.46        61

    accuracy                           0.68       171
   macro avg       0.65      0.62      0.62       171
weighted avg       0.67      0.68      0.66       171

Accuracy Score for Gadient Boosting Model (Without Hyperparameter Tuning) is: 0.6842105263157895


## Hyper Parameter Tuning

In [48]:
gamma = np.linspace(1,300, num = 19).astype(int)
learning_rate = np.linspace(0,1, num = 10)
max_depth = np.linspace(1,20, num = 13).astype(int)
n_estimator = np.linspace(50,300, num = 11).astype(int)
reg_alpha = np.linspace(1,250, num = 13).astype(int)
reg_lambda = np.linspace(1,250, num = 13).astype(int)

param_grid = {'gamma': gamma,'learning_rate': learning_rate, 'max_depth': max_depth,
              'n_estimators': n_estimator,'reg_alpha': reg_alpha,'reg_lambda': reg_lambda}


In [49]:
from sklearn.model_selection import RandomizedSearchCV
modelXGB = XGBClassifier(random_state = 42,verbosity = 0)
rcv = RandomizedSearchCV(estimator = modelXGB, scoring = 'f1',param_distributions = param_grid,
                         n_iter = 100, cv = 10, verbose = 2, random_state = 42, n_jobs = -1)

rcv.fit(x_train, y_train_mapped)
cv_best_params = rcv.best_params_
print(f"Best paramters: {cv_best_params})")

Fitting 10 folds for each of 100 candidates, totalling 1000 fits
Best paramters: {'reg_lambda': 1, 'reg_alpha': 63, 'n_estimators': 200, 'max_depth': 1, 'learning_rate': 0.0, 'gamma': 117})


## **Model Building - XG Boost (After Tuning)**

In [50]:
XGB2 = XGBClassifier(reg_lambda = 1, reg_alpha = 63, n_estimators = 200,
                     max_depth = 1, learning_rate = 0.0, gamma = 117)
XGB2.fit(x_train, y_train_mapped)

## **Model Prediction - XG Boost (After Tuning)**

In [51]:
Pred_XGBP2 = XGB2.predict(x_test)

## **Model Evaluation - XG Boost (After Tuning)**

In [52]:
print(classification_report(y_test_mapped, Pred_XGBP2))

print("Accuracy Score for Gadient Boosting Model (with Hyperparameter Tuning) is:", accuracy_score(y_test_mapped,Pred_XGBP2))

              precision    recall  f1-score   support

           0       0.64      1.00      0.78       110
           1       0.00      0.00      0.00        61

    accuracy                           0.64       171
   macro avg       0.32      0.50      0.39       171
weighted avg       0.41      0.64      0.50       171

Accuracy Score for Gadient Boosting Model (with Hyperparameter Tuning) is: 0.6432748538011696


## **Model Building - ANN**

In [54]:
from sklearn.neural_network import MLPClassifier
model_ANN = MLPClassifier( hidden_layer_sizes = (50,3), learning_rate_init = 0.1,
                      max_iter = 100, random_state = 2)
model_ANN.fit(x_train,y_train)

## **Model Prediction - ANN**

In [62]:
Pred_proba = model_ANN.predict_proba(x_test)
Pred_ANN = model_ANN.predict(x_test)

## **Model Evaluation - ANN**

In [65]:
print(classification_report(y_test, Pred_ANN))

print("Accuracy Score for Gadient Boosting Model (with Hyperparameter Tuning) is:", accuracy_score(y_test,Pred_ANN))

              precision    recall  f1-score   support

           1       0.65      0.98      0.78       110
           2       0.60      0.05      0.09        61

    accuracy                           0.65       171
   macro avg       0.63      0.52      0.44       171
weighted avg       0.63      0.65      0.54       171

Accuracy Score for Gadient Boosting Model (with Hyperparameter Tuning) is: 0.6491228070175439
