# Predicting 

## Introduction

### Authored by:
#### Team Name : UPMOST

Team Members: Sudhir, Ranjith, Srikar, Abbas, Shiva Ram, Neha.
### Description of the analysis

In this project, we are using a dataset containing Liver Patient from UCI's repository.

Our prediction task is to determine whether a person needs to be diagnosed for Cirrhosis Based on chemical compounds(bilirubin, albumin, proteins, alkaline phosphatase) present in human body.
We are using the input variables that include Age, Gender,TB(Total Bilirubin),DB(Direct Bilirubin),Alkphos(Alkaline Phosphatase),Sgpt(Alamine Aminotransferase),Sgot(Aspartate Aminotransferase) 
TP(Total Proteins),ALB(Albumin),A/G(Ratio Albumin)and Globulin Ratio Education.

The important Factor here is recall.

To conduct our analysis, we will utilize a set of Machine Learning Modules(k-nn, Descision Tree, RandomForest,XGBoost,Neural network, AdaBoost, and gradient boost).

## Preliminary (Business) Problem Scoping

We are developing a binary classifier to identify if a given person in the datasetneed to be diagnosed for Cirrhosis or not. Our positive case will therefore be Class1(Need's to be diagnosed) and Class0 (Does not need to be diagnosed) will be our negative case.

We will be trying out different models and check if we can develop a model that has sufficient predictive power to accurately 

## Step 1 - Importing the required packages

In [1]:
# importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, plot_roc_curve, roc_auc_score, roc_curve, auc, RocCurveDisplay, PrecisionRecallDisplay, precision_recall_curve
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV



## Step 2: Load, clean and prepare data


### Step 2.1-Loading the data from data source

In [2]:
patient_data = pd.read_csv("indian_liver_patient (1).csv")

### Step 2.2- Data Exploration

In [3]:
# Explore the dataset
# read the first row of the dataset 
print(patient_data.columns)
print(patient_data.describe())
print(patient_data.info())

Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',
       'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
       'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
       'Albumin_and_Globulin_Ratio', 'Result_data'],
      dtype='object')
              Age  Total_Bilirubin  Direct_Bilirubin  Alkaline_Phosphotase  \
count  579.000000       579.000000        579.000000            579.000000   
mean    44.782383         3.315371          1.494128            291.366149   
std     16.221786         6.227716          2.816499            243.561863   
min      4.000000         0.400000          0.100000             63.000000   
25%     33.000000         0.800000          0.200000            175.500000   
50%     45.000000         1.000000          0.300000            208.000000   
75%     58.000000         2.600000          1.300000            298.000000   
max     90.000000        75.000000         19.700000           2110.000000   

       Alamine_Aminotransferase  A

### 2.3 Clean/transform data (where necessary)
Cleaning up column names

In [4]:
patient_data.columns = [s.strip() for s in patient_data.columns] 
patient_data.columns

Index(['Age', 'Gender', 'Total_Bilirubin', 'Direct_Bilirubin',
       'Alkaline_Phosphotase', 'Alamine_Aminotransferase',
       'Aspartate_Aminotransferase', 'Total_Protiens', 'Albumin',
       'Albumin_and_Globulin_Ratio', 'Result_data'],
      dtype='object')

In [5]:
# Checking for null values
patient_data.isnull().sum()

Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    0
Result_data                   0
dtype: int64

#### Transforming Gender column

In [6]:
# Categorizing Gender column
patient_data['Gender'] = patient_data['Gender'].astype('category')
patient_data.dtypes

Age                              int64
Gender                        category
Total_Bilirubin                float64
Direct_Bilirubin               float64
Alkaline_Phosphotase             int64
Alamine_Aminotransferase         int64
Aspartate_Aminotransferase       int64
Total_Protiens                 float64
Albumin                        float64
Albumin_and_Globulin_Ratio     float64
Result_data                      int64
dtype: object

We will encode this data using OrdinalEncoder for the Gender column

In [7]:
# Encode Gender column
le =LabelEncoder()
patient_data['Gender'] = le.fit_transform(patient_data['Gender'])
patient_data.head()

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio,Result_data
0,65,0,0.7,0.1,187,16,18,6.8,3.3,0.9,1
1,62,1,10.9,5.5,699,64,100,7.5,3.2,0.74,1
2,62,1,7.3,4.1,490,60,68,7.0,3.3,0.89,1
3,58,1,1.0,0.4,182,14,20,6.8,3.4,1.0,1
4,72,1,3.9,2.0,195,27,59,7.3,2.4,0.4,1


### 2.4 Checking distribution of classes


In [8]:
patient_data['Result_data'].value_counts()

1    414
0    165
Name: Result_data, dtype: int64

In [9]:
X = patient_data.drop('Result_data', axis = 1).copy()

In [10]:
y = patient_data['Result_data'].copy()

In [11]:
X

Unnamed: 0,Age,Gender,Total_Bilirubin,Direct_Bilirubin,Alkaline_Phosphotase,Alamine_Aminotransferase,Aspartate_Aminotransferase,Total_Protiens,Albumin,Albumin_and_Globulin_Ratio
0,65,0,0.7,0.1,187,16,18,6.8,3.3,0.90
1,62,1,10.9,5.5,699,64,100,7.5,3.2,0.74
2,62,1,7.3,4.1,490,60,68,7.0,3.3,0.89
3,58,1,1.0,0.4,182,14,20,6.8,3.4,1.00
4,72,1,3.9,2.0,195,27,59,7.3,2.4,0.40
...,...,...,...,...,...,...,...,...,...,...
574,60,1,0.5,0.1,500,20,34,5.9,1.6,0.37
575,40,1,0.6,0.1,98,35,31,6.0,3.2,1.10
576,52,1,0.8,0.2,245,48,49,6.4,3.2,1.00
577,31,1,1.3,0.5,184,29,32,6.8,3.4,1.00


In [12]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: Result_data, dtype: int64

## Step 3 Split data intro training and validation sets


#### Create the training set and the test set with a 70/30 split.
We've decided to utilize a training/test split of the data at 70% training and 30% testing. This percentage split ratio is inline with common practice for small to medium sized datasets, which this data represents. Moreover, we have decided not to do a three way data split, as we are only testing two models and we wish to allocated as much data as possible to training and validation steps.

In [13]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1)

## DecisionTree

In [14]:
dtree=DecisionTreeClassifier(random_state=1)
_ = dtree.fit(patient_data.drop(columns=['Result_data']), patient_data['Result_data'])

In [15]:
dtree.fit(X_train, y_train)

DecisionTreeClassifier(random_state=1)

In [16]:
# Criterion used to guide data splits
criterion = ['gini', 'entropy']

# Maximum number of levels in tree.
# default = None
max_depth = [int(x) for x in np.linspace(1, 400, 50)]
max_depth.append(None)

# Minimum number of samples required to split a node
# default is 2
min_samples_split = [int(x) for x in np.linspace(2, 500, 50)]

# Minimum number of samples required at each leaf node
# default = 1 
min_samples_leaf = [int(x) for x in np.linspace(1, 1000, 50)]

# max_leaf_nodes  - Grow trees with max_leaf_nodes in best-first fashion.
# If None then unlimited number of leaf nodes.
# default=None 
max_leaf_nodes = [int(x) for x in np.linspace(2, len(y_test), 50)]
max_leaf_nodes.append(None)

# min_impurity_decrease - A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
# default=0.0
min_impurity_decrease = [x for x in np.arange(0.0, 0.01, 0.0001).round(5)]

# Create the random grid
param_grid_random = { 'criterion': criterion,
                      'max_depth': max_depth,
                      'min_samples_split': min_samples_split,
                      'min_samples_leaf' : min_samples_leaf,
                      'max_leaf_nodes' : max_leaf_nodes,
                      'min_impurity_decrease' : min_impurity_decrease,
                     }

In [17]:
random_seed = 1
np.random.seed(random_seed)

In [18]:


dtree_default = DecisionTreeClassifier(random_state=random_seed)

best_random_search_model = RandomizedSearchCV(
        estimator=DecisionTreeClassifier(random_state=random_seed), 
        scoring='recall', 
        param_distributions=param_grid_random, 
        n_iter = 2, 
        cv=10, 
        verbose=0, 
        n_jobs = -1
    )
_ = best_random_search_model.fit(X_train.values, y_train)

In [19]:
dtree_default = DecisionTreeClassifier(random_state=random_seed)

best_random_search_model = RandomizedSearchCV(
        estimator=DecisionTreeClassifier(random_state=random_seed), 
        scoring='recall', 
        param_distributions=param_grid_random, 
        n_iter = 2, 
        cv=10, 
        verbose=0, 
        n_jobs = -1
    )
_ = best_random_search_model.fit(X_train.values, y_train)

random_search_best_params = best_random_search_model.best_params_
print('Best parameters found: ', random_search_best_params)

print("Best Recall score is {}".format(best_random_search_model.best_score_))

Best parameters found:  {'min_samples_split': 134, 'min_samples_leaf': 123, 'min_impurity_decrease': 0.0016, 'max_leaf_nodes': 30, 'max_depth': 351, 'criterion': 'gini'}
Best Recall score is 1.0


In [20]:
plus_minus = 8 # change this to 10-15 when doing a final run. this current value is for testing
increment = 2

param_grid = { 'min_samples_split': [x for x in range(random_search_best_params['min_samples_split']-plus_minus, random_search_best_params['min_samples_split']+plus_minus,2) if x >= 2],       
              'min_samples_leaf': [x for x in range(random_search_best_params['min_samples_leaf']-plus_minus , random_search_best_params['min_samples_leaf']+plus_minus,2) if x > 0],
              'min_impurity_decrease': [x for x in np.arange(random_search_best_params['min_impurity_decrease']-0.001, random_search_best_params['min_impurity_decrease']+0.001,.0001).round(5) if x >= 0.000],
              'max_leaf_nodes':[x for x in range(random_search_best_params['max_leaf_nodes']-plus_minus , random_search_best_params['max_leaf_nodes']+plus_minus, 2) if x > 1],  
              'max_depth': [x for x in range(random_search_best_params['max_depth']-plus_minus , random_search_best_params['max_depth']+plus_minus, 2) if x > 1],
              'criterion': [random_search_best_params['criterion']]
              }
best_grid_search_model = GridSearchCV(estimator=DecisionTreeClassifier(random_state=random_seed), 
                                    scoring='recall', param_grid=param_grid, cv=10, verbose=0,  n_jobs = -1)
best_grid_search_dtree_model = best_grid_search_model.fit(X_train, y_train)

print('Best parameters found: ', best_grid_search_dtree_model.best_params_)

print("Best Recall score is {}".format(best_random_search_model.best_score_))

Best parameters found:  {'criterion': 'gini', 'max_depth': 343, 'max_leaf_nodes': 22, 'min_impurity_decrease': 0.0006, 'min_samples_leaf': 119, 'min_samples_split': 126}
Best Recall score is 1.0


## RandomForest

In [21]:
criterion = ['gini', 'entropy']
max_depth = [int(x) for x in np.linspace(1, 500, 50)]
min_samples_split = [int(x) for x in np.linspace(2, 500, 50)]
min_samples_leaf = [int(x) for x in np.linspace(1, 100, 50)]
max_leaf_nodes = [int(x) for x in np.linspace(2, len(y_test), 50)]
min_impurity_decrease = [x for x in np.arange(0.0, 0.01, 0.0001).round(5)]
param_grid_random = { 
                      'criterion': criterion,
                      'max_depth': max_depth,
                      'min_samples_split': min_samples_split,
                      'min_samples_leaf' : min_samples_leaf,
                      'max_leaf_nodes' : max_leaf_nodes,
                      'min_impurity_decrease' : min_impurity_decrease,
                     }

In [22]:
random_seed=11
randomtree_default = RandomForestClassifier(random_state=random_seed)
# change n_iter to 200_000 for full run
best_random_search_model = RandomizedSearchCV(
        estimator=RandomForestClassifier(random_state=random_seed), 
        scoring='recall', 
        param_distributions=param_grid_random, 
        n_iter = 5_000, 
        cv=10, 
        verbose=0, 
        n_jobs = -1,
        random_state=random_seed
    )
best_random_search_rtree_model = best_random_search_model.fit(X_train, y_train)

In [23]:
random_search_best_rtree_params = best_random_search_model.best_params_
print('Best parameters found: ', random_search_best_rtree_params)

Best parameters found:  {'min_samples_split': 418, 'min_samples_leaf': 97, 'min_impurity_decrease': 0.0, 'max_leaf_nodes': 166, 'max_depth': 113, 'criterion': 'entropy'}


In [24]:
y_pred = best_random_search_rtree_model.predict(X_test)

print("************************************")
print(f"{'Recall Score:':18}{recall_score(y_test, y_pred,average='weighted')}")
print("************************************")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred,average='weighted')}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred,average='weighted')}")
print("************************************")

************************************
Recall Score:     0.7413793103448276
************************************
Accuracy Score:   0.7413793103448276
Precision Score:  0.5496432818073722
F1 Score:         0.6312734721748037
************************************


  _warn_prf(average, modifier, msg_start, len(result))
