<table class="table table-bordered">
    <tr>
        <th style="text-align:center; width:35%"><img src='https://dl.dropbox.com/s/qtzukmzqavebjd2/icon_smu.jpg' style="width: 300px; height: 90px; "></th>
    <th style="text-align:center;"><font size="4"> <br/>IS.215 - Analytics in Python Practical 1</font></th>
    </tr>
</table> 

Practical Test

The data "liver_patient.csv" is adapted from Indian Liver Patient dataset (https://www.mldata.io/dataset-details/indian_liver_patient/). It is a binary (2-class) classification problem predicting if the patient has liver disease based on various readings and values. There are 583 observations with 9 input variables and 1 output/target variable.  The column names are 'age', 't_bilirubin', 'd_bilirubin', 'akl_ph', 'ala_at', 'asp_at', 't_protein', 'albumin', 'alb_glo', 'target' and the details as follows: 

- age: Age of patient.
- t_bilirubin: Total reading of Bilirubin.
- d_bilirubin: Direct reading of Bilirubin.
- akl_ph: Alkaline Phosphotase reading.
- ala_at: Alamine Aminotransferase reading.
- asp_at: Aspartate Aminotransferase reading.
- t_protein: Total amount of proteins.
- albumin: Albumin reading.
- alb_glo: Ratio of Albumin and Globulin reading.
- target: Target variable (1 -'yes, a liver patient' or 2-'no, not a liver patient').

**Step 1: import relevant libraries**

In [1]:
#Step 1: import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import MinMaxScaler

**Step 2: read in the data**

After importing the libraries, the source file 'diabetes.csv' is imported by making use of read_csv of pandas into dataframe (df). Essentially it is reading the data as dataframe columns with the corresponding name for ease of reference.

The preliminary analysis includes checking for dimension of the imported dataframe (df.shape) and making sure that it is 768 rows and 9 columns (including the target column). df.describe() is used to check the summary of the individual column.

In [2]:
#Step 2: read in the data and do a preliminary analysis

df = pd.read_csv("liver_patient.csv", names=['age', 't_bilirubin', 'd_bilirubin', 'akl_ph', 'ala_at', 'asp_at', 't_protein', 'albumin', 'alb_glo', 'target'])
print(df.shape)
print(df.describe())

(583, 10)
              age  t_bilirubin  d_bilirubin       akl_ph       ala_at  \
count  583.000000   583.000000   583.000000   583.000000   583.000000   
mean    44.746141     3.298799     1.486106   290.576329    80.713551   
std     16.189833     6.209522     2.808498   242.937989   182.620356   
min      4.000000     0.400000     0.100000    63.000000    10.000000   
25%     33.000000     0.800000     0.200000   175.500000    23.000000   
50%     45.000000     1.000000     0.300000   208.000000    35.000000   
75%     58.000000     2.600000     1.300000   298.000000    60.500000   
max     90.000000    75.000000    19.700000  2110.000000  2000.000000   

            asp_at   t_protein     albumin       alb_glo      target  
count   583.000000  583.000000  583.000000     583.00000  583.000000  
mean    109.910806    6.483190    3.141852    -685.16578    1.286449  
std     288.918529    1.085451    0.795519    8261.85600    0.452490  
min      10.000000    2.700000    0.900000 -1000

**Step 3: create input and target data**

Since the data is read as a whole into df. There is a need to split the content that containing input variables or information regarding the diagnotics values and the target value. The target value is actually the label of the input values, indicating if the person has diabetes or not.

input_df = df.drop('class', axis=1) is to create input_df that contains only the first 8 columns containing diagnotics information and removing the last column ('class'). It doesn't make sense to include 'class' in the data to train the model since it is the target we want to learn and predict.

target = df['class'] is creating a dataframe that only contains the target value - '0' or '1' 

In [3]:
#Step 3: Split into input_df and target dataframe. axis=0, row, axis=1, column.
input_df = df.drop('target', axis=1)
target = df['target']

print(input_df.shape,target.shape)

(583, 9) (583,)


In [4]:
#distribution of class - imbalance class with '1' having lesser count
target.value_counts()

1    416
2    167
Name: target, dtype: int64

**Step 4: split data into training and testing data**

In order to train a classifier or a classification model, there is a need to split the input data and corresponding target value into two parts - training and testing data. The purpose of having two set of data is to evaluate the model (built using the training data). In other words, we want to be able to evaluate how good is the model by comparing the predicted result from the trained model and the testing data. Scikit-learn provides a useful function - train_test_split that is able to separate two set of data according to a proportion, for example, when test_size=0.3, it means splitting 70% training -30% testing. 

It is common to use 'X' to denote input values and 'y' to indicate target value in scikit-learn. As a result, X_train, y_train is essentially the input values and corresponding target value of training data. Similarly for X_test and y_test for testing data.

Question 1. Split data into 70% training and 30% testing data and keeping random state as 7. How many records (rows) are found in the newly created training data set?



In [5]:
#Step 4: Split feature and label sets to train and data sets - 70-30, random_state is desirable for reproducibility, stratify - same proportion as input data

X_train, X_test, y_train, y_test = train_test_split(input_df, target, test_size = 0.3, random_state = 7, stratify = target)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(408, 9) (175, 9) (408,) (175,)


Since the range of input values are of different scale (from df.describe()), there is a need to normalize the values between 0-1 so that the range of values won't influence the model. It is important to do the same scaling and transforming on both input values of training data (X_train) and testing data (X_test) to ensure meaningful evaluation.

In [6]:
#Question 2 - Normalize using MinMaxScaler to constrain values to between 0 and 1.

scaler = MinMaxScaler(feature_range = (0,1))

scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

**Step 5: create the Logistic Regression model**

The algorithm used in this example is Logistic Regression. It is one of the commonly used classification algorithm. The model created is 'logreg'. You can use any other variable to represent it, e.g., 'lr'. 

logreg.fit(X_train, y_train) - using the training input data and target value to build the 'logreg' model. After this command, 'logreg' will learn the 'rules'/'knowledge' of differentiating an onset diabetes and non-diabetes patient.

y_pred = logreg.predict(X_test) - 'logreg' model is used to predict X_test and the predicted result can be found in y_pred. If the model has learnt all the 'knowledge/rules', y_pred will be 100% match with y_test. 

The subsequent codes are to evaluate the predicted results using both accuracy and F1-score. You may observe that the recall value is quite low for target value '1', this is likely due to the imbalance class '1' and '0' of the original data with '1' having 268 and '0' having 500 records.

In [6]:
#Step 5: Create a logistic regression classifier, default c=1

logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Testing accuracy %s' % accuracy_score(y_test, y_pred))

#look at the value under the '1' class (or 'yes') for the corresponding precision, recall and f1 score
print(classification_report(y_test, y_pred))

Testing accuracy 0.72
              precision    recall  f1-score   support

           1       0.74      0.94      0.83       125
           2       0.53      0.16      0.25        50

    accuracy                           0.72       175
   macro avg       0.64      0.55      0.54       175
weighted avg       0.68      0.72      0.66       175



One approach to improve the model is to handle the imbalance class data by oversampling the minority class. The method used in this example is SMOTE which stands for Synthetic Minority Over-sampling Techniques. SMOTE is applied on the X_train and y_train to ensure that both '1' and '0' classes are represented equally.

Based on the result, it can be observed that precision and recall are now more balanced (for the 'yes' or '1' class) and this model created is a better model to be used in predicting the onset of diabates. 

It is important to be able to interpret the model created and to understand the importance of each features. For Logistic Regression, since we have normalized the input values, we can make use of the magnitude of the coefficients (coef_) to find the important features. 

argsort returns indices of the sorted values from smallest to the largest. By adding a '-' for the logreg.coef_, it will sort in descending order.

Feature with largest value contributes the most to the model (based on the magnitude associated with the feature, it is glucose_conc).

In [7]:
#Question 1,3 - get the descending sorted indices based on coefficient values
sorted_index = np.argsort(-logreg.coef_)

#get the feature_names
feature_names = input_df.columns

#get the names of the important features (largest to smallest)
print(feature_names.to_numpy()[sorted_index])

[['albumin' 'alb_glo' 'akl_ph' 'asp_at' 'ala_at' 'age' 't_bilirubin'
  't_protein' 'd_bilirubin']]


Question 3: Accuracy is a suitable evaluation metric to assess this classifier.



Question 4: Which is the most important feature based on the raw data (considering that the coefficient values of logistic regression can be used for the purpose)?

`albumin`

Question 5: Does scaling or normalization has an impact in the ranking of the important features? Explain your reason.


Yes it does. By looking at df.describe(), we can see that alb_glo has a very large range and high standard deviation - the minimum is -100000 and max is 2.80000!

The -100000 is most likely an outlier, which will have an impact in the ranking of the features.

This, along with all other features, should be scaled down to be between 0 and 1 using a MinMaxScaler to reduce the impact of outliers with very large numbers.