<table class="table table-bordered">
    <tr>
        <th style="text-align:center; width:35%"><img src='https://dl.dropbox.com/s/qtzukmzqavebjd2/icon_smu.jpg' style="width: 300px; height: 90px; "></th>
    <th style="text-align:center;"><font size="4"> <br/>IS.215 - Analytics in Python Practical 1</font></th>
    </tr>
</table> 

This program builds a classifier for Pima Indians Diabetes dataset - https://www.kaggle.com/uciml/pima-indians-diabetes-database. It is a binary (2-class) classification problem. There are 768 observations with 8 input variables and 1 output/target variable. The variable names are as follows:

- Number of times pregnant.
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- Diastolic blood pressure (mm Hg).
- Triceps skinfold thickness (mm).
- 2-Hour serum insulin (mu U/ml).
- Body mass index (weight in kg/(height in m)^2).
- Diabetes pedigree function.
- Age (years).
- Target variable (0 if non-diabetic, 1 if diabetic).

In [1]:
# Step 1: import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [2]:
# Step 2: Read in the data and analyse
df = pd.read_csv(
    "diabetes.csv",
    names=['num_pregnant','glucose_conc','diastolic_bp','triceps_thick','serum_insulin','bmi','pedigree','age','class'])

print(df.shape) # returns the dimensions of the data (row x column)
print(df.describe()) # summarises the data

(768, 9)
       num_pregnant  glucose_conc  diastolic_bp  triceps_thick  serum_insulin  \
count    768.000000    768.000000    768.000000     768.000000     768.000000   
mean       3.845052    120.894531     69.105469      20.536458      79.799479   
std        3.369578     31.972618     19.355807      15.952218     115.244002   
min        0.000000      0.000000      0.000000       0.000000       0.000000   
25%        1.000000     99.000000     62.000000       0.000000       0.000000   
50%        3.000000    117.000000     72.000000      23.000000      30.500000   
75%        6.000000    140.250000     80.000000      32.000000     127.250000   
max       17.000000    199.000000    122.000000      99.000000     846.000000   

              bmi    pedigree         age       class  
count  768.000000  768.000000  768.000000  768.000000  
mean    31.992578    0.471876   33.240885    0.348958  
std      7.884160    0.331329   11.760232    0.476951  
min      0.000000    0.078000   21.00

In [3]:
# Step 3: Split into input and target dataframes. axis=0, row, axis=1, column. 
input_df = df.drop("class", axis=1) # we're trying to make the model guess the input, so we remove it first
target = df['class'] # target is something like expected output

print(input_df.shape, target.shape) # should expect 8 columns cos we dropped the `class` column

(768, 8) (768,)


In [4]:
# Distribution of class i.e. how many have diabetes and how many doesn't have
target.value_counts()

0    500
1    268
Name: class, dtype: int64

For `train_test_split`, see the [documentations](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [5]:
# Step 4:
# Split feature (input_df) and label sets (target) to train and test datasets - 70-30
# random_state is desirable for reproducibility (so that the randomness is deterministic)
# stratify - same proportion as input data

X_train, X_test, y_train, y_test = train_test_split(input_df, target, test_size=0.3, random_state=10, stratify=target)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# Now, X becomes all the features (e.g. bmi, age, glucose) and y becomes the result

(537, 8) (231, 8) (537,) (231,)


### Exercise 2
Check the range of the values of the features. Are they of similar ranges? Will the ranges affect the prediction result? What should you do?

#### Answer
I understand [from this Medium article](https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02) that `MinMaxScaler` preserves the shape of the original distribution, and it doesn't meaningfully change the information embedded in the original data. In this case, we can use the `MinMaxScaler` to scale down the values:
- to be between 0 and 1
- while still keeping the original distribution

We can use `MinMaxScaler` here as a starting point, if we are not sure we should even standardise to normal distribution using `StandardScaler`.

In [6]:
# Question 2 - Normalize using MinMaxScaler to constrain values to between 0 and 1.
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))


# # print(X_train)

# Create a model using the training dataset first
scaler.fit(X_train)
# Then scale down both
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# print(X_train) # X_train now would be scaled down

In [7]:
# Step 5: Create a logistic regression classifier, default c=1
logreg = LogisticRegression(solver="liblinear")

# This is where you actually train the model, using fit() with the training dataset
# See https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit
logreg.fit(X_train, y_train)

# Now, using your trained LogisticRegression model,
# we pass in the X_test (features, but test dataset)
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Testing accuracy %s" % accuracy)

# look at the value under the positive class
# macro avg is insensitive to imbalanced data
# micro result will be affected if there is imbalance data - infleunce by majority
print(classification_report(y_test, y_pred))

Testing accuracy 0.8051948051948052
              precision    recall  f1-score   support

           0       0.80      0.93      0.86       150
           1       0.81      0.58      0.68        81

    accuracy                           0.81       231
   macro avg       0.81      0.75      0.77       231
weighted avg       0.81      0.81      0.80       231



### Exercise 4
This is a 2-classes dataset but it is imbalanced. As a result, there is a possibility that one class is over-represented and the model built maybe biased towards to majority. One approach to solve this is to oversample the smaller data.

#### My notes
Since the answer is already given above, just writing my notes here: In our training dataset, notice that we have 350 who do not have diabetes and 187 who have — it makes our dataset imbalanced.

In [8]:
# Question 4 - handling imbalanced data
from imblearn.over_sampling import SMOTE

# Rerunning above with resampled data - using oversampling
sm = SMOTE(random_state=2) # synthetic creation of data that's balanced
X_train_sm, y_train_sm = sm.fit_sample(X_train, y_train.ravel()) # using 70% of our dataset to train

print("Existing training dataset's counts")
print(y_train.value_counts())
print("\n")

print("Resampled dataset's counts")
print(pd.value_counts(pd.Series(y_train_sm)))
print("\n")

print("Resampled dataset vs. existing training dataset")
print(X_train_sm.shape, X_train.shape)
print("\n")

clf = logreg.fit(X_train_sm, y_train_sm)
y_pred = clf.predict(X_test)
print(clf.score)
print(classification_report(y_test, y_pred))

Existing training dataset's counts
0    350
1    187
Name: class, dtype: int64


Resampled dataset's counts
1    350
0    350
dtype: int64


Resampled dataset vs. existing training dataset
(700, 8) (537, 8)


<bound method ClassifierMixin.score of LogisticRegression(solver='liblinear')>
              precision    recall  f1-score   support

           0       0.84      0.72      0.78       150
           1       0.59      0.75      0.66        81

    accuracy                           0.73       231
   macro avg       0.72      0.74      0.72       231
weighted avg       0.76      0.73      0.74       231



In [9]:
# Question 1, 3 - analyse the features
# Get the sorting indices in descending order
sorted_index = np.argsort(-logreg.coef_)

# Get the feature_names
feature_names = input_df.columns

# Get the names of the important features
print(feature_names.to_numpy()[sorted_index])

[['glucose_conc' 'bmi' 'pedigree' 'num_pregnant' 'age' 'serum_insulin'
  'triceps_thick' 'diastolic_bp']]


### Exercise 1
Which feature play an important role in the prediction? Is it explainable?

#### Answer
I would think `glucose_conc`, `serum_insulin` and `pedigree` are more important features. A [cursory research](https://www.healthhub.sg/a-z/diseases-and-conditions/626/diabetes) shows that glucose tolerance, having sufficient insulin and having a family history (i.e. `pedigree`) would affect the chances of getting diabetes.


### Exercise 3
Rerun the program and check the features again. Is the result more explainable now?

#### Answer
Yes, the top three features are:

- `glucose_conc`
- `bmi`
- `pedigree`


### Question
What is the accuracy? What does it imply?

#### Answer
The testing accuracy is 80.9%. It implies the percentage in which the model got a prediction correct, out of all the testing dataset.