<table class="table table-bordered">
    <tr>
        <th style="text-align:center; width:35%"><img src='https://dl.dropbox.com/s/qtzukmzqavebjd2/icon_smu.jpg' style="width: 300px; height: 90px; "></th>
    <th style="text-align:center;"><font size="4"> <br/>IS.215 - Analytics in Python Practical 1</font></th>
    </tr>
</table> 

This program builds a classifier for Pima Indians Diabetes dataset - https://www.kaggle.com/uciml/pima-indians-diabetes-database. It is a binary (2-class) classification problem. There are 768 observations with 8 input variables and 1 output/target variable. The variable names are as follows:

- Number of times pregnant.
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- Diastolic blood pressure (mm Hg).
- Triceps skinfold thickness (mm).
- 2-Hour serum insulin (mu U/ml).
- Body mass index (weight in kg/(height in m)^2).
- Diabetes pedigree function.
- Age (years).
- Target variable (0 -'no' or 1-'yes').

In [2]:
# Step 1: import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [4]:
# Step 2: Read in the data and analyse
df = pd.read_csv(
    "diabetes.csv",
    names=['num_pregnant','glucose_conc','diastolic_bp','triceps_thick','serum_insulin','bmi','pedigree','age','class'])

print(df.shape) # returns the dimensions of the data (row x column)
print(df.describe()) # summarises the data

(768, 9)
       num_pregnant  glucose_conc  diastolic_bp  triceps_thick  serum_insulin  \
count    768.000000    768.000000    768.000000     768.000000     768.000000   
mean       3.845052    120.894531     69.105469      20.536458      79.799479   
std        3.369578     31.972618     19.355807      15.952218     115.244002   
min        0.000000      0.000000      0.000000       0.000000       0.000000   
25%        1.000000     99.000000     62.000000       0.000000       0.000000   
50%        3.000000    117.000000     72.000000      23.000000      30.500000   
75%        6.000000    140.250000     80.000000      32.000000     127.250000   
max       17.000000    199.000000    122.000000      99.000000     846.000000   

              bmi    pedigree         age       class  
count  768.000000  768.000000  768.000000  768.000000  
mean    31.992578    0.471876   33.240885    0.348958  
std      7.884160    0.331329   11.760232    0.476951  
min      0.000000    0.078000   21.00

In [7]:
# Step 3: Split into input and target dataframes. axis=0, row, axis=1, column. 
input_df = df.drop("class", axis=1) # we're trying to make the model guess the input, so we remove it first
target = df['class'] # target is something like expected output

print(input_df.shape, target.shape) # should expect 8 columns cos we dropped the `class` column

(768, 8) (768,)


In [8]:
# Distribution of class, e.g. how many have diabetes and how many doesn't have
target.value_counts()

0    500
1    268
Name: class, dtype: int64

In [9]:
# Step 4:
# Split feature and label sets to train and data sets - 70-30
# random_state is desirable for reproducibility
# stratify - same proportion as input data

X_train, X_test, y_train, y_test = train_test_split(input_df, target, test_size=0.3, random_state=10, stratify=target)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(537, 8) (231, 8) (537,) (231,)


In [10]:
# Question 2 - Normalize using MinMaxScaler to constrain values to between 0 and 1.
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range = (0,1))

#?

# scaler.fit(X_train)
# X_train = scaler.transform(X_train)
# X_test = scaler.transform(X_test)

In [11]:
#Step 5: Create a logistic regression classifier, default c=1

logreg = LogisticRegression(solver="liblinear")
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print("Testing accuracy %s" % accuracy_score(y_test, y_pred))
#look at the value under the positive class - macro avg is insensitive to imbalanced data, micro result will be affected if there is imbalance data - infleunce by majority
print(classification_report(y_test, y_pred))

Testing accuracy 0.8181818181818182
              precision    recall  f1-score   support

           0       0.82      0.92      0.87       150
           1       0.81      0.63      0.71        81

    accuracy                           0.82       231
   macro avg       0.82      0.77      0.79       231
weighted avg       0.82      0.82      0.81       231



### Question 4
This is a 2-classes dataset but it is imbalanced. As a result, there is a possibility that one class is over-represented and the model built maybe biased towards to majority. One approach to solve this is to oversample the smaller data.

In our dataset, we have 350 who do not have diabetes and 187 who have, which makes our dataset imbalanced.

In [19]:
# Question 4 - handling imbalanced data
from imblearn.over_sampling import SMOTE

# Rerunning above with resampled data - using oversampling
sm = SMOTE(random_state=2) # synthetic creation of data that's balanced
X_train_sm, y_train_sm = sm.fit_sample(X_train, y_train.ravel())

print("Existing dataset's counts")
print(y_train.value_counts())
print("\n")

print("Resampled dataset's counts")
print(pd.value_counts(pd.Series(y_train_sm)))
print("\n")

print("Resampled dataset vs. existing dataset")
print(X_train_sm.shape, X_train.shape)
print("\n")

clf = logreg.fit(X_train_sm, y_train_sm)
y_pred = clf.predict(X_test)
print(clf.score)
print(classification_report(y_test, y_pred))

#?


Existing dataset's counts
0    350
1    187
Name: class, dtype: int64


Resampled dataset's counts
1    350
0    350
dtype: int64


Resampled dataset vs. existing dataset
(700, 8) (537, 8)


<bound method ClassifierMixin.score of LogisticRegression(solver='liblinear')>
              precision    recall  f1-score   support

           0       0.85      0.74      0.79       150
           1       0.61      0.75      0.67        81

    accuracy                           0.74       231
   macro avg       0.73      0.75      0.73       231
weighted avg       0.76      0.74      0.75       231



In [12]:
#Question 1,3 - analyse the features
#get the sorting indices in descending order
sorted_index = np.argsort(-logreg.coef_)

#get the feature_names
feature_names = input_df.columns

#get the names of the important features
print (feature_names.to_numpy()[sorted_index])

[['pedigree' 'num_pregnant' 'bmi' 'glucose_conc' 'age' 'serum_insulin'
  'triceps_thick' 'diastolic_bp']]


#### Question 1