# Data Preprocessing

This is an individual assignment. 

You are to use the attached diabetes dataset to train and evaluate any classifier for predicting the outcome (0 or 1). 

Before training, perform two transformations on the data:

    - Fill in the missing values. These will be zeros where a zero does not make sense. Research how to convert a zero into a NaN and then fill in the NaNs with the median value as in the lecture notes.
    
    - Standardize the data.

# Load the diabetes dataset

In [6]:
import pandas as pd
diabetes = pd.read_csv("diabetes.csv")
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Convert a zero to a NaN and fill in the Nan with a median value

In [9]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
new_diabetes = diabetes.drop("Outcome", axis=1)
imputer.fit(new_diabetes)
new_diabetes.median().values

array([  3.    , 117.    ,  72.    ,  23.    ,  30.5   ,  32.    ,
         0.3725,  29.    ])

In [10]:
new_diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [11]:
X = imputer.transform(new_diabetes)
diabetes_tr = pd.DataFrame(X, columns=new_diabetes.columns, index=new_diabetes.index)

In [12]:
diabetes_tr.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0


# Standardization

In [14]:
diabetes_tr_columns = diabetes_tr.columns
diabetes_tr.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6.0,148.0,72.0,35.0,0.0,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,0.0,137.0,40.0,35.0,168.0,43.1,2.288,33.0


In [15]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
diabetes_tr = scaler.fit_transform(new_diabetes)

In [16]:
diabetes_tr_df = pd.DataFrame(diabetes_tr, columns=new_diabetes.columns, index=new_diabetes.index)
diabetes_tr_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0.639947,0.848324,0.149641,0.90727,-0.692891,0.204013,0.468492,1.425995
1,-0.844885,-1.123396,-0.160546,0.530902,-0.692891,-0.684422,-0.365061,-0.190672
2,1.23388,1.943724,-0.263941,-1.288212,-0.692891,-1.103255,0.604397,-0.105584
3,-0.844885,-0.998208,-0.160546,0.154533,0.123302,-0.494043,-0.920763,-1.041549
4,-1.141852,0.504055,-1.504687,0.90727,0.765836,1.409746,5.484909,-0.020496


In [17]:
diabetes_tr_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,2.5442610000000002e-17,3.614007e-18,-1.3272440000000001e-17,7.994184000000001e-17,-3.556183e-17,2.295979e-16,2.462585e-16,1.8576e-16
std,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652,1.000652
min,-1.141852,-3.783654,-3.572597,-1.288212,-0.6928906,-4.060474,-1.189553,-1.041549
25%,-0.8448851,-0.6852363,-0.3673367,-1.288212,-0.6928906,-0.5955785,-0.6889685,-0.7862862
50%,-0.2509521,-0.1218877,0.1496408,0.1545332,-0.4280622,0.0009419788,-0.3001282,-0.3608474
75%,0.6399473,0.6057709,0.5632228,0.7190857,0.4120079,0.5847705,0.4662269,0.6602056
max,3.906578,2.444478,2.734528,4.921866,6.652839,4.455807,5.883565,4.063716


Train a model to classify zero or one

In [93]:
from sklearn.ensemble import RandomForestClassifier
rf_cl = RandomForestClassifier()
diabetes_label = diabetes["Outcome"]
rf_cl.fit(diabetes_tr, diabetes_label)

RandomForestClassifier()

## Model Evaluation

In [87]:
from sklearn.model_selection import train_test_split
X = diabetes_tr_df
y = diabetes["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

In [94]:
from sklearn.metrics import mean_squared_error
train_pred = rf_cl.predict(X_train)
rmse = mean_squared_error(y_train, train_pred, squared=False)
print("Error on train data:", rmse)

Error on train data: 0.0




In [95]:
from sklearn.metrics import mean_squared_error
test_pred = rf_cl.predict(X_test)
rmse = mean_squared_error(y_test, test_pred, squared=False)
print("Error on test data:", rmse)

Error on test data: 0.0




In [None]:
rf_cl.sc