# Instructions
This is an individual assignment. 

You are to use the attached diabetes dataset to train and evaluate any classifier for predicting the outcome (0 or 1). 

Before training, perform two transformations on the data:

    - Fill in the missing values. These will be zeros where a zero does not make sense. Research how to convert a zero into a NaN and then fill in the NaNs with the median value as in the lecture notes.
    
    - Standardize the data.

## Load the Dataset

In [17]:
import pandas as pd
diabetes_data = pd.read_csv('diabetes.csv')
diabetes_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Describe the Dataset

In [18]:
diabetes_data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


## Inspect the Data

In [19]:
diabetes_data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

## Replace Zeros with NaN, except for Outcome

In [20]:
import numpy as np
diabetes_data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']] = diabetes_data[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']].replace(0, np.nan)
diabetes_data.head()



Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,,33.6,0.627,50,1
1,1.0,85.0,66.0,29.0,,26.6,0.351,31,0
2,8.0,183.0,64.0,,,23.3,0.672,32,1
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,,137.0,40.0,35.0,168.0,43.1,2.288,33,1


## Replace NaNs with median values

In [21]:
diabetes_data.fillna(diabetes_data.median(), inplace=True)
diabetes_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,125.0,33.6,0.627,50,1
1,1.0,85.0,66.0,29.0,125.0,26.6,0.351,31,0
2,8.0,183.0,64.0,29.0,125.0,23.3,0.672,32,1
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,4.0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


## Separate features and target variable

In [22]:
from sklearn.preprocessing import StandardScaler
X = diabetes_data.iloc[:, :-1]
y = diabetes_data.iloc[:, -1]

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

## Split data into training and testing sets

In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Train a logistic regression classifier model

In [24]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)

LogisticRegression(random_state=42)

## Evaluate classifier on the testing set

In [25]:
from sklearn.metrics import accuracy_score
y_pred = log_reg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.7597402597402597


## Check for Errors on Training & Testing Data

In [26]:
from sklearn.metrics import mean_squared_error
train_pred = log_reg.predict(X_train)
rmse = mean_squared_error(y_train, train_pred, squared=False)
print("Error on train data:", rmse)
test_pred = log_reg.predict(X_test)
rmse = mean_squared_error(y_test, test_pred, squared=False)
print("Error on test data:", rmse)

Error on train data: 0.47579865996117193
Error on test data: 0.4901629731627434


# Conclusion
My model has an accuracy score of approximately 76% which is not good but given the points I have laid out ahead, I guess it can be worked on and with. 

I think the reason for this poor performance comes from when I replaced zeros in the columns that had zeros with NaN and then replaced NaN with the median values. 

In my opinion, I think this messed up the realistic information we would have gotten from a non-tampered with data. 

Here, I took the risk of trading accuracy with uniformity of a dataset. 

When it came to checking for errors in the training and test data, both rmse values were high, again, pointing out the effect that came about when replacing zeros with the median values. 

Despite this, I think that this model can be improved with more research and more techniques applied.