<h3>Predicting Diabetic Patients</h3> <hr>
<h2>Project Summary</h2>
<hr>
<h3>Requirements</h3><br>

Nowadays, anyone can get outside food delivered at home by using a mobile application. If you consuming outside food on a daily basis, then it can be harmful and it can cause disease such as diabetes. Diabetes is one of the acute diseases that occurs because of bad food habits. <br>

You are given a dataset that contains the details of 1 million patients who have undergone different tests and medications for diabetes. Your task is to predict if a patient requires medications for diabetes again in the near future. Most of the features are made anonymous are made anonymous to protect the privacy of patients, insurance companies, etc.<br>

You are required to predict the 'diabetesMeds column.

Note: Refer to the sample_sbmission.csv files to check the format of the submissions.

<h3>Evaluation Criteria</h3> <br>
The Evaluation criteria for this problem is the weigthed recall score <br>
Score = 100*recall_score(acutal_values,predicted_vvalues,average='weighted')

<h3>Data Description</h3> <br>
1) encounter_id = A Calculated unique ID for each encounter with the patient <br>
2) Patient_ID = Unique ID for each patient<br>
3) race = Patient race <br>
4) gender = Patient gender <br>
5) age = Patient age <br>
6) weight = Patient weight <br>
7) Admission_type_id = The ID assigned while taking admission in the hospital <br>
8) Discharge_diposition_id= The ID assigned while discharging <br>
9) Admission_source_id = The ID of the physician for whom the patient got admitted <br>
10) Time_in_hospital = Time spent by the patient in the hospital <br>
11) diabetesMed = Two unique values, Yes or NO, representing if the patient needs medicines for diabetes or not <br>

<h3>Analysis</h3><br>

Steps we performned:<br>
1) Read the datasets <br>
2) Remove ID columns <br>
3) Remove target columns (diabetesMed) from train dataset <br>
4) Merge train and test dataset <br>
5) Label Encoding of categorical columns <br>
6) BoxCox transformation of numerical columns <br>
7) Separate out train and test dataset <br>
8) Build Model on Logistic regression <br>
9) Prediction on test data <br>


<h3>Summary</h3><br>
Here we do not have target column in Test data. Model accuracy was calculated based on submission of code in HackerEarth. 
Training the data using algorithms like Logistic Regression, Decision Tree, Random Forest, Ada Boost and checking the accuracy to find out which algorithm is the best.



<h3>Results</h3><br>
Logistic Regression model with highest accuracy. We have observed that model was 99.3% correct on 1 millon test data.


<h3>Reference</h3><br>
Hackathon organised by HackerEarth


### Python Code

In [189]:
#import libraries 
import pandas as pd
import numpy as np
from scipy import stats 
pd.set_option('display.max_columns', None)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

### Read Data sets

In [190]:
df_train_raw=pd.read_csv('train.csv')
df_test_raw=pd.read_csv('test.csv')

df_train=df_train_raw.copy()
df_test=df_test_raw.copy()

df_train_y=df_train['diabetesMed']

In [191]:
#drop id columns from test and train data
df_train.drop(['encounter_id','patient_id','diabetesMed'],axis=1,inplace=True)
df_test.drop(['encounter_id','patient_id'],axis=1,inplace=True)


In [192]:
# Concat Train and Test data for label encoding

df = pd.concat([df_train,df_test], axis = 0 )

In [193]:
df.shape

(22666, 47)

In [194]:
# Select All object columns and apply Leble Encoding

df[df.select_dtypes(include=['object']).columns]=df[df.select_dtypes(include=['object']).columns].apply(le.fit_transform)

#### Box-Cox Transformation
When you are dealing with real-world data, you are going to deal with features that are heavily skewed. Transformation technique is useful to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association. <br> <br>


In [195]:
# applying Box Cox Transformation
admission_type_id_1, fitted_lambda = stats.boxcox(df['admission_type_id'])
discharge_disposition_id_1, fitted_lambda = stats.boxcox(df['discharge_disposition_id'])
admission_source_id_1, fitted_lambda = stats.boxcox(df['admission_source_id'])
time_in_hospital_1, fitted_lambda = stats.boxcox(df['time_in_hospital'])
tel_3_1, fitted_lambda = stats.boxcox(df['tel_3'])
tel_5_1, fitted_lambda = stats.boxcox(df['tel_5'])
tel_12_1, fitted_lambda = stats.boxcox(df['tel_12'])

In [196]:
# Drop existing columns
df.drop(['admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'tel_3', 'tel_5', 'tel_12'],axis=1,inplace=True)

In [197]:
# Reassign the value to the individual columns
df['admission_type_id']=admission_type_id_1
df['discharge_disposition_id']=discharge_disposition_id_1
df['admission_source_id']=admission_source_id_1
df['time_in_hospital']=time_in_hospital_1
df['tel_3']=tel_3_1
df['tel_5']=tel_5_1
df['tel_12']=tel_12_1

In [198]:
df.head(2)

Unnamed: 0,race,gender,age,weight,tel_1,tel_2,tel_4,tel_6,tel_7,tel_8,tel_9,tel_10,tel_11,tel_13,tel_14,tel_15,tel_16,tel_17,tel_18,tel_19,tel_20,tel_21,tel_22,tel_23,tel_24,tel_25,tel_26,tel_27,tel_28,tel_29,tel_30,tel_41,tel_42,tel_43,tel_44,tel_45,tel_46,tel_47,tel_48,tel_49,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,tel_3,tel_5,tel_12
0,3,1,8,0,7,16,0,0,0,0,438,100,433,2,2,1,1,1,0,1,0,2,1,0,1,1,1,1,0,0,0,0,2,0,0,0,0,0,0,1,0.0,0.709312,3.094682,2.031779,54.472918,3.10059,15.156166
1,1,0,5,0,1,25,1,0,0,0,421,404,196,2,2,1,1,1,0,1,0,1,1,0,1,1,1,1,0,0,0,0,1,1,0,0,0,0,1,2,0.679927,0.0,0.0,0.727254,28.664223,4.771756,9.107879


#### Split df into df_train and df_test also verifying the shape 

In [199]:
df_train.shape

(14696, 47)

In [200]:
df_test.shape

(7970, 47)

In [201]:
#Split Data into Train and Test
df_train=df.iloc[:df_train.shape[0],:]
df_test=df.iloc[df_train.shape[0]:,:]

In [202]:
df_train.shape

(14696, 47)

In [203]:
df_test.shape

(7970, 47)

### Model Building

In [204]:
from sklearn.linear_model import LogisticRegression  # import the logistic regression
logmodel = LogisticRegression(max_iter=10000)

In [205]:
logmodel.fit(df_train,df_train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [206]:
predict=logmodel.predict(df_test)

In [207]:
submission = pd.concat([df_test_raw.encounter_id, pd.Series(predict)], axis = 1) 
submission.rename(columns={submission.columns[0]:"encounter_id", submission.columns[1]:"diabetesMed"},inplace=True)


In [208]:
submission['diabetesMed'].value_counts()

1    4558
0    3412
Name: diabetesMed, dtype: int64

In [209]:
submission.to_csv('submission.csv',index=False)

In [210]:
!ipython nbconvert diabetic_patients.ipynb
import os
os.rename(r'diabetic_patients.html',r'index.html')

[NbConvertApp] Converting notebook diabetic_patients.ipynb to html
[NbConvertApp] Writing 301179 bytes to diabetic_patients.html
