# Data Cleansing and Transformation

In this notebook we attempt to improve our model by cleansing our data and potentially imputing it. 

## Initial Model Testing

The first run of the model was with the complete raw dataset, including all of the rows with 'Unknown' or 'Unknown/Other'. We wanted to first see how well the model performed leaving these entries in. Surprisingly, using SkLearn's Random Forest Classifier, our model was able to predict the persistence of patients in the test set with approximately 80% accuracy. We plan on improving our model by using other algorithms, but first we would like to explore ways of cleansing or transforming our data so that our model performs better regardless of the algorithm used. 

## Improving the model

### Attempting to improve the model by removing data

Next we try to improve the model's accuracy by removing rows with any uknown entries. We expect this to reduce the accuracy however we believe there might be a slight chance it helps so we attempt it. 

In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import seaborn as sns
import scipy.stats as st

In [37]:
df_drug = pd.read_excel("Healthcare_dataset.xlsx",'Dataset')

# Some of the columns in the data set only have one unique value, so we can remove these
df_drug.drop(['Risk_Immobilization','Risk_Estrogen_Deficiency','Risk_Chronic_Liver_Disease','Risk_Untreated_Early_Menopause','Risk_Untreated_Chronic_Hyperthyroidism','Risk_Osteogenesis_Imperfecta'],axis=1)

# Now we will drop rows with unknown values

df_drug = df_drug[df_drug.Race != "Other/Unknown"]
df_drug = df_drug[df_drug.Region != "Unknown"]
#df_drug = df_drug[df_drug.Ntm_Speciality != "Unknown"] Leaving this one in ended up yielding a better model. 
print(len(df_drug))

drug_y = df_drug["Persistency_Flag"]
drug_x = df_drug.drop("Persistency_Flag",axis=1)
drug_x = df_drug.drop("Ptid",axis=1)
drug_x=df_drug.drop("Gender",axis=1) #Trying to see if dropping genderhelps

features = drug_x.columns

# Encode the columns
x_factorized = pd.DataFrame()
for feature in features:
    x_factorized[feature] = pd.factorize(drug_x[feature])[0]

drug_y = pd.DataFrame(pd.factorize(drug_y)[0])
drug_x = x_factorized
drug_y = np.ravel(drug_y)

drug_x = drug_x.drop("Persistency_Flag",axis=1)

#Train and test
X_train, X_test, y_train, y_test = train_test_split(drug_x,drug_y,test_size=0.3,random_state=42)

#Create classifier object
clf = RandomForestClassifier(n_estimators=100)

# Train the model
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)
#clf.get_params()

# Now let's test accuracy of the random forest classifier
print("Accuracy: ",metrics.accuracy_score(y_test,y_pred))

3327
Accuracy:  0.8188188188188188


### Second Method

We ended up improving the model's accuracy by exluding the rows that had Race or Region unknown or other.

Next we will try replacing some of the unknown entries with the mode of the corresponding column.  

In [38]:
df_drug.head()


Unnamed: 0,Ptid,Persistency_Flag,Gender,Race,Ethnicity,Region,Age_Bucket,Ntm_Speciality,Ntm_Specialist_Flag,Ntm_Speciality_Bucket,...,Risk_Family_History_Of_Osteoporosis,Risk_Low_Calcium_Intake,Risk_Vitamin_D_Insufficiency,Risk_Poor_Health_Frailty,Risk_Excessive_Thinness,Risk_Hysterectomy_Oophorectomy,Risk_Estrogen_Deficiency,Risk_Immobilization,Risk_Recurring_Falls,Count_Of_Risks
0,P1,Persistent,Male,Caucasian,Not Hispanic,West,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,0
1,P2,Non-Persistent,Male,Asian,Not Hispanic,West,55-65,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,0
3,P4,Non-Persistent,Female,Caucasian,Not Hispanic,Midwest,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,1
4,P5,Non-Persistent,Female,Caucasian,Not Hispanic,Midwest,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,1
5,P6,Non-Persistent,Female,Caucasian,Not Hispanic,Midwest,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,Y,N,N,N,N,N,N,N,N,2


In [41]:
mode_region = st.mode(df_drug['Region'])
mode_race = st.mode(df_drug['Race'])

replace_dict = {'Region':{'Unknown':mode_region},'Race':{'Other/Unknown':mode_race}}

df_drug.replace(replace_dict)



  mode_region = st.mode(df_drug['Region'])
  mode_region = st.mode(df_drug['Region'])
  mode_race = st.mode(df_drug['Race'])
  mode_race = st.mode(df_drug['Race'])


Unnamed: 0,Ptid,Persistency_Flag,Gender,Race,Ethnicity,Region,Age_Bucket,Ntm_Speciality,Ntm_Specialist_Flag,Ntm_Speciality_Bucket,...,Risk_Family_History_Of_Osteoporosis,Risk_Low_Calcium_Intake,Risk_Vitamin_D_Insufficiency,Risk_Poor_Health_Frailty,Risk_Excessive_Thinness,Risk_Hysterectomy_Oophorectomy,Risk_Estrogen_Deficiency,Risk_Immobilization,Risk_Recurring_Falls,Count_Of_Risks
0,P1,Persistent,Male,Caucasian,Not Hispanic,West,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,0
1,P2,Non-Persistent,Male,Asian,Not Hispanic,West,55-65,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,0
3,P4,Non-Persistent,Female,Caucasian,Not Hispanic,Midwest,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,1
4,P5,Non-Persistent,Female,Caucasian,Not Hispanic,Midwest,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,1
5,P6,Non-Persistent,Female,Caucasian,Not Hispanic,Midwest,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,Y,N,N,N,N,N,N,N,N,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3419,P3420,Persistent,Female,Caucasian,Not Hispanic,South,>75,GENERAL PRACTITIONER,Others,OB/GYN/Others/PCP/Unknown,...,N,N,Y,N,N,N,N,N,N,1
3420,P3421,Persistent,Female,Caucasian,Not Hispanic,South,>75,Unknown,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,0
3421,P3422,Persistent,Female,Caucasian,Not Hispanic,South,>75,ENDOCRINOLOGY,Specialist,Endo/Onc/Uro,...,N,N,Y,N,N,N,N,N,N,1
3422,P3423,Non-Persistent,Female,Caucasian,Not Hispanic,South,55-65,Unknown,Others,OB/GYN/Others/PCP/Unknown,...,N,N,N,N,N,N,N,N,N,0


In [50]:
# Finally let's check the accuracy of the model after these changes

drug_y = df_drug["Persistency_Flag"]
drug_x = df_drug.drop("Persistency_Flag",axis=1)
drug_x = df_drug.drop("Ptid",axis=1)
drug_x=df_drug.drop("Gender",axis=1) #Trying to see if dropping genderhelps

features = drug_x.columns

# Encode the columns
x_factorized = pd.DataFrame()
for feature in features:
    x_factorized[feature] = pd.factorize(drug_x[feature])[0]

drug_y = pd.DataFrame(pd.factorize(drug_y)[0])
drug_x = x_factorized
drug_y = np.ravel(drug_y)

drug_x = drug_x.drop("Persistency_Flag",axis=1)

#Train and test
X_train, X_test, y_train, y_test = train_test_split(drug_x,drug_y,test_size=0.3,random_state=42)

#Create classifier object
clf = RandomForestClassifier(n_estimators=100)

# Train the model
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)
#clf.get_params()

# Now let's test accuracy of the random forest classifier
print("Accuracy: ",metrics.accuracy_score(y_test,y_pred))

Accuracy:  0.8228228228228228


It seems like we have gotten a tiny improvement from this change so we will go with this second method for our final model. 