# Factors affecting Canadiansâ€™ decisions to obtain flu vaccinations
## Intial Results and Code
## Lise Doucette  |  July 23, 2019 


## _Section 1: Comparisons of three classifiers_

In [54]:
# import libraries (note that there is sometimes a DepreciationWarning given,
# but it is not relevant to the specific functions I am using)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

In [51]:
# use a saved version of the data that is cleaned and ready for analysis
# (the code to do the cleaning is CapstoneCodeCCHS.py at https://github.com/librarianlise/RyersonCapstoneProject/)

df2=pd.read_csv('CapstoneDataCleaned.csv', sep=',')

In [52]:
# check that data is read in properly
df2.head()

Unnamed: 0,flu_past_year,age,sex,self_health,education,has_doctor,is_immigrant,province,income,household_size,income_level,marital_status,children_in_household
0,1,5,1,2,2,1,0,24,1,1,1,4,0
1,0,1,0,3,3,1,0,35,3,3,2,4,0
2,0,4,0,1,3,1,1,35,3,4,2,1,0
3,0,3,0,4,3,0,0,35,5,1,3,4,0
4,0,5,0,2,2,1,0,10,2,1,2,3,0


In [31]:
# display percentage of rows/participants with each class outcome to compare to classifier results

df2['flu_past_year'].value_counts().sort_index()/len(df2)

0    0.627921
1    0.372079
Name: flu_past_year, dtype: float64

### Preparatory work for all analyses

In [25]:
#make a new data frame with the independent variables #2-13
X=df2[df2.columns[1:13]]

# create dataframe that includes dummy variables for the marital status and province variables, as they are
# the only two non-binary nominal variables
X = pd.get_dummies(X, columns=['marital_status', 'province'])

# make marital status of married (1) and province of Ontario (35) the baseline measurements by removing them
X = X.drop(['marital_status_1', 'province_35'], axis=1)

# confirm that correct variables are included
X.columns

Index(['age', 'sex', 'self_health', 'education', 'has_doctor', 'is_immigrant',
       'income', 'household_size', 'income_level', 'children_in_household',
       'marital_status_2', 'marital_status_3', 'marital_status_4',
       'province_10', 'province_11', 'province_12', 'province_13',
       'province_24', 'province_46', 'province_47', 'province_48',
       'province_59', 'province_60', 'province_61', 'province_62'],
      dtype='object')

In [26]:
#class series that includes only the classification/prediction variable (flu_last_year)
y = df2.iloc[:,0]

### Classifier 1: Logistic Regression

In [14]:
# create test and train data

X_trainLR, X_testLR, y_trainLR, y_testLR = train_test_split(X, y, test_size=0.33, random_state=0)

# run classifier

classifier = LogisticRegression(random_state=4)
classifier.fit(X_trainLR, y_trainLR)
y_predLR = classifier.predict(X_testLR)

In [39]:
# create and dispay confusion matrix

confusion_matrixLR = confusion_matrix(y_testLR, y_predLR)
print(confusion_matrixLR)

[[17021  3277]
 [ 6736  5414]]


In [17]:
# display accuracy
classifier.score(X_testLR, y_testLR)

0.6914139546351085

In [19]:
# display metrics (precision, recall, etc.)
print(classification_report(y_testLR, y_predLR))

             precision    recall  f1-score   support

          0       0.72      0.84      0.77     20298
          1       0.62      0.45      0.52     12150

avg / total       0.68      0.69      0.68     32448



### Results of Logistic Regression

The overall accuracy is 69.1%.  This can be compared to the original percentage of persons not obtaining the flu shot of 62.8%, which would also be the accuracy if the classifier algorithm simply chose the majority class.

The accuracy is 6.3% better than the original majority class; there may be some tweaking that can be done to improve it.

## Classifier 2: Random Forest

In [38]:
# create test and train data

X_trainRF, X_testRF, y_trainRF, y_testRF = train_test_split(X, y, test_size=0.33, random_state=0)

# run classifier

clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_trainRF,y_trainRF)
y_predRF=clf.predict(X_testRF)

In [40]:
# create and dispay confusion matrix

confusion_matrixRF = confusion_matrix(y_testRF, y_predRF)
print(confusion_matrixRF)

[[15814  4484]
 [ 6627  5523]]


In [43]:
# display accuracy
clf.score(X_testRF, y_testRF)

0.6575751972386588

In [44]:
# display metrics (precision, recall, etc.)
print(classification_report(y_testRF, y_predRF))

             precision    recall  f1-score   support

          0       0.70      0.78      0.74     20298
          1       0.55      0.45      0.50     12150

avg / total       0.65      0.66      0.65     32448



### Results of Random Forest

The overall accuracy is 65.8%.  This can be compared to the original percentage of persons not obtaining the flu shot of 62.8%, which would also be the accuracy if the classifier algorithm simply chose the majority class. 

The accuracy is 3.0% better than the original majority class; there may be some tweaking that can be done to improve it.

## Classifier 3: Naive Bayes

In [47]:
# create test and train data

X_trainNB, X_testNB, y_trainNB, y_testNB = train_test_split(X, y, test_size=0.33, random_state=0)

# run classifier

model = GaussianNB()
model.fit(X_trainNB,y_trainNB)
y_predNB = model.predict(X_testNB)

In [48]:
# create and dispay confusion matrix

confusion_matrixNB = confusion_matrix(y_testNB, y_predNB)
print(confusion_matrixNB)

[[13313  6985]
 [ 4702  7448]]


In [49]:
# display accuracy
model.score(X_testRF, y_testRF)

0.639823717948718

In [50]:
# display metrics (precision, recall, etc.)
print(classification_report(y_testNB, y_predNB))

             precision    recall  f1-score   support

          0       0.74      0.66      0.69     20298
          1       0.52      0.61      0.56     12150

avg / total       0.66      0.64      0.64     32448



### Results of Naive Bayes

The overall accuracy is 64.0%.  This can be compared to the original percentage of persons not obtaining the flu shot of 62.8%, which would also be the accuracy if the classifier algorithm simply chose the majority class. 

The accuracy is 1.2% better than the original majority class; there may be some tweaking that can be done to improve it.

### Comparing Classifiers

The Logistic Regression classifier performed the best in terms of accuracy, at 69.1%.  I still need to look at the results of the three classifiers in more details, and in particular look at precision, recall.

## _Section 2: Decision Tree Rules_

In [59]:
#create tree model
dtree = tree.DecisionTreeClassifier(criterion='entropy')
dtree.fit(X, y)

# export tree so that I can create the diagram in the http://webgraphviz.com/ program
dotfile = open("dtree.dot", 'w')
tree.export_graphviz(dtree, out_file=dotfile, feature_names=X.columns)
dotfile.close()

The above code does run and creates a decision tree and  a file that can be imported into a visualization program, though it is so large that it cannot be easily visualized.

I think there are some issues with the dummy variables, as in some cases, multiple provinces appear within the same rules (for example, province_12 < 0.5 and then later in the same rule/branch, province_60 < 0.5).  However, running it with the variables as integers causes problems as the algorithm then treats it like a numeric variable.  I need to look at this.

I am still currently working through how to extract the top decision rules from Python, as it is not straightforward.  This will also likely lead to setting constraints for the decision tree algorithm in terms of how to select the 'best' rules.  For example, from the dtree.dot file above, I can see rules that look like:


    1 [label="has_doctor <= 0.5\nentropy = 0.829\nsamples = 63689\nvalue = [47032, 16657]"] ; 

and then later rules that look like this:

    84 [label="age <= 1.5\nentropy = 0.985\nsamples = 7\nvalue = [3, 4]"] ;

These are obviously based on vastly different amounts of data, and I will need to set constraints to rate rules as 'best' rules only when they contain a certain minimum of results.