# Assignment 8:

## Task: Can we accurately predict voting outcomes by using informal polling questions? (https://www.kaggle.com/c/can-we-predict-voting-outcomes)

## Data: In this data, we have to predict whether the person is going to vote for democrat or republican based on an answers to an informal questions and other attributes. There are total 101 informal questions and their answers in the data. The questions.pdf file contains the information about the questions.

## Attributes:
### USER_ID - an anonymous id unique to a given user
### YOB - the year of birth of the user
### Gender - the gender of the user, either Male or Female
### Income - the household income of the user. Either not provided, or one of "under $25,000", "$25,001 - $50,000", "$50,000 - $74,999", "$75,000 - $100,000", "$100,001 - $150,000", or "over $150,000".
### HouseholdStatus - the household status of the user. Either not provided, or one of "Domestic Partners (no kids)", "Domestic Partners (w/kids)", "Married (no kids)", "Married (w/kids)", "Single (no kids)", or "Single (w/kids)".
### EducationalLevel - the education level of the user. Either not provided, or one of "Current K-12", "High School Diploma", "Current Undergraduate", "Associate's Degree", "Bachelor's Degree", "Master's Degree", or "Doctoral Degree".
### Party - the political party for whom the user intends to vote for. Either "Democrat" or "Republican
### Q124742, Q124122, . . . , Q96024 - 101 different questions that the users were asked on Show of Hands. If the user didn't answer the question, there is a blank. For information about the question text and possible answers, see the file Questions.pdf.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
import numpy as np

In [2]:
data = pd.read_csv("train2016.csv")

In [3]:
data.head()

Unnamed: 0,USER_ID,YOB,Gender,Income,HouseholdStatus,EducationLevel,Party,Q124742,Q124122,Q123464,...,Q100010,Q99716,Q99581,Q99480,Q98869,Q98578,Q98059,Q98078,Q98197,Q96024
0,1,1938.0,Male,,Married (w/kids),,Democrat,No,,No,...,Yes,No,No,,No,,Only-child,No,No,Yes
1,4,1970.0,Female,"over $150,000",Domestic Partners (w/kids),Bachelor's Degree,Democrat,,Yes,No,...,,,,No,No,No,Only-child,Yes,No,No
2,5,1997.0,Male,"$75,000 - $100,000",Single (no kids),High School Diploma,Republican,,Yes,Yes,...,Yes,No,No,No,Yes,No,Yes,No,Yes,No
3,8,1983.0,Male,"$100,001 - $150,000",Married (w/kids),Bachelor's Degree,Democrat,No,Yes,No,...,No,No,No,Yes,Yes,No,Yes,No,No,Yes
4,9,1984.0,Female,"$50,000 - $74,999",Married (w/kids),High School Diploma,Republican,No,Yes,No,...,Yes,No,No,Yes,No,No,Yes,No,No,Yes


## Data Pre-Processing

In [4]:
X = data[data.columns.difference(['Party'])]
Y = data['Party']

In [5]:
for i in X.columns:
    if(X[i].dtype=='object'):
        X[i].fillna(X[i].mode()[0],inplace=True)
    else:
        X[i].fillna(int(X[i].mean()),inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [6]:
X = pd.get_dummies(X)

In [7]:
del X['USER_ID']

In [8]:
X['YOB<%d'%X['YOB'].mean()] = [int(x<X['YOB'].mean()) for x in X['YOB']]
X['YOB>%d'%X['YOB'].mean()] = [int(x>X['YOB'].mean()) for x in X['YOB']]

In [9]:
del X['YOB']

## Model Fitting

In [10]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,train_size=0.67,random_state = 3)

In [11]:
X_test.index = range(0,len(X_test))

In [12]:
clf = BernoulliNB()

In [13]:
clf.fit(X_train,Y_train)
test_prob = clf.predict_proba(X_test)

## Probability and Evidence Part

In [18]:
## Log Probabilities for feature value = 1
pos_proba = clf.feature_log_prob_

In [19]:
## Class Probabilities
class_prob = clf.class_log_prior_

In [20]:
## Log Probabilities for feature value = 0
neg_prob = np.log(1-np.exp(clf.feature_log_prob_))

In [21]:
## Index of most positive and negative object with respect to the probabilities
pos_obj_index = np.argmax(test_prob[:,0])
neg_obj_index = np.argmax(test_prob[:,1])

In [22]:
## Difference Between Probabilities of Democrat and Republican Class
class_prob[0] - class_prob[1]

0.11271973369983712

In [23]:
total_pos_evi = {}
total_neg_evi = {}
pos_evi_feat = {}
neg_evi_feat = {}

In [24]:
for i,r in X_test.iterrows():
    ## Since the difference between Democrat and Republican Class is Positive we can add it to the Pos_evidence
    pos_evd = class_prob[0] - class_prob[1]
    neg_evd = 0
    feat_pos = {}
    feat_neg = {}
    for j in range(0,len(r)):
        if(r[j]==0):
            if(neg_prob[0][j]-neg_prob[1][j]>0):
                pos_evd += neg_prob[0][j] - neg_prob[1][j]
                feat_pos[j] = neg_prob[0][j] - neg_prob[1][j]
            if(neg_prob[0][j]-neg_prob[1][j]<0):
                neg_evd += neg_prob[0][j] - neg_prob[1][j]
                feat_neg[j] = neg_prob[0][j] - neg_prob[1][j]
        if(r[j]==1):
            if(pos_proba[0][j]-pos_proba[1][j]>0):
                pos_evd += pos_proba[0][j] - pos_proba[1][j]
                feat_pos[j] = pos_proba[0][j] - pos_proba[1][j]
            if(pos_proba[0][j]-pos_proba[1][j]<0):
                neg_evd += pos_proba[0][j] - pos_proba[1][j]
                feat_neg[j] = pos_proba[0][j] - pos_proba[1][j]
    pos_evi_feat[i] = feat_pos
    neg_evi_feat[i] = feat_neg
    total_pos_evi[i] = pos_evd
    total_neg_evi[i] = neg_evd

In [25]:
largest_pos_evd = sorted(total_pos_evi, key=total_pos_evi.get,reverse=True)
largest_neg_evd = sorted(total_neg_evi,key=total_neg_evi.get,reverse=True)

In [26]:
# The most uncertain object with respect to the probabilities
print (min(enumerate(test_prob[:,0]),key=lambda x: abs(x[1]-0.5)))
print (min(enumerate(test_prob[:,1]),key=lambda x: abs(x[1]-0.5)))

(1464, 0.49980092339604293)
(1464, 0.5001990766039609)


## 1. The most positive object with respect to the probabilities

In [27]:
X_test.iloc[pos_obj_index]

EducationLevel_Associate's Degree              1
EducationLevel_Bachelor's Degree               0
EducationLevel_Current K-12                    0
EducationLevel_Current Undergraduate           0
EducationLevel_Doctoral Degree                 0
EducationLevel_High School Diploma             0
EducationLevel_Master's Degree                 0
Gender_Female                                  1
Gender_Male                                    0
HouseholdStatus_Domestic Partners (no kids)    0
HouseholdStatus_Domestic Partners (w/kids)     0
HouseholdStatus_Married (no kids)              0
HouseholdStatus_Married (w/kids)               0
HouseholdStatus_Single (no kids)               1
HouseholdStatus_Single (w/kids)                0
Income_$100,001 - $150,000                     0
Income_$25,001 - $50,000                       0
Income_$50,000 - $74,999                       0
Income_$75,000 - $100,000                      0
Income_over $150,000                           0
Income_under $25,000

### 1.1) the total positive log-evidence

In [28]:
print ("The total positive log-evidence: %f"%total_pos_evi[pos_obj_index])

The total positive log-evidence: 12.338390


### 1.2) the total negative log-evidence

In [29]:
print ("The total negative log-evidence: %f"%total_neg_evi[pos_obj_index])

The total negative log-evidence: -2.638247


### 1.3) Probability Distribution

In [30]:
print ("Class-1 Probability:%f"%test_prob[pos_obj_index][0])
print ("Class-0 Probaility: %f"%test_prob[pos_obj_index][1])

Class-1 Probability:0.999939
Class-0 Probaility: 0.000061


### 1.4) top 3 features values that contribute most to the positive evidence

In [31]:
top3_pos_feat_index = sorted(pos_evi_feat[pos_obj_index], key=pos_evi_feat[pos_obj_index].get,reverse=True)[0:3]

In [32]:
print ("First Feature: %s"%X_test.columns[top3_pos_feat_index[0]])
print ("Second Feature: %s"%X_test.columns[top3_pos_feat_index[1]])
print ("Third Feature: %s"%X_test.columns[top3_pos_feat_index[2]])

First Feature: Q109244_No
Second Feature: Q109244_Yes
Third Feature: Q98869_Yes


### 1.5) top 3 feature values that contribute the most to the negative evidence.

In [33]:
top3_neg_feat_index = sorted(neg_evi_feat[pos_obj_index], key=neg_evi_feat[pos_obj_index].get,reverse=True)[0:3]

In [34]:
print ("First Feature: %s"%X_test.columns[top3_neg_feat_index[0]])
print ("Second Feature: %s"%X_test.columns[top3_neg_feat_index[1]])
print ("Third Feature: %s"%X_test.columns[top3_neg_feat_index[2]])

First Feature: Income_$100,001 - $150,000
Second Feature: EducationLevel_Doctoral Degree
Third Feature: Q108343_Yes


## 2) The most negative object with respect to the probabilities.

In [35]:
X_test.iloc[neg_obj_index]

EducationLevel_Associate's Degree              0
EducationLevel_Bachelor's Degree               1
EducationLevel_Current K-12                    0
EducationLevel_Current Undergraduate           0
EducationLevel_Doctoral Degree                 0
EducationLevel_High School Diploma             0
EducationLevel_Master's Degree                 0
Gender_Female                                  0
Gender_Male                                    1
HouseholdStatus_Domestic Partners (no kids)    0
HouseholdStatus_Domestic Partners (w/kids)     0
HouseholdStatus_Married (no kids)              0
HouseholdStatus_Married (w/kids)               1
HouseholdStatus_Single (no kids)               0
HouseholdStatus_Single (w/kids)                0
Income_$100,001 - $150,000                     0
Income_$25,001 - $50,000                       0
Income_$50,000 - $74,999                       0
Income_$75,000 - $100,000                      0
Income_over $150,000                           1
Income_under $25,000

### 2.1) the total positive log-evidence

In [36]:
print ("The total positive log-evidence: %f"%total_pos_evi[neg_obj_index])

The total positive log-evidence: 2.676487


### 2.2) the total negative log-evidence

In [37]:
print ("The total negative log-evidence: %f"%total_neg_evi[neg_obj_index])

The total negative log-evidence: -12.011135


### 2.3) Probability Distribution

In [38]:
print ("Class-1 Probability:%f"%test_prob[neg_obj_index][0])
print ("Class-0 Probaility: %f"%test_prob[neg_obj_index][1])

Class-1 Probability:0.000088
Class-0 Probaility: 0.999912


### 2.4) top 3 features values that contribute most to the positive evidence

In [39]:
top3_pos_feat_index = sorted(pos_evi_feat[neg_obj_index], key=pos_evi_feat[neg_obj_index].get,reverse=True)[0:3]

In [40]:
print ("First Feature: %s"%X_test.columns[top3_pos_feat_index[0]])
print ("Second Feature: %s"%X_test.columns[top3_pos_feat_index[1]])
print ("Third Feature: %s"%X_test.columns[top3_pos_feat_index[2]])

First Feature: Q120012_No
Second Feature: Q120012_Yes
Third Feature: Q110740_PC


### 2.5) top 3 feature values that contribute the most to the negative evidence.

In [41]:
top3_neg_feat_index = sorted(neg_evi_feat[neg_obj_index], key=neg_evi_feat[neg_obj_index].get,reverse=True)[0:3]

In [42]:
print ("First Feature: %s"%X_test.columns[top3_neg_feat_index[0]])
print ("Second Feature: %s"%X_test.columns[top3_neg_feat_index[1]])
print ("Third Feature: %s"%X_test.columns[top3_neg_feat_index[2]])

First Feature: Income_$100,001 - $150,000
Second Feature: EducationLevel_Doctoral Degree
Third Feature: Q108343_Yes


## 3) The object that has the largest positive evidence.

In [43]:
largest_pos_evd = sorted(total_pos_evi, key=total_pos_evi.get,reverse=True)
largest_pos_evd_index = largest_pos_evd[0]
largest_pos_evd_obj = X_test.iloc[largest_pos_evd[0]]

In [44]:
largest_pos_evd_obj

EducationLevel_Associate's Degree              0
EducationLevel_Bachelor's Degree               0
EducationLevel_Current K-12                    0
EducationLevel_Current Undergraduate           0
EducationLevel_Doctoral Degree                 0
EducationLevel_High School Diploma             0
EducationLevel_Master's Degree                 1
Gender_Female                                  1
Gender_Male                                    0
HouseholdStatus_Domestic Partners (no kids)    0
HouseholdStatus_Domestic Partners (w/kids)     0
HouseholdStatus_Married (no kids)              0
HouseholdStatus_Married (w/kids)               0
HouseholdStatus_Single (no kids)               1
HouseholdStatus_Single (w/kids)                0
Income_$100,001 - $150,000                     0
Income_$25,001 - $50,000                       0
Income_$50,000 - $74,999                       1
Income_$75,000 - $100,000                      0
Income_over $150,000                           0
Income_under $25,000

### 3.1) the total positive log-evidence

In [45]:
print ("The total positive log-evidence: %f"%total_pos_evi[largest_pos_evd_index])

The total positive log-evidence: 12.607412


### 3.2) the total negative log-evidence

In [46]:
print ("The total negative log-evidence: %f"%total_neg_evi[largest_pos_evd_index])

The total negative log-evidence: -4.112884


### 3.3) Probability Distribution

In [47]:
print ("Class-1 Probability:%f"%test_prob[largest_pos_evd_index][0])
print ("Class-0 Probaility: %f"%test_prob[largest_pos_evd_index][1])

Class-1 Probability:0.999795
Class-0 Probaility: 0.000205


### 3.4) top 3 features values that contribute most to the positive evidence

In [48]:
top3_pos_feat_index = sorted(pos_evi_feat[largest_pos_evd_index], key=pos_evi_feat[largest_pos_evd_index].get,reverse=True)[0:3]

In [49]:
print ("First Feature: %s"%X_test.columns[top3_pos_feat_index[0]])
print ("Second Feature: %s"%X_test.columns[top3_pos_feat_index[1]])
print ("Third Feature: %s"%X_test.columns[top3_pos_feat_index[2]])

First Feature: Q109244_No
Second Feature: Q109244_Yes
Third Feature: Q98869_Yes


### 3.5) top 3 feature values that contribute the most to the negative evidence.

In [50]:
top3_neg_feat_index = sorted(neg_evi_feat[largest_pos_evd_index], key=neg_evi_feat[largest_pos_evd_index].get,reverse=True)[0:3]

In [51]:
print ("First Feature: %s"%X_test.columns[top3_neg_feat_index[0]])
print ("Second Feature: %s"%X_test.columns[top3_neg_feat_index[1]])
print ("Third Feature: %s"%X_test.columns[top3_neg_feat_index[2]])

First Feature: Income_$100,001 - $150,000
Second Feature: EducationLevel_Doctoral Degree
Third Feature: Q108343_Yes


## 4) The object that has the largest (in magnitude) negative evidence.

In [52]:
largest_neg_evd = sorted(total_neg_evi, key=total_neg_evi.get,reverse=True)
largest_neg_evd_index = largest_neg_evd[0]
largest_neg_evd_obj = X_test.iloc[largest_neg_evd[0]]

In [53]:
largest_neg_evd_obj

EducationLevel_Associate's Degree              0
EducationLevel_Bachelor's Degree               1
EducationLevel_Current K-12                    0
EducationLevel_Current Undergraduate           0
EducationLevel_Doctoral Degree                 0
EducationLevel_High School Diploma             0
EducationLevel_Master's Degree                 0
Gender_Female                                  1
Gender_Male                                    0
HouseholdStatus_Domestic Partners (no kids)    0
HouseholdStatus_Domestic Partners (w/kids)     0
HouseholdStatus_Married (no kids)              0
HouseholdStatus_Married (w/kids)               0
HouseholdStatus_Single (no kids)               1
HouseholdStatus_Single (w/kids)                0
Income_$100,001 - $150,000                     0
Income_$25,001 - $50,000                       0
Income_$50,000 - $74,999                       0
Income_$75,000 - $100,000                      0
Income_over $150,000                           0
Income_under $25,000

### 4.1)  the total positive log-evidence

In [54]:
print ("The total positive log-evidence: %f"%total_pos_evi[largest_neg_evd_index])

The total positive log-evidence: 8.616982


### 4.2) the total negative log-evidence

In [55]:
print ("The total negative log-evidence: %f"%total_neg_evi[largest_neg_evd_index])

The total negative log-evidence: -1.910867


### 4.3) Probability Distribution

In [56]:
print ("Class-1 Probability:%f"%test_prob[largest_neg_evd_index][0])
print ("Class-0 Probaility: %f"%test_prob[largest_neg_evd_index][1])

Class-1 Probability:0.998778
Class-0 Probaility: 0.001222


### 4.4) top 3 features values that contribute most to the positive evidence

In [57]:
top3_pos_feat_index = sorted(pos_evi_feat[largest_neg_evd_index], key=pos_evi_feat[largest_neg_evd_index].get,reverse=True)[0:3]

In [58]:
print ("First Feature: %s"%X_test.columns[top3_pos_feat_index[0]])
print ("Second Feature: %s"%X_test.columns[top3_pos_feat_index[1]])
print ("Third Feature: %s"%X_test.columns[top3_pos_feat_index[2]])

First Feature: Q109244_No
Second Feature: Q109244_Yes
Third Feature: Gender_Male


### 4.5) top 3 feature values that contribute the most to the negative evidence.

In [59]:
top3_neg_feat_index = sorted(neg_evi_feat[largest_neg_evd_index], key=neg_evi_feat[largest_neg_evd_index].get,reverse=True)[0:3]
print ("First Feature: %s"%X_test.columns[top3_neg_feat_index[0]])
print ("Second Feature: %s"%X_test.columns[top3_neg_feat_index[1]])
print ("Third Feature: %s"%X_test.columns[top3_neg_feat_index[2]])

First Feature: Income_$100,001 - $150,000
Second Feature: EducationLevel_Doctoral Degree
Third Feature: Q111580_Supportive


## 5) The most uncertain object (the probabilities are closest to 0.5)

In [60]:
# The most uncertain object with respect to the probabilities
print (min(enumerate(test_prob[:,0]),key=lambda x: abs(x[1]-0.5)))
print (min(enumerate(test_prob[:,1]),key=lambda x: abs(x[1]-0.5)))

(1464, 0.49980092339604293)
(1464, 0.5001990766039609)


In [61]:
X_test.iloc[1464]

EducationLevel_Associate's Degree              0
EducationLevel_Bachelor's Degree               0
EducationLevel_Current K-12                    0
EducationLevel_Current Undergraduate           0
EducationLevel_Doctoral Degree                 0
EducationLevel_High School Diploma             0
EducationLevel_Master's Degree                 1
Gender_Female                                  0
Gender_Male                                    1
HouseholdStatus_Domestic Partners (no kids)    0
HouseholdStatus_Domestic Partners (w/kids)     0
HouseholdStatus_Married (no kids)              0
HouseholdStatus_Married (w/kids)               1
HouseholdStatus_Single (no kids)               0
HouseholdStatus_Single (w/kids)                0
Income_$100,001 - $150,000                     0
Income_$25,001 - $50,000                       0
Income_$50,000 - $74,999                       0
Income_$75,000 - $100,000                      0
Income_over $150,000                           1
Income_under $25,000

### 5.1) the total positive log-evidence

In [62]:
print ("The total positive log-evidence: %f"%total_pos_evi[828])

The total positive log-evidence: 4.919405


### 5.2) the total negative log-evidence

In [63]:
print ("The total negative log-evidence: %f"%total_neg_evi[828])

The total negative log-evidence: -4.937221


### 5.3) Probability Distribution

In [64]:
print ("Class-1 Probability:%f"%test_prob[828][0])
print ("Class-0 Probaility: %f"%test_prob[828][1])

Class-1 Probability:0.495546
Class-0 Probaility: 0.504454


### 5.4) top 3 features values that contribute most to the positive evidence

In [65]:
top3_pos_feat_index = sorted(pos_evi_feat[828], key=pos_evi_feat[828].get,reverse=True)[0:3]

In [66]:
print ("First Feature: %s"%X_test.columns[top3_pos_feat_index[0]])
print ("Second Feature: %s"%X_test.columns[top3_pos_feat_index[1]])
print ("Third Feature: %s"%X_test.columns[top3_pos_feat_index[2]])

First Feature: Q101163_Dad
Second Feature: Q101163_Mom
Third Feature: EducationLevel_Current Undergraduate


### 5.5) top 3 feature values that contribute the most to the negative evidence.

In [67]:
top3_neg_feat_index = sorted(neg_evi_feat[828], key=neg_evi_feat[828].get,reverse=True)[0:3]
print ("First Feature: %s"%X_test.columns[top3_neg_feat_index[0]])
print ("Second Feature: %s"%X_test.columns[top3_neg_feat_index[1]])
print ("Third Feature: %s"%X_test.columns[top3_neg_feat_index[2]])

First Feature: Income_$100,001 - $150,000
Second Feature: EducationLevel_Doctoral Degree
Third Feature: Q108343_Yes


In [68]:
## End of Ipython Notebook