### Dealing with Missing Data

In [1]:
polling = read.csv('./dataset/PollingData.csv')
str(polling)

'data.frame':	145 obs. of  7 variables:
 $ State     : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 3 4 4 4 ...
 $ Year      : int  2004 2008 2004 2008 2004 2008 2012 2004 2008 2012 ...
 $ Rasmussen : int  11 21 NA 16 5 5 8 7 10 NA ...
 $ SurveyUSA : int  18 25 NA NA 15 NA NA 5 NA NA ...
 $ DiffCount : int  5 5 1 6 8 9 4 8 5 2 ...
 $ PropR     : num  1 1 1 1 1 ...
 $ Republican: int  1 1 1 1 1 1 1 1 1 1 ...


In [2]:
table(polling$Year)


2004 2008 2012 
  50   50   45 

In [3]:
summary(polling)

         State          Year        Rasmussen          SurveyUSA       
 Arizona    :  3   Min.   :2004   Min.   :-41.0000   Min.   :-33.0000  
 Arkansas   :  3   1st Qu.:2004   1st Qu.: -8.0000   1st Qu.:-11.7500  
 California :  3   Median :2008   Median :  1.0000   Median : -2.0000  
 Colorado   :  3   Mean   :2008   Mean   :  0.0404   Mean   : -0.8243  
 Connecticut:  3   3rd Qu.:2012   3rd Qu.:  8.5000   3rd Qu.:  8.0000  
 Florida    :  3   Max.   :2012   Max.   : 39.0000   Max.   : 30.0000  
 (Other)    :127                  NA's   :46         NA's   :71        
   DiffCount           PropR          Republican    
 Min.   :-19.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.: -6.000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :  1.000   Median :0.6250   Median :1.0000  
 Mean   : -1.269   Mean   :0.5259   Mean   :0.5103  
 3rd Qu.:  4.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   : 11.000   Max.   :1.0000   Max.   :1.0000  
                                                    

#### Simple Approaches to Missing Data

- Delete the missing observations
    - throwing away more than 50% of the data
    - want to predict for all states
- Delete variables with missing values
    - want to retain data from Rasmussen/SurveyUSA
- Fill missing data points with average values
    - the average value for a poll will be close to 0 (tie between Democrat and Republican)
    - If other polls in a state favor one candidate, the missing one probably would have too.

#### Multiple Imputation

- Fill in missing values based on non-missing values
    - If Rasmussen is very negative, then a missing SurveyUSA value will likely be negative
    - Just like `sample.split`, results will differ between runs unless you fix the random seed
- Although the method is complicated, we can use it easily through R\` libraries
- Will use Multiple Imputation by Chained Equations (`mice`) package

Requires to install
```R
install.packages("mice")
```

In [4]:
library("mice")


Attaching package: ‘mice’


The following objects are masked from ‘package:base’:

    cbind, rbind




In [5]:
simple = polling[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount")]
summary(simple)

   Rasmussen          SurveyUSA            PropR          DiffCount      
 Min.   :-41.0000   Min.   :-33.0000   Min.   :0.0000   Min.   :-19.000  
 1st Qu.: -8.0000   1st Qu.:-11.7500   1st Qu.:0.0000   1st Qu.: -6.000  
 Median :  1.0000   Median : -2.0000   Median :0.6250   Median :  1.000  
 Mean   :  0.0404   Mean   : -0.8243   Mean   :0.5259   Mean   : -1.269  
 3rd Qu.:  8.5000   3rd Qu.:  8.0000   3rd Qu.:1.0000   3rd Qu.:  4.000  
 Max.   : 39.0000   Max.   : 30.0000   Max.   :1.0000   Max.   : 11.000  
 NA's   :46         NA's   :71                                           

In [6]:
set.seed(144)

In [7]:
imputed = complete(mice(simple))


 iter imp variable
  1   1  Rasmussen  SurveyUSA
  1   2  Rasmussen  SurveyUSA
  1   3  Rasmussen  SurveyUSA
  1   4  Rasmussen  SurveyUSA
  1   5  Rasmussen  SurveyUSA
  2   1  Rasmussen  SurveyUSA
  2   2  Rasmussen  SurveyUSA
  2   3  Rasmussen  SurveyUSA
  2   4  Rasmussen  SurveyUSA
  2   5  Rasmussen  SurveyUSA
  3   1  Rasmussen  SurveyUSA
  3   2  Rasmussen  SurveyUSA
  3   3  Rasmussen  SurveyUSA
  3   4  Rasmussen  SurveyUSA
  3   5  Rasmussen  SurveyUSA
  4   1  Rasmussen  SurveyUSA
  4   2  Rasmussen  SurveyUSA
  4   3  Rasmussen  SurveyUSA
  4   4  Rasmussen  SurveyUSA
  4   5  Rasmussen  SurveyUSA
  5   1  Rasmussen  SurveyUSA
  5   2  Rasmussen  SurveyUSA
  5   3  Rasmussen  SurveyUSA
  5   4  Rasmussen  SurveyUSA
  5   5  Rasmussen  SurveyUSA


In [8]:
summary(imputed)

   Rasmussen         SurveyUSA           PropR          DiffCount      
 Min.   :-41.000   Min.   :-33.000   Min.   :0.0000   Min.   :-19.000  
 1st Qu.: -8.000   1st Qu.:-11.000   1st Qu.:0.0000   1st Qu.: -6.000  
 Median :  3.000   Median :  1.000   Median :0.6250   Median :  1.000  
 Mean   :  2.786   Mean   :  2.014   Mean   :0.5259   Mean   : -1.269  
 3rd Qu.: 13.000   3rd Qu.: 18.000   3rd Qu.:1.0000   3rd Qu.:  4.000  
 Max.   : 39.000   Max.   : 30.000   Max.   :1.0000   Max.   : 11.000  

In [9]:
polling$Rasmussen = imputed$Rasmussen
polling$SurveyUSA = imputed$SurveyUSA
summary(polling)

         State          Year        Rasmussen         SurveyUSA      
 Arizona    :  3   Min.   :2004   Min.   :-41.000   Min.   :-33.000  
 Arkansas   :  3   1st Qu.:2004   1st Qu.: -8.000   1st Qu.:-11.000  
 California :  3   Median :2008   Median :  3.000   Median :  1.000  
 Colorado   :  3   Mean   :2008   Mean   :  2.786   Mean   :  2.014  
 Connecticut:  3   3rd Qu.:2012   3rd Qu.: 13.000   3rd Qu.: 18.000  
 Florida    :  3   Max.   :2012   Max.   : 39.000   Max.   : 30.000  
 (Other)    :127                                                     
   DiffCount           PropR          Republican    
 Min.   :-19.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.: -6.000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :  1.000   Median :0.6250   Median :1.0000  
 Mean   : -1.269   Mean   :0.5259   Mean   :0.5103  
 3rd Qu.:  4.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   : 11.000   Max.   :1.0000   Max.   :1.0000  
                                                    

### Sophisticated Baseline Method

In [10]:
train = subset(polling, Year == 2004 | Year == 2008)
test = subset(polling, Year == 2012)

In [11]:
table(train$Republican)


 0  1 
47 53 

In [12]:
table(sign(train$Rasmussen))


-1  0  1 
42  2 56 

In [13]:
table(train$Republican, sign(train$Rasmussen))

   
    -1  0  1
  0 42  1  4
  1  0  1 52

### Logistic Regression Models

In [14]:
cor(train)

ERROR: Error in cor(train): 'x' must be numeric


In [15]:
str(train)

'data.frame':	100 obs. of  7 variables:
 $ State     : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 4 4 5 5 ...
 $ Year      : int  2004 2008 2004 2008 2004 2008 2004 2008 2004 2008 ...
 $ Rasmussen : int  11 21 39 16 5 5 7 10 -11 -27 ...
 $ SurveyUSA : int  18 25 24 19 15 3 5 18 -11 -24 ...
 $ DiffCount : int  5 5 1 6 8 9 8 5 -8 -5 ...
 $ PropR     : num  1 1 1 1 1 1 1 1 0 0 ...
 $ Republican: int  1 1 1 1 1 1 1 1 0 0 ...


In [16]:
cor(train[c("Rasmussen", "SurveyUSA", "PropR", "DiffCount", "Republican")])

Unnamed: 0,Rasmussen,SurveyUSA,PropR,DiffCount,Republican
Rasmussen,1.0,0.9127481,0.8356056,0.4926308,0.7908133
SurveyUSA,0.9127481,1.0,0.8869625,0.5695477,0.8418046
PropR,0.8356056,0.8869625,1.0,0.8273785,0.9484204
DiffCount,0.4926308,0.5695477,0.8273785,1.0,0.8092777
Republican,0.7908133,0.8418046,0.9484204,0.8092777,1.0


In [17]:
mod1 = glm(Republican ~ PropR, data=train, family="binomial")
summary(mod1)


Call:
glm(formula = Republican ~ PropR, family = "binomial", data = train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.22880  -0.06541   0.10260   0.10260   1.37392  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -6.146      1.977  -3.108 0.001882 ** 
PropR         11.390      3.153   3.613 0.000303 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.269  on 99  degrees of freedom
Residual deviance:  15.772  on 98  degrees of freedom
AIC: 19.772

Number of Fisher Scoring iterations: 8


In [18]:
pred1 = predict(mod1, type="response")

In [19]:
table(train$Republican, pred1 >= 0.5)

   
    FALSE TRUE
  0    45    2
  1     2   51

In [20]:
mod2 = glm(Republican ~ SurveyUSA + DiffCount, data=train, family="binomial")
summary(mod2)


Call:
glm(formula = Republican ~ SurveyUSA + DiffCount, family = "binomial", 
    data = train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.96335  -0.01207   0.01526   0.06363   1.50373  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.9991     1.2437  -0.803   0.4218  
SurveyUSA     0.2583     0.1454   1.777   0.0756 .
DiffCount     0.7388     0.4464   1.655   0.0979 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.269  on 99  degrees of freedom
Residual deviance:  10.749  on 97  degrees of freedom
AIC: 16.749

Number of Fisher Scoring iterations: 9


In [21]:
pred2 = predict(mod2, type='response')
table(train$Republican, pred2 >= 0.5)

   
    FALSE TRUE
  0    45    2
  1     1   52

### Test Set Predictions

In [23]:
table(test$Republican, sign(test$Rasmussen))

   
    -1  0  1
  0 18  2  4
  1  0  0 21

In [24]:
testPred = predict(mod2, newdata=test, type='response')
table(test$Republican, testPred >= 0.5)

   
    FALSE TRUE
  0    23    1
  1     0   21

In [25]:
subset(test, testPred >= 0.5 & Republican == 0)

Unnamed: 0_level_0,State,Year,Rasmussen,SurveyUSA,DiffCount,PropR,Republican
Unnamed: 0_level_1,<fct>,<int>,<int>,<int>,<int>,<dbl>,<int>
24,Florida,2012,2,0,6,0.6666667,0
