First we need to install the pollstR package. This package, and all the pre-processing code, uses R. This choice was made due to the ease at which R allows us to visualize data, which greatly sped up the cleaning process. This package allows us to pull polls directly from the huffington post, PollstR API. 

In [1]:
install.packages('pollstR')
library('pollstR')

"unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5:
  cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/3.5/PACKAGES'"

package 'pollstR' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\xon12\AppData\Local\Temp\RtmpKqEKLN\downloaded_packages


Now we need to import the polls. For convience, I defined a few things first. Since polls are indexed by date, I included the data of each election in my set. I also included a function that allows us to get a list of each poll at a given data (note: that the date needs to be entered in the form "-201_"). This function greatly simplified identifying polls that we wanted to exlcude. 

In [None]:
poll_lister <- function(Poll_obj,date){
  polls <- c()
  for(i in 1:length(Poll_obj)){
    polls <- c(polls,strsplit(as.character(Poll_obj[[i]]$question$charts[[1]]),date)[[1]][2])
  }
  return(polls)
}



### dates
date_2014 <- '2014-11-04'
date_2012 <- '2012-11-06'
date_2016 <- '2016-11-08'


### imports polls

polls_2012 <- pollster_charts(tag = '2012-senate', election_date = date_2012)$content$items

polls_2014 <- pollster_charts(tag = '2014-senate', election_date = date_2014)$content$items

polls_2016 <- pollster_charts(tag = '2016-senate', election_date = date_2016)$content$items

We now begin the cleaning procedure. Each year is handeled slightly differently. The first set of polls we will clean are from 2012. Our pipeline defines a dataframe with our desired features, then excludes any items that we don't want from our collection of polls. This was done to account for things like 'Jungle Primaries", where if no candidate recieved over 50% of the vote, the election would go to a runoff in december (some southern states use this system), polls conducted before the end of the primary (these often included matchups of canidates that didn't compete in the general election), and some states with electoral oddities (for example, in the 2012 Kansas senate elections, an independent faced off with a Republican after the democrats withdrew from the race, but no head to head polls of that candidate and the Republican were conducted). 

We then append our election results (the name of the leading candidate name, and and whether they won or lost) for each state. We do this by checking the name of the election winner by the first candidate listed in the poll, if they are the same, we return a 1, if they are different, we return a zero.  We then drop some unecessary features included in the original poll objects, and find the party affiliation of the first candidate in our poll. If that candidate is republican, we leave it, if they are a democrat, we flip the order of the candidates (so it is republican, then democrat), and change the result. Now our results tell us if the Republican candidate won the election, and our polls have the republican listed first. We then convert our date variables to give us the number of days before the election. Lastly, we convert our method, population type, and polling affiliation variables to binary variables, convert our polling percentage to a Republican margin (what percentage they are doing better than the democrats) and then drop the state variable from the final dataframe (since the states in our training set don't overlap heavily with our test set). 

Thus, our final data will have the Republican margin, the number of days from the election the poll began at, the number of days from the election the poll ended at, the sample size, a binary variable for whether the voters were registered, a binary variable if there was no partisan affiliation for the poll, binary variables for whether the polls were conducted by democrats or republicans, and several method variables, including if the poll was online, over a live phone call, an automated response system, or an interactive response system, and who won the election (binary, republican and democrat). 

The changes in the cleaning techniques between the years is mostly focused on changing which polls we eliminate, since that had to be done manually. However, there are also a few other particularities to each set of polls (some extraneuous methods and population types that only appear once or twice and are unique to that year that we eliminated, such as mail polls, or surveys of Republican only voters). 

In [None]:
### 2012


### creats data array
full_data <- matrix(ncol = 12)
full_data <- data.frame(full_data)

names <- c("State","Republican","Democrat","Pollster","Start_DFE","End_DFE","Subpop","Sample_size", "Method", "Partisanship","Affiliation","Result")

colnames(full_data) <- names

### removes polls before primary, states with jungle primaries, and kansas due to an independent running and no polling directly 
# comparing him to the republican nominee
polls_2012 <- pollster_charts(tag = '2012-senate', election_date = date_2012)$content$items
exclude = sort(poll_lister(polls_2012,'2012-'))[c(3,5,6,7:10,11,13,18,20,21,23,24,26,29,32,33,35,39,41,42,47:49)]
polls_2012 <- polls_2012[-match(exclude,poll_lister(polls_2012,'2012-'))]


polls_2012 <- polls_2012[-10] ## excludes oddly coded Penn race
polls_2012 <- polls_2012[-3] ### removes maryland for similar reasons

### imports results
X2012_results <- read_csv("~/2012_results.csv", col_names = FALSE)
num_of_races <- nrow(X2012_results)


### processes the data for the rest of the states

for(i in 1:length(polls_2012)){
  poll <- polls_2012[[i]]
  data <- pollster_charts_polls(poll$slug)$content

# adds state and candidate names to the data from the results (specifically coded to prevent a bug with west virginia being read 
# as both west virginia and virgina )
    
  for(j in 1:num_of_races){
    state_name <- as.character(X2012_results[j,1])
    candidate_name <- as.character(X2012_results[j,2])
    if(state_name != 'Virginia'){
      if(grepl(state_name,poll$title) ==TRUE){
        data <- cbind(rep(state_name,nrow(data)),data)
        result <- grepl(poll$question$responses[[1]]$label,candidate_name)}}
    else{
      if((grepl('West Virginia',poll$title) ==FALSE) && (grepl(state_name,poll$title) ==TRUE)){
        data <- cbind(rep(state_name,nrow(data)),data)
        result <- grepl(poll$question$responses[[1]]$label,candidate_name)}}
    }
    

  data <- cbind(data,as.integer(result))


#######################################################

  ### clears uneeded variables 
  ind1 <- which(colnames(data) == 'poll_slug') ## poll ID
  ind2 <- which(colnames(data) == 'question_text') ## question text -- this is often empty
  ind3 <- which(colnames(data) == 'margin_of_error') ## margin of error (many polls don't compute)
  ind0 <- 4:(ind1-1) ### non-top two candidates (we don't care about 3rd party candidates)
  to_delete <- c(ind0,ind1,ind2,ind3) ### deletes those 
  data <- data[,-to_delete]


  ### replaces candidate names with party affiliation
  colnames(data)[2] <- poll$question$responses[[1]]$party
  colnames(data)[3] <- poll$question$responses[[2]]$party

  n <- length(colnames(data))

  ### if candidate order is dem, rep (or independent, rep), switches results
  if(colnames(data)[2] == 'Republican') data <- data
  if(colnames(data)[3] == 'Republican') { 
      data <- data[,c(1,3,2,4:n)]
      ### tracks whether the we flipped
      data[,n] <- abs(data[,n]-1)}
  
### date from election
  data[,5] <- as.integer(as.Date(date_2012)-as.Date(data[,5]))
  data[,6] <- as.integer(as.Date(date_2012)-as.Date(data[,6]))

  colnames(data) <- names
    
  full_data <- rbind(full_data,data)}

full_data <- full_data[,-10]
full_data <- full_data[-1,]
rownames(full_data) <- as.character(sort(as.integer(rownames(full_data))-1))


#################################################################################################
for(i in 1:nrow(full_data)){
  if(full_data$Subpop == "Likely Voters - Republican") full_data <- full_data[-i,]
  if(full_data$Subpop == "Registered Voters - Republican") full_data <- full_data[-i,]
}


### converts the subpop,method,affiliation categorical variables to binary variables

subpop <- unique(full_data[,'Subpop'])
method <- unique(full_data[,'Method'])
affiliation <- unique(full_data[,'Affiliation'])



data_indicator1 <- matrix(ncol = length(subpop), nrow = nrow(full_data))
data_indicator1 <- data.frame(data_indicator1)

data_indicator2 <- matrix(ncol = length(method), nrow = nrow(full_data))
data_indicator2 <- data.frame(data_indicator2)

data_indicator3 <- matrix(ncol = length(affiliation), nrow = nrow(full_data))
data_indicator3 <- data.frame(data_indicator3)


colnames(data_indicator1) <- subpop
colnames(data_indicator2) <- method
colnames(data_indicator3) <- affiliation


for(i in 1:nrow(full_data)) {
  for(j in 1:length(subpop)) data_indicator1[i,j] <- as.integer(full_data[i,'Subpop'] == subpop[j])
  for(j in 1:length(method)) data_indicator2[i,j] <- as.integer(full_data[i,'Method'] == method[j])
  for(j in 1:length(affiliation)) data_indicator3[i,j] <- as.integer(full_data[i,'Affiliation'] == affiliation[j])
}


data_indicator <- matrix(ncol = length(affiliation)+length(subpop)+length(method), nrow = nrow(full_data))
data_indicator <- data.frame(data_indicator)
data_indicator <- cbind(data_indicator1,data_indicator2,data_indicator3)

### adds to data frame and removes old categorical variables
full_data <- full_data[,-c(7,9,10)] 
full_data <- cbind(full_data,data_indicator)
full_data <- full_data[,c(1:7,9:21,8)]

### deletes "adult" subpop, mail and mixed methods as they account for at most 2-3 polls in the dataset
for(i in 1:nrow(full_data)){
  if(full_data$Mixed == 1) full_data <- full_data[-i,]
}

### there are alot of overlapping survey type variables (live phone + online), so we want a variable for the presecence of each 
# method, with the possibility of there being ones in multiple of these variables

method2 <- c('Online','Live Phone', 'Automated Phone', 'IVR')
data_indicator4 <- matrix(ncol =4, nrow = nrow(full_data))
data_indicator4 <- data.frame(data_indicator4)
data_indicator4[is.na(data_indicator4)] <- 0
for(i in 1:nrow(full_data)){
  if(full_data$`Live Phone`[i] == 1) data_indicator4[i,2] <- 1
  if(full_data$`Automated Phone`[i] == 1) data_indicator4[i,3] <- 1
  if(full_data$`IVR/Live Phone`[i] == 1) {
    data_indicator4[i,2] <- 1
    data_indicator4[i,4] <- 1
  }
  if(full_data$Internet[i] == 1) data_indicator4[i,1] <- 1
  if(full_data$`IVR/Online`[i] == 1) {
    data_indicator4[i,1] <- 1
    data_indicator4[i,4] <- 1}
  
}
colnames(data_indicator4) <- method2

full_data <- full_data[,-c(9:17)]
full_data <- cbind(full_data,data_indicator4)
full_data <- full_data[,c(1:11,13:16,12)]

# converts percentages to rep margin
rep_margin <- full_data[,2]-full_data[,3]
year <- 2012
full_data[,2] <- year
full_data[,3] <- rep_margin
colnames(full_data)[2] <- 'year'
colnames(full_data)[3] <- 'rep_margin'
write_csv(full_data, '2012_senate_results.csv')
senate_2012 <- full_data


```

We now do the same thing for 2014

In [None]:
### 2014
```{r}
full_data <- matrix(ncol = 12)
full_data <- data.frame(full_data)

names <- c("State","Republican","Democrat","Pollster","Start_DFE","End_DFE","Subpop","Sample_size", "Method", "Partisanship","Affiliation","Result")

colnames(full_data) <- names


polls_2014 <- pollster_charts(tag = '2014-senate', election_date = date_2014)$content$items
exclude = sort(poll_lister(polls_2014,'2014-'))[c(1,3,6:11,15,17,18,19:21,27,28,30,32,36,40,44)]
polls_2014 <- polls_2014[-match(exclude,poll_lister(polls_2014,'2014-'))] ### removes polls before primary, states with jungle primaries, and kansas due to an independent running and no polling directly comparing him to the republican nominee

polls_2014 <- polls_2014[-27] ### removes badly coded NC polls
X2014_results <- read_csv("~/2014_results.csv", col_names = FALSE)
num_of_races <- nrow(X2014_results)


### Rest of the states

for(i in 1:length(polls_2014)){
  poll <- polls_2014[[i]]
  data <- pollster_charts_polls(poll$slug)$content

  ### adds state name 
  
  for(j in 1:num_of_races){
    state_name <- as.character(X2014_results[j,1])
    candidate_name <- as.character(X2014_results[j,2])
    if(state_name != 'Virginia'){
      if(grepl(state_name,poll$title) ==TRUE){
        data <- cbind(rep(state_name,nrow(data)),data)
        result <- grepl(poll$question$responses[[1]]$label,candidate_name)}}
    else{
      if((grepl('West Virginia',poll$title) ==FALSE) && (grepl(state_name,poll$title) ==TRUE)){
        data <- cbind(rep(state_name,nrow(data)),data)
        result <- grepl(poll$question$responses[[1]]$label,candidate_name)}}
    }
    
      
  data <- cbind(data,as.integer(result))


#######################################################

  ### clears uneeded variables 
  ind1 <- which(colnames(data) == 'poll_slug') ## poll ID
  ind2 <- which(colnames(data) == 'question_text') ## question text -- this is often empty
  ind3 <- which(colnames(data) == 'margin_of_error') ## margin of error (many polls don't compute)
  ind0 <- 4:(ind1-1) ### non-top two candidates (we don't care about 3rd party candidates)
  to_delete <- c(ind0,ind1,ind2,ind3) ### deletes those 
  data <- data[,-to_delete]


  ### replaces candidate names with party affiliation
  colnames(data)[2] <- poll$question$responses[[1]]$party
  colnames(data)[3] <- poll$question$responses[[2]]$party

  n <- length(colnames(data))

  ### if candidate order is dem, rep (or independent, rep), switches results
  if(colnames(data)[2] == 'Republican') data <- data
  if(colnames(data)[3] == 'Republican') { 
      data <- data[,c(1,3,2,4:n)]
      ### tracks whether the we flipped
      data[,n] <- abs(data[,n]-1)}
  
  data[,5] <- as.integer(as.Date(date_2014)-as.Date(data[,5]))
  data[,6] <- as.integer(as.Date(date_2014)-as.Date(data[,6]))

  colnames(data) <- names
##########################################################
  full_data <- rbind(full_data,data)}

full_data <- full_data[,-10]
full_data <- full_data[-1,]
rownames(full_data) <- as.character(sort(as.integer(rownames(full_data))-1))


#################################################################################################

### converts the subpop,method,affiliation categorical variables to binary variables

subpop <- unique(full_data[,'Subpop'])
method <- unique(full_data[,'Method'])
affiliation <- unique(full_data[,'Affiliation'])


data_indicator1 <- matrix(ncol = length(subpop), nrow = nrow(full_data))
data_indicator1 <- data.frame(data_indicator1)

data_indicator2 <- matrix(ncol = length(method), nrow = nrow(full_data))
data_indicator2 <- data.frame(data_indicator2)

data_indicator3 <- matrix(ncol = length(affiliation), nrow = nrow(full_data))
data_indicator3 <- data.frame(data_indicator3)


colnames(data_indicator1) <- subpop
colnames(data_indicator2) <- method
colnames(data_indicator3) <- affiliation


for(i in 1:nrow(full_data)) {
  for(j in 1:length(subpop)) data_indicator1[i,j] <- as.integer(full_data[i,'Subpop'] == subpop[j])
  for(j in 1:length(method)) data_indicator2[i,j] <- as.integer(full_data[i,'Method'] == method[j])
  for(j in 1:length(affiliation)) data_indicator3[i,j] <- as.integer(full_data[i,'Affiliation'] == affiliation[j])
}

### combines "other" into 'None" since we only care about partisan affiliation
data_indicator3[,1] <- data_indicator3[,1]+data_indicator3[,4]
data_indicator3 <- data_indicator3[,-4]

data_indicator <- matrix(ncol = length(affiliation)+length(subpop)+length(method)-1, nrow = nrow(full_data))
data_indicator <- data.frame(data_indicator)
data_indicator <- cbind(data_indicator1,data_indicator2,data_indicator3)

### adds to data frame and removes old categorical variables
full_data <- full_data[,-c(7,9,10)] 
full_data <- cbind(full_data,data_indicator)
full_data <- full_data[,c(1:7,9:21,8)]

### deletes "adult" subpop, mail and mixed methods
for(i in 1:nrow(full_data)){
  if(full_data$Mixed == 1) full_data <- full_data[-i,]
  if(full_data$Adults == 1) full_data <- full_data[-i,]

}

method2 <- c('Online','Live Phone', 'Automated Phone', 'IVR')
data_indicator4 <- matrix(ncol =4, nrow = nrow(full_data))
data_indicator4 <- data.frame(data_indicator4)
data_indicator4[is.na(data_indicator4)] <- 0
for(i in 1:nrow(full_data)){
  if(full_data$`Live Phone`[i] == 1) data_indicator4[i,2] <- 1
  if(full_data$`Automated Phone`[i] == 1) data_indicator4[i,3] <- 1
  if(full_data$`IVR/Live Phone`[i] == 1) {
    data_indicator4[i,2] <- 1
    data_indicator4[i,4] <- 1
  }
  if(full_data$Internet[i] == 1) data_indicator4[i,1] <- 1
  if(full_data$`IVR/Online`[i] == 1) {
    data_indicator4[i,1] <- 1
    data_indicator4[i,4] <- 1}
  if(full_data$`Live Phone/Online`[i] == 1) {
    data_indicator4[i,1] <- 1
    data_indicator4[i,2] <- 1
  }
  
}
colnames(data_indicator4) <- method2

full_data <- full_data[,-c(8,10:17)]
full_data <- cbind(full_data,data_indicator4)
full_data <- full_data[,c(1:11,13:16,12)]
rep_margin <- full_data[,2]-full_data[,3]
year <- 2014
full_data[,2] <- year
full_data[,3] <- rep_margin
colnames(full_data)[2] <- 'year'
colnames(full_data)[3] <- 'rep_margin'
write_csv(full_data, '2014_senate_results.csv')
senate_2014 <- full_data
```

and, 2016...

In [None]:
full_data <- matrix(ncol = 12)
full_data <- data.frame(full_data)

names <- c("State","Republican","Democrat","Pollster","Start_DFE","End_DFE","Subpop","Sample_size", "Method", "Partisanship","Affiliation","Result")

colnames(full_data) <- names

date_2016 <- '2016-11-08'
polls_2016 <- pollster_charts(tag = '2016-senate', election_date = date_2016)$content$items
polls_2016 <- polls_2016[-c(10,22,26,29,34)] ### removes polls before primary, california (since it was a dem v dem race), LA due to the jungle primary, and a duplicated nevada entry


X2016_results <- read_csv("~/2016_senate.csv", skip = 1)
X2016_results <- X2016_results[,-c(2,3,4,5,7,8,9)]
num_of_races <- nrow(X2016_results)


### Rest of the states

for(i in 1:length(polls_2016)){
  poll <- polls_2016[[i]]
  data <- pollster_charts_polls(poll$slug)$content

  ### adds state name 
  for(j in 1:num_of_races){
    state_name <- as.character(X2016_results[j,1])
    candidate_name <- as.character(X2016_results[j,2])
    if(state_name != 'Virginia'){
      if(grepl(state_name,poll$title) ==TRUE){
        data <- cbind(rep(state_name,nrow(data)),data)
        result <- grepl(poll$question$responses[[1]]$label,candidate_name)}}
    else{
      if((grepl('West Virginia',poll$title) ==FALSE) && (grepl(state_name,poll$title) ==TRUE)){
        data <- cbind(rep(state_name,nrow(data)),data)
        result <- grepl(poll$question$responses[[1]]$label,candidate_name)}}
    }
    

  data <- cbind(data,as.integer(result))


#######################################################

  ### clears uneeded variables 
  ind1 <- which(colnames(data) == 'poll_slug') ## poll ID
  ind2 <- which(colnames(data) == 'question_text') ## question text -- this is often empty
  ind3 <- which(colnames(data) == 'margin_of_error') ## margin of error (many polls don't compute)
  ind0 <- 4:(ind1-1) ### non-top two candidates (we don't care about 3rd party candidates)
  to_delete <- c(ind0,ind1,ind2,ind3) ### deletes those 
  data <- data[,-to_delete]


  ### replaces candidate names with party affiliation
  colnames(data)[2] <- poll$question$responses[[1]]$party
  colnames(data)[3] <- poll$question$responses[[2]]$party

  n <- length(colnames(data))

  ### if candidate order is dem, rep (or independent, rep), switches results
  if(colnames(data)[2] == 'Republican') data <- data
  if(colnames(data)[3] == 'Republican') { 
      data <- data[,c(1,3,2,4:n)]
      ### tracks whether the we flipped
      data[,n] <- abs(data[,n]-1)}
  
  data[,5] <- as.integer(as.Date(date_2016)-as.Date(data[,5]))
  data[,6] <- as.integer(as.Date(date_2016)-as.Date(data[,6]))

  colnames(data) <- names
##########################################################
  full_data <- rbind(full_data,data)}

full_data <- full_data[,-10]
full_data <- full_data[-1,]
rownames(full_data) <- as.character(sort(as.integer(rownames(full_data))-1))


#################################################################################################

### converts the subpop,method,affiliation categorical variables to binary variables

subpop <- unique(full_data[,'Subpop'])
method <- unique(full_data[,'Method'])
affiliation <- unique(full_data[,'Affiliation'])


data_indicator1 <- matrix(ncol = length(subpop), nrow = nrow(full_data))
data_indicator1 <- data.frame(data_indicator1)

data_indicator2 <- matrix(ncol = length(method), nrow = nrow(full_data))
data_indicator2 <- data.frame(data_indicator2)

data_indicator3 <- matrix(ncol = length(affiliation), nrow = nrow(full_data))
data_indicator3 <- data.frame(data_indicator3)


colnames(data_indicator1) <- subpop
colnames(data_indicator2) <- method
colnames(data_indicator3) <- affiliation


for(i in 1:nrow(full_data)) {
  for(j in 1:length(subpop)) data_indicator1[i,j] <- as.integer(full_data[i,'Subpop'] == subpop[j])
  for(j in 1:length(method)) data_indicator2[i,j] <- as.integer(full_data[i,'Method'] == method[j])
  for(j in 1:length(affiliation)) data_indicator3[i,j] <- as.integer(full_data[i,'Affiliation'] == affiliation[j])
}


data_indicator <- matrix(ncol = length(affiliation)+length(subpop)+length(method), nrow = nrow(full_data))
data_indicator <- data.frame(data_indicator)
data_indicator <- cbind(data_indicator1,data_indicator2,data_indicator3)

### adds to data frame and removes old categorical variables
full_data <- full_data[,-c(7,9,10)] 
full_data <- cbind(full_data,data_indicator)
full_data <- full_data[,c(1:7,9:22,8)]

### deletes "adult" subpop, mail and mixed methods
for(i in 1:nrow(full_data)){
  if(full_data$Mixed == 1) full_data <- full_data[-i,]
  if(full_data$Mail == 1) full_data <- full_data[-i,]
  if(full_data$Adults == 1) full_data <- full_data[-i,]
}

method2 <- c('Online','Live Phone', 'Automated Phone', 'IVR')
data_indicator4 <- matrix(ncol =4, nrow = nrow(full_data))
data_indicator4 <- data.frame(data_indicator4)
data_indicator4[is.na(data_indicator4)] <- 0
for(i in 1:nrow(full_data)){
  if(full_data$`Live Phone`[i] == 1) data_indicator4[i,2] <- 1
  if(full_data$`Automated Phone`[i] == 1) data_indicator4[i,3] <- 1
  if(full_data$`IVR/Live Phone`[i] == 1) {
    data_indicator4[i,2] <- 1
    data_indicator4[i,4] <- 1
  }
  if(full_data$Internet[i] == 1) data_indicator4[i,1] <- 1
  if(full_data$`IVR/Online`[i] == 1) {
    data_indicator4[i,1] <- 1
    data_indicator4[i,4] <- 1}
  if(full_data$`Live Phone/Online`[i] == 1) {
    data_indicator4[i,1] <- 1
    data_indicator4[i,2] <- 1
  }
  
}
colnames(data_indicator4) <- method2

full_data <- full_data[,-c(8,10:18)]
full_data <- cbind(full_data,data_indicator4)
full_data <- full_data[,c(1:11,13:16,12)]
rep_margin <- full_data[,2]-full_data[,3]
year <- 2016
full_data[,2] <- year
full_data[,3] <- rep_margin
colnames(full_data)[2] <- 'year'
colnames(full_data)[3] <- 'rep_margin'
write_csv(full_data, '2016_senate_results.csv')
senate_2016 <- full_data

Now we split our data into a training and test set. We use 2016 as the test set, and 2012 and 2014 as the training sets. We change the start date and end date from election variables into a end date from election, and length of poll variables (this is a simple linear transformation that is capturing the same information). We then exclude polls conudcted a year before an election (of which there are a handful), since these polls are either a. miscoded datewise, or b. are conducted before primaries have come close to concluding. We also drop the state name variable, and the survey identifying code. Finally, we turn our data into a CSV. 

In [None]:
senate_train <- rbind(senate_2012,senate_2014)
senate_train <- senate_train[,-c(2,4)]
senate_train[,2] <- senate_train[,2]-senate_train[,3]
colnames(senate_train)[2] <- 'length'
senate_train <- senate_train[-which(senate_train[,3] >= 365),]
write_csv(senate_train, 'senate_train.csv')

senate_2016 <- senate_2016[,-c(2,4)]
senate_2016[,2] <- senate_2016[,2]-senate_2016[,3]
colnames(senate_2016)[2] <- 'length'
senate_2016 <- senate_2016[-which(senate_2016[,3] >= 365),]
write_csv(senate_2016, 'senate_test.csv')