# Analyzing Data using R
By Shuhei Kitamura

### Outline
1. Preparing Data for Analysis
    - Importing Data
    - Combining Data
    - Reshaping Data  
    - Making Variables
    - Saving Data

In [17]:
library(plyr)
library(tidyverse)

In [18]:
options(repr.matrix.max.rows=200, repr.matrix.max.cols=100) # set # of rows and columns to display 

In [19]:
setwd('...') # set the working directory

## 1. Preparing Data for Analysis
- We already have cleaned data. The next step is to make final data for analysis.
- In this exercise, we will put eight files together. How?
    1. Append US Senate election data
    2. Append daily temperature data
    3. Merge them
- Our goal is to make the data that have a panel structure.

### Importing Data
- Import data as usual. Recall that all files are saved in csv format.

In [20]:
elec_data <- list()
temp_data <- list()
for (year in seq(2008,2014,by=2)) {
    elec_data[[paste0('elec_',year)]] <- as.data.frame(read.table(paste0('data/elec_senate_R_', year, '.csv'), sep=",", header = TRUE, stringsAsFactors=FALSE))
    temp_data[[paste0('elec_',year)]] <- as.data.frame(read.table(paste0('data/daily_temp_R_', year, '.csv'), sep=",", header = TRUE, stringsAsFactors=FALSE))
}

- Check data entries and their types, if you do not know them yet.

### Combining Data
#### - Appending
- To append data in R, you can use `rbind()`.
- Append `data2` to `data1`.

In [16]:
data1 <- as.data.frame(matrix(1:4, nrow=2, ncol=2, byrow=TRUE), col.name=c('v1', 'v2'))
data2 <- as.data.frame(matrix(5:8, nrow=2, ncol=2, byrow=TRUE), col.name=c('v1', 'v2'))

- However, this function is not quite as usuful as pandas' `concat`. For example, outer join is not supported.
    - In that case, use `rbind.fill()` in `plyr` package.
    - Alternatively, use `bind_rows()` in `dplyr` package.
- Append `data1` to `data2` using `rbind.fill()`.

In [10]:
data1 <- as.data.frame(matrix(c(1, NaN, 10L, 'abc'), nrow=2, ncol=2), col.name=c('v1', 'v2'))
data2 <- as.data.frame(matrix(1:9, nrow=3, ncol=3, byrow=TRUE), col.name=c('v1', 'v2', 'v3'))

- Let's combine all years for election data and temperature data, respectively.
- Wait... but temperature data are already very long (> 250,000 observations).
- Let's reduce the sizes of the datasets before appending them.

#### - Group Aggregation
- Our goal is to keep a single observation for each state and year. 
- How? There are several strategies.
    - Take the mean/max/min/std, etc.
    - Keep some of the observations
    - Reshape data
- I suggest the following procedure:
    1. Keep the Election Day temperature
    2. Take the mean of `'arithmetic_mean'` (daily average) and the max and mean of `'1st_max_value'` (daily max) for each state.
- Does it make sense to you?

- Let's keep the Election Day temperature for each year.
    - Election Day: November 4th, 2008, November 2nd, 2010, November 6th, 2012, November 8th, 2014

In [21]:
temp_data[['elec_2008']] <- temp_data[['elec_2008']][temp_data[['elec_2008']]['date_local'] == '2008-11-04',]
temp_data[['elec_2010']] <- temp_data[['elec_2010']][temp_data[['elec_2010']]['date_local'] == '2010-11-02',]
temp_data[['elec_2012']] <- temp_data[['elec_2012']][temp_data[['elec_2012']]['date_local'] == '2012-11-06',]
temp_data[['elec_2014']] <- temp_data[['elec_2014']][temp_data[['elec_2014']]['date_local'] == '2014-11-08',]

- Next, check whether `arithmetic_mean` and `x1st_max_value` have missing values before aggregating their values.
    - Recall that some computation methods do not igore missing values.

- A powerful method for aggregation is `group_by()` and `summarize()` in `dplyr` package.
- In this example, we use a useful pipe operator `%>%`.
    - Starting from initial inputs, apply several functions in a sequential way.
    - For example: `x %>% f(y)` for `f(x, y)`, `x %>% f(y) %>% g(z)` for `g(f(x, y), z)`.  
    - See also [this site](https://www.datacamp.com/community/tutorials/pipe-r-tutorial) for an explanation about pipes.
    - If you want to see the outcome for each step, install an [addin](https://github.com/daranzolin/ViewPipeSteps) for RStudio.

In [None]:
head(mtcars)
mtcars %>%
    filter(hp > 100) %>% # or subset(), keep samples gross horsepower > 100
    group_by(cyl) %>% # group by the number of cylinders
    summarize(mpg_mean=mean(mpg), mpg_sd=sd(mpg), n=n()) # summary statistics of mpg (miles per gallon)

- Compute the mean of `'arithmetic_mean'` (daily average) and the max and mean of `'1st_max_value'` (daily max) for each state.
    - Also, add a new column `elec_year`.

In [22]:
for (year in seq(2008,2014,by=2)) {
    temp_data[[paste0('elec_', year, '_agg')]] <- 
        temp_data[[paste0('elec_', year)]] %>%
        group_by(state_name) %>%
        summarize(
            temp_mean = mean(arithmetic_mean, na.rm=TRUE),
            temp_max_max = max(x1st_max_value, na.rm=TRUE),
            temp_max_mean = mean(x1st_max_value, na.rm=TRUE)
        )    
    temp_data[[paste0('elec_', year, '_agg')]]['elec_year'] <- as.integer(year)
}

- Finally, it's time to append the data!

In [23]:
elec_all <- elec_data[['elec_2008']] 
temp_all <- temp_data[['elec_2008_agg']] 
for (year in seq(2010,2014,by=2)){
    elec_all <- rbind(elec_all, elec_data[[paste0('elec_',year)]])
    temp_all <- rbind(temp_all, temp_data[[paste0('elec_',year,'_agg')]])
}

- Check that each dataset contains state names and election years.

In [None]:
print(c(unique(elec_all[['state_long']]), unique(elec_all[['elec_year']])))
print(c(unique(temp_all[['state_name']]), unique(temp_all[['elec_year']])))

#### - Removing spaces
- Though not in this case, strings often contain strange spaces.
- In that case, we have to remove them. Otherwise, we will not be able to merge data properly.
- To remove spaces, use `str_trim(, side = c('right','both','left'))`, where `side` refers to the side that has whitespaces.

#### - Merging
- Merging means that you append data horizontally.
- To merge R's objects, use `inner_join` or `full_join` in `dplyr` package.
    - You can also use `right_join` and `left_join`.
    - Alternatively, use `merge` instead.
- Merge `data1` and `data2`.

In [None]:
data1 <- data.frame(name=c('tom', 'jerry'), educ=c(9, 12), stringsAsFactors=FALSE)
data2 <- data.frame(name=c('tom', 'jerry', 'spike'), height=c(185, 170, 165), weight=c(70, 62, 60), stringsAsFactors=FALSE)
print(inner_join(data1, data2, by='name')) # inner join (intersection)
print(full_join(data1, data2, by='name')) # outer join (union)
print(right_join(data1, data2, by='name'))  # right join (keep right data)
print(left_join(data1, data2, by='name')) # left join (keep left data)
# print(merge(data1, data2, by='name')) # inner join (intersection)
# print(merge(data1, data2, by='name', all=TRUE)) # outer join (union)
# print(merge(data1, data2, by='name', all.y=TRUE))  # right join (keep right data)
# print(merge(data1, data2, by='name', all.x=TRUE)) # left join (keep left data)

- Key names can be different.

In [None]:
data1 <- data.frame(name1=c('tom', 'jerry'), educ=c(9, 12), stringsAsFactors=FALSE)
data2 <- data.frame(name2=c('tom', 'jerry', 'spike'), height=c(185, 170, 165), weight=c(70, 62, 60), stringsAsFactors=FALSE)
print(full_join(data1, data2, by=c('name1' = 'name2')))
#print(merge(data1, data2, by.x='name1', by.y='name2', all=TRUE))

- You can use more than one key.

In [None]:
data1 <- data.frame(name=c('tom', 'jerry'), educ=c(9, 12), year=c(2000, 2000))
data2 <- data.frame(name=c('tom', 'jerry'), height=c(185, 170, 187, 171), weight=c(70, 62, 75, 63), year=c(2000, 2000, 2001, 2001))
print(full_join(data1, data2, by=c('name','year')))
#print(merge(data1, data2, by=c('name','year'), all=TRUE))

- What happens if two datasets have the same column name with different values?

In [None]:
data1 <- data.frame(name=c('tom', 'jerry'), height=c(185, 170))
data2 <- data.frame(name=c('tom', 'jerry'), height=c(185, 172))
print(full_join(data1, data2, by='name'))
#print(merge(data1, data2, by='name', all=TRUE))

- Let's merge election and temperature data.
    - What are the keys?

In [24]:
data_use <- full_join(elec_all, temp_all, by=c('state_long' = 'state_name', 'elec_year' = 'elec_year'))

- Print `data_use`, which should be in long format. You will often use this type of data structure for the panel data analysis.

In [None]:
data_use

### Reshaping Data
- If necessary, reshape data for analysis. In that case, use `reshape`.
    - In our exercise, we don't need to reshape the data.

In [None]:
data1 <- reshape(data_use, idvar="state_long", timevar="elec_year", direction="wide") # from long to wide
rownames(data1) <- NULL # delete old rownames
data1

In [None]:
data2 <- reshape(data1, idvar="state_long", timevar="elec_year", direction="long") # back to long
rownames(data2) <- NULL # delete old rownames
data2

### Making Variables
- You may need more variables for analysis. For example:
    - Logarithm
    - Total, mean, min, max...
    - Share, ratio...
- Let's make 
    - Vote share of Republican and Democratic candicates
    - Natural logarithm of temperature    

In [25]:
# make vote shares
data_use['gelec_total'] <- rowSums(data_use[c('gelec_dem','gelec_rep','gelec_oth')], na.rm=TRUE)
data_use[data_use['gelec_total'] == 0, 'gelec_total'] <- NA
data_use['rep_share'] = data_use['gelec_rep'] / data_use['gelec_total'] # republican vote share
data_use['dem_share'] = data_use['gelec_dem'] / data_use['gelec_total'] # democrat vote share

In [None]:
# take natural logs
data_use['ln_temp_mean'] <- log(data_use['temp_mean'])
data_use['ln_temp_max_max'] <- log(data_use['temp_max_max'])
data_use['ln_temp_max_mean'] <- log(data_use['temp_max_mean'])

### Saving Data

In [27]:
write.table(data_use, file='data/data_use_R.csv', sep=",", na="", row.names=FALSE)