# Recasting data

Sometimes long data needs to be wide, and sometimes wide data needs to be long. I'll explain.

You are soon going to discover that long before you can visualize data, you need to have it in a form that the visualization library can deal with. One of the ways that isn't immediately obvious is how your data is cast. Most of the data you will encounter will be wide -- each row will represent a single entity with multiple measures for that entity. So think of states. Your dataset could have population, average life expectancy and other demographic data. 

But what if your visualization library needs one row for each measure? That's where recasting your data comes in. We can use a library called `reshape2` to `melt` or `cast` the data, depending on what we need.

So let's transform a dataset we've already used -- registered voters in Nebraska -- from wide data to long data and back again. First, we'll import the library and then open the data. 

In [1]:
library(reshape2)

In [2]:
voters <- read.csv("../../Data/registeredvoters.csv")

In [3]:
head(voters)

County,Republican10,Democrat10,Libertarian10,Nonpartisan10,Total10,Republican16,Democrat16,Nonpartisan16,Libertarian16,Total16
Adams,10018,5536,6,2972,18532,10746,5027,3591,163,19527
Antelope,3005,1147,0,538,4690,3088,863,594,12,4557
Arthur,284,52,0,10,346,286,37,15,3,341
Banner,424,53,0,53,530,427,38,73,7,545
Blaine,314,56,0,24,394,310,43,29,2,384
Boone,2390,1156,0,408,3954,2469,901,404,11,3785


Making data long, in most cases, is very, very easy. It's simple. We're going to create a new data frame called longvoters, and then `melt` our voters data into it. Then we'll run `head` and you'll see each measure gets it's own row -- so each county has 10 rows of data for it. 

In [4]:
longvoters <- melt(voters)
head(longvoters)

Using County as id variables


County,variable,value
Adams,Republican10,10018
Antelope,Republican10,3005
Arthur,Republican10,284
Banner,Republican10,424
Blaine,Republican10,314
Boone,Republican10,2390


But one problem that isn't immediately clear is where in spreadsheet world, a header like Republican10 and Republican16 makes sense, it doesn't here. We need fields to be able to have a County, a Party, a Year and then a count of those voters. So let's so some more worth with `mutate` and create those columns we need. And let's start exploring programmatic text manipulations. To simplify, we're going to use a library called `stringr` to get predictable patterns in our data. 

So if you look at the `variable` field, you see we have the party at the front and the year in the back and the year is always the last two characters of the field. The problem? We don't know WHICH characters those are. But `stringr` has a powerful tool called `str_sub` that can get us that, if we know one trick: That telling it to start at -2 means go to the end of the word and move backwards two characters. 

In [5]:
library(stringr)
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [6]:
longvoters %>% mutate(
    Year = str_sub(variable, start= -2),
)

County,variable,value,Year
Adams,Republican10,10018,10
Antelope,Republican10,3005,10
Arthur,Republican10,284,10
Banner,Republican10,424,10
Blaine,Republican10,314,10
Boone,Republican10,2390,10
Box Butte,Republican10,4115,10
Boyd,Republican10,1036,10
Brown,Republican10,1663,10
Buffalo,Republican10,15768,10


If I really wanted to be correct, I could make those four digit years by adding 2000 to what the str_sub finds, but I also have to turn that into a number first. 

In [7]:
longvoters %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
) 

County,variable,value,Year
Adams,Republican10,10018,2010
Antelope,Republican10,3005,2010
Arthur,Republican10,284,2010
Banner,Republican10,424,2010
Blaine,Republican10,314,2010
Boone,Republican10,2390,2010
Box Butte,Republican10,4115,2010
Boyd,Republican10,1036,2010
Brown,Republican10,1663,2010
Buffalo,Republican10,15768,2010


Okay, so I have one side of it. Now I need the other. Luckily, we can tell str_sub to start at a character and end at one. In this case, we want to start at 1 and end at -3. 

In [8]:
longvoters %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
    Party = str_sub(variable, 1, -3),
)

County,variable,value,Year,Party
Adams,Republican10,10018,2010,Republican
Antelope,Republican10,3005,2010,Republican
Arthur,Republican10,284,2010,Republican
Banner,Republican10,424,2010,Republican
Blaine,Republican10,314,2010,Republican
Boone,Republican10,2390,2010,Republican
Box Butte,Republican10,4115,2010,Republican
Boyd,Republican10,1036,2010,Republican
Brown,Republican10,1663,2010,Republican
Buffalo,Republican10,15768,2010,Republican


Two last things that are bothering me about our data? I don't like the number of voters being called `value` and we don't need `variable`. So we'll use dpylr to clean this up a bit.

First we're going to use dpylr's `select` and a little R trick where you can use a negative sign to mean not this. 

Then, we'll use `rename`, which is backwards to me. If you read it, what it says is rename a column to Count and assign it the values of value. The reverse makes sense to me. Make values = Count, but `dplyr` does it the other way. If you get an error, you probably did it backwards (I did). 

In [9]:
longvoters %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
    Party = str_sub(variable, 1, -3),
) %>% select (-c(variable)) %>% rename(Count=value)

County,Count,Year,Party
Adams,10018,2010,Republican
Antelope,3005,2010,Republican
Arthur,284,2010,Republican
Banner,424,2010,Republican
Blaine,314,2010,Republican
Boone,2390,2010,Republican
Box Butte,4115,2010,Republican
Boyd,1036,2010,Republican
Brown,1663,2010,Republican
Buffalo,15768,2010,Republican


In [10]:
newlongvoters <- longvoters %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
    Party = str_sub(variable, 1, -3),
) %>% select (-c(variable)) %>% rename(Count=value)

Then, we can put it back together again by casting it using `dcast`. With `dcast`, we need to tell it which variable is our main identifier -- which is County -- and what the headers should be. We also have to tell it where the numbers should come from, since we blew it all apart. And you'll see, we've changed the data substantially, but it looks almost identical to the original dataset. 

In [11]:
widevoters <- dcast(newlongvoters, County ~ Party+Year, value.var = "Count")
head(widevoters) 

County,Democrat_2010,Democrat_2016,Libertarian_2010,Libertarian_2016,Nonpartisan_2010,Nonpartisan_2016,Republican_2010,Republican_2016,Total_2010,Total_2016
Adams,5536,5027,6,163,2972,3591,10018,10746,18532,19527
Antelope,1147,863,0,12,538,594,3005,3088,4690,4557
Arthur,52,37,0,3,10,15,284,286,346,341
Banner,53,38,0,7,53,73,424,427,530,545
Blaine,56,43,0,2,24,29,314,310,394,384
Boone,1156,901,0,11,408,404,2390,2469,3954,3785


## Assignment

Melt the [population estimates data](https://www.dropbox.com/s/u4s7rhcb37sk3cy/population.csv?dl=0) from assignment 3. Create long data, where each row is a single year for a single county, with columns for the state, county, year and estimate.  

#### Rubric

1. Did you import the data correctly?
2. Did you apply melt correctly?
3. Did you mutate/rename columns correctly?
4. Did you explain your steps using Markdown comments?