In [1]:
library(reshape2)

use reshape 2 refers to an R package used for transform data to wide and long data. Hadley Wickham, a Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University created this reshape 2 for R.  

In [2]:
population <- read.csv("Data/population-4.csv")

import the population data

In [3]:
head (population)

STNAME,CTYNAME,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,POPESTIMATE2016
Alabama,Autauga County,54742,55255,55027,54792,54977,55035,55416
Alabama,Baldwin County,183199,186653,190403,195147,199745,203690,208563
Alabama,Barbour County,27348,27326,27132,26938,26763,26270,25965
Alabama,Bibb County,22861,22736,22645,22501,22511,22561,22643
Alabama,Blount County,57376,57707,57772,57746,57621,57676,57704
Alabama,Bullock County,10892,10722,10654,10576,10712,10455,10362


find out the form and headers of the dataset

In [4]:
longpopulation <- melt(population)
head(longpopulation)

Using STNAME, CTYNAME as id variables


STNAME,CTYNAME,variable,value
Alabama,Autauga County,POPESTIMATE2010,54742
Alabama,Baldwin County,POPESTIMATE2010,183199
Alabama,Barbour County,POPESTIMATE2010,27348
Alabama,Bibb County,POPESTIMATE2010,22861
Alabama,Blount County,POPESTIMATE2010,57376
Alabama,Bullock County,POPESTIMATE2010,10892


So, we want to use the long format data which consists of a column of "variable" and a column of "values" of the variable. In this data, the variable is POPESTIMATE2010 and the value is the estimation number of population. To have a long format data, we should change the wide format data by using "melt". 

Melt command refers to an action of taking wide format data and melts it into long format data, like a melting metal. (Reference: Sean C. Anderson, http://seananderson.ca/2013/10/19/reshape.html)

According to Sean, there is also another command, which is "cast". In cast command, we cast long format data into wide format data. 

Wide format data? a summary of long format data! each category is grouped. 

In [5]:
library(stringr)
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Use stringr package to manipulate data and extract string (like adding or taking out header). Reference: introduction to stringr, CRAN.R-project.org. Basically, function of stringr package: to clean the data, simpifly string operations by eliminating data that you don't want, extract and replace substring from a character vector. Some commands used here, such as: 

str_sub
str_subset
str_locate
str_replace
word
%>%

In [6]:
longpopulation %>% mutate(
    Year = str_sub(variable, start= -2),
)

STNAME,CTYNAME,variable,value,Year
Alabama,Autauga County,POPESTIMATE2010,54742,10
Alabama,Baldwin County,POPESTIMATE2010,183199,10
Alabama,Barbour County,POPESTIMATE2010,27348,10
Alabama,Bibb County,POPESTIMATE2010,22861,10
Alabama,Blount County,POPESTIMATE2010,57376,10
Alabama,Bullock County,POPESTIMATE2010,10892,10
Alabama,Butler County,POPESTIMATE2010,20938,10
Alabama,Calhoun County,POPESTIMATE2010,118468,10
Alabama,Chambers County,POPESTIMATE2010,34101,10
Alabama,Cherokee County,POPESTIMATE2010,25977,10


we want to add year as variable. so we use the command: Year = str_sub(variable, start= -2).
str_sub refers to an action to extract and replace substring. Meanwhile dplyr mutate means adding new variable by perserving the existing ones, as in year. Reference: Statistical tools for high-throughput data analysis (sthda.com)

In [7]:
longpopulation %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
)

STNAME,CTYNAME,variable,value,Year
Alabama,Autauga County,POPESTIMATE2010,54742,2010
Alabama,Baldwin County,POPESTIMATE2010,183199,2010
Alabama,Barbour County,POPESTIMATE2010,27348,2010
Alabama,Bibb County,POPESTIMATE2010,22861,2010
Alabama,Blount County,POPESTIMATE2010,57376,2010
Alabama,Bullock County,POPESTIMATE2010,10892,2010
Alabama,Butler County,POPESTIMATE2010,20938,2010
Alabama,Calhoun County,POPESTIMATE2010,118468,2010
Alabama,Chambers County,POPESTIMATE2010,34101,2010
Alabama,Cherokee County,POPESTIMATE2010,25977,2010


You want to have 2010 and not 10 for "Year". Therefore, we use this command: Year = 2000 + as.integer(str_sub(variable, start= -2)),
), which can be read as you add 2000+ 2 last number as integer. E.g. 2000 + 16 (two last numbers in "Year" which become integer).

In [8]:
longpopulation %>% mutate(
    Year = 2000 + as.integer(str_sub(variable, start= -2)),
) %>% select (-c(variable)) %>% rename(Estimate=value)

STNAME,CTYNAME,Estimate,Year
Alabama,Autauga County,54742,2010
Alabama,Baldwin County,183199,2010
Alabama,Barbour County,27348,2010
Alabama,Bibb County,22861,2010
Alabama,Blount County,57376,2010
Alabama,Bullock County,10892,2010
Alabama,Butler County,20938,2010
Alabama,Calhoun County,118468,2010
Alabama,Chambers County,34101,2010
Alabama,Cherokee County,25977,2010


REMOVE UNWANTED VARIABLE! Make it short and simple. And eliminate the "variable" and rename "value" into another name as "estimate". %>% select (-c(variable)) %>% rename(Estimate=value). At the end, you will have a dataset consists of State name, County name, Estimate number (value) and Year. 