# Making new data

One of the most common data analysis techniques is to look at change over time. The most common way of comparing change over time is through percent change. The math behind calculating percent change is very simple, and you should know it off the top of your head. The easy way to remember it is:

`(new - old) / old` 

Or new minus old divided by old. Your new number minus the old number, the result of which is divided by the old number. To do that in R, we can use `dplyr` and `mutate` to calculate new metrics in a new field using existing fields of data. 

So first we'll import dyplr.

In [1]:
library(dplyr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Now we'll import a common and simple dataset of population estimates for every county in the US. The estimates data has data from 2010 to 2016. 

In [2]:
population <- read.csv("../../Data/population.csv")

In [3]:
head(population)

STNAME,CTYNAME,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,POPESTIMATE2016
Alabama,Autauga County,54742,55255,55027,54792,54977,55035,55416
Alabama,Baldwin County,183199,186653,190403,195147,199745,203690,208563
Alabama,Barbour County,27348,27326,27132,26938,26763,26270,25965
Alabama,Bibb County,22861,22736,22645,22501,22511,22561,22643
Alabama,Blount County,57376,57707,57772,57746,57621,57676,57704
Alabama,Bullock County,10892,10722,10654,10576,10712,10455,10362


The code to calculate percent change is pretty simple. Remember, with `summarize`, we used `n()` to count things. With `mutate`, we use very similar syntax to calculate a new value using other values in our dataset. So in this case, we're trying to do (new-old)/old, but we're doing it with fields. If we look at what we got when we did `head`, you'll see there's POPESTIMATE16 as the new data, and we'll use POPESTIMATE2015 as the old data. So we're looking at 1 year. Then, to help us, we'll use arrange again to sort it, so we get the county with the fastest growing population over one year. 

In [4]:
population %>% mutate(
  change = (POPESTIMATE2016 - POPESTIMATE2015)/POPESTIMATE2015,
) %>% arrange(desc(change))

STNAME,CTYNAME,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,POPESTIMATE2016,change
Texas,Hudspeth County,3467,3417,3351,3331,3243,3425,4053,0.18335766
Utah,San Juan County,14797,14787,14900,14988,15208,15707,16895,0.07563507
Texas,Kendall County,33651,34525,35766,37461,38830,40452,42540,0.05161673
Texas,Hays County,158241,163209,168408,176029,184951,194574,204470,0.05085983
Utah,Wasatch County,23629,24403,25385,26609,27789,29165,30528,0.04673410
Iowa,Dallas County,66699,69759,72271,75010,77798,80777,84516,0.04628793
Colorado,Costilla County,3526,3646,3603,3521,3540,3563,3721,0.04434465
Texas,Comal County,109294,112047,115005,118776,123487,129113,134788,0.04395375
Nebraska,Thomas County,650,688,692,700,688,686,716,0.04373178
Florida,Sumter County,94280,98584,102790,108263,114012,118882,123996,0.04301745


But if we look at change, we'll see that it's a decimal point. That's because for percent change to be a percent, you must multiply it by 100. You do that this way:

In [6]:
population %>% mutate(
  change = ((POPESTIMATE2016 - POPESTIMATE2015)/POPESTIMATE2015)*100,
  longchange = ((POPESTIMATE2016 - POPESTIMATE2010)/POPESTIMATE2010)*100,
) %>% arrange(longchange)

STNAME,CTYNAME,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,POPESTIMATE2016,change,longchange
Illinois,Alexander County,8214,8018,7712,7215,7074,6776,6478,-4.3978749,-21.134648
Texas,Terrell County,1009,952,917,888,905,855,812,-5.0292398,-19.524281
Kentucky,Lee County,7707,7691,7551,6838,6774,6737,6580,-2.3304141,-14.623070
Idaho,Butte County,2907,2805,2722,2626,2609,2501,2501,0.0000000,-13.966288
West Virginia,McDowell County,22076,21708,21335,20901,20291,19698,19141,-2.8276982,-13.294981
Texas,Schleicher County,3499,3304,3255,3193,3156,3196,3056,-4.3804756,-12.660760
South Carolina,Allendale County,10344,10247,9988,9805,9689,9420,9045,-3.9808917,-12.558005
Arkansas,Phillips County,21670,21413,20762,20437,19938,19534,18975,-2.8616771,-12.436548
Michigan,Ontonagon County,6750,6626,6417,6314,6173,6008,5911,-1.6145140,-12.429630
Idaho,Clark County,979,951,874,862,874,872,860,-1.3761468,-12.155260


## Assignment

How has Nebraska's electorate changed from 2010 to the election of Donald Trump in 2016? Specifically, how has the total number of voters, Republicans, Democrats and Non-Partisans changed in that time **in each county** in Nebraska? Which counties have changed the most in terms of the number of registered voters? [Here is your dataset](https://www.dropbox.com/s/7epokhv6ruujfgf/registeredvoters.csv?dl=0).

#### Rubric

1. Did you import the data correctly?
2. Did you mutate the data correctly? Did you do it in one step?
3. Did you sort the data correctly?
4. Did you explain each step using Markdown?