# Creating New Variables
___

Some common reasons to create new varibles:
* Missingness indicators - is this observation fully intact?
* "Cutting up" quantitative variables - turn them into factors
* Applying transforms

Useful links:

* [Biostat lecture 2](http://www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf)
* [StatMethods transforms](http://statmethods.net/management/functions.html)
* [Plyr tutorial](http://plyr.had.co.nz/09-user/)

## Dataset to be used

The dataset used here - the *Baltimore Restaurants* dataset - can be downloaded [here](https://data.baltimorecity.gov/Culture-Arts/Restaurants/k5ry-ef3g).

In [3]:
rest.data <- read.csv("data/Restaurants.csv")

## Creating sequences
Sequences can be particularly useful for generating indices for the dataset. Some basic use is covered here.

In [4]:
s1 <- seq(1, 10, by=2)
s1

In [5]:
s2 <- seq(1, 10, length=3)
s2

In [6]:
x <- c(1, 3, 8, 25, 100)
seq(along = x)

## Adding new columns

In [7]:
rest.data$nearMe = rest.data$neighborhood %in% c("Roland Park", "Homeland")
table(rest.data$nearMe)


FALSE  TRUE 
 1314    13 

In [8]:
rest.data$zipWrong = ifelse(rest.data$zipCode < 0, TRUE, FALSE)  # Where is the zipcode <0 (and thus invalid?)
table(rest.data$zipWrong)


FALSE  TRUE 
 1326     1 

## Quantitative into Qualitative

Grouping by ranges with `cut`.

In [9]:
rest.data$zipGroups <- cut(rest.data$zipCode, breaks=quantile(rest.data$zipCode))

In [10]:
table(rest.data$zipGroups)


(-2.123e+04,2.12e+04]  (2.12e+04,2.122e+04] (2.122e+04,2.123e+04] 
                  337                   375                   282 
(2.123e+04,2.129e+04] 
                  332 

In [11]:
table(rest.data$zipGroups, rest.data$zipCode)

                       
                        -21226 21201 21202 21205 21206 21207 21208 21209 21210
  (-2.123e+04,2.12e+04]      0   136   201     0     0     0     0     0     0
  (2.12e+04,2.122e+04]       0     0     0    27    30     4     1     8    23
  (2.122e+04,2.123e+04]      0     0     0     0     0     0     0     0     0
  (2.123e+04,2.129e+04]      0     0     0     0     0     0     0     0     0
                       
                        21211 21212 21213 21214 21215 21216 21217 21218 21220
  (-2.123e+04,2.12e+04]     0     0     0     0     0     0     0     0     0
  (2.12e+04,2.122e+04]     41    28    31    17    54    10    32    69     0
  (2.122e+04,2.123e+04]     0     0     0     0     0     0     0     0     1
  (2.123e+04,2.129e+04]     0     0     0     0     0     0     0     0     0
                       
                        21222 21223 21224 21225 21226 21227 21229 21230 21231
  (-2.123e+04,2.12e+04]     0     0     0     0     0     0     0

In [12]:
library(Hmisc)

In [13]:
rest.data$zipGroups2 <- cut2(rest.data$zipCode, g=5)

Transmute straight into a factor with `factor`.

In [15]:
table(rest.data$zipGroups2)


[-21226,21205) [ 21205,21214) [ 21214,21225) [ 21225,21231) [ 21231,21287] 
           338            193            445            210            141 

In [14]:
rest.data$zipcodefactor <- factor(rest.data$zipCode)
rest.data$zipcodefactor[1:10]

In [None]:
k