# Creating Factor Variables

In the previous warm-up we explored how factor variables could
be used to split a dataset.  Such splits are usually performed in
order to apply a calculation to each split and perhaps even
combine the results in a later step.  This scenario is so
common that it has its own name: **split-apply-combine**.

In the last warm-up we used factor variables that came with the
original dataset for the split.  It's great when such factors
are readily available.  But sometimes we need to split according
to criteria that are not available with existing factor variables.
In this case we often create one or more factor variables with
values that capture the desired criteria and then perform the
split with these new factor variables.  Here are some of the
ways we do it.

* [Regular Patterns](#regular-patterns)
* [Level Interactions](#level-interactions)
* [Column Computation](#column-computation)
* [Cut Function](#cut-function)


## <a id="regular-patterns">Regular Patterns</a>

Sometimes the data in your dataset is structured in regular
patterns.  A useful function for generating factor variables
in regular patterns is **gl** (for Generate Levels).  A few
examples will help.

In [1]:
gl(2, 4, labels=c('this', 'that'))

In [2]:
gl(2, 1, 8, labels=c('this', 'that'))

The parameters to `gl` have the following description.

* `n` - the number of levels to generate,
* `k` - the number of consecutive times each level is repeated.
* `l` - (optional) the total length, `n * k` by default
* `labels` - (optional) names assigned to the factor values, defaults to integers

We can see from the outputs above that the result is a regular
pattern of two constants; so the first parameter is `2` in both
cases.  The difference is in the number of times each constant
is repeated.  In the first case, each constant is repeated `4`
times.  This result in groups of four adjacent elements.

The second example alternates every element; so the second
parameter is `1`.  The default length of such a pattern is
`n * k = 2 * 1 = 2`.

In [4]:
gl(2, 1, labels=c('this', 'that'))

In order to get eight elements like in the first example, we need
to specify the optional third parameter as `8`.


## <a id="level-interactions">Level Interactions</a>

We can create a factor from two existing factors through their
**interaction** - that is, through the cross product of their
possible values.

In [5]:
f1 <- gl(2, 2, labels=c('this', 'that'))
f1
f2 <- gl(2, 1, labels=c('one', 'other'))
f2
interaction(f1, f2)

Note that `f2` is only length `2`; `f1` is length `4`.
Two factors must be the same length in order to interact them.
Since the length of `f1` is a multiple of the length of `f2`,
*recycling* was used to extend `f2` for the interaction.


## <a id="column-computation">Column Computation</a>

Another common way to create a factor variable is to us other
non-factor columns in a data frame.  In the following example,
we create a data frame consisting of days of the month for
August, 2017.

In [11]:
aug2017 <- data.frame(day=1:31)
head(aug2017)

day
1
2
3
4
5
6


Now create a factor variable labeled by the days of the week.
Make sure the days of the week correctly correspond to the
day of the month.

In [12]:
dow <- as.factor(aug2017$day %% 7)
dow

Now assign these labels accordign to the day of the week.
For example, in August 2017, this would be

In [13]:
levels(dow) <- c('tue', 'wed', 'thu', 'fri', 'sat', 'sun', 'mon')
dow

In [14]:
aug2017['day_of_week'] <- dow
head(aug2017, n=10)

day,day_of_week
1,wed
2,thu
3,fri
4,sat
5,sun
6,mon
7,tue
8,wed
9,thu
10,fri


## <a id="cut-function">Cut Function</a>

Another important way to generate factors is with the **cut** function.
When a category value is to be based on a range of values for a variable,
the `cut` function can be used to create a factor variables based on these
ranges.  Its first three parameters are

1. a numeric vector to cut
2. a specification for the cuts
3. (optional) labels for the cuts

The result is a factor variable with the same length as the first argument.

Let's try this out on the **InsectSprays** dataset.

In [15]:
head(InsectSprays)

count,spray
10,A
7,A
20,A
14,A
14,A
12,A


The **count** column measures the insects eradicated by the spray.
The **spray** column is an anonymized ID for the spray.  Let's split
this dataset based on the "quality" of the spray, which we presume to
be proportional to the eradication count.  We'll create a **quality**
factor variable with value of either `bad`, `ok`, or `good` depending
on the eradication count in two ways.

1. `qualityA` - based on absolute values of the count
2. `qualityC` - based on quantiles (college students refer to this as "the curve").

For the absolute case, we simply divide the range of values into equal
intervals.

In [16]:
qualityA <- cut(InsectSprays$count, 3)
table(qualityA)

qualityA
(-0.026,8.67]   (8.67,17.3]     (17.3,26] 
           37            25            10 

The names of the intervals default to a string representation of
the intervals, which by default are open on the left and closed
on the right.  Let's assign friendlier, if less informative, names
with the label parameter.

In [17]:
qualityA <- cut(InsectSprays$count, 3, labels=c('bad', 'ok', 'good'))
table(qualityA)

qualityA
 bad   ok good 
  37   25   10 

For the absolute case above, we specified the number of breaks (`3`) and
let the `cut` function establish the equal intervals.  If we want more
control, we can specify the breaks ourselves.  Since we're grading on a
curve, we assign

* `bad` - to the lower third
* `ok` - to the middle third
* `good` - to the upper third

We use the `quantile` function to determine the break points.

In [19]:
curveBreakPoints <- quantile(InsectSprays$count, c(0, .33, .66, 1))
curveBreakPoints

In [20]:
qualityC <- cut(InsectSprays$count, curveBreakPoints, labels=c('bad', 'ok', 'good'))
table(qualityC)

qualityC
 bad   ok good 
  22   26   22 

Since `qualityC` is based on a curve, one might expect the numbers in
each bucket to be closer together.  This is only approximate due to the
different ways quantiles can be computed.  (Check the `quantile` help
documentation; there are no fewer than **nine** algorithms from which
to choose).  The default is usually fine.

Now that we have "quality buckets", let's see how they split the spray
brands.

In [21]:
table(InsectSprays$spray, qualityA)
table(InsectSprays$spray, qualityC)

   qualityA
    bad ok good
  A   1  8    3
  B   1  8    3
  C  12  0    0
  D  11  1    0
  E  12  0    0
  F   0  8    4

   qualityC
    bad ok good
  A   0  5    7
  B   0  4    8
  C   9  1    0
  D   5  7    0
  E   8  4    0
  F   0  5    7

We can see how our choice of splitting affected the distribution of
sprays into `bad`, `ok`, and `good`.