# Exploring Factor Variables

Factor variables, also known as categorical variables, usually convey
a qualitative characteristic instead of a quantitative one.  Examples
of factor variables are

* a product brand
* a person's ethnicity
* a label for a diet

Factor variables in R are often created from strings.

In [1]:
myStringList <- c('one', 'one', 'two', 'two', 'one')
myStringList

From such a list of strings that represent the factor values we
can designate it as a factor to R using the **as.factor** function.

In [2]:
myFactor <- as.factor(myStringList)
myFactor

The **levels** command provides us a list of the unique factors
stored in a variable.

In [3]:
levels(myFactor)

We can assign to labels to the levels by assigning to the
output of the levels command.

In [4]:
levels(myFactor) <- c('first', 'second')
myFactor

Factor counts are easily tabulated with the **table** command.

In [5]:
table(myFactor)

myFactor
 first second 
     3      2 

Let's experiment with a larger dataset like `ChickWeight`.

In [6]:
cw <- ChickWeight
head(cw, n=10)

weight,Time,Chick,Diet
42,0,1,1
51,2,1,1
59,4,1,1
64,6,1,1
76,8,1,1
93,10,1,1
106,12,1,1
125,14,1,1
149,16,1,1
171,18,1,1


This is a dataset that records the weight of a set of
baby chickens fed on different diets.  The `Chick` and
`Diet` variables are factors.

In [7]:
str(cw$Chick)
str(cw$Diet)

 Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
 Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...


We can quickly compare the counts for each diet.

In [8]:
table(cw$Diet)


  1   2   3   4 
220 120 120 118 

We can see that they are not all equal.  By supplying two
arguments to `table`, we get a 2-D table.  The first argument
will represent the rows, the second the columns.

In [9]:
table(cw$Time, cw$Diet)

    
      1  2  3  4
  0  20 10 10 10
  2  20 10 10 10
  4  19 10 10 10
  6  19 10 10 10
  8  19 10 10 10
  10 19 10 10 10
  12 19 10 10 10
  14 18 10 10 10
  16 17 10 10 10
  18 17 10 10 10
  20 17 10 10  9
  21 16 10 10  9

We still have the sum of counts for each of the diets.
But this time is broken down for each `Time` since we
added `Time` as a parameter.

Another way to produce tables is with the `xtabs` command.
The `xtabs` command accepts a forumula for its first parameter
and a data frame for its second parameter.  Formulas take some
getting used to; but they are more flexible.

In [10]:
xtabs( ~ Diet, cw)

Diet
  1   2   3   4 
220 120 120 118 

The formula ` ~ Diet` has nothing on the left hand side and
`Diet` on the right hand side.  When there is nothing on the
left hand side, that means use counts (just like the `table`
command).  The right hand side determines the column to count.

The equivalent of the two-argument `table` command for `xtabs`
is

In [11]:
xtabs( ~ Time + Diet, cw)

    Diet
Time  1  2  3  4
  0  20 10 10 10
  2  20 10 10 10
  4  19 10 10 10
  6  19 10 10 10
  8  19 10 10 10
  10 19 10 10 10
  12 19 10 10 10
  14 18 10 10 10
  16 17 10 10 10
  18 17 10 10 10
  20 17 10 10  9
  21 16 10 10  9

Once again, the left hand side is empty.  The right hand side
separates its elements with the `+` symbol.  The first specifies
the rows and the second specifies the columns, just like with
the `table` command.

So far we've left the left hand side blank, which counts the
various entries.  But the left hand side allows us to do more
than count they entries.  We can add values rather than simply
count them.  Let's add the associated `weight` value for each
of the pairs.

In [12]:
xtabs(weight ~ Time + Diet, cw)

    Diet
Time    1    2    3    4
  0   828  407  408  410
  2   945  494  504  518
  4  1073  598  622  645
  6  1269  754  779  839
  8  1514  917  984 1056
  10 1768 1085 1171 1260
  12 2062 1313 1444 1514
  14 2221 1419 1645 1618
  16 2459 1647 1974 1820
  18 2702 1877 2331 2029
  20 2897 2056 2589 2105
  21 2844 2147 2703 2147

In the above example, we used the values of factor variables
to **split** the dataset into a collection of disjoint
subsets.  Each subset corresponds to a particular combination
of `Time` and `Diet`.  The `weight` values for each subset were
summed.

The **aggregate** function allows us to extend this notion beyond
sums to arbitrary operations.  Let's say were intersted in the
**mean** rather than the sum of each `Time` x `Diet` combination.

In [14]:
ag <- aggregate(weight ~ Time + Diet, cw, mean)
head(ag, n=10)

Time,Diet,weight
0,1,41.4
2,1,47.25
4,1,56.47368
6,1,66.78947
8,1,79.68421
10,1,93.05263
12,1,108.52632
14,1,123.38889
16,1,144.64706
18,1,158.94118


The `aggregate` function offers a formula interface much like `xtabs`.
There are several differences.

* There is a third parameter to specify an arbitrary function
  to operate on the variable specified by the left hand side
  of the formula.

* The result is a `data.frame` where the right hand side variables
  are columns rather than dimention labels of a table.

The third parameter does not have to be a predefined R function.
We can dynamically provide our own.

In [15]:
ag <- aggregate(weight ~ Time + Diet, cw, function(x) { max(x) - min(x) })
head(ag, n=10)

Time,Diet,weight
0,1,4
2,1,16
4,1,15
6,1,33
8,1,55
10,1,88
12,1,114
14,1,124
16,1,156
18,1,169


## Summary

In this *R Warm-up* we explored factor variables.
A vector can be converted to a factor type using the
`as.Factor` function.

A factor variable is *qualitative* rather than
quantitative in character.  Even when represented with
numbers, quantitative calculations on a factor variable,
such as average, maximum, or variance, don't usually make
sense.  A popular use for factor variables is to *split*
a dataset by grouping members of the set according to the
value of the factor variable.

R provides functions for analyzing the distribution of the
values of a factor variable.  We explored a few of these,
namely `table` and `xtabs`.  The `xtabs` function extended
the capabilities of `table` by allowing for sums of particular
columns rather than just the counts of each combination of
factors.