# Lecture 9.1: Factor
<div style="border: 1px double black; padding: 10px; margin: 10px">

**After today's lecture you will understand:**
* Learn how to deal with categorical variable 
</div>

This correpsonds to Chapter 15 of your book



    




In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Factors
Recall that a random variable is *categorical* if it takes on one of a (small) number of discrete values. 

In [2]:
birth_months <- c("Jan", "Feb", "Sep", "Ser", "Dec", "Jan", "Jul", "Aug")  # categorical variable
birth_months

Using a string to record this variable has some problems.  First of all there are only twelve possible months, there is nothing prevent you from typos:

It does not sort in a useful way.

You can fix the above problems by setting the categorical variable into a factor.

The *possible* values of a categorical variable are called the *levels*. The levels of `birth_months` should be `Jan`, `Feb`, ..., `Dec`. The *actual* values of `birth_months` are just called the values.

*Factors* are the traditional way to represent categorical data in R. To create a factor, we must specify the levels and the values:

To create a factor, you can use the `factor` function

If you specify the factor levels using the `levels=` option, then that will specify the default order. If you do *not* specify the levels, then they will be sorted alphabetically by default:

It's best to be explicit about the factor levels. This way, if you there are typos or data entry errors, you will catch them more easily:

### The `forcats` package
`tidyverse` contains a package that has some tools for working with factors. Sometimes it is not automatically loaded by the `tidyverse` metapackage, in which case you must load it manually:

In [11]:
library(forcats)

`forcats` commands are prefixed by `fct_` (compare `stringr`).

For the rest of the examples, we'll use a data set included in `forcats` called `gss_cat`. This is a standard data set from the General Social Survey which contains a lot of categorical variables:

In [12]:
print(gss_cat)

[38;5;246m# A tibble: 21,483 x 9[39m
    year marital     age race  rincome    partyid     relig     denom    tvhours
   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<fct>[39m[23m     [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<fct>[39m[23m [3m[38;5;246m<fct>[39m[23m      [3m[38;5;246m<fct>[39m[23m       [3m[38;5;246m<fct>[39m[23m     [3m[38;5;246m<fct>[39m[23m      [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m  [4m2[24m000 Never ma…    26 White $8000 to … Ind,near r… Protesta… Souther…      12
[38;5;250m 2[39m  [4m2[24m000 Divorced     48 White $8000 to … Not str re… Protesta… Baptist…      [31mNA[39m
[38;5;250m 3[39m  [4m2[24m000 Widowed      67 White Not appli… Independent Protesta… No deno…       2
[38;5;250m 4[39m  [4m2[24m000 Never ma…    39 White Not appli… Ind,near r… Orthodox… Not app…       4
[38;5;250m 5[39m  [4m2[24m000 Divorced     25 White Not appli… Not str de… None      Not app…       1
[38;5;250m 6[39m  [4m2[24m000 

### Order
One advantage of factors is that they can be ordered. This enables them to sort and plot in the way you would expect. Compare:

If you specify the factor levels using the `levels=` option, then that will specify the default order. If you do *not* specify the levels, then they will be sorted alphabetically by default:

There are several options for reordering factor levels. The first is `fct_reorder` which we have already seen. It reorders a factor based on the values of another continuous variable.

Suppose that we are interested in a plot of whether religion is associated with tvhours.

The above plot is very difficult to interpret because there is no overall pattern.  We can improve it by reordering the levels of religion using the `fct_reorder` function.  It takes two arguments -- factor you want to modify, a numeric vector that you want to use to reorder the levels.

What if we want to create a plot to investigate how age varies across reported income level?

The plot may look visually appealing but the $y$-axis is totally jumbled! This shows that we only use `fct_reorder` in cases where there is not already a natural order. 

Nevertheless, there are a few categories that can be sensibly broken out: `No answer`, `Not applicable`, `Don't Know` and `Refused`. The command `fct_relevel(f, lvls)` takes a factor `f` and returns a new factor which has the vector `lvls` of factor levels placed at the front:

Finally we have a couple of other useful commands. `fct_infreq(f)` will reorder the levels in `f` increasing frequency. This is useful with `geom_bar`:

`fct_rev` will reverse the order of a factor:

### Altering levels
In many cases it is necessary to change the *values* of a factor, especially when generating plots and tables for publications. 

The `fct_recode` command makes this easy. This command takes a factor and a set of `new_level=old_level` options:

A useful feature of `fct_recode` is for combining multiple factors into one: