In [None]:
options(jupyter.rich_display = FALSE)
options(repr.plot.width=6, repr.plot.height=4)

# Factors and categories

Consider the following data table:

|Name|Gender|Month of Birth|
|----|----|-----|
|Can|Male|January|
|Cem|Male|July|
|Hande|Female|May|
|Mehmet|Male|May|
|Deniz|Female|February|
|Kemal|Male|July|
|Derya|Female|May|
|Fatma|Female|April|

* All columns are strings.
* `Gender` and `Month of Birth` columns can be considered **categories**.

# Categories and levels

* A _categorical variable_ (factor) can take one of predetermined, discrete values.
    * Day of week
    * Month of year
    * Shirt sizes
* A single value of a categorical variable is called a _level_
    * Monday
    * December
    * XL

# Pause to think

Find three quantities that cannot be represented as a factor variable.

1. ...
2. ...
3. ...

Generate vectors to hold the relevant data.

In [None]:
name <- c("Can","Cem","Hande","Mehmet","Deniz","Kemal","Derya","Fatma")
gender <- c("Male","Male","Female","Male","Female","Male","Female","Female")
mode(gender)

We can convert the `gender` vector to a factor variable using the `factor()` function.

In [None]:
gender_fac <- factor(gender)
gender_fac

# Getting the levels of a factor

The factor vector  has an additional attribute, the _levels_ information.

In [None]:
levels(gender_fac)

In [None]:
nlevels(gender_fac)

Also, common R functions for analysis and data description handles factors in specialized ways.

In [None]:
summary(gender)  # character vector

In [None]:
summary(gender_fac) # factor

One can change the level names easily using an assignment to the `levels()` function.

In [None]:
levels(gender_fac) <- c("F","M")
gender_fac

Internally, categories are represented with integers starting at 1.

# Indexing and subsetting

Factor-valued vectors are subsetted in the same way as any other vector.

In [None]:
print(gender_fac[2:5])

In [None]:
gender_fac[c(3,5,7:8)]

Note that after subsetting a factor object, the object continues to store
all defined levels even if some of the levels are no longer represented in the
subsetted object.

# Filtering with factors

In [None]:
gender_fac

In [None]:
gender_fac=="M"

In [None]:
name[gender_fac=="M"]

# Removing categories

Sometimes we may want to remove one level in a category. For example, consider the following factor where the same level is duplicated.

In [None]:
gender_fac <- factor(c("Male","Male","Female","Male","female","Male","female","Female"))
gender_fac

The factor has technically three levels, but actually `"female"` and `"Female"` are the same. Fix this by overwriting all occurrences of `"female"` with `"Female"`.

In [None]:
gender_fac[gender_fac=="female"] <- "Female"
gender_fac

However, the levels attribute still lists the invalid `"female"` category. To remove it, we use the `droplevels()` function. It removes all levels for which there are no entries.

In [None]:
gender_fac <- droplevels(gender_fac)
gender_fac

# Nominal and ordinal factors
* The _gender_ factor is an example of a _nominal factor_: There is no inherent order between levels. We cannot ask the question whether "Male" is greater than "Female" or not.

* The _month of birth_ information is an _ordinal factor_: Months appear in a certain order, so it makes sense to say that "January" < "February".

|Name|Gender|Month of Birth|
|----|----|-----|
|Can|Male|January|
|Cem|Male|July|
|Hande|Female|May|
|Mehmet|Male|May|
|Deniz|Female|February|
|Kemal|Male|July|
|Derya|Female|May|
|Fatma|Female|April|

Let’s store the observed month-of-birth (MOB) data as a character vector.

In [None]:
mob <- c("January","July","May","May","February","July","May","April")

Two problems with this vector:

1. Only five unique months. Not all possible categories are represented.
2. Doesn’t reflect the natural order of the months. If you compare January and February to see which is greater, you get:

In [None]:
mob[1] < mob[5]  # alphabetical ordering

When we create a factor object, we can set `levels` parameter of the `factor()` function to ensure that it holds all the levels of the factor in the correct order.

In [None]:
months <- c("January","February","March","April","May",
            "June","July","August","September","October","November","December")

In [None]:
mob_fac <- factor(mob, levels=months, ordered=TRUE)
mob_fac

Comparisons can be done correctly:

In [None]:
mob_fac[1] < mob_fac[5]  # January < February

The `summary()` function gives a count of elements in each category.

In [None]:
summary(mob_fac)

# Combining two factor objects

Earlier we have seen that combining two vectors into a single vector is done with the `c()` function:

In [None]:
x1 <- c(1,2,3,4)
x2 <- c(7,8,9)
c(x1, x2)

However, this does not work with factor objects:

In [None]:
mob_fac
mob2 <- factor(c("April","March","May"), levels=months, ordered=TRUE)
mob2

In [None]:
c(mob_fac, mob2)

* The `c()` function returns a vector of integers.
* `c()` combines the numeric values of categories. Not what we want.

Factors are combined in an indirect way: First use the result of `c()` to index the `months` vector, which holds an ordered list of all categories. This will give a character vector:

In [None]:
levels(mob_fac)[ c(mob_fac, mob2) ]

Then we convert this to a factor object

In [None]:
factor(levels(mob_fac)[ c(mob_fac, mob2) ], levels=levels(mob_fac), ordered=TRUE)

If we need to use this task frequently, we can write a function for it:

In [None]:
concat_factors <- function(f1, f2, ordered=TRUE) {
    stopifnot( identical(levels(f1), levels(f2)) ) # ensure that the levels are the same
    return( factor(levels(f1)[ c(f1,f2) ], levels=levels(f1), ordered=ordered) )
}

In [None]:
concat_factors(mob_fac, mob2)

Binning
----
One can create categories from continuous data, such as Small/Medium/Large, or Low/High.

Example:

In [None]:
x <- c(11, 18, 36, 74, 43, 81, 95, 64, 32, 51)

Suppose we want to categorize this data as _small_ for values in [0, 30), _medium_ for [30, 70), and _high_ for [70, 100]. The notation [30,70) means that the value 30 belongs to this category, but 70 does not.

The `cut()` function generates a factor object with the interval end specified by the `breaks` parameter.

In [None]:
cut(x, breaks=c(0, 30, 70, 100))

However, note that the ends of the intervals are not as we want. The first value of the boundary in not included in the interval, but the second value is.

To fix this, we set the parameter `right` to `FALSE`.

In [None]:
cut(x, breaks=c(0, 30, 70, 100), right = F)

But the last value 100 is excluded now. We can include it by setting the `include.lowest` parameter to `TRUE`.

In [None]:
cut(x, breaks = c(0, 30, 70, 100), right = F, include.lowest = T)

The levels can be set with the `labels` parameter.

In [None]:
cut(x, breaks = c(0, 30, 70, 100), right = F, include.lowest = T,
   labels = c("Low","Medium","High"))

# Factors and data frames

Suppose that we create a data frame out of `names`, `gender`, and `mob` vectors:

In [None]:
df <- data.frame(name, gender, mob)
df

See a summary of the dataframe:

In [None]:
summary(df)

Note that all fields are interpreted as factors in `df`, including `names`. The reason is that the `stringsAsFactors`parameter is `TRUE` by default. We can turn it off, and use the factor vectors we prepared before:

In [None]:
df <- data.frame(name, gender_fac, mob_fac, stringsAsFactors = FALSE )
summary(df)

As another example, consider the _mtcars_ data set:

In [None]:
head(mtcars)

The `summary()` function returns the summary statistics for each numeric field.

In [None]:
summary(mtcars)

However, it makes more sense to treat `"cyl"`, `"vs"`, `"am"`, `"gear"` and `"carb"` as categorical variables.

In [None]:
mtcars$cyl <- factor(mtcars$cyl, ordered=TRUE)
mtcars$gear <- factor(mtcars$gear, ordered=TRUE)
mtcars$carb <- factor(mtcars$carb, ordered=TRUE)
mtcars$vs <- factor(mtcars$vs)
mtcars$am <- factor(mtcars$am)

Now we can use the `summary()` function to get the counts of categories in each factor field.

In [None]:
summary(mtcars)

The `"vs"` (V engine or straight) and `"am"` (Automatic or manual transmission) fields have level values 0 or 1. Let's replace them with clearer labels.

In [None]:
levels(mtcars$vs) <- c("V-engine","Standard")
levels(mtcars$am) <- c("Automatic","Manual")

In [None]:
summary(mtcars)

# Plotting factor variables
When we specify a factor-type vector as data, the `plot()` function displays a bar plot.

In [None]:
plot(mtcars$am)

When the x-axis is categorical and the y-axis is numerical, a boxplot is displayed.

In [None]:
plot(x = mtcars$vs, y=mtcars$hp,ylab="Horse power")

If both axes are categorical, a stacked bar plot is displayed.

In [None]:
plot(x = mtcars$vs, y=mtcars$gear, xlab="Engine type",ylab="Gear")