In [1]:
library(tidyverse)
employees <- read_csv("_build/data/employee_data.csv")
employees$Salary <- parse_number(employees$Salary)
employees$Start_Date <- parse_date(employees$Start_Date, format = "%m/%d/%Y")

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang


Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2


-- Attaching packages --------------------------------------- tidyverse 1.2.1 --


v ggplot2 3.1.1       v purrr   0.3.2  
v tibble  2.1.1       v dplyr   0.8.0.1
v tidyr   0.8.3       v stringr 1.4.0  
v readr   1.3.1       v forcats 0.4.0  


-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()


Parsed with column specification:
cols(
  ID = col_double(),
  Name = col_character(),
  Gender = col_character(),
  Age = col_double(),
  Rating = col_double(),
  Degree = col_character(),
  Start_Date = col_character(),
  Retired = col_logical(),
  Division = col_character(),
  Salary = col_character()
)


# Categorical Variables

Categorical variables take on values corresponding to a category. For example, `Degree` in `employees` can only take on the values `High School`, `Associate's`, `Bachelor's`, `Master's`, and `Ph.D`. Categorical variables cannot be summarized by the mean, median, or standard deviation. Instead, these variables are often summarized using tables and bar plots. For categorical variables, the `table()` and `prop.table()` commands show the number and percentage (**prop**ortion) of observations in each category, respectively. Note that to use `prop.table()`, we need to apply `table()` first.

```{admonition} Syntax
`table(x)` & `prop.table(table(x))`
+ *Required arguments*
  - `x`: The atomic vector of values.
```

In [2]:
table(employees$Division)


     Accounting       Corporate     Engineering Human Resources      Operations 
             63             103             236              97             287 
          Sales 
            214 

In [3]:
prop.table(table(employees$Division))


     Accounting       Corporate     Engineering Human Resources      Operations 
          0.063           0.103           0.236           0.097           0.287 
          Sales 
          0.214 

Two categorical variables can be summarized in a two-way table using the same `table()` and `prop.table()` commands shown above. For example:

In [4]:
table(employees$Division, employees$Degree)

                 
                  Associate's Bachelor's High School Master's Ph.D
  Accounting                0         31           0       32    0
  Corporate                 0         20           0       40   43
  Engineering               0         36           0       43  157
  Human Resources          35         30           0       32    0
  Operations              110         16         146       15    0
  Sales                    55         67          54       38    0

The `prop.table()` command has an optional second argument `margin` that calculates the proportion of observations by row (`margin` = 1) or column (`margin` = 2). Note that the term `margin` refers to the "margins" (*i.e.*, the outer edges) of the table, where the sum of the rows and columns are often written. In the code chunk below we do not specify the `margin` parameter in `prop.table()`, so each cell represents the proportion over all observations in the data set. For example, 5.4% of all employees work in Sales and have a high school diploma.

In [5]:
prop.table(table(employees$Division, employees$Degree))

                 
                  Associate's Bachelor's High School Master's  Ph.D
  Accounting            0.000      0.031       0.000    0.032 0.000
  Corporate             0.000      0.020       0.000    0.040 0.043
  Engineering           0.000      0.036       0.000    0.043 0.157
  Human Resources       0.035      0.030       0.000    0.032 0.000
  Operations            0.110      0.016       0.146    0.015 0.000
  Sales                 0.055      0.067       0.054    0.038 0.000

If we set `margin` equal to 1, each cell represents the proportion of observations by row. For example, of all employees in Accounting, 49.2% have a Bachelor's.

In [6]:
prop.table(table(employees$Division, employees$Degree), margin = 1)

                 
                  Associate's Bachelor's High School   Master's       Ph.D
  Accounting       0.00000000 0.49206349  0.00000000 0.50793651 0.00000000
  Corporate        0.00000000 0.19417476  0.00000000 0.38834951 0.41747573
  Engineering      0.00000000 0.15254237  0.00000000 0.18220339 0.66525424
  Human Resources  0.36082474 0.30927835  0.00000000 0.32989691 0.00000000
  Operations       0.38327526 0.05574913  0.50871080 0.05226481 0.00000000
  Sales            0.25700935 0.31308411  0.25233645 0.17757009 0.00000000

If we set `margin` equal to 2, each cell represents the proportion of observations by column. For example, of all employees with an Associate's, 55.0% work in Operations.

In [7]:
prop.table(table(employees$Division, employees$Degree), margin = 2)

                 
                  Associate's Bachelor's High School Master's  Ph.D
  Accounting            0.000      0.155       0.000    0.160 0.000
  Corporate             0.000      0.100       0.000    0.200 0.215
  Engineering           0.000      0.180       0.000    0.215 0.785
  Human Resources       0.175      0.150       0.000    0.160 0.000
  Operations            0.550      0.080       0.730    0.075 0.000
  Sales                 0.275      0.335       0.270    0.190 0.000