# Categoricals

## Overview

Many datasets have non-numeric (text) data. This data has two issues to deal with.

* It wastes memory to store many repeated text values.
* When using regression and some other types of analysis, numeric-only data is required.

The first issue can be solved by using the datatype __category__ (such values are then called *categoricals*); the second by __dummy variables__ (AKA 'one-hot' encoding). 

__*Note*__: in pandas, text fields are called "object" data. 

 

**Categoricals** are a data type ("category") that saves memory when a column has _low cardinality_ (contains many repeated text values). For instance, if a "gender" column contains the strings "MALE" and "FEMALE", it takes extra memory to store all the strings, when there are only two distinct values.

Using categoricals allows you to view, query, and otherwise use the text values, but behind the scenes, pandas replaces the values with integers that index a lookup table of the actual values.

The larger the dataset, the bigger the savings.

## Categorical examples

We'll use a dataset from the city of Chicago which has 4 columns -- name, position, department, and salary. The position and department columns have low cardinality, and are thus good candidates for categoricals. 

To get started, we just need to import __pandas__.

In [1]:
import pandas as pd

### Dataset without categoricals

First let's read in the dataset

In [2]:
# columns:
# Name,Position Title,Department,Employee Annual Salary
df_orig = pd.read_csv(
    "../DATA/city-of-chicago-salaries.csv",
)
df_orig.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$85512.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$75372.00
2,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$80916.00
3,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$99648.00
4,"ABBATACOLA, ROBERT J",ELECTRICAL MECHANIC,AVIATION,$89440.00


We can use the __.info()__ method to see how much memory is being used by the entire dataframe -- a little over 8.5MB. 

In [3]:
df_orig.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32054 entries, 0 to 32053
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Name                    32054 non-null  object
 1   Position Title          32054 non-null  object
 2   Department              32054 non-null  object
 3   Employee Annual Salary  32054 non-null  object
dtypes: object(4)
memory usage: 8.6 MB


We can use the __.memory_usage()__ method to find out how much memory each column is using.

In [4]:
df_orig.memory_usage("deep")

Index                        132
Name                      256432
Position Title            256432
Department                256432
Employee Annual Salary    256432
dtype: int64

### Dataset with categoricals

First, we'll analize the data with the __.describe()__ method. That will show us the cardinality of the object (text) fields. 

By default, describe() only shows numeric field, so we'll tell it to show objects with the `include="O"` argument.

We see that __Position Title__ has only 1098 unique values out of 32K entries, and __Department__ has even lower cardinality, with only 35 unique values. These two columns are prime candidates for the __category__ datatype. 

In [5]:
df_orig.describe(include="O")

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
count,32054,32054,32054,32054
unique,31795,1098,35,1097
top,"HERNANDEZ, JUAN C",POLICE OFFICER,POLICE,$78012.00
freq,4,9432,13623,2750


Now we'll reload the data, but specify "category" as the data type for our two low-cardinality columns. The data still looks and acts the same. 

In [6]:
df_cat = pd.read_csv(
    "../DATA/city-of-chicago-salaries.csv",
    dtype={"Position Title": "category", "Department": "category"},
)
df_cat.head()

Unnamed: 0,Name,Position Title,Department,Employee Annual Salary
0,"AARON, ELVIA J",WATER RATE TAKER,WATER MGMNT,$85512.00
1,"AARON, JEFFERY M",POLICE OFFICER,POLICE,$75372.00
2,"AARON, KIMBERLEI R",CHIEF CONTRACT EXPEDITER,GENERAL SERVICES,$80916.00
3,"ABAD JR, VICENTE M",CIVIL ENGINEER IV,WATER MGMNT,$99648.00
4,"ABBATACOLA, ROBERT J",ELECTRICAL MECHANIC,AVIATION,$89440.00


The new dataframe uses about half the memory of the original. 

In [7]:
df_cat.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32054 entries, 0 to 32053
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   Name                    32054 non-null  object  
 1   Position Title          32054 non-null  category
 2   Department              32054 non-null  category
 3   Employee Annual Salary  32054 non-null  object  
dtypes: category(2), object(2)
memory usage: 4.5 MB


In [8]:
df_cat.memory_usage("deep")

Index                        132
Name                      256432
Position Title            105956
Department                 33406
Employee Annual Salary    256432
dtype: int64

### Dataset with categoricals and splitting name
Just for fun, we'll split the "Name" column into separate "First Name" and "Last Name" columns, because there is much lower cardinality if we consider them separately. This might result in some more memory savings. 

In [9]:
df_split = pd.read_csv(
    "../DATA/city-of-chicago-salaries.csv",    
    dtype={"Position Title": "category", "Department": "category"},
)
df_split['First Name'], df_split['Last Name'] = df_split.Name.str.split(',', 1).str
df_split.drop('Name', axis=1, inplace=True)
df_split['First Name'] = df_split['First Name'].astype('category')
df_split['Last Name'] = df_split['Last Name'].astype('category')

df_split.head()

TypeError: StringMethods.split() takes from 1 to 2 positional arguments but 3 were given

It turns out that this uses slightly *more* memory than *not* splitting the name fields. 

In [None]:
df_split.info(memory_usage="deep")

The __Employee Annual Salary__ field, since it has dollar signs, is read in as strings. While making them categoricals won't help, since the salaries have high cardinality, we should convert them to float values, since any analysis of that field would be numerical.

You can pass a dictionary to the __converters__ argument. The keys are column names; the values are functions that accept the original value and return the converted value.

In [None]:
df_convert = pd.read_csv(
    "../DATA/city-of-chicago-salaries.csv",    
    dtype={"Position Title": "category", "Department": "category"},
    converters={'Employee Annual Salary':lambda s: float(s[1:])},
)
df_convert.head()

This also saves another couple of megabytes, since floats take up less space than objects. Our total reduction of memory usage is close to 75%. 

In [None]:
df_convert.info(memory_usage="deep")

### Conclusion
Using categoricals can lead to big memory savings if the dataset contains low-cardinality text fields. 