### (lighter | faster | safer | flexible | fool-proof)

# How to Encode Categorical data: Unleash the power of 'category' dtypes

### hands-on tutorials of using 'category' data type in Python

Lately, I was worknig on a former Kaggle competition dataset- Talking data, predict customer demographics. The original dataset includes 8 data frames, in which 5 of them I think is relevant; all in 'csv' format, total size 2G. But after I tried to merge the data frames into a single tabular data frame. BOOM, 2G explode to 18G. 😟

Luckily there are easy fixes. After some simple tricks, I was able to make the dataset from 18G to 5G, without losing any information or change of the data structure.  The way I do it is by using 'category' date type. 

Later, I found out 'category' data type not only can make the dataset light-weighted, but also help to improve data operation performance and machine learning performance.

So, today I will talk about how to use 'category' dtype in Python and why you should consider use it.

# Things to cover

- [ ]  show starter: lighter and faster
- [ ]  explain how 'category' dtypes works
- [ ]  why it matters to machine learning
- [ ]  takeaways: when you should use 'category'?

# Is this article for you?

The goal of this tutorial is not to cover up the very rich top of categorical encoding. (I doubt anybody can do that in just one blog.) Instead, the focus is on Python and its Pandas library. But if you are not a Python user or work closely with Pandas, don't worry. I believe this post can still shed some light.

Like always, I will explain in a hands-on style. Without further ado, let's make our hands dirty and jump into the dataset. Here is the link you can download the example data (We will not work on the whole dataset, but a portion of it.)

In [204]:
# load the data

In [205]:
import pandas as pd

In [233]:
url = 'https://raw.githubusercontent.com/kefeimo/DataScienceBlog/master/3.category_dtype/df_example.csv'
df_original = pd.read_csv(url)
df_original.head()

Unnamed: 0,event_id,timestamp,longitude,latitude,app_id,device_id,label_id,gender,brand_parse,model_parse
0,2466991,2016-05-01 00:43:07,117.09,36.12,8165649363453695304,1438711534922792517,713,M,OPPO,A33
1,370002,2016-05-04 08:11:03,0.0,0.0,-755461362045697404,-2449610688324901118,548,F,Meizu,Charm Blue NOTE
2,1608644,2016-05-02 13:56:37,116.28,40.1,8893877044209647765,4075941473982616348,206,F,Huawei,Glory 6 Plus
3,3008180,2016-05-03 19:02:56,0.0,0.0,-1633887856876571208,1915112695298339924,779,F,OPPO,R7s
4,107379,2016-05-02 17:44:32,116.5,39.91,2229153468836897886,7353572136329657630,782,F,Huawei,Mate 7


In [207]:
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130821 entries, 0 to 130820
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   event_id     130821 non-null  int64  
 1   timestamp    130821 non-null  object 
 2   longitude    130821 non-null  float64
 3   latitude     130821 non-null  float64
 4   app_id       130821 non-null  int64  
 5   device_id    130821 non-null  int64  
 6   label_id     130821 non-null  int64  
 7   gender       130821 non-null  object 
 8   brand_parse  130821 non-null  object 
 9   model_parse  130821 non-null  object 
dtypes: float64(2), int64(4), object(4)
memory usage: 10.0+ MB


from the dataframe info, there are 130,821 rows, 10 columns (2+4+4) and the number of float, int, object variables are 2, 4, 4, respectively.

first tips (and most import one) is to change 'id'-type/ label data into categorical type. Which, basically everything except 'timestamp', 'longitude', 'latitude'

Note: we need to use some common sense to decide which featuers should be 'category' type not just based on the their default data type. e.g. timestamp has object type, but it should not be 'category' type (duh). While, 'app_id', 'device_id', 'label_id' are 'int64' type by default, but they are actually 'category' type.

Here is how we do it

# show starter: lighter and faster

In [208]:
df_tmp = df_original.copy()
df_tmp.event_id = df_tmp.event_id.astype('category')
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130821 entries, 0 to 130820
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype   
---  ------       --------------   -----   
 0   event_id     130821 non-null  category
 1   timestamp    130821 non-null  object  
 2   longitude    130821 non-null  float64 
 3   latitude     130821 non-null  float64 
 4   app_id       130821 non-null  int64   
 5   device_id    130821 non-null  int64   
 6   label_id     130821 non-null  int64   
 7   gender       130821 non-null  object  
 8   brand_parse  130821 non-null  object  
 9   model_parse  130821 non-null  object  
dtypes: category(1), float64(2), int64(3), object(4)
memory usage: 15.3+ MB


## showcase 1: lighter
let's try to make 'gender' into 'category' type.
we have saved aroud 10% of the memory usage.

In [209]:
# encode the 'gender'
df_tmp = df_original.copy()
df_tmp.gender = df_tmp.gender.astype('category')
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130821 entries, 0 to 130820
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype   
---  ------       --------------   -----   
 0   event_id     130821 non-null  int64   
 1   timestamp    130821 non-null  object  
 2   longitude    130821 non-null  float64 
 3   latitude     130821 non-null  float64 
 4   app_id       130821 non-null  int64   
 5   device_id    130821 non-null  int64   
 6   label_id     130821 non-null  int64   
 7   gender       130821 non-null  category
 8   brand_parse  130821 non-null  object  
 9   model_parse  130821 non-null  object  
dtypes: category(1), float64(2), int64(4), object(3)
memory usage: 9.1+ MB


## show case 2: faster

In [210]:
%timeit df_original.groupby('gender').latitude.mean()

df_tmp = df_original.copy()
df_tmp.gender = df_tmp.gender.astype('category')

%timeit df_tmp.groupby('gender').latitude.mean()

17 ms ± 268 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.7 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [211]:
# even for device_id, it is still faster
%timeit df_original.groupby('device_id').latitude.mean()

df_tmp = df_original.copy()
df_tmp.gender = df_tmp.gender.astype('category')

%timeit df_tmp.groupby('device_id').latitude.mean()

19.4 ms ± 727 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
18.9 ms ± 383 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## two more examples:

In [212]:
# encode 'event_id'
df_tmp = df_original.copy()
df_tmp.event_id = df_tmp.event_id.astype('category')
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130821 entries, 0 to 130820
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype   
---  ------       --------------   -----   
 0   event_id     130821 non-null  category
 1   timestamp    130821 non-null  object  
 2   longitude    130821 non-null  float64 
 3   latitude     130821 non-null  float64 
 4   app_id       130821 non-null  int64   
 5   device_id    130821 non-null  int64   
 6   label_id     130821 non-null  int64   
 7   gender       130821 non-null  object  
 8   brand_parse  130821 non-null  object  
 9   model_parse  130821 non-null  object  
dtypes: category(1), float64(2), int64(3), object(4)
memory usage: 15.3+ MB


In [213]:
# encode several
df_tmp = df_original.copy()
col_cate_list =  [
#     'event_id', 
    'device_id', 
    'app_id', 'label_id', 'gender', 'brand_parse', 'model_parse'
                 ]
df_tmp.loc[:,col_cate_list] = df_tmp.loc[:,col_cate_list].astype('category')
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130821 entries, 0 to 130820
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype   
---  ------       --------------   -----   
 0   event_id     130821 non-null  int64   
 1   timestamp    130821 non-null  object  
 2   longitude    130821 non-null  float64 
 3   latitude     130821 non-null  float64 
 4   app_id       130821 non-null  category
 5   device_id    130821 non-null  category
 6   label_id     130821 non-null  category
 7   gender       130821 non-null  category
 8   brand_parse  130821 non-null  category
 9   model_parse  130821 non-null  category
dtypes: category(6), float64(2), int64(1), object(1)
memory usage: 6.1+ MB


You might have guessed it right, the number of unique event_id is large, while gender only has two unique values (M and F), would that the reason?

Let's continue and make those features with few unique values into 'category' type. Namely, 'app_id', 'label_id', 'gender', 'brand_parse', 'model_parse'.

Ahah, this time, we have saved near 40% of the memory usage. That's a lot. 

- But why sometimes using 'category' data type can save us memory usage, but sometimes not? What is going on here?

- Also is that possible to pick other combinations and even save more memory usage? (remember I made a heavy 18G data loose its weight to 5G, saving more than 70% of the memory. How I did that?)

- Is there any other way to do that?

Ok, hold your horses, let's address the following question:



# explain how 'category' dtypes works

it's time to talk about data type.

note: the inherent python data types are... But since Numpy is almost like the default library of python, also pandas uses Numpy as backend. In this post, the data type is Numpy data type, which includes.

table: https://pbpython.com/pandas_dtypes.html (add bytes taken)

- But why sometimes using 'category' data type can save us memory usage, but sometimes not? What is going on here?
First of all, 'category' is a pandas data type. In pandas documentation, 

'Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable takes on a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood type, country affiliation, observation time or rating via Likert scales.'

But the definition is not important, what we care the most is why using 'category' data type can save space. 

The mechanism behind is the idea of 'hash table'. before we talk about hash table, here is a brief reiview of datatype table and memory consumed in Bytes...


let's take 'gender' feature for instance. there are 130821 iterms (rows), if each iterm is a object data type and each object data take up 8 Bytes of memory. Then in total the memory cosumed would be 130821 * 8 = 1,046,568 Bytes, this is same with the value using 'nbytes' function. Similarly, for 'brand_parse' and 'app_id', they both take up 130821 * 8 = 1,046,568. (note: each int64 also consume 8 Bytes)

Next, if we change the data type from 'object' to 'category', the calculation is a little bit more complicated. I use a formula to calculate that: 
bytes_hashed * num_of_row + (0 + bytes_object) * n_unique, 
where bytes_hashed = hash_func(n_unique)
- bytes_hashed is the int type that can cover up the n_unique values. e.g. there 2 unique values in gender, an int8 dtype can cover it up. an int8 dtype takes up 1 bytes. So, bytes_hashed = 1. another example, there 2660 unique values in gender, an int16 dtype can cover it up. an int16 dtype takes up 2 bytes. So, bytes_hashed = 2
- the 8 byptes is the original class name in the hash table. Since the original class name in the hash table is a object type, which take up 8 bytes. (that's the reason why the number of classes are close to the row, using category type would even cosume more memory space)


https://numpy.org/doc/stable/reference/generated/numpy.ndarray.nbytes.html#numpy.ndarray.nbytes, pandas.Series.nbytes return 'Total bytes consumed by the elements of the array'


In [214]:
df_original[['gender', 'brand_parse', 'app_id']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130821 entries, 0 to 130820
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   gender       130821 non-null  object
 1   brand_parse  130821 non-null  object
 2   app_id       130821 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 3.0+ MB


In [215]:
# gender original (object)
print(8 * 130821)
print(df_original.gender.nbytes)

1046568
1046568


In [216]:
# gender encode (category)
print(1*130821 + (0+8)*2)
print(df_original.gender.astype('category').nbytes)

130837
130837


In [217]:
# more example

# brand_parse
print(df_original.brand_parse.nunique())
print(1* 130821+ (0 + 8) *76)
print(df_original.brand_parse.astype('category').nbytes)
print()

# app_id
print(df_original.app_id.nunique())
print(2* 130821+ (0 + 8) * 2660)
print(df_original.app_id.astype('category').nbytes)

76
131429
131429

2660
282922
282922


### what about ('actual') memory_usage
the actual memory_usage can be a bit different from nbytes. we can check the memory with .memory_usage(). The formula to calculate memory_usage can be very tedious, which is out of the scope of this post. But if you are intrested go check pandas' document. https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

Generally speaking, using nbytes to estimate memory_usage would be sufficient.

In [218]:
# what about memory_usage
print(df_original.gender.memory_usage()- df_original.gender.nbytes)
print(df_original.brand_parse.memory_usage()- df_original.brand_parse.nbytes)
print(df_original.brand_parse.memory_usage()- df_original.app_id.nbytes)

# print(df_original.gender.memory_usage(index=False)- df_original.gender.nbytes)
# print(df_original.brand_parse.memory_usage(index=False)- df_original.brand_parse.nbytes)
# print(df_original.brand_parse.memory_usage(index=False)- df_original.app_id.nbytes)

print(df_original.gender.astype('category').memory_usage() - df_original.gender.astype('category').nbytes)
print(df_original.brand_parse.astype('category').memory_usage() - df_original.brand_parse.astype('category').nbytes)
print(df_original.app_id.astype('category').memory_usage() - df_original.app_id.astype('category').nbytes)

# print(df_original.gender.astype('category').memory_usage(index=False) - df_original.gender.astype('category').nbytes)
# print(df_original.brand_parse.astype('category').memory_usage(index=False) - df_original.brand_parse.astype('category').nbytes)
# print(df_original.app_id.astype('category').memory_usage(index=False) - df_original.app_id.astype('category').nbytes)

128
128
128
208
2688
82048


- Also is that possible to pick other combinations and even save more memory usage? (remember I made a heavy 18G data loose its weight to 5G, saving more than 70% of the memory. How I did that?)

- Is there any other way to do that? why use 'category'

In [219]:
for col in df_original.columns:
    print('{:11}'.format(col), ': ', '{:6} ({:.2f})'.format(df_original[col].nunique(), df_original[col].nunique()/len(df_original)))

event_id    :  107900 (0.82)
timestamp   :   97356 (0.74)
longitude   :    2071 (0.02)
latitude    :    2057 (0.02)
app_id      :    2660 (0.02)
device_id   :   13986 (0.11)
label_id    :     353 (0.00)
gender      :       2 (0.00)
brand_parse :      76 (0.00)
model_parse :     761 (0.01)


In [220]:
# other options 
print(df_original.gender.nbytes)
print(df_original.gender.astype('category').nbytes)
print(df_original.gender.astype('bool').nbytes)

1046568
130837
130821


In [221]:
# this is wrong but we got no warning
df_tmp = df_original.copy()
df_tmp.brand_parse = df_tmp.brand_parse.astype('bool')
df_tmp.brand_parse

0         True
1         True
2         True
3         True
4         True
          ... 
130816    True
130817    True
130818    True
130819    True
130820    True
Name: brand_parse, Length: 130821, dtype: bool

# Why category dtypes matters to machine learning?

If a machine learning package cannot directly handle categorical variables, there are two conventions to encode categorical data. Label encoding and one hot encoding

(Note: theoretically, a machine learning model, like random forest, has no problem to take categorical data. But the random forest model in Sklearn cannot handle categorical data.)

## label encoder

Take brand_parse for example, we can use label encoder, I.e. Huawei, xiaomi... to  0, 1, 2, 3...(Int type)

The problem of this way of encoding is that the machine learning model might miss interpret the meaning of the encoded labels as ordinal (meaning with orders) while the original categorical data is nominal, I.e. it doesn’t make much sense to say huawei is less than Xiaomi, xiaomi is less Meiji, etc.  thus the tree algorithm could improperly split the data.

In [222]:
from sklearn.preprocessing import LabelEncoder

brand_parse_label_encode = LabelEncoder().fit_transform(df_original.brand_parse)
pd.concat([df_original.brand_parse, pd.DataFrame({'brand_encode': brand_parse_label_encode})], axis=1)

Unnamed: 0,brand_parse,brand_encode
0,OPPO,49
1,Meizu,39
2,Huawei,28
3,OPPO,49
4,Huawei,28
...,...,...
130816,OPPO,49
130817,Huawei,28
130818,Huawei,28
130819,vivo,74


In [223]:
pd.concat([df_original.brand_parse, pd.DataFrame({'brand_encode': brand_parse_label_encode})], axis=1).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130821 entries, 0 to 130820
Data columns (total 2 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   brand_parse   130821 non-null  object
 1   brand_encode  130821 non-null  int32 
dtypes: int32(1), object(1)
memory usage: 1.5+ MB


## one hot encoding

Conceptually, by using one hot encoding it will look like this

In [224]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False)
brand_parse_one_hot_encode = ohe.fit_transform(X=df_original.brand_parse.values.reshape(-1, 1))
pd.concat([df_original.brand_parse, pd.DataFrame(brand_parse_one_hot_encode).astype('int8')], axis=1)

Unnamed: 0,brand_parse,0,1,2,3,4,5,6,7,8,...,66,67,68,69,70,71,72,73,74,75
0,OPPO,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Meizu,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Huawei,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,OPPO,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Huawei,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
130816,OPPO,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
130817,Huawei,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
130818,Huawei,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
130819,vivo,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [225]:
pd.concat([df_original.brand_parse, pd.DataFrame(brand_parse_one_hot_encode).astype('int8')], axis=1).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130821 entries, 0 to 130820
Data columns (total 77 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   brand_parse  130821 non-null  object
 1   0            130821 non-null  int8  
 2   1            130821 non-null  int8  
 3   2            130821 non-null  int8  
 4   3            130821 non-null  int8  
 5   4            130821 non-null  int8  
 6   5            130821 non-null  int8  
 7   6            130821 non-null  int8  
 8   7            130821 non-null  int8  
 9   8            130821 non-null  int8  
 10  9            130821 non-null  int8  
 11  10           130821 non-null  int8  
 12  11           130821 non-null  int8  
 13  12           130821 non-null  int8  
 14  13           130821 non-null  int8  
 15  14           130821 non-null  int8  
 16  15           130821 non-null  int8  
 17  16           130821 non-null  int8  
 18  17           130821 non-null  int8  
 19  18

The first issue is memory usage. You should not use dense matrix otherwise you are very likely to experience memory blow up. E.g. after one hot encode brand_parse will turn into 76 columns, even use the most efficient datatype (int8 or boolean), it still take a lot of memory; imagine if there more than 76 sub-classes.

There are 

**Potential solution 1**: use sparse matrix

- advantage: save space
- disadvantage: hard for human being to interpret

In [232]:
# sparse
brand_parse_one_hot_encode = OneHotEncoder(sparse=True).fit_transform(X=df_original.brand_parse.values.reshape(-1, 1))

print(OneHotEncoder(sparse=False).fit_transform(X=df_original.brand_parse.values.reshape(-1, 1)).data.nbytes)
print(brand_parse_one_hot_encode.data.nbytes)
brand_parse_one_hot_encode

79539168
1046568


<130821x76 sparse matrix of type '<class 'numpy.float64'>'
	with 130821 stored elements in Compressed Sparse Row format>

In [227]:
# sparse matrix is hard to interpret
print(brand_parse_one_hot_encode)

  (0, 49)	1.0
  (1, 39)	1.0
  (2, 28)	1.0
  (3, 49)	1.0
  (4, 28)	1.0
  (5, 59)	1.0
  (6, 28)	1.0
  (7, 67)	1.0
  (8, 35)	1.0
  (9, 72)	1.0
  (10, 28)	1.0
  (11, 28)	1.0
  (12, 59)	1.0
  (13, 28)	1.0
  (14, 67)	1.0
  (15, 67)	1.0
  (16, 28)	1.0
  (17, 67)	1.0
  (18, 49)	1.0
  (19, 39)	1.0
  (20, 35)	1.0
  (21, 35)	1.0
  (22, 74)	1.0
  (23, 39)	1.0
  (24, 59)	1.0
  :	:
  (130796, 74)	1.0
  (130797, 28)	1.0
  (130798, 67)	1.0
  (130799, 59)	1.0
  (130800, 74)	1.0
  (130801, 32)	1.0
  (130802, 28)	1.0
  (130803, 28)	1.0
  (130804, 28)	1.0
  (130805, 39)	1.0
  (130806, 28)	1.0
  (130807, 28)	1.0
  (130808, 28)	1.0
  (130809, 70)	1.0
  (130810, 28)	1.0
  (130811, 13)	1.0
  (130812, 32)	1.0
  (130813, 62)	1.0
  (130814, 28)	1.0
  (130815, 28)	1.0
  (130816, 49)	1.0
  (130817, 28)	1.0
  (130818, 28)	1.0
  (130819, 74)	1.0
  (130820, 28)	1.0


Potentials solution 2: group or dispose sub-classes

- advantage: save space,
- disadvantage: hard to decide which sub-classes to group or dispose, risk losing information

Additionally, there is a saying, tree algorithms do not favor one-hot-encoding data, since it is likely to grow sparse trees. (ref: [https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769](https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769))

# Tips: pickle is better

In [228]:
%%timeit
df_sample.to_csv('./df_example.csv', index=False)

1.97 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [229]:
%%timeit
df_sample.to_pickle('./df_example_pkl.pkl')

207 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# Takeaways from this article

- How to use 'category'? astype('category')
- Why you should use 'category'? Save memory (except the Gochas situation), speed up operation performance (most likely), and improve machine learning performance (no proof but there are good reasons).
- When to use it? Almost always; at least give it a try. Even though sometime it might take more space, but in the long run, it speed up operation performance and can seamless integrate into machine learning models with categorical feature support.
- But, 'category' is not the silver bullet. Trial-and-error and iteration is your friend.
- Stop to_csv, start to_pickle.