### Example 4 - Categorical Data 

In [1]:
import pandas as pd


url = 'https://raw.githubusercontent.com/kefeimo/DataScienceBlog/master/3.category_dtype/df_example.csv'
df_original = pd.read_csv(url) 
df_original.head() 

Unnamed: 0,event_id,timestamp,longitude,latitude,app_id,device_id,label_id,gender,brand_parse,model_parse
0,2466991,2016-05-01 00:43:07,117.09,36.12,8165649363453695304,1438711534922792517,713,M,OPPO,A33
1,370002,2016-05-04 08:11:03,0.0,0.0,-755461362045697404,-2449610688324901118,548,F,Meizu,Charm Blue NOTE
2,1608644,2016-05-02 13:56:37,116.28,40.1,8893877044209647765,4075941473982616348,206,F,Huawei,Glory 6 Plus
3,3008180,2016-05-03 19:02:56,0.0,0.0,-1633887856876571208,1915112695298339924,779,F,OPPO,R7s
4,107379,2016-05-02 17:44:32,116.5,39.91,2229153468836897886,7353572136329657630,782,F,Huawei,Mate 7


In [2]:
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130821 entries, 0 to 130820
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   event_id     130821 non-null  int64  
 1   timestamp    130821 non-null  object 
 2   longitude    130821 non-null  float64
 3   latitude     130821 non-null  float64
 4   app_id       130821 non-null  int64  
 5   device_id    130821 non-null  int64  
 6   label_id     130821 non-null  int64  
 7   gender       130821 non-null  object 
 8   brand_parse  130821 non-null  object 
 9   model_parse  130821 non-null  object 
dtypes: float64(2), int64(4), object(4)
memory usage: 10.0+ MB


Let’s change the ‘gender’ feature dtype to ‘category’.

In [3]:
df_tmp = df_original.copy()
df_tmp.gender = df_tmp.gender.astype('category')

In [4]:
df_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130821 entries, 0 to 130820
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype   
---  ------       --------------   -----   
 0   event_id     130821 non-null  int64   
 1   timestamp    130821 non-null  object  
 2   longitude    130821 non-null  float64 
 3   latitude     130821 non-null  float64 
 4   app_id       130821 non-null  int64   
 5   device_id    130821 non-null  int64   
 6   label_id     130821 non-null  int64   
 7   gender       130821 non-null  category
 8   brand_parse  130821 non-null  object  
 9   model_parse  130821 non-null  object  
dtypes: category(1), float64(2), int64(4), object(3)
memory usage: 9.1+ MB


‘groupby’ is a handy operation but not the best for categorical data analysis. 

To demonstrate the operation performance (i.e., time efficiency) using category dtype. As shown below, by using ‘category’ dtype, the execution is much faster (e.g., less than 1/3 of the time taken comparing to using ‘object’ date type.)

In [7]:
%timeit df_original.groupby('gender').latitude.mean()

df_tmp = df_original.copy()
df_tmp.gender = df_tmp.gender.astype('category')

%timeit df_tmp.groupby('gender').latitude.mean()

7.03 ms ± 84.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)




1.4 ms ± 18 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


All values of categorical data are either in categories or np.nan. 
Order is defined by the order of categories, not lexical order of the values. 
Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array.

![Categories](cat.png)

- First, we create a HASH TABLE to map the unique values ‘M’ and ‘F’ (a.k.a. keys) into ‘0’ and ‘1’ (a.k.a hashes)
- Then we ENCODE the original dataset: use ‘0’ and ‘1’ to replace ‘M’ and ‘F’, respectively. The encoded dataset will be stored in memory. 
- Instead of showing the encoded dataset, when display, Pandas maps the encoded dataset back to the original ‘M’ and ‘F’ values (INV_ENCODE). In other words, the new dataset with the ‘category’ dtype looks identical to the original one (but they are not the same.)