The categorical data type is useful in the following cases:

A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

### Create categorical variable

In [125]:
import pandas as pd
import numpy as np

s = pd.Series(["a","b","c","a"], dtype="category")
s.cat.codes

0    0
1    1
2    2
3    0
dtype: int8

In [126]:
df = pd.DataFrame({"A":["a","b","c","a"]})
print(df)
df["B"] = df["A"].astype('category')
df

   A
0  a
1  b
2  c
3  a


Unnamed: 0,A,B
0,a,a
1,b,b
2,c,c
3,a,a


### Cut

By using special functions, such as cut(), which groups data into discrete bins

In [128]:
df = pd.DataFrame({'number': np.random.randint(1, 100, 10)})
print (df)

df['bins'] = pd.cut(x=df['number'], bins=[1, 20, 40, 60, 80, 100],
                    labels=['1 - 20', '21 - 40', '41 - 60',
                            '61 - 80', '81 - 100'], right=True)
 
print(df)
 
# We can check the frequency of each bin
print(df['bins'].unique())

   number
0      63
1      79
2       8
3      64
4      54
5      56
6      54
7      67
8      89
9      30
   number      bins
0      63   61 - 80
1      79   61 - 80
2       8    1 - 20
3      64   61 - 80
4      54   41 - 60
5      56   41 - 60
6      54   41 - 60
7      67   61 - 80
8      89  81 - 100
9      30   21 - 40
['61 - 80', '1 - 20', '41 - 60', '81 - 100', '21 - 40']
Categories (5, object): ['1 - 20' < '21 - 40' < '41 - 60' < '61 - 80' < '81 - 100']


In [142]:
# Sample data
data = {'Age': [0, 12, 19, 35, 50, 70, 85, 95, 100, 101]}

# Create a DataFrame
df = pd.DataFrame(data)

# Define custom age categories with include_lowest
bins = [0 ,12, 19, 65, float('inf') ]
labels = ['Child', 'Teenager', 'Adult', 'Senior']

# By default, include_lowest is False
df['Age_Category_exclusive'] = pd.cut(df['Age'], bins=bins, labels=labels, include_lowest=True, right=True)

# Using include_lowest=True to make the lower bound inclusive
df['Age_Category_inclusive'] = pd.cut(df['Age'], bins=bins, labels=labels, include_lowest=False, right=True, )

# Display the DataFrame
print(df)

   Age Age_Category_exclusive Age_Category_inclusive
0    0                  Child                    NaN
1   12                  Child                  Child
2   19               Teenager               Teenager
3   35                  Adult                  Adult
4   50                  Adult                  Adult
5   70                 Senior                 Senior
6   85                 Senior                 Senior
7   95                 Senior                 Senior
8  100                 Senior                 Senior
9  101                 Senior                 Senior


 passing a pandas.Categorical object to Series os DF

In [None]:
cat = pd.Categorical(
    ["a", "b", "c", "a"], categories=["a", "c", "d"], ordered=False
)

s = pd.Series(cat)
print(s)

df = pd.DataFrame({"A":["a","b","c","a"]})
df["B"] = cat
df.dtypes



### CategoricalDtype

Custom caterorical type can be used for cast

In [144]:
from pandas.api.types import CategoricalDtype

cat_type= CategoricalDtype(["a", "b", "c"], ordered=True)

df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
print(df)
df_cat = df.astype(cat_type)
print(df_cat)

print(cat_type == "category")

   A  B
0  a  b
1  b  c
2  c  c
3  a  d
   A    B
0  a    b
1  b    c
2  c    c
3  a  NaN
True


### Characteristics


Get statistical characteristics

In [145]:
df_cat.describe()

Unnamed: 0,A,B
count,4,3
unique,3,2
top,a,c
freq,2,2


### Properties

Categorical data has a categories and a ordered property, which list their possible values and whether the ordering matters or not. These properties are exposed as s.cat.categories and s.cat.ordered. If you don’t manually specify categories and ordering, they are inferred from the passed arguments.

In [146]:
s = pd.Series(["a", "b", "c", "a"], dtype="category")
print(s.cat.codes)
print(s.cat.categories)
print(s.cat.ordered)

0    0
1    1
2    2
3    0
dtype: int8
Index(['a', 'b', 'c'], dtype='object')
False


### Rename

Renaming categories is done by using the rename_categories() method:

In [148]:
# create a dataframe
df = pd.DataFrame({
        "Name": ["Tim", "Sarah", "Hasan", "Jyoti", "Jack"],
        "Gender": ["Male", "Female", "Male", "Female", "Male"]
})

# change to category dtype
df["Gender"] = df["Gender"].astype("category")
# display the dataframe
print(df)
df["Gender"] = df["Gender"].cat.rename_categories(["F", "M"])
df


    Name  Gender
0    Tim    Male
1  Sarah  Female
2  Hasan    Male
3  Jyoti  Female
4   Jack    Male


Unnamed: 0,Name,Gender
0,Tim,M
1,Sarah,F
2,Hasan,M
3,Jyoti,F
4,Jack,M


### Appending new categories

Appending categories can be done by using the add_categories() method:

In [151]:
cat = pd.Categorical(
    ["a", "b", "c", "a", "e"], categories=["a", "c", "d"], ordered=False
)

s = pd.Series(cat)
print (s)

#add a new category
s = s.cat.add_categories("e")

#set actual categories (removee and add in one step)
s = s.cat.set_categories(["a", "c", "d", "e"])
s.cat.categories
print (s)



0      a
1    NaN
2      c
3      a
4    NaN
dtype: category
Categories (3, object): ['a', 'c', 'd']
0      a
1    NaN
2      c
3      a
4    NaN
dtype: category
Categories (4, object): ['a', 'c', 'd', 'e']


### Sorting and order

If categorical data is ordered (s.cat.ordered == True), then the order of the categories has a meaning and certain operations are possible.

In [119]:
s = pd.Series([1, 2, 3, 1], dtype="category")
s = s.cat.set_categories([2, 3, 1], ordered=True)
print(s.sort_values())

print ("min:" + str(s.min()))
print ("max:" + str(s.max()))
print ("mode:" + str(s.mode()))




1    2
2    3
0    1
3    1
dtype: category
Categories (3, int64): [2 < 3 < 1]
min:2
max:1
mode:0    1
dtype: category
Categories (3, int64): [2 < 3 < 1]


### Operations

Series methods like `Series.value_counts()` will use all categories, even if some categories are not present in the data

In [152]:
s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))
s.value_counts()

c    2
a    1
b    1
d    0
dtype: int64

Groupby will also show “unused” categories

In [153]:
cats = pd.Categorical(
    ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
)

df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
print (df)

df.groupby("cats", observed=False).sum()

  cats  values
0    a       1
1    b       2
2    b       2
3    b       2
4    c       3
5    c       4
6    c       5


Unnamed: 0_level_0,values
cats,Unnamed: 1_level_1
a,1
b,6
c,12
d,0


Pivot table

In [154]:
raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])

df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})
print(df)

pd.pivot_table(df, values="values", index=["A", "B"])

   A  B  values
0  a  c       1
1  a  d       2
2  b  c       3
3  b  d       4


Unnamed: 0_level_0,Unnamed: 1_level_0,values
A,B,Unnamed: 2_level_1
a,c,1
a,d,2
b,c,3
b,d,4


### Missing data

In [155]:
s = pd.Series(["a", "b", np.nan, "a"], dtype="category")

print (s)
print (s.cat.codes)
print(pd.isna(s))
s.fillna("a")


0      a
1      b
2    NaN
3      a
dtype: category
Categories (2, object): ['a', 'b']
0    0
1    1
2   -1
3    0
dtype: int8
0    False
1    False
2     True
3    False
dtype: bool


0    a
1    b
2    a
3    a
dtype: category
Categories (2, object): ['a', 'b']

### Resource consumption

In [156]:
s = pd.Series(['foo','bar']*1000)
print(s.nbytes)

s = s.astype('category')
print(s.nbytes)


16000
2016
