The categorical data type is useful in the following cases:

A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.
The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.
As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

### Create categorical variable

In [None]:
import pandas as pd
import numpy as np

s = pd.Series(["a","b","c","a"], dtype="category")
s

In [None]:
df = pd.DataFrame({"A":["a","b","c","a"]})
df["B"] = df["A"].astype('category')
df

### Cut

By using special functions, such as cut(), which groups data into discrete bins

In [None]:
df = pd.DataFrame({'number': np.random.randint(1, 100, 10)})
df['bins'] = pd.cut(x=df['number'], bins=[1, 20, 40, 60, 80, 100],
                    labels=['1 - 20', '21 - 40', '41 - 60',
                            '61 - 80', '81 - 100'])
 
print(df)
 
# We can check the frequency of each bin
print(df['bins'].unique())

In [None]:
# Sample data
data = {'Age': [0, 12, 19, 35, 50, 70, 85, 95, 100, 101]}

# Create a DataFrame
df = pd.DataFrame(data)

# Define custom age categories with include_lowest
bins = [0 ,12, 19, 65, 100]
labels = ['Child', 'Teenager', 'Adult', 'Senior']

# By default, include_lowest is False
df['Age_Category_exclusive'] = pd.cut(df['Age'], bins=bins, labels=labels, include_lowest=True, right=False)

# Using include_lowest=True to make the lower bound inclusive
df['Age_Category_inclusive'] = pd.cut(df['Age'], bins=bins, labels=labels, include_lowest=True, right=True)

# Display the DataFrame
print(df)

 passing a pandas.Categorical object to Series os DF

In [None]:
cat = pd.Categorical(
    ["a", "b", "c", "a"], categories=["a", "c", "d"], ordered=False
)

s = pd.Series(cat)
print(s)

df = pd.DataFrame({"A":["a","b","c","a"]})
df["B"] = cat
df.dtypes



### CategoricalDtype

Custom caterorical type can be used for cast

In [None]:
from pandas.api.types import CategoricalDtype

cat_type= CategoricalDtype(["a", "b", "c"], ordered=True)

df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
df_cat = df.astype(cat_type)
print(df_cat)

print(cat_type == "category")

### Characteristics


Get statistical characteristics

In [None]:
df_cat.describe()

### Properties

Categorical data has a categories and a ordered property, which list their possible values and whether the ordering matters or not. These properties are exposed as s.cat.categories and s.cat.ordered. If you don’t manually specify categories and ordering, they are inferred from the passed arguments.

In [None]:
s = pd.Series(["a", "b", "c", "a"], dtype="category")
print(s.cat.categories)
print(s.cat.ordered)

### Rename

Renaming categories is done by using the rename_categories() method:

In [None]:
# create a dataframe
df = pd.DataFrame({
        "Name": ["Tim", "Sarah", "Hasan", "Jyoti", "Jack"],
        "Gender": ["Male", "Female", "Male", "Female", "Male"]
})

# change to category dtype
df["Gender"] = df["Gender"].astype("category")
# display the dataframe
print(df)
print(df["Gender"])
df["Gender"] = df["Gender"].cat.rename_categories(["M", "F"])
df


### Appending new categories

Appending categories can be done by using the add_categories() method:

In [None]:
cat = pd.Categorical(
    ["a", "b", "c", "a", "e"], categories=["a", "c", "d"], ordered=False
)

s = pd.Series(cat)
print (s)

#add a new category
s = s.cat.add_categories("e")

#set actual categories (removee and add in one step)
s = s.cat.set_categories(["a", "c", "d", "e"])
print (s)


### Sorting and order

If categorical data is ordered (s.cat.ordered == True), then the order of the categories has a meaning and certain operations are possible.

In [None]:
s = pd.Series([1, 2, 3, 1], dtype="category")
s = s.cat.set_categories([2, 3, 1], ordered=True)
print(s)
s.sort_values()



### Operations

Series methods like `Series.value_counts()` will use all categories, even if some categories are not present in the data

In [None]:
s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b", "d"]))
s.value_counts()

Groupby will also show “unused” categories

In [None]:
cats = pd.Categorical(
    ["a", "b", "b", "b", "c", "c", "c"], categories=["a", "b", "c", "d"]
)

df = pd.DataFrame({"cats": cats, "values": [1, 2, 2, 2, 3, 4, 5]})
print (df)

df.groupby("cats", observed=False).sum()

Pivot table

In [None]:
raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])

df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2, 3, 4]})
print(df)

pd.pivot_table(df, values="values", index=["A", "B"])

### Missing data

In [None]:
s = pd.Series(["a", "b", np.nan, "a"], dtype="category")

print (s)
print (s.cat.codes)
print(pd.isna(s))
s.fillna("a")


### Resource consumption

In [101]:
s = pd.Series(['foo','bar']*1000)
print(s.nbytes)

s = s.astype('category')
print(s.nbytes)


16000
2016
