# Categories

pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.

In [30]:
import numpy as np
import pandas as pd

In [65]:
df = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": [5, 60, 80, 90, 30, 70]}
)

df

Unnamed: 0,id,raw_grade
0,1,5
1,2,60
2,3,80
3,4,90
4,5,30
5,6,70


# Cell 1 - Add this markdown cell

In [66]:
# Cell 2 - Add this code cell
# Define score ranges and corresponding grade labels
score_bins = [0, 30, 50, 70, 85, 100]
grade_labels = ["very bad", "bad", "medium", "good", "very good"]

# Create categories based on score ranges using pd.cut()
df["grade_manual"] = pd.cut(df["raw_grade"], 
                           bins=score_bins, 
                           labels=grade_labels, 
                           include_lowest=True)

print("Original scores:", df["raw_grade"].tolist())
print("Manual categories:", df["grade_manual"].tolist())
print("\nDataFrame with manual categories:")
df[["id", "raw_grade", "grade_manual"]]

Original scores: [5, 60, 80, 90, 30, 70]
Manual categories: ['very bad', 'medium', 'good', 'very good', 'very bad', 'medium']

DataFrame with manual categories:


Unnamed: 0,id,raw_grade,grade_manual
0,1,5,very bad
1,2,60,medium
2,3,80,good
3,4,90,very good
4,5,30,very bad
5,6,70,medium


In [57]:
# Cell 3 - Add this code cell
# Let's see the mapping more clearly
print("Score → Category mapping:")
for i, (score, category) in enumerate(zip(df["raw_grade"], df["grade_manual"])):
    print(f"ID {i+1}: Score {score} → {category}")

Score → Category mapping:
ID 1: Score 5 → very bad
ID 2: Score 60 → medium
ID 3: Score 80 → good
ID 4: Score 90 → very good
ID 5: Score 30 → very bad
ID 6: Score 75 → good


In [67]:
# Cell 4 - Add this code cell
# Compare with the original alphabetical approach
print("Comparison:")
print("Score | Manual Method | Original Method")
print("------|---------------|----------------")
for i in range(len(df)):
    score = df.iloc[i]["raw_grade"]
    manual = df.iloc[i]["grade_manual"]
    original = df.iloc[i]["grade"] if "grade" in df.columns else "N/A"
    print(f"{score:5} | {manual:13} | {original}")

Comparison:
Score | Manual Method | Original Method
------|---------------|----------------
    5 | very bad      | N/A
   60 | medium        | N/A
   80 | good          | N/A
   90 | very good     | N/A
   30 | very bad      | N/A
   70 | medium        | N/A


---

Converting the raw grades to a categorical data type:


In [52]:
df["grade"] = df["raw_grade"].astype("category")

df["grade"]

0     5
1    60
2    80
3    90
4    30
5    75
Name: grade, dtype: category
Categories (6, int64): [5, 30, 60, 75, 80, 90]

Rename the categories to more meaningful names:

In [54]:
new_categories = ["very good", "good", "very bad"]

df["grade"] = df["grade"].cat.rename_categories(new_categories)

ValueError: new categories need to have the same number of items as the old categories!

In [49]:
df["grade"]

0     very bad
1    very good
2         good
3         good
4    very good
5    very good
Name: grade, dtype: category
Categories (3, object): ['very good', 'good', 'very bad']

Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new Series by default):

In [50]:
df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"]
)

df["grade"]

0     very bad
1    very good
2         good
3         good
4    very good
5    very good
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

Sorting is per order in the categories, not lexical order:

In [36]:
df.sort_values(by="grade")

Unnamed: 0,id,raw_grade,grade
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good
4,5,a,very good


Grouping by a categorical column with observed=False also shows empty categories:

In [37]:
df.groupby("grade", observed=False).size()

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64