---
title: "Categories (Levels) in a DataFrame"
description: "Categories (or levels) are categorical variables that usually take on a fixed number of possible values. Examples are sex, blood type, political affiliation etc. Pandas makes it easy to deal with these types of values."
tags: Pandas, Data Cleaning / Preprocessing
URL: https://github.com/ageron/handson-ml
Licence: Apache License 2.0
Creator: 
Meta: ""

---

 <div>
    	<img src="./coco.png" style="float: left;height: 55px">
    	<div style="height: 150px;text-align: center; padding-top:5px">
        <h1>
      	Categories (Levels) in a DataFrame
        </h1>
        <p>Categories (or levels) are categorical variables that usually take on a fixed number of possible values. Examples are sex, blood type, political affiliation etc. Pandas makes it easy to deal with these types of values.</p>
    	</div>
		</div> 

 <div style="height:40px">
		<div style="width:100%; text-align:center; border-bottom: 1px solid #000; line-height:0.1em; margin:40px 0 20px;">
    	<span style="background:#fff; padding:0 10px; font-size:25px; font-family: 'Open Sans', sans-serif;">
        Key Code
    	</span>
		</div>
		</div>
			

In [None]:
import pandas as pd

In [None]:
# make a new column from codes (ex: levels are 0, 1, 2)
df['new_col'] = df['levels'].astype('category')
df['new_col'].cat.categories # => array of categories/levels

In [None]:
# rename the categories for readability
df['new_col'].cat.categories = ['name_1', 'name_2', 'name_3']

 <div style="height:40px">
		<div style="width:100%; text-align:center; border-bottom: 1px solid #000; line-height:0.1em; margin:40px 0 20px;">
    	<span style="background:#fff; padding:0 10px; font-size:25px; font-family: 'Open Sans', sans-serif;">
        Example
    	</span>
		</div>
		</div>
			

## Example DataFrame with codes/levels

In [36]:
city_eco = pd.DataFrame(
    [
        [808976, "San Francisco", "California", 0],
        [8363710, "New York", "New-York", 0],
        [413201, "Miami", "Florida", 3],
        [2242193, "Houston", "Texas", 7],
    ], columns=["population", "city", "state", "eco_code"])

city_eco

Unnamed: 0,population,city,state,eco_code
0,808976,San Francisco,California,0
1,8363710,New York,New-York,0
2,413201,Miami,Florida,3
3,2242193,Houston,Texas,7


Right now the `eco_code` column is full of apparently meaningless codes. Let's fix that. 

## Create a new categorical column based on the `eco_code`s

In [38]:
city_eco["economy"] = city_eco["eco_code"].astype('category')
city_eco["economy"].cat.categories

Int64Index([0, 3, 7], dtype='int64')

## Rename the category with a meaningful name

In [39]:
city_eco["economy"].cat.categories = ["Finance", "Energy", "Tourism"]
city_eco

Unnamed: 0,population,city,state,eco_code,economy
0,808976,San Francisco,California,0,Finance
1,8363710,New York,New-York,0,Finance
2,413201,Miami,Florida,3,Energy
3,2242193,Houston,Texas,7,Tourism


**Note:** categorical values are sorted according to their categorical order, *not* their alphabetical (lexicographical) order:

In [40]:
city_eco.sort_values(by="economy", ascending=False)

Unnamed: 0,population,city,state,eco_code,economy
3,2242193,Houston,Texas,7,Tourism
2,413201,Miami,Florida,3,Energy
1,8363710,New York,New-York,0,Finance
0,808976,San Francisco,California,0,Finance
