# Working with categorical data

Categorical data can be split in 2 different kinds:
- Ordinal: having a natural order
- Nominal: they cannot be ordered

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

import warnings
warnings.filterwarnings('ignore')


In [None]:
df = pd.read_csv('../data/adult.csv')

In [None]:
df.info()

The object variables contain strings.
Pandas try to infer the type of each column. Categories are not directly flagged as categorical data but as strings.


In [None]:
df.describe()

In [None]:
# What's the amount of different values of each variable?
df.nunique()

In [None]:
# What are the different values of a specific variable?
df.workclass.value_counts()

In [None]:
# And their relative frequences
df.workclass.value_counts(normalize=True)

Object types can be converted into categorical ones by using the astype method:

In [None]:
df['marital.status'] = df['marital.status'].astype('category')

In [None]:
df['marital.status'].dtype

In [None]:
# Creating categorical series

my_data = ['A', 'A', 'B', 'A', 'C', 'B']

# Unordered
my_series1 = pd.Series(my_data, dtype='category')
print(my_series1)

In [None]:
# Ordered
my_series2 = pd.Categorical(my_data, categories=["C", "B", "A"], ordered=True)
print(my_series2)

Categorical data helps reducing memory footprint by a lot.

In order to benefit from lower memory requirements right from the beginning there is a way to pass the dtypes of the variables when reading from the source:

In [None]:
df_dtypes={
"marital.status": "category"
}

df = pd.read_csv('../data/adult.csv', dtype=df_dtypes)

df.dtypes


## Grouping data by categories



In [None]:
df = pd.read_csv('../data/adult.csv')

#the following
df1 = df[df["income"]=="<=50K"]
df2 = df[df["income"]==">50K"]
#can be replaced by 
groupby_object = df.groupby(by=["income"])

In [None]:
#We can use now functions like countm sum, mean... or our own custom functions
groupby_object.size()

In [None]:
groupby_object[["education.num", "age"]].sum()

In [None]:
groupby_object = df.groupby(by=["income", "marital.status"])

In [None]:
groupby_object[["education.num", "age"]].sum()

## Setting categorical variables

In this section we are going to see how we can manipulate categories. For the most part we are going to use the *.cat* accessor.

In [None]:
dogs = pd.read_csv('../data/ShelterDogs.csv')

In [None]:
dogs.info()

In [None]:
dogs["coat"] = dogs["coat"].astype("category")

In [None]:
dogs["coat"].value_counts()

In [None]:
# Setting the categories will assing a null categorie to the eventual categories previously assigned but not included in the list
dogs["coat"] = dogs["coat"].cat.set_categories(new_categories=["short", "medium", "long"])

In [None]:
dogs["coat"].value_counts(dropna=False)

The wirehaired category is now gone

In [None]:
dogs["coat"] = dogs["coat"].cat.set_categories(
    new_categories=["short", "medium", "long"], 
    ordered=True
)
dogs["coat"].head(3)

In [None]:
#In the case of likes_people, there is a lot of null values

dogs['likes_people'].value_counts(dropna=False)

In [None]:
dogs["likes_people"] = dogs["likes_people"].astype("category")
dogs["likes_people"] = dogs["likes_people"].cat.add_categories(["did not check", "could not tell"])

In [None]:
dogs["likes_people"].value_counts(dropna=False)

In [None]:
#We can remove categories as well
dogs['coat'] = dogs['coat'].cat.remove_categories(removals=['wirehaired'])

## Updating and collapsing categories

In [None]:
dogs['breed'] = dogs['breed'].astype('category')

In [None]:
dogs['breed'].value_counts()

In [None]:
#Renaming one or more categories
dogs['breed'] = dogs['breed'].cat.rename_categories(new_categories={
   'Unknown Mix': 'Unknown'
})

In [None]:
dogs['breed'].value_counts()

In [None]:
# Renaming can be done with lambdas too
dogs['sex'] = dogs['sex'].astype('category')
dogs['sex'] = dogs['sex'].cat.rename_categories(lambda x: x.title())

In [None]:
dogs['sex'].value_counts()

In [None]:
dogs['color'] = dogs['color'].astype("category")
print(dogs['color'].cat.categories)

In [None]:
update_colors = {
    "black and brown": "black",
    "black and tan": "black",
    "black and white": "black"
}

In [None]:
dogs["main_color"] = dogs["color"].replace(update_colors)

In [None]:
dogs["main_color"].value_counts()

In [None]:
dogs["main_color"].dtype

## Reordering Categories

In [None]:
dogs = pd.read_csv('../data/ShelterDogs.csv')

dogs['coat'] = dogs['coat'].astype('category')
dogs['coat'] = dogs['coat'].cat.reorder_categories(
    new_categories=['short', 'medium', 'wirehaired', 'long'],
    ordered=True
)

dogs['coat'].cat.categories

In [None]:
#The order is taken into account to display the results of the grouping
dogs.groupby(by='coat')['age'].mean()

## Cleaning and accessing data

There are several kinds of issues we can face when dealing with categorical data:
- Inconsistent values
- Misspelled values
- Wrong dtype after corrections

To identify them we can use either the cat.categories or the value_counts method.

For fixing the inconsistent values we can use the same methods as for fixing strings:
- str.strip()
- str.title(), upper() or lower()
- str.mapping(dict)

Dont forget to check the dtype after the change

To access data we can use .str.contains(string or regex)

In [None]:
# Accessing data with loc

dogs.loc[dogs['get_along_cats']=='yes', "size"]

In [None]:
dogs.loc[dogs['get_along_cats']=='yes', "size"].value_counts(sort=False)

## Using categorical data in visualization

Seaborn has a kind of plot for categories, called catplot

In [None]:
df = pd.read_csv('../data/lasvegas_tripadvisor.csv')

In [None]:
df

In [None]:
df.describe()

In [None]:
df.info()

Catplot has the following parameters:
- x
- y
- data
- kind: strip, swarm, box, violin, boxen, point, bar, count

In [None]:
sns.catplot(data=df, y='Nr. rooms', kind='box')

In [None]:
sns.catplot(data=df, y='Nr. rooms', kind='boxen')

In [None]:
sns.catplot(data=df, y='Nr. rooms', kind='violin')

In [None]:
sns.catplot(data=df, y='Nr. rooms', kind='swarm')

In [None]:
sns.catplot(data=df, y='Nr. rooms', kind='strip')

In [None]:
sns.catplot(data=df, y='Nr. rooms', kind='point')

In [None]:
sns.catplot(data=df, y='Nr. rooms', kind='bar')

In [None]:
df['Score'].value_counts()

In [None]:
sns.catplot(x='Pool', y='Score', data=df, kind='box')

In [None]:
sns.set(font_scale=1.4)
sns.set_style('whitegrid')

In [None]:
sns.catplot(x='Pool', y='Score', data=df, kind='box')

In [None]:
#Pandas offers a way to plot barcharts out of the box:
df['Traveler type'].value_counts().plot.bar()

In [None]:
#Seaborn goes beyond this simple plot and allow us to analyze numerical variables through categorical ones
sns.catplot(x='Traveler type', hue='Traveler type', y='Score', data=df, kind='bar')

Sns barplot represents with the height of each bar, the point estimate of the mean of the data. 

The black band represent the confidence interval for that value. That means, the range of values we are confident at 95% that the observations will fall in.

### Ordering categories

In [None]:
df['Traveler type'] = df['Traveler type'].astype('category')
df['Traveler type'].cat.categories

sns.catplot(x='Traveler type', hue='Traveler type', y='Score', data=df, kind='bar')

The catplot method has a "order" attribute, but its way more preferable to use the category order since that would be applied to every single graph, and not only.

### The hue parameter

Adds a color difference depending on the category

In [None]:
sns.catplot(x='Traveler type', hue='Tennis court', y='Score', data=df, kind='bar')

## Point and count plot

The point plot connects the means of the categories helping understanding the eventual difference between them. Confidence intervals are represented as well.

In [None]:
sns.catplot(x='Pool', y='Score', data=df, kind='point')

In [None]:
sns.catplot(x='Spa', y='Score', data=df, kind='point', hue='Tennis court', dodge=True)

In [None]:
sns.catplot(x='Spa', y='Score', data=df, kind='point', hue='Tennis court', dodge=True, join=False)

In [None]:
sns.catplot(x='Spa', data=df, kind='count', hue='Tennis court')

## The challenge of displaying several categories at the same time

In [None]:
ax = sns.catplot(x='Traveler type', 
            kind='count', 
            col='User continent', 
            col_wrap=3, 
            palette=sns.color_palette('Set1'), 
            data=df)

ax.fig.suptitle('Hotel Score by Traveler Type and User Continent')
ax.set_axis_labels("Traveler Type", 'Number of Reviews')
plt.subplots_adjust(top=.9)


## Categorical Pitfalls

The first one is related to the memory footprint reduction associated with the usage of categories. It wont happen in case of the variable having a big number of different values.

In [None]:
df = pd.read_csv('../data/cars.csv')

In [None]:
df

In [None]:
df.manufacturer_name.nunique()

In [None]:
df.manufacturer_name.nbytes

In [None]:
df.manufacturer_name.astype('category').nbytes

In [None]:
df.odometer_value.nbytes

In [None]:
df.odometer_value.astype('category').nbytes

In [None]:
df.odometer_value.nunique()

The memory saving on odometer_value is quite limited compared to what happens with manufacturer_name

Using categories can be frustrating as well because:
- using .str accessor object to manipulate data converts the Series into objects
- The .apply() method outputs a new Series as an object
- The common methods of adding, removing, replacing or setting categories do not handle missing categories the same way
- Numpy functions generally do not work with categorical Series


In [None]:
#Double check the variable is still a category after an operationabs
df['color']=df['color'].astype('category')
df['color']=df['color'].str.upper()
print(df['color'].dtype)


In [None]:
df['color']=df['color'].astype('category')
print(df['color'].dtype)

In [None]:
#Check the missing values
df['color']=df['color'].astype('category')
df['color']=df['color'].cat.set_categories(['black', 'silver', 'blue'])
df['color'].value_counts(dropna=False)

## Label Encoding

It codes each category with an integer from 0 to n-1 where n is the number of different categories.
-1 is often use to encode missing values.
Label encoding is often used to save memory

The method cat.codes is used to crete codes:


In [None]:
df['manufacturer_name'] = df['manufacturer_name'].astype('category')

In [None]:
df['manufacturer_code'] = df['manufacturer_name'].cat.codes

In [None]:
df[['manufacturer_code', 'manufacturer_name']]

If we want to create a code book with the codes and names for later manipulation we can do so thanks to python's zip method:

In [None]:
name_map = dict(zip(df.manufacturer_code, df.manufacturer_name))

In [None]:
name_map

In [None]:
#to convert codes back into names:
df['manufacturer_code'].map(name_map)

In [None]:
# To create a van code 
df['van_code'] = np.where (df['body_type'].str.contains('van', regex=False), 1, 0)

In [None]:
df['van_code'].value_counts()

### One hot encoding 

Very helpful for machine learning data preparation.

The pandas get_dummies method does this

In [None]:
df[['odometer_value', 'color']].head()

In [None]:
df_onehot = pd.get_dummies(df[['odometer_value', 'color']])

In [None]:
df_onehot.head()

In [None]:
df_onehot = pd.get_dummies(df, columns=['color'], prefix='onehot')

In [None]:
df_onehot