# Considerations on categorical data

The data (sample) has to represent the population

One of the most important considerations when working with categorical data are their classes (or labels)

Sometimes, our classes are imbalanced: one class is more frequent than the others.

For checking class frequencies there are several methods:

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

import warnings
warnings.filterwarnings('ignore')


In [None]:
divorce = pd.read_csv('../data/divorces.csv')

In [None]:
divorce

In [None]:
divorce['Type_of_divorce'] = divorce['Type_of_divorce'].astype('category')
divorce['Level_of_education_partner_man'] = divorce['Level_of_education_partner_man'].astype('category')
divorce['Level_of_education_partner_woman'] = divorce['Level_of_education_partner_woman'].astype('category')


In [None]:
sns.histplot(data=divorce, x='Type_of_divorce', hue='Type_of_divorce')

In [None]:
sns.histplot(data=divorce, x='Level_of_education_partner_man', hue='Level_of_education_partner_man')

In [None]:
divorce['Level_of_education_partner_man'].value_counts(normalize=True)

In [None]:
pd.crosstab(divorce['Level_of_education_partner_man'], divorce['Level_of_education_partner_woman'])

In [None]:
pd.crosstab(divorce['Level_of_education_partner_man'], divorce['Level_of_education_partner_woman'], 
            values=divorce['Num_Children'], aggfunc='median')

# Generating new features 

Sometimes we have to do changes on the data as we received it, replacing chars, converting into numeric, extracting year, month or day, 



In [None]:
str.replace
dt.month
dt.year
dt.day
dt.hour


We can group values into classes too:

In [None]:
twenty_fifth = divorce['Monthly_income_partner_man_peso'].quantile(0.25)
fifthy = divorce['Monthly_income_partner_man_peso'].quantile(0.5)
seventy_fifth = divorce['Monthly_income_partner_man_peso'].quantile(0.75)
max_ = divorce['Monthly_income_partner_man_peso'].max()

labels = ['low_inc', 'med_inc', 'med_high_inc', 'high_inc']
bins = [0, twenty_fifth, fifthy, seventy_fifth, max_]

In [None]:
divorce["inc_cat"]=pd.cut(divorce['Monthly_income_partner_man_peso'], 
                          labels=labels,
                         bins=bins)

In [None]:
divorce