#### Introduction to Categorical Data
When exploring data, we’re often interested in summarizing a large amount of information with a single number or visualization.

Depending on what we’re trying to understand from our data, we may need to rely on different statistics. For quantitative data, we can summarize central tendency using mean, median or mode and we can summarize spread using standard deviation, variance, or percentiles. However, when working with categorical data, we may not be able to use all the same summary statistics.

For example, here are the first five rows and some selected columns of a dataset from the 1994 U.S. census:

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Manually Creating the dataframe used in the example
ages = [90, 82, 66, 54, 41]
educations = ['HS-grad', 'HS-grad', 'Some-college', '7th-8th', 'Some-college']
marital_statuses = ['Widowed', 'Widowed', 'Widowed', 'Divorced', 'Separated']
races = ['White', 'White', 'Black', 'White', 'White']


data = {'age': ages, 
        'education': educations, 
        'marital_status': marital_statuses, 
        'race': races }

data_df = pd.DataFrame(data)

print(data_df)

   age     education marital_status   race
0   90       HS-grad        Widowed  White
1   82       HS-grad        Widowed  White
2   66  Some-college        Widowed  Black
3   54       7th-8th       Divorced  White
4   41  Some-college      Separated  White


Age is a quantitative variable, so we can calculate the average (or mean) age. However, for a variable like marital.status, we can’t calculate something like "average marital status" because the possible values of marital status are categories rather than numbers (e.g. "Married", "Widowed", "Seperated", etc.). 

In [3]:
nyc_trees = pd.read_csv('../Datasets/nyc_tree_census.csv')
print(nyc_trees.head(3))

   tree_id  trunk_diam status health   spc_common       neighborhood
0   199250           8  Alive   Good   crab apple     Lincoln Square
1   136891          17  Alive   Good  honeylocust  East Harlem North
2   200218           3  Alive   Good       ginkgo          Chinatown


In [4]:
nyc_trees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   tree_id       50000 non-null  int64 
 1   trunk_diam    50000 non-null  int64 
 2   status        50000 non-null  object
 3   health        47695 non-null  object
 4   spc_common    47695 non-null  object
 5   neighborhood  50000 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.3+ MB


#### Nominal Categories
Depending on the data, some of the summary statistics we use for quantitative data can still be meaningful for categorical data. Let’s first consider a nominal categorical variable. A nominal categorical variable is a categorical variable with no intrinsic ordering to the categories. Examples from the census dataset introduced in the previous exercise include `marital.status` and `race`.

Because these variables’ categories have no ordering or numeric equivalents, it’s impossible to calculate a mean or median. It would also be impossible to describe spread with statistics like variance, standard deviation, a range, IQR, or percentiles, because these statistics all rely on being able to order the data in some way. However, it is still possible to calculate the mode, the most common value in the dataset.

We can do this in Python using the `.value_counts()` function. The `.value_counts()` function calculates the count of each value in a column and returns the result as a series. By default, `.value_counts()` orders categories in descending order by frequency, thus the top row in the output will be the mode.



In [5]:
# Get tree counts by neighborhood
tree_counts = nyc_trees['neighborhood'].value_counts()
print(tree_counts, '\n')

# Get neighborhoods with most trees
greenest_neighborhood = tree_counts.index[0]
print('Greenest neighborhood is {}'.format(greenest_neighborhood))

Annadale-Huguenot-Prince's Bay-Eltingville    950
Great Kills                                   761
East New York                                 702
Bayside-Bayside Hills                         665
Rossville-Woodrow                             633
                                             ... 
39                                              1
BX33                                            1
MN50                                            1
82                                              1
87                                              1
Name: neighborhood, Length: 442, dtype: int64 

Greenest neighborhood is Annadale-Huguenot-Prince's Bay-Eltingville


In [6]:
#Finding the unique health statuses of the trees
tree_health_statuses = nyc_trees['health'].unique()
print(tree_health_statuses)

['Good' 'Poor' 'Fair' nan]


In [7]:
#Converting the health statuses to categories and ordering like poor < fair < good
health_categories = ['Poor', 'Fair', 'Good']
nyc_trees['health'] = pd.Categorical(nyc_trees['health'], health_categories, ordered = True)

In [8]:
#Finding the median health status
median_index = np.median(nyc_trees['health'].cat.codes)
median_health_status = health_categories[int(median_index)]
print(median_health_status)

Good


In [9]:
nyc_trees2 = pd.read_csv('../Datasets/nyc_tree_census2.csv')

This dataset contains two variables related to trunk size. The first variable, trunk_diam contains the diameter of the trunk (in inches) for each tree. The variable tree_diam_category, on the other hand, categorizes each tree based on the size of the trunk. The categories are: 'Small (0-3in)', 'Medium (3-10in)', 'Medium-Large (10-18in)', 'Large (18-24in)','Very large (>24in)'. You’ll notice that these categories are not evenly spaced with respect to diameter.

In [10]:
nyc_trees2.tree_diam_category = pd.Categorical(nyc_trees2.tree_diam_category, ['Small (0-3in)', 
                                                                             'Medium (3-10in)', 
                                                                             'Medium-Large (10-18in)', 
                                                                             'Large (18-24in)',
                                                                             'Very large (>24in)'], ordered=True)


In [11]:
# Get Mean Diam of diameter variable, `trunk_diam`
mean_diam = np.mean(nyc_trees2['trunk_diam'])
print(mean_diam)

# Get Mean Category of `tree_diam_category`
mean_diam_cat = np.mean(nyc_trees2['tree_diam_category'].cat.codes)
print(mean_diam_cat)

11.27048
1.97282


In [12]:
size_labels_ordered = ['Small (0-3in)', 'Medium (3-10in)', 
                       'Medium-Large (10-18in)', 'Large (18-24in)','Very large (>24in)']

nyc_trees2.tree_diam_category = pd.Categorical(nyc_trees2.tree_diam_category, size_labels_ordered, ordered=True)


In [13]:
# Calculate 25th Percentile Category
p25_ind = np.percentile(nyc_trees2['tree_diam_category'].cat.codes, 25)
p25_tree_diam_category = size_labels_ordered[int(p25_ind)]
print(p25_tree_diam_category)

Medium (3-10in)


In [14]:
# Calculate 75th Percentile Category

p75_ind = np.percentile(nyc_trees2['tree_diam_category'].cat.codes, 75)
p75_tree_diam_category = size_labels_ordered[int(p75_ind)]
print(p75_tree_diam_category)

Large (18-24in)


In [15]:
#returning proportions of tree statuses
tree_status_proportions = nyc_trees2['status'].value_counts(normalize = True)
print(tree_status_proportions)

Alive    0.9539
Stump    0.0267
Dead     0.0194
Name: status, dtype: float64


In [16]:
health_proportions = nyc_trees2['health'].value_counts(normalize = True)
print(health_proportions, '\n')


health_proportions_2 = nyc_trees2['health'].value_counts(dropna = False, normalize = True)
print(health_proportions_2)


Good    0.810986
Fair    0.146871
Poor    0.042143
Name: health, dtype: float64 

Good    0.7736
Fair    0.1401
NaN     0.0461
Poor    0.0402
Name: health, dtype: float64


In [17]:
living_frequency = np.sum(nyc_trees['status'] == 'Alive')
living_proportion = np.mean(nyc_trees['status'] == 'Alive').mean()
print(living_frequency)
print(living_proportion)

47695
0.9539


- For nominal categorical variables, there is no ordering to the categories. Because of this, we’re limited to using the mode to describe central tendency and there is no way to summarize the spread.

- For ordinal categorical variables, there is an implied ordering to the categories. In Python, we can use pd.Categorical() to transform a variable to a categorical type. The Categorical type allows us to access a numeric value for each category by using .cat.codes. From there, we may perform operations on this variable as if it were a regular, numeric variable.

- However, when calculating statistics for an ordinal categorical variable we should be mindful that some numeric statistics rely on the assumption of equal spacing between categories.

- For ordinal categorical variables, median and mode can be used to summarize the central tendency, and the IQR (or any difference between percentiles) can be used to summarize the spread.

- Certain summary statistics (e.g. frequencies and proportions), can be used for all categorical variables. You can create true/false columns and np.sum() and np.mean() to quickly summarize what proportion of your data meets certain criteria.

In [18]:
# Read CSV
film_permits = pd.read_csv('../Datasets/film_permits.csv')
print(film_permits.head(), '\n')
print(film_permits['EventType'].unique())

   EventID                      EventType           StartDateTime  \
0   446168                Shooting Permit  10/19/2018 02:00:00 PM   
1   186438                Shooting Permit  10/30/2014 07:00:00 AM   
2   445255                Shooting Permit  10/20/2018 07:00:00 AM   
3   128794  Theater Load in and Load Outs  11/16/2013 12:01:00 AM   
4    43547                Shooting Permit  01/10/2012 07:00:00 AM   

              EndDateTime    Borough           Category  SubCategoryName  
0  10/20/2018 02:00:00 AM  Manhattan               Film          Feature  
1  10/31/2014 02:00:00 AM     Queens         Television  Episodic series  
2  10/20/2018 06:00:00 PM   Brooklyn  Still Photography   Not Applicable  
3  11/17/2013 06:00:00 AM  Manhattan            Theater          Theater  
4  01/10/2012 07:00:00 PM   Brooklyn         Television  Episodic series   

['Shooting Permit' 'Theater Load in and Load Outs'
 'DCAS Prep/Shoot/Wrap Permit' 'Rigging Permit']


In [19]:
print(film_permits['Category'].unique())

['Film' 'Television' 'Still Photography' 'Theater' 'WEB' 'Commercial'
 'Student' 'Documentary' 'Music Video']


In [20]:
film_permits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   EventID          9999 non-null   int64 
 1   EventType        9999 non-null   object
 2   StartDateTime    9999 non-null   object
 3   EndDateTime      9999 non-null   object
 4   Borough          9999 non-null   object
 5   Category         9999 non-null   object
 6   SubCategoryName  9999 non-null   object
dtypes: int64(1), object(6)
memory usage: 546.9+ KB


In [21]:
# Nominal Vars
nominalvars = ['EventType', 'Borough', 'Category', 'SubCategoryName']

In [22]:
# Ordinal Vars - We might consider an Id like 'EventID' an ordinal variable in some situations

# Borough with the most permits for pilot episodes
print(film_permits[film_permits.SubCategoryName == 'Pilot'].Borough.value_counts())

Manhattan        149
Brooklyn          89
Queens            21
Bronx             10
Staten Island      2
Name: Borough, dtype: int64


In [23]:
# Borough with the most permits for pilot episodes
print(film_permits[film_permits.SubCategoryName == 'Pilot'].Borough.value_counts())

Manhattan        149
Brooklyn          89
Queens            21
Bronx             10
Staten Island      2
Name: Borough, dtype: int64


In [24]:
# Summarize the Top Categories
print(film_permits.Category.value_counts())

Television           5271
Film                 1765
Theater               966
Commercial            878
Still Photography     658
WEB                   313
Student                72
Documentary            48
Music Video            28
Name: Category, dtype: int64


In [25]:
# Summarize the Top Subcategories
print(film_permits.SubCategoryName.value_counts())

Episodic series            2916
Feature                    1382
Not Applicable             1381
Cable-episodic             1033
Theater                     966
Commercial                  686
Pilot                       271
News                        202
Cable-other                 126
Reality                     124
Morning Show                121
Short                       120
Promo                       112
Made for TV/mini-series      90
Variety                      76
Student Film                 65
Special/Awards Show          59
Cable-daily                  55
Industrial/Corporate         54
Talk Show                    48
PSA                          27
Game show                    25
Signed Artist                15
Children                     12
Syndication/First Run        11
Independent Artist            9
Magazine Show                 8
Daytime soap                  5
Name: SubCategoryName, dtype: int64
