#### Introduction to Categorical Data
When exploring data, we’re often interested in summarizing a large amount of information with a single number or visualization.

Depending on what we’re trying to understand from our data, we may need to rely on different statistics. For quantitative data, we can summarize central tendency using mean, median or mode and we can summarize spread using standard deviation, variance, or percentiles. However, when working with categorical data, we may not be able to use all the same summary statistics.

For example, here are the first five rows and some selected columns of a dataset from the 1994 U.S. census:

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Manually Creating the dataframe used in the example
ages = [90, 82, 66, 54, 41]
educations = ['HS-grad', 'HS-grad', 'Some-college', '7th-8th', 'Some-college']
marital_statuses = ['Widowed', 'Widowed', 'Widowed', 'Divorced', 'Separated']
races = ['White', 'White', 'Black', 'White', 'White']


data = {'age': ages, 
        'education': educations, 
        'marital_status': marital_statuses, 
        'race': races }

data_df = pd.DataFrame(data)

print(data_df)

   age     education marital_status   race
0   90       HS-grad        Widowed  White
1   82       HS-grad        Widowed  White
2   66  Some-college        Widowed  Black
3   54       7th-8th       Divorced  White
4   41  Some-college      Separated  White


Age is a quantitative variable, so we can calculate the average (or mean) age. However, for a variable like marital.status, we can’t calculate something like "average marital status" because the possible values of marital status are categories rather than numbers (e.g. "Married", "Widowed", "Seperated", etc.). 

In [3]:
nyc_trees = pd.read_csv('../Datasets/nyc_tree_census.csv')
print(nyc_trees.head(3))

   tree_id  trunk_diam status health   spc_common       neighborhood
0   199250           8  Alive   Good   crab apple     Lincoln Square
1   136891          17  Alive   Good  honeylocust  East Harlem North
2   200218           3  Alive   Good       ginkgo          Chinatown


In [4]:
nyc_trees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   tree_id       50000 non-null  int64 
 1   trunk_diam    50000 non-null  int64 
 2   status        50000 non-null  object
 3   health        47695 non-null  object
 4   spc_common    47695 non-null  object
 5   neighborhood  50000 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.3+ MB


#### Nominal Categories
Depending on the data, some of the summary statistics we use for quantitative data can still be meaningful for categorical data. Let’s first consider a nominal categorical variable. A nominal categorical variable is a categorical variable with no intrinsic ordering to the categories. Examples from the census dataset introduced in the previous exercise include `marital.status` and `race`.

Because these variables’ categories have no ordering or numeric equivalents, it’s impossible to calculate a mean or median. It would also be impossible to describe spread with statistics like variance, standard deviation, a range, IQR, or percentiles, because these statistics all rely on being able to order the data in some way. However, it is still possible to calculate the mode, the most common value in the dataset.

We can do this in Python using the `.value_counts()` function. The `.value_counts()` function calculates the count of each value in a column and returns the result as a series. By default, `.value_counts()` orders categories in descending order by frequency, thus the top row in the output will be the mode.



In [5]:
# Get tree counts by neighborhood
tree_counts = nyc_trees['neighborhood'].value_counts()
print(tree_counts, '\n')

# Get neighborhoods with most trees
greenest_neighborhood = tree_counts.index[0]
print('Greenest neighborhood is {}'.format(greenest_neighborhood))

Annadale-Huguenot-Prince's Bay-Eltingville    950
Great Kills                                   761
East New York                                 702
Bayside-Bayside Hills                         665
Rossville-Woodrow                             633
                                             ... 
72                                              1
MN50                                            1
40                                              1
69                                              1
65                                              1
Name: neighborhood, Length: 442, dtype: int64 

Greenest neighborhood is Annadale-Huguenot-Prince's Bay-Eltingville


In [6]:
#Finding the unique health statuses of the trees
tree_health_statuses = nyc_trees['health'].unique()
print(tree_health_statuses)

['Good' 'Poor' 'Fair' nan]


In [7]:
#Converting the health statuses to categories and ordering like poor < fair < good
health_categories = ['Poor', 'Fair', 'Good']
nyc_trees['health'] = pd.Categorical(nyc_trees['health'], health_categories, ordered = True)

In [8]:
#Finding the median health status
median_index = np.median(nyc_trees['health'].cat.codes)
median_health_status = health_categories[int(median_index)]
print(median_health_status)

Good


In [9]:
nyc_trees2 = pd.read_csv('../Datasets/nyc_tree_census2.csv')

This dataset contains two variables related to trunk size. The first variable, trunk_diam contains the diameter of the trunk (in inches) for each tree. The variable tree_diam_category, on the other hand, categorizes each tree based on the size of the trunk. The categories are: 'Small (0-3in)', 'Medium (3-10in)', 'Medium-Large (10-18in)', 'Large (18-24in)','Very large (>24in)'. You’ll notice that these categories are not evenly spaced with respect to diameter.

In [10]:
nyc_trees2.tree_diam_category = pd.Categorical(nyc_trees2.tree_diam_category, ['Small (0-3in)', 
                                                                             'Medium (3-10in)', 
                                                                             'Medium-Large (10-18in)', 
                                                                             'Large (18-24in)',
                                                                             'Very large (>24in)'], ordered=True)


In [11]:
# Get Mean Diam of diameter variable, `trunk_diam`
mean_diam = np.mean(nyc_trees2['trunk_diam'])
print(mean_diam)

# Get Mean Category of `tree_diam_category`
mean_diam_cat = np.mean(nyc_trees2['tree_diam_category'].cat.codes)
print(mean_diam_cat)

11.27048
1.97282


In [12]:
size_labels_ordered = ['Small (0-3in)', 'Medium (3-10in)', 
                       'Medium-Large (10-18in)', 'Large (18-24in)','Very large (>24in)']

nyc_trees2.tree_diam_category = pd.Categorical(nyc_trees2.tree_diam_category, size_labels_ordered, ordered=True)


In [13]:
# Calculate 25th Percentile Category
p25_ind = np.percentile(nyc_trees2['tree_diam_category'].cat.codes, 25)
p25_tree_diam_category = size_labels_ordered[int(p25_ind)]
print(p25_tree_diam_category)

Medium (3-10in)


In [14]:
# Calculate 75th Percentile Category

p75_ind = np.percentile(nyc_trees2['tree_diam_category'].cat.codes, 75)
p75_tree_diam_category = size_labels_ordered[int(p75_ind)]
print(p75_tree_diam_category)

Large (18-24in)


In [16]:
#returning proportions of tree statuses
tree_status_proportions = nyc_trees2['status'].value_counts(normalize = True)
print(tree_status_proportions)

Alive    0.9539
Stump    0.0267
Dead     0.0194
Name: status, dtype: float64
