# Categorical Manipulation


A common form of data is textual data, and a subset of textual data is categorical data. Categorical data is textual data that has repetitions. Categories are labels that describe data.

There are usually repeated values and if they have an intrinsic order, they are referred to as ordinal values. One example is shirt sizes: small, medium, and large. Underordered values such as colors are called nominal values. In addition, you can convert numerical data to categories by binning them.

In [1]:
import pandas as pd

In [3]:
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'

auto_df = pd.read_csv(url)

make = auto_df['make']

make

  auto_df = pd.read_csv(url)


0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: object

## Frequency Counts

Use the `.value_counts` method to determine the cardinality of the values. The frequency of the values will determine whether ite categorical or not. If every value is unique or a free form text, it is NOT categorical.

In [4]:
make.value_counts()

Chevrolet                      4003
Ford                           3371
Dodge                          2583
GMC                            2494
Toyota                         2071
                               ... 
Volga Associated Automobile       1
Panos                             1
Mahindra                          1
Excalibur Autos                   1
London Coach Co Inc               1
Name: make, Length: 136, dtype: int64

> Inspect the size and the number of unique items to infer cardinality.

In [5]:
make.shape, make.nunique()

((41144,), 136)

## Benefits of Categories

- Uses less memory
- Faster Computations

In [6]:
cat_make = make.astype('category')

cat_make

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']

In [8]:
print(f"The memory usage of string column types is {make.memory_usage(deep=True)}")
print(f"The memory usage of categorical column types is {cat_make.memory_usage(deep=True)}")

The memory usage of string column types is 2606395
The memory usage of categorical column types is 95888


In [9]:
# Categorical data types have access to the .str attribute

cat_make.str.upper()

0        ALFA ROMEO
1           FERRARI
2             DODGE
3             DODGE
4            SUBARU
            ...    
41139        SUBARU
41140        SUBARU
41141        SUBARU
41142        SUBARU
41143        SUBARU
Name: make, Length: 41144, dtype: object

## Conversion to Ordinal Categories

You can make an ordinal categorical from the series.

In [11]:
# Create an ordered series of make types

make_types = pd.CategoricalDtype(categories=sorted(make.unique()), ordered=True)

make_types

CategoricalDtype(categories=['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo',
                  'American Motors Corporation', 'Aston Martin', 'Audi',
                  'Aurora Cars Ltd', 'Autokraft Limited',
                  'Avanti Motor Corporation',
                  ...
                  'Toyota', 'VPG', 'Vector', 'Vixen Motor Company',
                  'Volga Associated Automobile', 'Volkswagen', 'Volvo',
                  'Wallace Environmental', 'Yugo', 'smart'],
, ordered=True)

In [14]:
# Assign the new Data Type of odered make to the make series

ordered_make = make.astype(make_types)

ordered_make

0        Alfa Romeo
1           Ferrari
2             Dodge
3             Dodge
4            Subaru
            ...    
41139        Subaru
41140        Subaru
41141        Subaru
41142        Subaru
41143        Subaru
Name: make, Length: 41144, dtype: category
Categories (136, object): ['AM General' < 'ASC Incorporated' < 'Acura' < 'Alfa Romeo' ... 'Volvo' < 'Wallace Environmental' < 'Yugo' < 'smart']

In [15]:
# Now, we can sort the makes

ordered_make.sort_values()

20288    AM General
20289    AM General
369      AM General
358      AM General
19314    AM General
            ...    
31289         smart
31290         smart
29605         smart
22974         smart
26882         smart
Name: make, Length: 41144, dtype: category
Categories (136, object): ['AM General' < 'ASC Incorporated' < 'Acura' < 'Alfa Romeo' ... 'Volvo' < 'Wallace Environmental' < 'Yugo' < 'smart']

## The .cat Accessor

If you need to rename the categories, you can use the .rename_categories method. You need to pass in a list with the same length as the current categories or a dictionary mapping old values to new values.

In [18]:
cat_make.cat.rename_categories(
    [c.lower() for c in cat_make.cat.categories]
                              )

0        alfa romeo
1           ferrari
2             dodge
3             dodge
4            subaru
            ...    
41139        subaru
41140        subaru
41141        subaru
41142        subaru
41143        subaru
Name: make, Length: 41144, dtype: category
Categories (136, object): ['am general', 'asc incorporated', 'acura', 'alfa romeo', ..., 'volvo', 'wallace environmental', 'yugo', 'smart']

In [17]:
ordered_make.cat.rename_categories({c:c.lower () for c in ordered_make .cat.categories })

0        alfa romeo
1           ferrari
2             dodge
3             dodge
4            subaru
            ...    
41139        subaru
41140        subaru
41141        subaru
41142        subaru
41143        subaru
Name: make, Length: 41144, dtype: category
Categories (136, object): ['am general' < 'asc incorporated' < 'acura' < 'alfa romeo' ... 'volvo' < 'wallace environmental' < 'yugo' < 'smart']

## Things to NOTE about CATEGORIES

Applying the `.value_counts` method
or `.groupby` to categorical data uses all of the categories even if there were no values for them.

In [19]:
ordered_make.iloc[:100].value_counts()

Dodge                        17
Oldsmobile                    8
Ford                          8
Buick                         7
Chevrolet                     5
                             ..
Grumman Allied Industries     0
Goldacre                      0
Geo                           0
Genesis                       0
smart                         0
Name: make, Length: 136, dtype: int64

In [21]:
(
    cat_make
    # Select the first 100
    .iloc[:100]
    # Group the data
    .groupby(cat_make.iloc[:100])
    # Return the first
    .first()
)

make
AM General                            NaN
ASC Incorporated                      NaN
Acura                                 NaN
Alfa Romeo                     Alfa Romeo
American Motors Corporation           NaN
                                  ...    
Volkswagen                     Volkswagen
Volvo                               Volvo
Wallace Environmental                 NaN
Yugo                                  NaN
smart                                 NaN
Name: make, Length: 136, dtype: category
Categories (136, object): ['AM General', 'ASC Incorporated', 'Acura', 'Alfa Romeo', ..., 'Volvo', 'Wallace Environmental', 'Yugo', 'smart']

In [22]:
# However using the string version works as expected

(
    make
    # Select the first 100
    .iloc[:100]
    # Group the data
    .groupby(make.iloc[:100])
    # Return the first
    .first()
)

make
Alfa Romeo          Alfa Romeo
Audi                      Audi
BMW                        BMW
Buick                    Buick
CX Automotive    CX Automotive
Cadillac              Cadillac
Chevrolet            Chevrolet
Chrysler              Chrysler
Dodge                    Dodge
Ferrari                Ferrari
Ford                      Ford
Hyundai                Hyundai
Infiniti              Infiniti
Lexus                    Lexus
Mazda                    Mazda
Mercury                Mercury
Nissan                  Nissan
Oldsmobile          Oldsmobile
Plymouth              Plymouth
Pontiac                Pontiac
Rolls-Royce        Rolls-Royce
Subaru                  Subaru
Toyota                  Toyota
Volkswagen          Volkswagen
Volvo                    Volvo
Name: make, dtype: object

## Generalization

In [24]:
# Generalization

def generalize_topn(ser, n, other='Other'):
    topn = ser.value_counts().index[:n]
    if isinstance(ser.dtype, pd.CategoricalDtype):
        ser = ser.cat.set_categories(
                       topn.set_categories(list(topn)+[other]))
    return ser.where(ser.isin(topn), other)

# Call the function
cat_make.pipe(generalize_topn, n=20, other='NA')
        

0            NA
1            NA
2         Dodge
3         Dodge
4        Subaru
          ...  
41139    Subaru
41140    Subaru
41141    Subaru
41142    Subaru
41143    Subaru
Name: make, Length: 41144, dtype: category
Categories (21, object): ['Chevrolet', 'Ford', 'Dodge', 'GMC', ..., 'Volvo', 'Hyundai', 'Chrysler', 'NA']