# Coercing to Categorical

### Introduction

In the last lesson, we saw how we can discover boolean values by looking for a large percentage of repeated values in a feature.  In this lesson, we'll continue with our Airbnb dataset, and tackle the categorical features.

Let's keep going.

### Loading our AirBnb Data

We'll start by loading the data where we last left off.

In [39]:
import pandas as pd
df = pd.read_csv('./coerced_bools_complete.csv', index_col = 0)

And let's load our datatypes and set our them in our dataframe.

In [40]:
import json

file = './coerced_bools_complete_dtypes.json'
with open(file, 'r') as f:
    dtypes = json.load(f)

In [41]:
df = df.astype(dtypes)

Now to see what work we have left, we can select our object columns.

In [42]:
object_df = df.select_dtypes('object')

### Loading our Library

Let's load up our methods for discovering categorical variables.  Remember our technique for is to use `value_counts` to see the top values in each column.  

In [43]:
def get_multiple_val_counts(df, num_vals = 1):
    return [df[column].value_counts(normalize=True).iloc[:num_vals] for column in df.columns]

Then, in `summarize_cats`, we sort columns by the percentage that top value makesup that column.

In [44]:
import numpy as np
def summarize_cats(df):
    multiple_val_counts = get_multiple_val_counts(df)
    stacked_counts = np.vstack([np.array([val_count.name, val_count.index[0], float(val_count.values[0])]) for val_count in multiple_val_counts])
    sorted_cols = np.argsort(stacked_counts.reshape(-1, 3)[:, 2].astype('float'))
    return stacked_counts[sorted_cols[::-1]]

Overall, it does appear that we have once again done a good job at identifying our categorical columns.

### Selecting Columns

Let's use `summarize_cats` to select the names of our columns that are likely categorical. 

In [45]:
cols = summarize_cats(object_df)[:16][:, 0]

In [46]:
cols

array(['street', 'bed_type', 'property_type', 'host_is_superhost',
       'host_location', 'room_type', 'host_identity_verified',
       'host_response_time', 'cancellation_policy',
       'neighbourhood_group_cleansed', 'host_verifications',
       'neighbourhood', 'host_neighbourhood', 'calendar_updated',
       'neighbourhood_cleansed', 'zipcode'], dtype='<U1000')

Then we can use the `selected_cat_values` method to take a deeper look at the values in each of the columns.

In [47]:
def selected_cat_values(column, threshold = .02):
    values_counted = column.value_counts(normalize=True)
    return values_counted[values_counted > threshold]

> Comment and uncomment the cell below.

In [48]:
# [selected_cat_values(df[col]) for col in cols]

### Coercing our Categorical Columns

Once we feel that we have selected our categorical columns we can remove sparse values from each column using the `reduce_cat_values` method.

In [49]:
def reduce_cat_values(column, threshold = .02):
    column = column.copy()
    selected_values = selected_cat_values(column, threshold).index
    column[~column.isin(selected_values)] = 'other'
    column.astype('category')
    return column

In [50]:
cat_cols = ['street', 'bed_type', 'property_type', 'host_is_superhost',
       'host_location', 'room_type', 'host_identity_verified',
       'host_response_time', 'cancellation_policy',
       'neighbourhood_group_cleansed', 'host_verifications',
       'neighbourhood', 'host_neighbourhood', 'calendar_updated',
       'neighbourhood_cleansed', 'zipcode']

In [51]:
remaining_cat_df = object_df[cat_cols]

> Below we remove all values that comprise less than 1 percent of the data.

In [52]:
reduced_df = remaining_cat_df.apply(lambda col: reduce_cat_values(col, .01))

Then let's take a look at the changes we made.

In [53]:
val_counts_grid = [reduced_df[col].value_counts(normalize = True)[:5] for col in reduced_df.columns]

In [54]:
val_counts_grid[:3]

[Berlin, Berlin, Germany    0.989125
 other                      0.010875
 Name: street, dtype: float64,
 Real Bed         0.935250
 Pull-out Sofa    0.034375
 Futon            0.022125
 other            0.008250
 Name: bed_type, dtype: float64,
 Apartment      0.899000
 Condominium    0.029375
 other          0.028750
 Loft           0.022625
 House          0.020250
 Name: property_type, dtype: float64]

### Integrating our Mapper

Ok, it's now time to apply one hot encoding to our categorical features.  Before we do, we can see that we do have missing values for a number of our categorical columns.  

In [55]:
reduced_df.isna().sum()

street                          0
bed_type                        0
property_type                   0
host_is_superhost               0
host_location                   0
room_type                       0
host_identity_verified          0
host_response_time              0
cancellation_policy             0
neighbourhood_group_cleansed    0
host_verifications              0
neighbourhood                   0
host_neighbourhood              0
calendar_updated                0
neighbourhood_cleansed          0
zipcode                         0
dtype: int64

So to apply one hot encoding, we first have to write coerce our na values to the string na, and then we can apply one hot encoding.

> Write a mapper that does both.

In [56]:
cat_cols = reduced_df.columns

In [57]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
steps = [([col], [OneHotEncoder()]) for col in cat_cols]

In [58]:
from sklearn_pandas import DataFrameMapper

In [59]:
mapper = DataFrameMapper(steps, df_out = True)

In [60]:
coerced_cat_df = mapper.fit_transform(reduced_df)

In [61]:
coerced_cat_df.shape

(8000, 180)

### Aggregating our Data

At this point, we can take our original dataframe.

In [62]:
import pandas as pd
df = pd.read_csv('./coerced_bools_complete.csv', index_col = 0)

In [63]:
import json

file = './coerced_bools_complete_dtypes.json'
with open(file, 'r') as f:
    dtypes = json.load(f)

In [64]:
df = df.astype(dtypes)

In [65]:
df.shape

(8000, 83)

And drop the categorical columns, and add in our `coerced_cat_df`.

In [66]:
df_dropped_cats = df.drop(columns = cat_cols)

In [67]:
df_dropped_cats[coerced_cat_df.columns] = coerced_cat_df

Let's take a look at what object columns we have left.

In [68]:
remaining_object_df = df_dropped_cats.select_dtypes('object')

In [69]:
remaining_object_df.shape

(8000, 19)

In [70]:
remaining_object_df[:2].T

Unnamed: 0,0,1
listing_url,https://www.airbnb.com/rooms/2015,https://www.airbnb.com/rooms/2695
name,Berlin-Mitte Value! Quiet courtyard/very central,Prenzlauer Berg close to Mauerpark
summary,Great location! 30 of 75 sq meters. This wood...,
space,A+++ location! This „Einliegerwohnung“ is an e...,In the summertime we are spending most of our ...
description,Great location! 30 of 75 sq meters. This wood...,In the summertime we are spending most of our ...
neighborhood_overview,It is located in the former East Berlin area o...,
notes,"This is my home, not a hotel. I rent out occas...",
transit,"Close to U-Bahn U8 and U2 (metro), Trams M12, ...",Within walking distance you'll find the S-Bahn...
access,"Simple kitchen/cooking, refrigerator, microwav...",Außer deinem Zimmer kannst du noch die Küche u...
interaction,Always available,


It looks  like these are not categorical (with the exception perhaps of amenities and licence), so let's export our data.

In [174]:
df_dropped_cats.to_csv('./coerced_cats.csv')

In [175]:
dtypes = df_dropped_cats.dtypes.astype(str).to_dict()

In [176]:
import json
with open('./dtypes_coerced_cats.json', 'w') as f:
    json.dump(dtypes, f)

### Summary 

In this lesson we coercing our categorical data by first identifying our categorical features with the `summarize_cats` method.  We then used `selected_cat_values` to take a look at the common values in each of these columns. 

Finally, we moved onto coercing our categorical columns.  We made three coercions in all.  First, we replaced sparse values with `other`.  Then, we replaced na values and applied one hot encoding with a DataFrameMapper.