# Coercing to Categorical

### Introduction

### Loading our AirBnb Data

1. Go after the low hanging fruit 
1. Numeric 
        * price and percent
2. Datetimes
3. Booleans
4. Categorical 
5. Datetimes-> add_datepart (weekday)
6. Get rid of our nas -> impute, is_na column

We'll start by loading the data where we last left off.

In [11]:
import pandas as pd
df = pd.read_csv('./coerced_bools_complete.csv')

And let's load our datatypes and set our them in our dataframe.

In [12]:
import requests 
url = "https://raw.githubusercontent.com/jigsawlabs-student/engineering-large-datasets/master/coerced_bools_complete_dtypes.json"
response = requests.get(url)
dtypes = response.json()

# df.to_feather('./whatever.feather')

In [13]:
df = df.astype(dtypes)

In [14]:
object_df = df.select_dtypes('object')

In [16]:
# object_df

### Loading our Library

In [18]:
def get_multiple_val_counts(df, num_vals = 1):
    return [df[column].value_counts(normalize=True).iloc[:num_vals] for column in df.columns]

In [19]:
import numpy as np
def summarize_cats(df):
    multiple_val_counts = get_multiple_val_counts(df)
    stacked_counts = np.vstack([np.array([val_count.name, val_count.index[0], float(val_count.values[0])]) for val_count in multiple_val_counts])
    sorted_cols = np.argsort(stacked_counts.reshape(-1, 3)[:, 2].astype('float'))
    return stacked_counts[sorted_cols[::-1]]

In [20]:
summarize_cats(object_df)[:10]

array([['street', 'Berlin, Berlin, Germany', '0.989125'],
       ['bed_type', 'Real Bed', '0.93525'],
       ['property_type', 'Apartment', '0.899'],
       ['host_is_superhost', 'f', '0.8743265254980579'],
       ['host_location', 'Berlin, Berlin, Germany', '0.8262015309323629'],
       ['room_type', 'Entire home/apt', '0.54125'],
       ['host_identity_verified', 't', '0.5090840746773587'],
       ['host_response_time', 'within an hour', '0.45789473684210524'],
       ['cancellation_policy', 'flexible', '0.34125'],
       ['neighbourhood_group_cleansed', 'Friedrichshain-Kreuzberg',
        '0.255125']], dtype='<U1000')

### Selecting Columns

Then we can use the `selected_cat_values` method to take a deeper look at the values in each of the columns.

In [22]:
def selected_cat_values(column, threshold = .02):
    values_counted = column.value_counts(normalize=True)
    return values_counted[values_counted > threshold]

In [31]:
# [selected_cat_values(object_df[col]) for col in object_df.columns][:20]

### Coercing our Categorical Columns

In [53]:
def reduce_cat_values(column, threshold = .02):
    column = column.copy()
    selected_values = selected_cat_values(column, threshold).index
    column[~column.isin(selected_values)] = 'other'
    column.astype('category')
    return column

In [None]:
# FunctionTransformer()

In [36]:
cat_cols = summarize_cats(object_df)[:16, 0]
cat_cols

array(['street', 'bed_type', 'property_type', 'host_is_superhost',
       'host_location', 'room_type', 'host_identity_verified',
       'host_response_time', 'cancellation_policy',
       'neighbourhood_group_cleansed', 'host_verifications',
       'neighbourhood', 'host_neighbourhood', 'calendar_updated',
       'neighbourhood_cleansed', 'zipcode'], dtype='<U1000')

In [52]:
cat_df = object_df[cat_cols]
cat_df.shape

(8000, 16)

In [39]:
reduced_cat_df = cat_df.apply(lambda col: reduce_cat_values(col))

In [42]:
# [for col in reduced_cat_df.columns]

# get_multiple_val_counts(reduced_cat_df, num_vals = 3)

### Integrating our Mapper

In [43]:
reduced_cat_df.columns

Index(['street', 'bed_type', 'property_type', 'host_is_superhost',
       'host_location', 'room_type', 'host_identity_verified',
       'host_response_time', 'cancellation_policy',
       'neighbourhood_group_cleansed', 'host_verifications', 'neighbourhood',
       'host_neighbourhood', 'calendar_updated', 'neighbourhood_cleansed',
       'zipcode'],
      dtype='object')

In [47]:
from sklearn.preprocessing import OneHotEncoder
steps = [([col], OneHotEncoder() )for col in reduced_cat_df.columns]

In [48]:
from sklearn_pandas import DataFrameMapper

mapper = DataFrameMapper(steps, df_out = True)

In [49]:
transformed_cat = mapper.fit_transform(reduced_cat_df)

In [51]:
transformed_cat.shape

(8000, 123)

In [60]:
coerced_cat_df = mapper.fit_transform(reduced_df)

In [56]:
cat_cols = reduced_df.columns

### Aggregating our Data

At this point, we can take our original dataframe.

In [62]:
import pandas as pd
df = pd.read_csv('./coerced_bools_complete.csv', index_col = 0)

In [63]:
import json

file = './coerced_bools_complete_dtypes.json'
with open(file, 'r') as f:
    dtypes = json.load(f)

In [64]:
df = df.astype(dtypes)

In [65]:
df.shape

(8000, 83)

And drop the categorical columns, and add in our `coerced_cat_df`.

In [66]:
df_dropped_cats = df.drop(columns = cat_cols)

In [67]:
df_dropped_cats[coerced_cat_df.columns] = coerced_cat_df

Let's take a look at what object columns we have left.

In [68]:
remaining_object_df = df_dropped_cats.select_dtypes('object')

In [69]:
remaining_object_df.shape

(8000, 19)

In [70]:
remaining_object_df[:2].T

Unnamed: 0,0,1
listing_url,https://www.airbnb.com/rooms/2015,https://www.airbnb.com/rooms/2695
name,Berlin-Mitte Value! Quiet courtyard/very central,Prenzlauer Berg close to Mauerpark
summary,Great location! 30 of 75 sq meters. This wood...,
space,A+++ location! This „Einliegerwohnung“ is an e...,In the summertime we are spending most of our ...
description,Great location! 30 of 75 sq meters. This wood...,In the summertime we are spending most of our ...
neighborhood_overview,It is located in the former East Berlin area o...,
notes,"This is my home, not a hotel. I rent out occas...",
transit,"Close to U-Bahn U8 and U2 (metro), Trams M12, ...",Within walking distance you'll find the S-Bahn...
access,"Simple kitchen/cooking, refrigerator, microwav...",Außer deinem Zimmer kannst du noch die Küche u...
interaction,Always available,


It looks  like these are not categorical (with the exception perhaps of amenities and licence), so let's export our data.

In [174]:
df_dropped_cats.to_csv('./coerced_cats.csv')

In [175]:
dtypes = df_dropped_cats.dtypes.astype(str).to_dict()

In [176]:
import json
with open('./dtypes_coerced_cats.json', 'w') as f:
    json.dump(dtypes, f)

### Summary 

In this lesson we coercing our categorical data by first identifying our categorical features with the `summarize_cats` method.  We then used `selected_cat_values` to take a look at the common values in each of these columns. 

Finally, we moved onto coercing our categorical columns.  We made three coercions in all.  First, we replaced sparse values with `other`.  Then, we replaced na values and applied one hot encoding with a DataFrameMapper.