# Coercing to Booleans

### Introduction

In this lesson, we'll work through identifying and coercing data to boolean values.  This will also prepare us to identify and coerce categorical values in our dataset.

### Loading our AirBnb Data

For this lesson, we'll work with [AirBnb listings in Berlin](https://www.kaggle.com/brittabettendorf/berlin-airbnb-data).  Let's load our data.

In [1]:
import pandas as pd
df = pd.read_feather('./coerced_nums_and_dates.feather')

In [2]:
df.shape

(18035, 85)

Lucky for us, we already have our good amount of our data already coerced.  But we still have more work to do.

In [3]:
object_df = df.select_dtypes(include = 'object')

object_df.shape

(18035, 47)

### Feature engineering

So a lot of our columns are still of type object.  Let's take a look at some of our object columns. 

In [5]:
object_df[:2]

Unnamed: 0,listing_url,name,summary,space,description,neighborhood_overview,notes,transit,access,interaction,...,amenities,weekly_price,monthly_price,calendar_updated,requires_license,license,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification
0,https://www.airbnb.com/rooms/17934396,primeflats - Apartment am Schillerpark 6,This ground-floor apartment in a former newspa...,Welcome to my Berlin classic! The style is mar...,This ground-floor apartment in a former newspa...,The apartment is located in the heart of Berli...,,The train line 6 is reachable within a five-mi...,,"Guests rent the whole apartment, but I am avai...",...,"{TV,Internet,Wifi,Kitchen,""Free street parking...",,,today,t,,t,super_strict_30,f,f
1,https://www.airbnb.com/rooms/11836363,Mostaza Bright 1Bed in Mitte,This spacious and calm one bedroom has everyth...,Enjoy of a bright apartment in the center of B...,This spacious and calm one bedroom has everyth...,"Everyone coming to Berlin, knows Mitte center....",Underground parking its available at an additi...,Getting around from the apartment its no probl...,You will have access at no extra charge to the...,As much as needed,...,"{TV,Internet,Wifi,Kitchen,Gym,Elevator,Heating...",,,2 days ago,t,,t,strict_14_with_grace_period,f,f


Where a larger percentage of the values in our columns repeat, we can think of them as categorical, and eventually one hot encode them.  So we wrote a method called `percent_different` that returns the percent of unique values that make up a series.  If most of the values in a series are unique, then it is not a categorical column.  

So in `find_categorical`, we loop through our columns, identifying those where `percent_different` is not too large - and those are our categorical columns.

In [9]:
def percent_different(df_series):
    series_filled = df_series.dropna()
    return len(series_filled.unique())/len(series_filled)

In [10]:
def find_categorical(df, threshold = .5):    
    categorical_df = pd.DataFrame({})
    for column in df.columns:
        if percent_different(df[column]) < threshold:
            categorical_df[column] = df[column]
    return categorical_df 

Let's see how this works.

In [11]:
potential_cat = find_categorical(object_df)

In [12]:
potential_cat.shape

(18035, 29)

In [13]:
potential_cat[:2]

Unnamed: 0,host_name,host_location,host_response_time,host_is_superhost,host_neighbourhood,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,...,room_type,bed_type,weekly_price,monthly_price,calendar_updated,requires_license,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification
0,Ben,"Berlin, Berlin, Germany",,f,,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Berlin, Berlin, Germany",Wedding,...,Entire home/apt,Real Bed,,,today,t,t,super_strict_30,f,f
1,Khadine,DE,,f,Mitte,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Berlin, Berlin, Germany",Mitte,...,Entire home/apt,Real Bed,,,2 days ago,t,t,strict_14_with_grace_period,f,f


It looks like it did a good job.

### Combine with Selecting Categorical Columns

The next step is to take a look at the values in those identified columns, to see if indeed they are full of categories.  Our `get_multiple_val_counts` method loops through a dataframe, providing the top `value_counts` values, and the related column.

In [15]:
def get_multiple_val_counts(df, num_vals = 1):
    return [df[column].value_counts(normalize=True).iloc[:num_vals] for column in df.columns]

In [16]:
get_multiple_val_counts(potential_cat)[:3]

[Anna    0.009992
 Name: host_name, dtype: float64,
 Berlin, Berlin, Germany    0.763945
 Name: host_location, dtype: float64,
 within an hour    0.527298
 Name: host_response_time, dtype: float64]

And summarize cats, puts this information in an easier to work with numpy array.

In [21]:
import numpy as np
def summarize_cats(df):
    multiple_val_counts = get_multiple_val_counts(df)
    stacked_counts = np.vstack([np.array([val_count.name, val_count.index[0], float(val_count.values[0])]) for val_count in multiple_val_counts])
    sorted_cols = np.argsort(stacked_counts.reshape(-1, 3)[:, 2].astype('float'))
    return stacked_counts[sorted_cols[::-1]]

In [24]:
summary = summarize_cats(potential_cat)
summary[:10]

array([['market', 'Berlin', '0.9997220369134979'],
       ['requires_license', 't', '0.999611865816468'],
       ['state', 'Berlin', '0.9976065902259824'],
       ['host_has_profile_pic', 't', '0.9973355537052456'],
       ['city', 'Berlin', '0.9941209095951192'],
       ['smart_location', 'Berlin, Germany', '0.9939561962850014'],
       ['require_guest_profile_picture', 'f', '0.9932353756584419'],
       ['street', 'Berlin, Berlin, Germany', '0.9893540338231217'],
       ['require_guest_phone_verification', 'f', '0.9878014970889937'],
       ['bed_type', 'Real Bed', '0.9642916551150541']], dtype='<U32')

The first column in summary is the name of the column, the second is the top value, and the last column is the percent of the column the value was in. 

### Identifying Boolean Values

From the summary grid, we can start to see some strings that are really boolean values.  These are the columns with `t` or `f` as their top values.  

In [25]:
summary[:3]

array([['market', 'Berlin', '0.9997220369134979'],
       ['requires_license', 't', '0.999611865816468'],
       ['state', 'Berlin', '0.9976065902259824']], dtype='<U32')

Let's select all of the columns from our summary that have values of `t` or `f`.

In [26]:
boolean_summary = summary[np.isin(summary[:, 1], ['t', 'f'])]

In [27]:
true_boolean_cols = boolean_summary[:, 0]
true_boolean_cols

array(['requires_license', 'host_has_profile_pic',
       'require_guest_profile_picture',
       'require_guest_phone_verification', 'host_is_superhost',
       'is_location_exact', 'instant_bookable', 'host_identity_verified'],
      dtype='<U32')

Now that we have selected our boolean columns, we can use our MissingIndicator to convert these columns to have True or False values.  We do so, we by having the transformer set `t` to True, and all other values to False.

> We can loop through to do this for each of our boolean columns.

In [29]:
from sklearn.impute import MissingIndicator
steps = [([col], MissingIndicator(missing_values = 't')) 
         for col in true_boolean_cols]

And then place these steps in a DataFrameMapper.

In [30]:
from sklearn_pandas import DataFrameMapper
boolean_mapper = DataFrameMapper(steps, df_out = True)

In [31]:
bool_df = boolean_mapper.fit_transform(df)

In [32]:
bool_df[:2]

Unnamed: 0,requires_license,host_has_profile_pic,require_guest_profile_picture,require_guest_phone_verification,host_is_superhost,is_location_exact,instant_bookable,host_identity_verified
0,True,True,False,False,False,False,True,True
1,True,True,False,False,False,True,True,True


Then we can update our dataframe.

In [35]:
df.loc[:, bool_df.columns] = bool_df

In [36]:
df.to_feather('./listings_coerced_bools.feather')

Then we can use numpy to identify our remaining potential_cat columns that we should coerce.

In [33]:
import numpy as np
remaining_cat_cols = np.setdiff1d(potential_cat.columns, bool_df.columns)

remaining_cat_cols

array(['bed_type', 'calendar_updated', 'cancellation_policy', 'city',
       'host_location', 'host_name', 'host_neighbourhood',
       'host_response_time', 'host_verifications', 'market',
       'monthly_price', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'property_type', 'room_type',
       'smart_location', 'state', 'street', 'weekly_price', 'zipcode'],
      dtype=object)

### Summary

In this lesson, we were introduced to some of the methods for handling boolean and categorical data.  We saw that we identified our categorical columns by looking at the percent different.  If not a large percent of a column's values are different, it is likely categorical or boolean.  We then used our `summarize_cats` method to view the top values in each of the columns, along with how often they occur.

Finally, we used the `MissingImputer` to convert values in almost boolean columns to True and False values.

In [59]:
import numpy as np
def summarize_counts(df):
    frequencies = np.array([df[column].value_counts(normalize=True).values[0] for column in df]).reshape(-1, 1)
    columns = df.columns.to_numpy().reshape(-1, 1)
    top_values = np.array([df[column].value_counts(normalize=True).index[0] for column in df]).reshape(-1, 1)
    summarize = np.hstack((columns, frequencies, top_values))
    return summarize[summarize[:,1].argsort()[::-1]]

In [60]:
summary = summarize_counts(potential_cat)