# Coercing to Booleans

### Introduction

In this lesson, we'll work through identifying and coercing data to boolean values.  This will also prepare us to identify and coerce categorical values in our dataset.

### Loading our AirBnb Data

For this lesson, we'll work with [AirBnb listings in Berlin](https://www.kaggle.com/brittabettendorf/berlin-airbnb-data).  Let's load our data.

In [1]:
import pandas as pd
df = pd.read_feather('./coerced_nums_and_dates.feather')

In [2]:
df.shape

(18035, 85)

Lucky for us, we already have our good amount of our data already coerced.  But we still have more work to do.

In [3]:
object_df = df.select_dtypes(include = 'object')

object_df.shape

(18035, 47)

### Feature engineering

So a lot of our columns are still of type object.  Let's take a look at some of our object columns. 

In [5]:
object_df[:2]

Unnamed: 0,listing_url,name,summary,space,description,neighborhood_overview,notes,transit,access,interaction,...,amenities,weekly_price,monthly_price,calendar_updated,requires_license,license,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification
0,https://www.airbnb.com/rooms/17934396,primeflats - Apartment am Schillerpark 6,This ground-floor apartment in a former newspa...,Welcome to my Berlin classic! The style is mar...,This ground-floor apartment in a former newspa...,The apartment is located in the heart of Berli...,,The train line 6 is reachable within a five-mi...,,"Guests rent the whole apartment, but I am avai...",...,"{TV,Internet,Wifi,Kitchen,""Free street parking...",,,today,t,,t,super_strict_30,f,f
1,https://www.airbnb.com/rooms/11836363,Mostaza Bright 1Bed in Mitte,This spacious and calm one bedroom has everyth...,Enjoy of a bright apartment in the center of B...,This spacious and calm one bedroom has everyth...,"Everyone coming to Berlin, knows Mitte center....",Underground parking its available at an additi...,Getting around from the apartment its no probl...,You will have access at no extra charge to the...,As much as needed,...,"{TV,Internet,Wifi,Kitchen,Gym,Elevator,Heating...",,,2 days ago,t,,t,strict_14_with_grace_period,f,f


Where a larger percentage of the values in our columns repeat, we can think of them as categorical, and eventually one hot encode them.  So we wrote a method called `percent_different` that returns the percent of unique values that make up a series.  If most of the values in a series are unique, then it is not a categorical column.  

So in `find_categorical`, we loop through our columns, identifying those where `percent_different` is not too large - and those are our categorical columns.

In [9]:
def percent_different(df_series):
    series_filled = df_series.dropna()
    return len(series_filled.unique())/len(series_filled)

In [10]:
def find_categorical(df, threshold = .5):    
    categorical_df = pd.DataFrame({})
    for column in df.columns:
        if percent_different(df[column]) < threshold:
            categorical_df[column] = df[column]
    return categorical_df 

Let's see how this works.

In [11]:
potential_cat = find_categorical(object_df)

In [12]:
potential_cat.shape

(18035, 29)

In [13]:
potential_cat[:2]

Unnamed: 0,host_name,host_location,host_response_time,host_is_superhost,host_neighbourhood,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,...,room_type,bed_type,weekly_price,monthly_price,calendar_updated,requires_license,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification
0,Ben,"Berlin, Berlin, Germany",,f,,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Berlin, Berlin, Germany",Wedding,...,Entire home/apt,Real Bed,,,today,t,t,super_strict_30,f,f
1,Khadine,DE,,f,Mitte,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Berlin, Berlin, Germany",Mitte,...,Entire home/apt,Real Bed,,,2 days ago,t,t,strict_14_with_grace_period,f,f


It looks like it did a good job.

### Combine with Selecting Categorical Columns

The next step is to take a look at the values in those identified columns, to see if indeed they are full of categories.  Our `get_multiple_val_counts` method loops through a dataframe, providing the top `value_counts` values, and the related column.

In [15]:
def get_multiple_val_counts(df, num_vals = 1):
    return [df[column].value_counts(normalize=True).iloc[:num_vals] for column in df.columns]

In [16]:
get_multiple_val_counts(potential_cat)[:3]

[Anna    0.009992
 Name: host_name, dtype: float64,
 Berlin, Berlin, Germany    0.763945
 Name: host_location, dtype: float64,
 within an hour    0.527298
 Name: host_response_time, dtype: float64]

And summarize cats, puts this information in an easier to work with numpy array.

In [21]:
import numpy as np
def summarize_cats(df):
    multiple_val_counts = get_multiple_val_counts(df)
    stacked_counts = np.vstack([np.array([val_count.name, val_count.index[0], float(val_count.values[0])]) for val_count in multiple_val_counts])
    sorted_cols = np.argsort(stacked_counts.reshape(-1, 3)[:, 2].astype('float'))
    return stacked_counts[sorted_cols[::-1]]

In [24]:
summary = summarize_cats(potential_cat)
summary[:10]

array([['market', 'Berlin', '0.9997220369134979'],
       ['requires_license', 't', '0.999611865816468'],
       ['state', 'Berlin', '0.9976065902259824'],
       ['host_has_profile_pic', 't', '0.9973355537052456'],
       ['city', 'Berlin', '0.9941209095951192'],
       ['smart_location', 'Berlin, Germany', '0.9939561962850014'],
       ['require_guest_profile_picture', 'f', '0.9932353756584419'],
       ['street', 'Berlin, Berlin, Germany', '0.9893540338231217'],
       ['require_guest_phone_verification', 'f', '0.9878014970889937'],
       ['bed_type', 'Real Bed', '0.9642916551150541']], dtype='<U32')

The first column in summary is the name of the column, the second is the top value, and the last column is the percent of the column the value was in. 

### Identifying Boolean Values

From the summary grid, we can start to see some strings that are really boolean values.  These are the columns with `t` or `f` as their top values.  

In [25]:
summary[:3]

array([['market', 'Berlin', '0.9997220369134979'],
       ['requires_license', 't', '0.999611865816468'],
       ['state', 'Berlin', '0.9976065902259824']], dtype='<U32')

Let's select all of the columns from our summary that have values of `t` or `f`.

In [26]:
boolean_summary = summary[np.isin(summary[:, 1], ['t', 'f'])]

In [27]:
true_boolean_cols = boolean_summary[:, 0]
true_boolean_cols

array(['requires_license', 'host_has_profile_pic',
       'require_guest_profile_picture',
       'require_guest_phone_verification', 'host_is_superhost',
       'is_location_exact', 'instant_bookable', 'host_identity_verified'],
      dtype='<U32')

Now that we have selected our boolean columns, we can use our MissingIndicator to convert these columns to have True or False values.  We do so, we by having the transformer set `t` to True, and all other values to False.

> We can loop through to do this for each of our boolean columns.

In [29]:
from sklearn.impute import MissingIndicator
steps = [([col], MissingIndicator(missing_values = 't')) 
         for col in true_boolean_cols]

And then place these steps in a DataFrameMapper.

In [30]:
from sklearn_pandas import DataFrameMapper
boolean_mapper = DataFrameMapper(steps, df_out = True)

In [31]:
bool_df = boolean_mapper.fit_transform(df)

In [32]:
bool_df[:2]

Unnamed: 0,requires_license,host_has_profile_pic,require_guest_profile_picture,require_guest_phone_verification,host_is_superhost,is_location_exact,instant_bookable,host_identity_verified
0,True,True,False,False,False,False,True,True
1,True,True,False,False,False,True,True,True


Then we can update our dataframe.

In [35]:
df.loc[:, bool_df.columns] = bool_df

In [36]:
df.to_feather('./listings_coerced_bools.feather')

Then we can use numpy to identify our remaining potential_cat columns that we should coerce.

In [33]:
import numpy as np
remaining_cat_cols = np.setdiff1d(potential_cat.columns, bool_df.columns)

remaining_cat_cols

array(['bed_type', 'calendar_updated', 'cancellation_policy', 'city',
       'host_location', 'host_name', 'host_neighbourhood',
       'host_response_time', 'host_verifications', 'market',
       'monthly_price', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'property_type', 'room_type',
       'smart_location', 'state', 'street', 'weekly_price', 'zipcode'],
      dtype=object)

### Summary

In this lesson, we were introduced to some of the methods for handling boolean and categorical data.  We saw that we identified our categorical columns by looking at the percent different.  If not a large percent of a column's values are different, it is likely categorical or boolean.  We then used our `summarize_cats` method to view the top values in each of the columns, along with how often they occur.

Finally, we used the `MissingImputer` to convert values in almost boolean columns to True and False values.

### Handle Categorical

In [339]:
from sklearn.impute import MissingIndicator
steps = [([col], MissingIndicator(missing_values = top_val), {'alias': f'{col}_is_{top_val}'}) for col, top_val in paired_bools]

In [341]:
practical_bools_mapper = DataFrameMapper(steps, df_out = True)

In [342]:
df_practical_bools = practical_bools_mapper.fit_transform(df)

In [343]:
df_practical_bools[:2]

Unnamed: 0,market_is_Berlin,state_is_Berlin,city_is_Berlin,"smart_location_is_Berlin, Germany","street_is_Berlin, Berlin, Germany"
0,True,True,True,True,True
1,True,True,True,True,True


### Remaining Categories

In [347]:
np.setdiff1d(remaining_cat_cols,  potential_bool_cols[:, 0])

array(['bed_type', 'calendar_updated', 'cancellation_policy',
       'host_location', 'host_name', 'host_neighbourhood',
       'host_response_time', 'host_verifications', 'monthly_price',
       'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'property_type', 'room_type',
       'weekly_price', 'zipcode'], dtype=object)

In [253]:
# get_multiple_val_counts(df[potential_bool_cols], num_vals = 3)

[Berlin                   0.999722
 Other (International)    0.000111
 Juarez                   0.000056
 Name: market, dtype: float64,
 Berlin                0.997607
 Brandenburg           0.000612
 Schleswig-Holstein    0.000390
 Name: state, dtype: float64,
 Berlin        0.994121
 Berlin        0.000555
 Schöneberg    0.000555
 Name: city, dtype: float64,
 Berlin, Germany        0.993956
 ., Germany             0.000554
 Schöneberg, Germany    0.000554
 Name: smart_location, dtype: float64,
 Berlin, Berlin, Germany    0.989354
 Berlin, Germany            0.003216
 ., Berlin, Germany         0.000554
 Name: street, dtype: float64,
 Real Bed         0.964292
 Pull-out Sofa    0.020238
 Futon            0.010923
 Name: bed_type, dtype: float64,
 Apartment      0.896812
 Condominium    0.026948
 Loft           0.020128
 Name: property_type, dtype: float64]

In [None]:
for val_count in multiple_value_counts

In [72]:
df['requires_license'].value_counts(normalize=True).iloc[:2]

t    0.999612
f    0.000388
Name: requires_license, dtype: float64

In [59]:
import numpy as np
def summarize_counts(df):
    frequencies = np.array([df[column].value_counts(normalize=True).values[0] for column in df]).reshape(-1, 1)
    columns = df.columns.to_numpy().reshape(-1, 1)
    top_values = np.array([df[column].value_counts(normalize=True).index[0] for column in df]).reshape(-1, 1)
    summarize = np.hstack((columns, frequencies, top_values))
    return summarize[summarize[:,1].argsort()[::-1]]

In [60]:
summary = summarize_counts(potential_cat)

In [61]:
summary[np.isin(summary[:, -1], ['t', 'f'])]

array([['requires_license', 0.999611865816468, 't'],
       ['host_has_profile_pic', 0.9973355537052456, 't'],
       ['require_guest_profile_picture', 0.9932353756584419, 'f'],
       ['require_guest_phone_verification', 0.9878014970889937, 'f'],
       ['host_is_superhost', 0.8658340271995559, 'f'],
       ['is_location_exact', 0.7439423343498752, 't'],
       ['instant_bookable', 0.6882173551427779, 'f'],
       ['host_identity_verified', 0.6122120455176242, 'f']], dtype=object)

In [35]:
summary[~np.isin(summary[:, -1], ['t', 'f'])]

array([['market', 0.9997220369134979, 'Berlin'],
       ['state', 0.9976065902259824, 'Berlin'],
       ['city', 0.9941209095951192, 'Berlin'],
       ['smart_location', 0.9939561962850014, 'Berlin, Germany'],
       ['street', 0.9893540338231217, 'Berlin, Berlin, Germany'],
       ['bed_type', 0.9642916551150541, 'Real Bed'],
       ['property_type', 0.896811754920987, 'Apartment'],
       ['host_location', 0.7639453886876567, 'Berlin, Berlin, Germany'],
       ['host_response_time', 0.5272984805562709, 'within an hour'],
       ['room_type', 0.50945383975603, 'Private room'],
       ['cancellation_policy', 0.404435819240366, 'flexible'],
       ['neighbourhood_group_cleansed', 0.24324923759356806,
        'Friedrichshain-Kreuzberg'],
       ['host_verifications', 0.1810923204879401,
        "['email', 'phone', 'reviews']"],
       ['neighbourhood', 0.14616911936463442, 'Neukölln'],
       ['host_neighbourhood', 0.14327652871258773, 'Neukölln'],
       ['calendar_updated', 0.111283615

In [18]:
def selected_summaries(df, not_values = [], lower_bound = .1, upper_bound = 1):
    potential_cols = summarize_counts(df)
    potential_cols = potential_cols[potential_cols[:, 1] > lower_bound]
    potential_cols = potential_cols[potential_cols[:, 1] < upper_bound]
    not_tf = ~np.isin(potential_cols[:, 2], not_values)
    return potential_cols[not_tf]

In [123]:
selected = selected_summaries(df, not_values = ['t', 'f'], upper_bound = .90)
selected

array([['property_type', 0.8968162468960624, 'Apartment'],
       ['bathrooms', 0.8795293072824156, '1.0'],
       ['review_scores_communication', 0.805393184074115, '10.0'],
       ['review_scores_checkin', 0.7908940397350993, '10.0'],
       ['guests_included', 0.7746984746363959, '1'],
       ['host_location', 0.7660902121590302, 'Berlin, Berlin, Germany'],
       ['calculated_host_listings_count', 0.7643667967364314, '1'],
       ['bedrooms', 0.761693441022455, '1.0'],
       ['review_scores_accuracy', 0.7544381960524865, '10.0'],
       ['host_response_rate', 0.7380138759449104, '100%'],
       ['host_listings_count', 0.7170824824647074, '1.0'],
       ['host_total_listings_count', 0.7170824824647074, '1.0'],
       ['availability_30', 0.6433575736076623, '0'],
       ['beds', 0.636549395877754, '1.0'],
       ['review_scores_location', 0.616356713205673, '10.0'],
       ['review_scores_cleanliness', 0.568325891626702, '10.0'],
       ['availability_60', 0.5574228449804896, '0'],


* But we may not want values with digits, as we could change them to floats.

In [126]:
def num_is_digit(array, str_index = 0):
    return np.array([value[str_index].isdigit() for value in array])

In [127]:
num_is_digit(selected[:, 2], str_index = 0)[0:10]

array([False,  True,  True,  True,  True, False,  True,  True,  True,
        True])

In [128]:
def remove_digits_from_selected(selected_matrix, col_idx, str_indices = [0, -1]):
    for idx in str_indices:
        selected_col = selected_matrix[~num_is_digit(selected_matrix[:, col_idx], idx)]
    return selected_col

In [130]:
selected_sums_no_digits = remove_digits_from_selected(selected, 2, [0, -1])
selected_sums_no_digits

array([['property_type', 0.8968162468960624, 'Apartment'],
       ['host_location', 0.7660902121590302, 'Berlin, Berlin, Germany'],
       ['host_response_rate', 0.7380138759449104, '100%'],
       ['host_response_time', 0.5260923586663906, 'within an hour'],
       ['room_type', 0.511440227030862, 'Private room'],
       ['cancellation_policy', 0.403600567577155, 'flexible'],
       ['neighbourhood_group_cleansed', 0.2437477829017382,
        'Friedrichshain-Kreuzberg'],
       ['host_verifications', 0.18193508336289466,
        "['email', 'phone', 'reviews']"],
       ['neighbourhood', 0.1498062648802577, 'Neukölln'],
       ['host_neighbourhood', 0.14640852331309429, 'Neukölln'],
       ['calendar_updated', 0.11160872649875843, 'today']], dtype=object)

### Cleaning Values

1. Find columns to clean

In [31]:
def categorical_plus_values(df, threshold = 5):
    categorical_cols = find_categorical(df)
    return [column for column in categorical_cols if len(df[column].value_counts()) > threshold]

In [131]:
selected_cat_cols = selected_sums_no_digits[:, 0]

selected_cat_cols

array(['property_type', 'host_location', 'host_response_rate',
       'host_response_time', 'room_type', 'cancellation_policy',
       'neighbourhood_group_cleansed', 'host_verifications',
       'neighbourhood', 'host_neighbourhood', 'calendar_updated'],
      dtype=object)

In [132]:
cat_cols_df = df_informative[selected_cat_cols]
cat_cols_df[:3]

Unnamed: 0,property_type,host_location,host_response_rate,host_response_time,room_type,cancellation_policy,neighbourhood_group_cleansed,host_verifications,neighbourhood,host_neighbourhood,calendar_updated
0,Guesthouse,"Key Biscayne, Florida, United States",96%,within an hour,Entire home/apt,strict_14_with_grace_period,Mitte,"['email', 'phone', 'reviews', 'jumio', 'offlin...",Mitte,Mitte,3 months ago
1,Apartment,"Berlin, Berlin, Germany",,,Private room,flexible,Pankow,"['email', 'phone', 'reviews', 'jumio', 'govern...",,Prenzlauer Berg,7 weeks ago
2,Apartment,"Coledale, New South Wales, Australia",100%,within a day,Entire home/apt,strict_14_with_grace_period,Pankow,"['email', 'phone', 'facebook', 'reviews', 'man...",Prenzlauer Berg,Prenzlauer Berg,a week ago


In [133]:
updated_non_digits = categorical_plus_values(cat_cols_df)

In [134]:
len(updated_non_digits)

8

In [135]:
updated_non_digits

['property_type',
 'host_location',
 'host_response_rate',
 'neighbourhood_group_cleansed',
 'host_verifications',
 'neighbourhood',
 'host_neighbourhood',
 'calendar_updated']

In [64]:
df[updated_non_digits].describe()

Unnamed: 0,property_type,host_location,host_response_rate,neighbourhood_group_cleansed,host_verifications,neighbourhood,host_neighbourhood,calendar_updated
count,22552,22436,9657,22552,22552,21421,17458,22552
unique,33,1036,64,12,301,91,181,75
top,Apartment,"Berlin, Berlin, Germany",100%,Friedrichshain-Kreuzberg,"['email', 'phone', 'reviews']",Neukölln,Neukölln,today
freq,20225,17188,7127,5497,4103,3209,2556,2517


In [141]:
# df['property_type'].value_counts(normalize = True)

### Clean Values of Relevant Columns

In [54]:
def selected_cat_values(column, threshold = .02):
    values_counted = column.value_counts(normalize=True)
    return values_counted[values_counted > threshold]

In [144]:
selected = selected_cat_values(df.neighbourhood_cleansed, .02)

In [145]:
selected

Tempelhofer Vorstadt        0.058753
Frankfurter Allee Süd FK    0.056846
Alexanderplatz              0.048377
Reuterstraße                0.044431
Rixdorf                     0.039021
Neuköllner Mitte/Zentrum    0.035341
Brunnenstr. Süd             0.034276
Frankfurter Allee Nord      0.032591
Schillerpromenade           0.029354
südliche Luisenstadt        0.028512
Prenzlauer Berg Nordwest    0.027625
Prenzlauer Berg Südwest     0.027403
Schöneberg-Nord             0.025142
Prenzlauer Berg Süd         0.024610
Wedding Zentrum             0.022925
Moabit West                 0.021728
nördliche Luisenstadt       0.021462
Schöneberg-Süd              0.021018
Helmholtzplatz              0.020353
Name: neighbourhood_cleansed, dtype: float64

In [156]:
def reduce_cat_values(column, threshold = .02):
    column = column.copy()
    selected_values = selected_cat_values(column, threshold).index
    column[~column.isin(selected_values)] = 'other'
    column.astype('category')
    return column

In [157]:
new_neigh_cleansed =  reduce_cat_values(df.neighbourhood_cleansed, .02)

In [159]:
new_neigh_cleansed.value_counts(normalize = True)

other                       0.380232
Tempelhofer Vorstadt        0.058753
Frankfurter Allee Süd FK    0.056846
Alexanderplatz              0.048377
Reuterstraße                0.044431
Rixdorf                     0.039021
Neuköllner Mitte/Zentrum    0.035341
Brunnenstr. Süd             0.034276
Frankfurter Allee Nord      0.032591
Schillerpromenade           0.029354
südliche Luisenstadt        0.028512
Prenzlauer Berg Nordwest    0.027625
Prenzlauer Berg Südwest     0.027403
Schöneberg-Nord             0.025142
Prenzlauer Berg Süd         0.024610
Wedding Zentrum             0.022925
Moabit West                 0.021728
nördliche Luisenstadt       0.021462
Schöneberg-Süd              0.021018
Helmholtzplatz              0.020353
Name: neighbourhood_cleansed, dtype: float64

In [151]:
len(df[updated_non_digits].columns)

8

In [70]:
categoricals = ['property_type', 'host_location', 'neighbourhood_cleansed', 'room_type', 'cancellation_policy', 'neighbourhood_group_cleansed', 'host_verifications', 'neighbourhood', 'host_neighbourhood']



In [71]:
def df_reduced_categories(df, categoricals, threshold = .01):
    new_df = pd.DataFrame()
    for category in categoricals:
        new_df[category] = reduce_cat_values(df[category], threshold)
    return new_df

In [72]:
df_reduced = df_reduced_categories(df, categoricals)

In [73]:
df_reduced.describe()

Unnamed: 0,property_type,host_location,neighbourhood_cleansed,room_type,cancellation_policy,neighbourhood_group_cleansed,host_verifications,neighbourhood,host_neighbourhood
count,22552,22552,22552,22552,22552,22552,22552,22552,22552
unique,5,4,31,3,4,11,18,14,14
top,Apartment,"Berlin, Berlin, Germany",other,Private room,flexible,Friedrichshain-Kreuzberg,"['email', 'phone', 'reviews']",other,other
freq,20225,17188,5031,11534,9102,5497,4103,4152,7648


In [198]:
summarize_counts(df_reduced)

array([['property_type', 0.8968162468960624, 'Apartment'],
       ['host_location', 0.7621496984746364, 'Berlin, Berlin, Germany'],
       ['neighbourhood_cleansed', 0.51232706633558, 'other'],
       ['room_type', 0.511440227030862, 'Private room'],
       ['cancellation_policy', 0.403600567577155, 'flexible'],
       ['host_neighbourhood', 0.38479957431713374, 'other'],
       ['host_verifications', 0.2920361830436325, 'other'],
       ['neighbourhood_group_cleansed', 0.2437477829017382,
        'Friedrichshain-Kreuzberg'],
       ['neighbourhood', 0.21882759843916283, 'other']], dtype=object)

In [236]:
def replace_df_columns(original_df, replacing_df):
    replacing_cols = replacing_df.columns
    original_df = original_df.drop(columns = replacing_cols)
    new_df = pd.concat([original_df, replacing_df], axis = 1)
    return new_df

In [233]:
new_df = replace_df_columns(df, df_reduced)