# Coercing to Booleans

### Introduction

In this lesson, we'll work through identifying and coercing data to boolean values.  This will also prepare us to identify and coerce categorical values in our dataset.

### Loading our AirBnb Data

For this lesson, we'll work with [AirBnb listings in Berlin](https://www.kaggle.com/brittabettendorf/berlin-airbnb-data).  Let's load our data.

In [1]:
import pandas as pd
df = pd.read_csv('./nums_and_dates_ten_k.csv', index_col = 0)

In [None]:
df.select_dtypes('object').shape

In [278]:
potential_date_cols = ['last_scraped',
 'host_since',
 'calendar_last_scraped',
 'first_review',
 'last_review']
df[potential_date_cols] = df[potential_date_cols].astype('datetime64')

In [279]:
df.shape

(8000, 83)

Lucky for us, we already have our good amount of our data already coerced.  But we still have more work to do.

In [220]:
object_df = df.select_dtypes(include = 'object')

object_df.shape

(8000, 45)

In [221]:
def contains_date(column):
#     remove nas first, potentially use all
    regex_string = (r'^\d{1,2}-\d{1,2}-\d{4}$|^\d{4}-\d{1,2}-\d{1,2}$' + 
'|^\d{1,2}\/\d{1,2}\/\d{4}$|^\d{4}\/\d{1,2}\/\d{1,2}$')
    return column.str.contains(regex_string).any()

### Feature engineering

So a lot of our columns are still of type object.  Let's take a look at some of our object columns. 

In [222]:
object_df.dtypes

In [223]:
def percent_different(df_series):
    series_filled = df_series.dropna()
    return len(series_filled.unique())/len(series_filled)

In [224]:
def find_categorical(df, threshold = .5):    
    categorical_df = pd.DataFrame({})
    for column in df.columns:
        if percent_different(df[column]) < threshold:
            categorical_df[column] = df[column]
    return categorical_df 

### Combine with Selecting Categorical Columns

In [228]:
def get_multiple_val_counts(df, num_vals = 1):
    return [df[column].value_counts(normalize=True).iloc[:num_vals] for column in df.columns]

In [229]:
get_multiple_val_counts(potential_cat)[:3]

[Anna    0.00852
 Name: host_name, dtype: float64,
 Berlin, Berlin, Germany    0.826202
 Name: host_location, dtype: float64,
 within an hour    0.457895
 Name: host_response_time, dtype: float64]

And summarize cats, puts this information in an easier to work with numpy array.

In [230]:
import numpy as np
def summarize_cats(df):
    multiple_val_counts = get_multiple_val_counts(df)
    stacked_counts = np.vstack([np.array([val_count.name, val_count.index[0], float(val_count.values[0])]) for val_count in multiple_val_counts])
    sorted_cols = np.argsort(stacked_counts.reshape(-1, 3)[:, 2].astype('float'))
    return stacked_counts[sorted_cols[::-1]]

### Identifying Boolean Values

In [280]:
from sklearn.impute import SimpleImputer, MissingIndicator
steps = [([col], [SimpleImputer(strategy = 'constant', missing_values= np.nan, fill_value = 'f'),
                  MissingIndicator(missing_values = 't')
                 ]
         ) 
         for col in true_boolean_cols]

In [None]:
df.to_csv('./listings_coerced_booleans.csv')

In [243]:
import json
data = df.dtypes.astype(str).to_dict()

file = './coerced_bools_dtypes.json'

with open(file, 'w') as f:
    json.dump(data, f)

In [115]:
import numpy as np
remaining_cat_cols = np.setdiff1d(potential_cat.columns, bool_df.columns)

remaining_cat_cols

array(['bed_type', 'calendar_updated', 'cancellation_policy', 'city',
       'host_has_profile_pic', 'host_identity_verified',
       'host_is_superhost', 'host_location', 'host_name',
       'host_neighbourhood', 'host_response_time', 'host_verifications',
       'market', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'property_type', 'room_type',
       'smart_location', 'state', 'street', 'zipcode'], dtype=object)

### Summary

In this lesson, we were introduced to some of the methods for handling boolean and categorical data.  We saw that we identified our categorical columns by looking at the percent different.  If not a large percent of a column's values are different, it is likely categorical or boolean.  We then used our `summarize_cats` method to view the top values in each of the columns, along with how often they occur.

Finally, we used the `MissingImputer` to convert values in almost boolean columns to True and False values.