# Coerce to Booleans Lab

### Introduction

In this lesson, we'll have you continue to work with on our airbnb dataset.  Let's start by loading up our data.

### Working with AirBnb

For this lesson, we'll work with [AirBnb listings in Berlin](https://www.kaggle.com/brittabettendorf/berlin-airbnb-data).

In [119]:
import pandas as pd
df = pd.read_csv('./listings_coerced_bools.csv', index_col = 0)

In [120]:
import json
file = "./coerced_bools_dtypes.json"
with open(file, 'r') as f:
    dtypes = json.load(f)

In [121]:
df = df.astype(dtypes)

In [122]:
df.select_dtypes('object').shape

(8000, 40)

So we'll continue to work on where we last left off with our data.  Let's take a look at the number of columns we have left to convert.  Select the columns that are still of type object.

In [123]:
object_df = df.select_dtypes(include = 'object')

# object_df.shape

In [152]:
# object_df.dtypes

So we're down to 39 features.

### Reviewing our Functions

These are the functions that we'll have to work with.  Remember that the first two functions are to identify our potential categorical columns, and the next set of functions allow us to look more closely at the top values in those columns.

In [125]:
def percent_different(df_series):
    series_filled = df_series.dropna()
    return len(series_filled.unique())/len(series_filled)

In [126]:
def find_categorical(df, threshold = .5):    
    categorical_df = pd.DataFrame({})
    for column in df.columns:
        if percent_different(df[column]) < threshold:
            categorical_df[column] = df[column]
    return categorical_df 

In [127]:
def get_multiple_val_counts(df, num_vals = 1):
    return [df[column].value_counts(normalize=True).iloc[:num_vals] for column in df.columns]

In [128]:
import numpy as np
def summarize_cats(df):
    multiple_val_counts = get_multiple_val_counts(df)
    stacked_counts = np.vstack([np.array([val_count.name, val_count.index[0], float(val_count.values[0])]) for val_count in multiple_val_counts])
    sorted_cols = np.argsort(stacked_counts.reshape(-1, 3)[:, 2].astype('float'))
    return stacked_counts[sorted_cols[::-1]]

### Selecting Categorical Columns

Begin by selecting the remaining categorical columns from our dataframe that have not yet been coerced.

In [129]:
potential_cat = find_categorical(object_df)

In [130]:
potential_cat.shape
# (8000, 22)

(8000, 22)

So we have 22 remaining columns that we can work with.  Now let's use the `summarize_cats` function to get a better sense of these columns.

In [131]:
summary = summarize_cats(potential_cat)

summary

array([['market', 'Berlin', '0.99975'],
       ['state', 'Berlin', '0.9978704747588626'],
       ['host_has_profile_pic', 't', '0.9976193459466233'],
       ['smart_location', 'Berlin, Germany', '0.99175'],
       ['city', 'Berlin', '0.9917489686210776'],
       ['street', 'Berlin, Berlin, Germany', '0.989125'],
       ['bed_type', 'Real Bed', '0.93525'],
       ['property_type', 'Apartment', '0.899'],
       ['host_is_superhost', 'f', '0.8743265254980579'],
       ['host_location', 'Berlin, Berlin, Germany', '0.8262015309323629'],
       ['room_type', 'Entire home/apt', '0.54125'],
       ['host_identity_verified', 't', '0.5090840746773587'],
       ['host_response_time', 'within an hour', '0.45789473684210524'],
       ['cancellation_policy', 'flexible', '0.34125'],
       ['neighbourhood_group_cleansed', 'Friedrichshain-Kreuzberg',
        '0.255125'],
       ['host_verifications',
        "['email', 'phone', 'reviews', 'jumio', 'government_id']",
        '0.224875'],
       ['neigh

Ok, so the main thing to identify is that some of the columns can probably be reframed as booleans.  For example the market column could be renamed `market_is_Berlin` and be labeled True whenever Berlin, and False otherwise.  

Let's select the first five column names from the summary above.

In [132]:
summary[:5, 0]

# array(['market', 'state', 'host_has_profile_pic', 'smart_location',
#        'city'], dtype='<U55')

array(['market', 'state', 'host_has_profile_pic', 'smart_location',
       'city'], dtype='<U55')

And then we can use our `get_multiple_val_counts` function to look at the top three values in each of these columns.

In [133]:
get_multiple_val_counts(df[summary[:5, 0]], 3)

# [Berlin     0.999750
#  Leipzig    0.000125
#  Zurich     0.000125
#  Name: market, dtype: float64,
#  Berlin         0.997870
#  Brandenburg    0.000752
#  Germany        0.000376
#  Name: state, dtype: float64,
#  t    0.997619
#  f    0.002381
#  Name: host_has_profile_pic, dtype: float64,
#  Berlin, Germany            0.991750
#  Berlin , Germany           0.000625
#  Berlin - Mitte, Germany    0.000375
#  Name: smart_location, dtype: float64,
#  Berlin                     0.991749
#  Berlin                     0.000625
#  Berlin, friedrichshain     0.000375
#  Name: city, dtype: float64]

[Berlin     0.999750
 Leipzig    0.000125
 Zurich     0.000125
 Name: market, dtype: float64,
 Berlin         0.997870
 Brandenburg    0.000752
 Germany        0.000376
 Name: state, dtype: float64,
 t    0.997619
 f    0.002381
 Name: host_has_profile_pic, dtype: float64,
 Berlin, Germany            0.991750
 Berlin , Germany           0.000625
 Berlin - Mitte, Germany    0.000375
 Name: smart_location, dtype: float64,
 Berlin                     0.991749
 Berlin                     0.000625
 Berlin, friedrichshain     0.000375
 Name: city, dtype: float64]

So it seems safe to relabel these five as boolean columns.

### Creating our Steps

Now we'll need to create our steps.  Note that each step will take on the following format.

```python
([col], MissingIndicator(missing_values = top_val) )
```

So we specify both the column, and the value that is we want set to True which is the top value.  To do this for all columns, we first need to pair our boolean columns and the related top values.

In [134]:
potential_bool_cols = summary[:5, :]

In [135]:
paired_bools = list(zip(list(potential_bool_cols[:, 0]), list(potential_bool_cols[:, 1])))

In [136]:
paired_bools

[('market', 'Berlin'),
 ('state', 'Berlin'),
 ('host_has_profile_pic', 't'),
 ('smart_location', 'Berlin, Germany'),
 ('city', 'Berlin')]

Now we are about to use list iteration to create our steps, but notice that we still have missing values in many of our columns.

In [137]:
df[summary[:5, 0]].isna().sum()

market                   0
state                   17
host_has_profile_pic    19
smart_location           0
city                     1
dtype: int64

The amount of missing values is quite small - less than one half of one percent.  So let's handle these simply by setting any `na` value to equal `'f'`, and coercing all values to True or False.  

Create our steps for each fo the above columns.  Use list iteration to do so.

In [138]:
from sklearn.impute import MissingIndicator, SimpleImputer
steps = [([col], [SimpleImputer(strategy = 'constant', fill_value = 'f'), 
                  MissingIndicator(missing_values = top_val)]) 
         for col, top_val in paired_bools]

In [139]:
steps[:2]

# [(['market'],
#   [SimpleImputer(add_indicator=False, copy=True, fill_value='f',
#                  missing_values=nan, strategy='constant', verbose=0),
#    MissingIndicator(error_on_new=True, features='missing-only',
#                     missing_values='Berlin', sparse='auto')]),
#  (['state'],
#   [SimpleImputer(add_indicator=False, copy=True, fill_value='f',
#                  missing_values=nan, strategy='constant', verbose=0),
#    MissingIndicator(error_on_new=True, features='missing-only',
#                     missing_values='Berlin', sparse='auto')])]

[(['market'],
  [SimpleImputer(add_indicator=False, copy=True, fill_value='f',
                 missing_values=nan, strategy='constant', verbose=0),
   MissingIndicator(error_on_new=True, features='missing-only',
                    missing_values='Berlin', sparse='auto')]),
 (['state'],
  [SimpleImputer(add_indicator=False, copy=True, fill_value='f',
                 missing_values=nan, strategy='constant', verbose=0),
   MissingIndicator(error_on_new=True, features='missing-only',
                    missing_values='Berlin', sparse='auto')])]

In [140]:
from sklearn_pandas import DataFrameMapper
practical_bools_mapper = DataFrameMapper(steps, df_out = True)

In [141]:
df_practical_bools = practical_bools_mapper.fit_transform(df)

In [142]:
df_practical_bools[:2]

Unnamed: 0,market,state,host_has_profile_pic,smart_location,city
0,True,True,True,True,True
1,True,True,True,True,True


Now it would be even better if we renamed the column, `col_is_top_value`.  So then we would see `market_is_Berlin`, and have values of True or False.  Add an alias when looping through the steps.

In [143]:
steps = [([col], [SimpleImputer(strategy = 'constant', fill_value = 'f'),
                  MissingIndicator(missing_values = top_val)], 
          {'alias': f'{col}_is_{top_val}'}) 
         for col, top_val in paired_bools]

Then place these steps in a DataFrameMapper and coerce the values.

In [144]:
from sklearn_pandas import DataFrameMapper
practical_bools_mapper = DataFrameMapper(steps, df_out = True)

In [145]:
df_practical_bools = practical_bools_mapper.fit_transform(df)

In [146]:
df_practical_bools[:3]

Unnamed: 0,market_is_Berlin,state_is_Berlin,host_has_profile_pic_is_t,"smart_location_is_Berlin, Germany",city_is_Berlin
0,True,True,True,True,True
1,True,True,True,True,True
2,True,True,True,True,True


Ok, this looks good.  Now let's update our dataframe.

We can drop the above columns from our dataframe.

In [147]:
cols_to_remove = [col for col, top_val in paired_bools]
cols_to_remove

['market', 'state', 'host_has_profile_pic', 'smart_location', 'city']

In [148]:
df_coerced = df.drop(columns = cols_to_remove)

And update with the new boolean columns.

In [149]:
df_coerced[df_practical_bools.columns] = df_practical_bools

In [150]:
df_coerced.select_dtypes(include = 'object').shape

(8000, 35)

In [151]:
df_coerced.to_csv('./coerced_bools_complete.csv')
import json
data = df_coerced.dtypes.astype(str).to_dict()

file = './coerced_bools_complete_dtypes.json'

with open(file, 'w') as f:
    json.dump(data, f)

And can see that we're now down to five fewer columns.

### Summary

In this lesson we worked through coercing more of our columns to be boolean columns. We continued to practice working with list iteration with creating our steps, this time adding an alias to our steps.  We also practiced using our categorical functions both to identify potential categorical or boolean columns, and look at the top values in those columns.