# Coerce to Booleans Lab

### Introduction

In this lesson, we'll have you continue to work with on our airbnb dataset.  Let's start by loading up our data.

### Working with AirBnb

For this lesson, we'll work with [AirBnb listings in Berlin](https://www.kaggle.com/brittabettendorf/berlin-airbnb-data).

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/engineering-large-datasets/master/listings_coerced_bools.csv"
df = pd.read_csv(url)

In [2]:
import requests
url = "https://raw.githubusercontent.com/jigsawlabs-student/engineering-large-datasets/master/coerced_bools_dtypes.json"
response = requests.get(url)
dtypes_dict = response.json()

In [4]:
df = df.astype(dtypes_dict)

In [5]:
df.select_dtypes('object').shape

(8000, 40)

So we'll continue to work on where we last left off with our data.  Let's take a look at the number of columns we have left to convert.  Select the columns that are still of type object.

In [29]:
object_df = None

# object_df.shape

So we're down to 39 features.

### Reviewing our Functions

These are the functions that we'll have to work with.  Remember that the first two functions are to identify our potential categorical columns, and the next set of functions allow us to look more closely at the top values in those columns.

In [30]:
def percent_different(df_series):
    series_filled = df_series.dropna()
    return len(series_filled.unique())/len(series_filled)

In [31]:
def find_categorical(df, threshold = .5):    
    categorical_df = pd.DataFrame({})
    for column in df.columns:
        if percent_different(df[column]) < threshold:
            categorical_df[column] = df[column]
    return categorical_df 

In [32]:
def get_multiple_val_counts(df, num_vals = 1):
    return [df[column].value_counts(normalize=True).iloc[:num_vals] for column in df.columns]

In [33]:
import numpy as np
def summarize_cats(df):
    multiple_val_counts = get_multiple_val_counts(df)
    stacked_counts = np.vstack([np.array([val_count.name, val_count.index[0], float(val_count.values[0])]) for val_count in multiple_val_counts])
    sorted_cols = np.argsort(stacked_counts.reshape(-1, 3)[:, 2].astype('float'))
    return stacked_counts[sorted_cols[::-1]]

### Selecting Categorical Columns

Begin by selecting the remaining categorical columns from our dataframe that have not yet been coerced.

In [34]:
potential_cat = None

In [35]:
potential_cat.shape
# (18035, 21)

AttributeError: 'NoneType' object has no attribute 'shape'

So we have 21 remaining columns that we can work with.  Now let's use the `summarize_cats` function to get a better sense of these columns.

In [36]:
summary = None

summary

Ok, so the main thing to identify is that some of the columns can probably be reframed as booleans.  For example the market column could be renamed `market_is_Berlin` and be labeled True whenever Berlin, and False otherwise.  

Let's select the first five column names from the summary above.

In [20]:
top_five_cols = None

top_five_cols
# array(['market', 'state', 'city', 'smart_location', 'street'],
#       dtype='<U29')

array(['market', 'state', 'city', 'smart_location', 'street'],
      dtype='<U29')

And then we can use our `get_multiple_val_counts` function to look at the top three values in each of these columns.

In [22]:


# [Berlin                   0.999722
#  Other (International)    0.000111
#  Juarez                   0.000056
#  Name: market, dtype: float64,
#  Berlin                0.997607
#  Brandenburg           0.000612
#  Schleswig-Holstein    0.000390
#  Name: state, dtype: float64,
#  Berlin        0.994121
#  Schöneberg    0.000555
#  Berlin        0.000555
#  Name: city, dtype: float64,
#  Berlin, Germany        0.993956
#  Schöneberg, Germany    0.000554
#  ., Germany             0.000554
#  Name: smart_location, dtype: float64,
#  Berlin, Berlin, Germany    0.989354
#  Berlin, Germany            0.003216
#  ., Berlin, Germany         0.000554
#  Name: street, dtype: float64]

[Berlin                   0.999722
 Other (International)    0.000111
 Juarez                   0.000056
 Name: market, dtype: float64,
 Berlin                0.997607
 Brandenburg           0.000612
 Schleswig-Holstein    0.000390
 Name: state, dtype: float64,
 Berlin        0.994121
 Schöneberg    0.000555
 Berlin        0.000555
 Name: city, dtype: float64,
 Berlin, Germany        0.993956
 Schöneberg, Germany    0.000554
 ., Germany             0.000554
 Name: smart_location, dtype: float64,
 Berlin, Berlin, Germany    0.989354
 Berlin, Germany            0.003216
 ., Berlin, Germany         0.000554
 Name: street, dtype: float64]

So it seems safe to relabel these five as boolean columns.

### Creating our Steps

Now we'll need to create our steps.  Note that each step will take on the following format.

```python
([col], MissingIndicator(missing_values = top_val) )
```

So we specify both the column, and the value that is we want set to True which is the top value.  To do this for all columns, we first need to pair our boolean columns and the related top values.

In [25]:
potential_bool_cols = summary[:5, :]

In [26]:
paired_bools = list(zip(list(potential_bool_cols[:, 0]), list(potential_bool_cols[:, 1])))

In [27]:
paired_bools

[('market', 'Berlin'),
 ('state', 'Berlin'),
 ('city', 'Berlin'),
 ('smart_location', 'Berlin, Germany'),
 ('street', 'Berlin, Berlin, Germany')]

Then use list iteration to create your steps.

In [31]:

steps = None

In [32]:
steps

# [(['market'],
#   MissingIndicator(error_on_new=True, features='missing-only',
#                    missing_values='Berlin', sparse='auto'),
#   {'alias': 'market_is_Berlin'}),
#  (['state'],
#   MissingIndicator(error_on_new=True, features='missing-only',
#                    missing_values='Berlin', sparse='auto'),
#   {'alias': 'state_is_Berlin'}),
#  (['city'],
#   MissingIndicator(error_on_new=True, features='missing-only',
#                    missing_values='Berlin', sparse='auto'),
#   {'alias': 'city_is_Berlin'}),
#  (['smart_location'],
#   MissingIndicator(error_on_new=True, features='missing-only',
#                    missing_values='Berlin, Germany', sparse='auto'),
#   {'alias': 'smart_location_is_Berlin, Germany'}),
#  (['street'],
#   MissingIndicator(error_on_new=True, features='missing-only',
#                    missing_values='Berlin, Berlin, Germany', sparse='auto'),
#   {'alias': 'street_is_Berlin, Berlin, Germany'})]

[(['market'],
  MissingIndicator(error_on_new=True, features='missing-only',
                   missing_values='Berlin', sparse='auto')),
 (['state'],
  MissingIndicator(error_on_new=True, features='missing-only',
                   missing_values='Berlin', sparse='auto')),
 (['city'],
  MissingIndicator(error_on_new=True, features='missing-only',
                   missing_values='Berlin', sparse='auto')),
 (['smart_location'],
  MissingIndicator(error_on_new=True, features='missing-only',
                   missing_values='Berlin, Germany', sparse='auto')),
 (['street'],
  MissingIndicator(error_on_new=True, features='missing-only',
                   missing_values='Berlin, Berlin, Germany', sparse='auto'))]

And place these steps in a mapper.

In [35]:

practical_bools_mapper = None

In [36]:
df_practical_bools = practical_bools_mapper.fit_transform(df)

In [37]:
df_practical_bools[:2]

# 	market	state	city	smart_location	street
# 0	True	True	True	True	True
# 1	True	True	True	True	True

Unnamed: 0,market,state,city,smart_location,street
0,True,True,True,True,True
1,True,True,True,True,True


Now it would be even better if we renamed the column, `col_is_top_value`.  So then we would see `market_is_Berlin`, and have values of True or False.  Add an alias when looping through the steps.

In [38]:
steps = None

Then place these steps in a DataFrameMapper and coerce the values.

In [40]:
from sklearn_pandas import DataFrameMapper
practical_bools_mapper_with_al = None

In [41]:
df_practical_bools_with_al = practical_bools_mapper_with_al.fit_transform(df)

In [42]:
df_practical_bools_with_al[:3]

# 	market_is_Berlin	state_is_Berlin	city_is_Berlin	smart_location_is_Berlin, Germany	street_is_Berlin, Berlin, Germany
# 0	True	True	True	True	True
# 1	True	True	True	True	True
# 2	True	True	True	True	True

Unnamed: 0,market_is_Berlin,state_is_Berlin,city_is_Berlin,"smart_location_is_Berlin, Germany","street_is_Berlin, Berlin, Germany"
0,True,True,True,True,True
1,True,True,True,True,True
2,True,True,True,True,True


Ok, this looks good.  Now let's update our dataframe.

In [50]:
cols_to_remove = [col for col, top_val in paired_bools]
cols_to_remove

['market', 'state', 'city', 'smart_location', 'street']

We can drop the above columns from our dataframe.

In [51]:
df_coerced = df.drop(columns = cols_to_remove)

And update with the new boolean columns.

In [52]:
df_coerced[df_practical_bools.columns] = df_practical_bools

In [54]:
df_coerced.select_dtypes(include = 'object').shape

(18035, 34)

And can see that we're now down to 34 columns.

### Summary

In this lesson we worked through coercing more of our columns to be boolean columns. We continued to practice working with list iteration with creating our steps, this time adding an alias to our steps.  We also practiced using our categorical functions both to identify potential categorical or boolean columns, and look at the top values in those columns.