# Coerce to Booleans Lab

### Introduction

In this lesson, we'll have you continue to work with on our airbnb dataset.  Let's start by loading up our data.

### Working with AirBnb

For this lesson, we'll work with [AirBnb listings in Berlin](https://www.kaggle.com/brittabettendorf/berlin-airbnb-data).

In [1]:
import pandas as pd
df = pd.read_feather('./listings_coerced_bools.feather')

So we'll continue to work on where we last left off with our data.  Let's take a look at the number of columns we have left to convert.  Select the columns that are still of type object.

In [8]:
object_df = df.select_dtypes(include = 'object')

# object_df.shape

So we're down to 39 features.

### Reviewing our Functions

These are the functions that we'll have to work with.  Remember that the first two functions are to identify our potential categorical columns, and the next set of functions allow us to look more closely at the top values in those columns.

In [5]:
def percent_different(df_series):
    series_filled = df_series.dropna()
    return len(series_filled.unique())/len(series_filled)

In [6]:
def find_categorical(df, threshold = .5):    
    categorical_df = pd.DataFrame({})
    for column in df.columns:
        if percent_different(df[column]) < threshold:
            categorical_df[column] = df[column]
    return categorical_df 

In [14]:
def get_multiple_val_counts(df, num_vals = 1):
    return [df[column].value_counts(normalize=True).iloc[:num_vals] for column in df.columns]

In [17]:
import numpy as np
def summarize_cats(df):
    multiple_val_counts = get_multiple_val_counts(df)
    stacked_counts = np.vstack([np.array([val_count.name, val_count.index[0], float(val_count.values[0])]) for val_count in multiple_val_counts])
    sorted_cols = np.argsort(stacked_counts.reshape(-1, 3)[:, 2].astype('float'))
    return stacked_counts[sorted_cols[::-1]]

### Selecting Categorical Columns

Begin by selecting the remaining categorical columns from our dataframe that have not yet been coerced.

In [9]:
potential_cat = find_categorical(object_df)

In [12]:
potential_cat.shape
# (18035, 21)

(18035, 21)

So we have 21 remaining columns that we can work with.  Now let's use the `summarize_cats` function to get a better sense of these columns.

In [19]:
summary = summarize_cats(potential_cat)

summary

array([['market', 'Berlin', '0.9997220369134979'],
       ['state', 'Berlin', '0.9976065902259824'],
       ['city', 'Berlin', '0.9941209095951192'],
       ['smart_location', 'Berlin, Germany', '0.9939561962850014'],
       ['street', 'Berlin, Berlin, Germany', '0.9893540338231217'],
       ['bed_type', 'Real Bed', '0.9642916551150541'],
       ['property_type', 'Apartment', '0.896811754920987'],
       ['host_location', 'Berlin, Berlin, Germany', '0.7639453886876567'],
       ['host_response_time', 'within an hour', '0.5272984805562709'],
       ['room_type', 'Private room', '0.50945383975603'],
       ['cancellation_policy', 'flexible', '0.404435819240366'],
       ['neighbourhood_group_cleansed', 'Friedrichshain-Kreuzberg',
        '0.24324923759356806'],
       ['host_verifications', "['email', 'phone', 'reviews']",
        '0.1810923204879401'],
       ['neighbourhood', 'Neukölln', '0.14616911936463442'],
       ['host_neighbourhood', 'Neukölln', '0.14327652871258773'],
       ['

Ok, so the main thing to identify is that some of the columns can probably be reframed as booleans.  For example the market column could be renamed `market_is_Berlin` and be labeled True whenever Berlin, and False otherwise.  

Let's select the first five column names from the summary above.

In [20]:
summary[:5, 0]

array(['market', 'state', 'city', 'smart_location', 'street'],
      dtype='<U29')

And then we can use our `get_multiple_val_counts` function to look at the top three values in each of these columns.

In [22]:
get_multiple_val_counts(df[summary[:5, 0]], 3)

# [Berlin                   0.999722
#  Other (International)    0.000111
#  Juarez                   0.000056
#  Name: market, dtype: float64,
#  Berlin                0.997607
#  Brandenburg           0.000612
#  Schleswig-Holstein    0.000390
#  Name: state, dtype: float64,
#  Berlin        0.994121
#  Schöneberg    0.000555
#  Berlin        0.000555
#  Name: city, dtype: float64,
#  Berlin, Germany        0.993956
#  Schöneberg, Germany    0.000554
#  ., Germany             0.000554
#  Name: smart_location, dtype: float64,
#  Berlin, Berlin, Germany    0.989354
#  Berlin, Germany            0.003216
#  ., Berlin, Germany         0.000554
#  Name: street, dtype: float64]

[Berlin                   0.999722
 Other (International)    0.000111
 Juarez                   0.000056
 Name: market, dtype: float64,
 Berlin                0.997607
 Brandenburg           0.000612
 Schleswig-Holstein    0.000390
 Name: state, dtype: float64,
 Berlin        0.994121
 Schöneberg    0.000555
 Berlin        0.000555
 Name: city, dtype: float64,
 Berlin, Germany        0.993956
 Schöneberg, Germany    0.000554
 ., Germany             0.000554
 Name: smart_location, dtype: float64,
 Berlin, Berlin, Germany    0.989354
 Berlin, Germany            0.003216
 ., Berlin, Germany         0.000554
 Name: street, dtype: float64]

So it seems safe to relabel these five as boolean columns.

### Creating our Steps

Now we'll need to create our steps.  Note that each step will take on the following format.

```python
([col], MissingIndicator(missing_values = top_val) )
```

So we specify both the column, and the value that is we want set to True which is the top value.  To do this for all columns, we first need to pair our boolean columns and the related top values.

In [25]:
potential_bool_cols = summary[:5, :]

In [26]:
paired_bools = list(zip(list(potential_bool_cols[:, 0]), list(potential_bool_cols[:, 1])))

In [27]:
paired_bools

[('market', 'Berlin'),
 ('state', 'Berlin'),
 ('city', 'Berlin'),
 ('smart_location', 'Berlin, Germany'),
 ('street', 'Berlin, Berlin, Germany')]

In [None]:
Then use a 

In [31]:
from sklearn.impute import MissingIndicator
steps = [([col], MissingIndicator(missing_values = top_val)) 
         for col, top_val in paired_bools]

In [32]:
steps

# [(['market'],
#   MissingIndicator(error_on_new=True, features='missing-only',
#                    missing_values='Berlin', sparse='auto'),
#   {'alias': 'market_is_Berlin'}),
#  (['state'],
#   MissingIndicator(error_on_new=True, features='missing-only',
#                    missing_values='Berlin', sparse='auto'),
#   {'alias': 'state_is_Berlin'}),
#  (['city'],
#   MissingIndicator(error_on_new=True, features='missing-only',
#                    missing_values='Berlin', sparse='auto'),
#   {'alias': 'city_is_Berlin'}),
#  (['smart_location'],
#   MissingIndicator(error_on_new=True, features='missing-only',
#                    missing_values='Berlin, Germany', sparse='auto'),
#   {'alias': 'smart_location_is_Berlin, Germany'}),
#  (['street'],
#   MissingIndicator(error_on_new=True, features='missing-only',
#                    missing_values='Berlin, Berlin, Germany', sparse='auto'),
#   {'alias': 'street_is_Berlin, Berlin, Germany'})]

[(['market'],
  MissingIndicator(error_on_new=True, features='missing-only',
                   missing_values='Berlin', sparse='auto')),
 (['state'],
  MissingIndicator(error_on_new=True, features='missing-only',
                   missing_values='Berlin', sparse='auto')),
 (['city'],
  MissingIndicator(error_on_new=True, features='missing-only',
                   missing_values='Berlin', sparse='auto')),
 (['smart_location'],
  MissingIndicator(error_on_new=True, features='missing-only',
                   missing_values='Berlin, Germany', sparse='auto')),
 (['street'],
  MissingIndicator(error_on_new=True, features='missing-only',
                   missing_values='Berlin, Berlin, Germany', sparse='auto'))]

In [35]:
from sklearn_pandas import DataFrameMapper
practical_bools_mapper = DataFrameMapper(steps, df_out = True)

In [36]:
df_practical_bools = practical_bools_mapper.fit_transform(df)

In [37]:
df_practical_bools[:2]

Unnamed: 0,market,state,city,smart_location,street
0,True,True,True,True,True
1,True,True,True,True,True


### Remaining Categories

* But we may not want values with digits, as we could change them to floats.

In [126]:
def num_is_digit(array, str_index = 0):
    return np.array([value[str_index].isdigit() for value in array])

In [127]:
num_is_digit(selected[:, 2], str_index = 0)[0:10]

array([False,  True,  True,  True,  True, False,  True,  True,  True,
        True])

In [128]:
def remove_digits_from_selected(selected_matrix, col_idx, str_indices = [0, -1]):
    for idx in str_indices:
        selected_col = selected_matrix[~num_is_digit(selected_matrix[:, col_idx], idx)]
    return selected_col

In [130]:
selected_sums_no_digits = remove_digits_from_selected(selected, 2, [0, -1])
selected_sums_no_digits

array([['property_type', 0.8968162468960624, 'Apartment'],
       ['host_location', 0.7660902121590302, 'Berlin, Berlin, Germany'],
       ['host_response_rate', 0.7380138759449104, '100%'],
       ['host_response_time', 0.5260923586663906, 'within an hour'],
       ['room_type', 0.511440227030862, 'Private room'],
       ['cancellation_policy', 0.403600567577155, 'flexible'],
       ['neighbourhood_group_cleansed', 0.2437477829017382,
        'Friedrichshain-Kreuzberg'],
       ['host_verifications', 0.18193508336289466,
        "['email', 'phone', 'reviews']"],
       ['neighbourhood', 0.1498062648802577, 'Neukölln'],
       ['host_neighbourhood', 0.14640852331309429, 'Neukölln'],
       ['calendar_updated', 0.11160872649875843, 'today']], dtype=object)

### Cleaning Values

1. Find columns to clean

In [31]:
def categorical_plus_values(df, threshold = 5):
    categorical_cols = find_categorical(df)
    return [column for column in categorical_cols if len(df[column].value_counts()) > threshold]

In [131]:
selected_cat_cols = selected_sums_no_digits[:, 0]

selected_cat_cols

array(['property_type', 'host_location', 'host_response_rate',
       'host_response_time', 'room_type', 'cancellation_policy',
       'neighbourhood_group_cleansed', 'host_verifications',
       'neighbourhood', 'host_neighbourhood', 'calendar_updated'],
      dtype=object)

In [132]:
cat_cols_df = df_informative[selected_cat_cols]
cat_cols_df[:3]

Unnamed: 0,property_type,host_location,host_response_rate,host_response_time,room_type,cancellation_policy,neighbourhood_group_cleansed,host_verifications,neighbourhood,host_neighbourhood,calendar_updated
0,Guesthouse,"Key Biscayne, Florida, United States",96%,within an hour,Entire home/apt,strict_14_with_grace_period,Mitte,"['email', 'phone', 'reviews', 'jumio', 'offlin...",Mitte,Mitte,3 months ago
1,Apartment,"Berlin, Berlin, Germany",,,Private room,flexible,Pankow,"['email', 'phone', 'reviews', 'jumio', 'govern...",,Prenzlauer Berg,7 weeks ago
2,Apartment,"Coledale, New South Wales, Australia",100%,within a day,Entire home/apt,strict_14_with_grace_period,Pankow,"['email', 'phone', 'facebook', 'reviews', 'man...",Prenzlauer Berg,Prenzlauer Berg,a week ago


In [133]:
updated_non_digits = categorical_plus_values(cat_cols_df)

In [134]:
len(updated_non_digits)

8

In [135]:
updated_non_digits

['property_type',
 'host_location',
 'host_response_rate',
 'neighbourhood_group_cleansed',
 'host_verifications',
 'neighbourhood',
 'host_neighbourhood',
 'calendar_updated']

In [64]:
df[updated_non_digits].describe()

Unnamed: 0,property_type,host_location,host_response_rate,neighbourhood_group_cleansed,host_verifications,neighbourhood,host_neighbourhood,calendar_updated
count,22552,22436,9657,22552,22552,21421,17458,22552
unique,33,1036,64,12,301,91,181,75
top,Apartment,"Berlin, Berlin, Germany",100%,Friedrichshain-Kreuzberg,"['email', 'phone', 'reviews']",Neukölln,Neukölln,today
freq,20225,17188,7127,5497,4103,3209,2556,2517


In [141]:
# df['property_type'].value_counts(normalize = True)

### Clean Values of Relevant Columns

In [54]:
def selected_cat_values(column, threshold = .02):
    values_counted = column.value_counts(normalize=True)
    return values_counted[values_counted > threshold]

In [144]:
selected = selected_cat_values(df.neighbourhood_cleansed, .02)

In [145]:
selected

Tempelhofer Vorstadt        0.058753
Frankfurter Allee Süd FK    0.056846
Alexanderplatz              0.048377
Reuterstraße                0.044431
Rixdorf                     0.039021
Neuköllner Mitte/Zentrum    0.035341
Brunnenstr. Süd             0.034276
Frankfurter Allee Nord      0.032591
Schillerpromenade           0.029354
südliche Luisenstadt        0.028512
Prenzlauer Berg Nordwest    0.027625
Prenzlauer Berg Südwest     0.027403
Schöneberg-Nord             0.025142
Prenzlauer Berg Süd         0.024610
Wedding Zentrum             0.022925
Moabit West                 0.021728
nördliche Luisenstadt       0.021462
Schöneberg-Süd              0.021018
Helmholtzplatz              0.020353
Name: neighbourhood_cleansed, dtype: float64

In [156]:
def reduce_cat_values(column, threshold = .02):
    column = column.copy()
    selected_values = selected_cat_values(column, threshold).index
    column[~column.isin(selected_values)] = 'other'
    column.astype('category')
    return column

In [157]:
new_neigh_cleansed =  reduce_cat_values(df.neighbourhood_cleansed, .02)

In [159]:
new_neigh_cleansed.value_counts(normalize = True)

other                       0.380232
Tempelhofer Vorstadt        0.058753
Frankfurter Allee Süd FK    0.056846
Alexanderplatz              0.048377
Reuterstraße                0.044431
Rixdorf                     0.039021
Neuköllner Mitte/Zentrum    0.035341
Brunnenstr. Süd             0.034276
Frankfurter Allee Nord      0.032591
Schillerpromenade           0.029354
südliche Luisenstadt        0.028512
Prenzlauer Berg Nordwest    0.027625
Prenzlauer Berg Südwest     0.027403
Schöneberg-Nord             0.025142
Prenzlauer Berg Süd         0.024610
Wedding Zentrum             0.022925
Moabit West                 0.021728
nördliche Luisenstadt       0.021462
Schöneberg-Süd              0.021018
Helmholtzplatz              0.020353
Name: neighbourhood_cleansed, dtype: float64

In [151]:
len(df[updated_non_digits].columns)

8

In [70]:
categoricals = ['property_type', 'host_location', 'neighbourhood_cleansed', 'room_type', 'cancellation_policy', 'neighbourhood_group_cleansed', 'host_verifications', 'neighbourhood', 'host_neighbourhood']



In [71]:
def df_reduced_categories(df, categoricals, threshold = .01):
    new_df = pd.DataFrame()
    for category in categoricals:
        new_df[category] = reduce_cat_values(df[category], threshold)
    return new_df

In [72]:
df_reduced = df_reduced_categories(df, categoricals)

In [73]:
df_reduced.describe()

Unnamed: 0,property_type,host_location,neighbourhood_cleansed,room_type,cancellation_policy,neighbourhood_group_cleansed,host_verifications,neighbourhood,host_neighbourhood
count,22552,22552,22552,22552,22552,22552,22552,22552,22552
unique,5,4,31,3,4,11,18,14,14
top,Apartment,"Berlin, Berlin, Germany",other,Private room,flexible,Friedrichshain-Kreuzberg,"['email', 'phone', 'reviews']",other,other
freq,20225,17188,5031,11534,9102,5497,4103,4152,7648


In [198]:
summarize_counts(df_reduced)

array([['property_type', 0.8968162468960624, 'Apartment'],
       ['host_location', 0.7621496984746364, 'Berlin, Berlin, Germany'],
       ['neighbourhood_cleansed', 0.51232706633558, 'other'],
       ['room_type', 0.511440227030862, 'Private room'],
       ['cancellation_policy', 0.403600567577155, 'flexible'],
       ['host_neighbourhood', 0.38479957431713374, 'other'],
       ['host_verifications', 0.2920361830436325, 'other'],
       ['neighbourhood_group_cleansed', 0.2437477829017382,
        'Friedrichshain-Kreuzberg'],
       ['neighbourhood', 0.21882759843916283, 'other']], dtype=object)

In [236]:
def replace_df_columns(original_df, replacing_df):
    replacing_cols = replacing_df.columns
    original_df = original_df.drop(columns = replacing_cols)
    new_df = pd.concat([original_df, replacing_df], axis = 1)
    return new_df

In [233]:
new_df = replace_df_columns(df, df_reduced)