# Boolean Values

### Introduction

Deciding which features should be included and focused on in our linear model is an important skill of any data scientist.  As we saw previously, if we include features which are too collinear, we will improperly measure the coefficients related to our collinear features.  In addition, feature selection and prioritizing features with feature importance will help us to understand which features to devote our attention to in terms of feature engineering and domain understanding.  Finally, limiting the number of features in our model, and identifying the most crucial features in our model will make our models, and their insights more understandable.

### Working with AirBnb

For this lesson, we'll work with [AirBnb listings in Berlin](https://www.kaggle.com/brittabettendorf/berlin-airbnb-data).

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mhan1/Data-Science/master/listings_summary.csv')

In [2]:
pd.set_option('display.max_rows',100)

In [3]:
df.head(2)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,requires_license,license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,2015,https://www.airbnb.com/rooms/2015,20181107122246,2018-11-07,Berlin-Mitte Value! Quiet courtyard/very central,Great location! 30 of 75 sq meters. This wood...,A+++ location! This „Einliegerwohnung“ is an e...,Great location! 30 of 75 sq meters. This wood...,none,It is located in the former East Berlin area o...,...,t,,,f,f,strict_14_with_grace_period,f,f,4,3.76
1,2695,https://www.airbnb.com/rooms/2695,20181107122246,2018-11-07,Prenzlauer Berg close to Mauerpark,,In the summertime we are spending most of our ...,In the summertime we are spending most of our ...,none,,...,t,,,f,f,flexible,f,f,1,1.42


In [4]:
df.shape

(22552, 96)

### Feature engineering

Let's try to capture as much of this object data as possible.

In [5]:
def find_object_features(df):
    return list(df.dtypes[df.dtypes == 'object'].index)

In [6]:
find_object_features(df)

['listing_url',
 'last_scraped',
 'name',
 'summary',
 'space',
 'description',
 'experiences_offered',
 'neighborhood_overview',
 'notes',
 'transit',
 'access',
 'interaction',
 'house_rules',
 'picture_url',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'street',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'city',
 'state',
 'zipcode',
 'market',
 'smart_location',
 'country_code',
 'country',
 'is_location_exact',
 'property_type',
 'room_type',
 'bed_type',
 'amenities',
 'price',
 'weekly_price',
 'monthly_price',
 'security_deposit',
 'cleaning_fee',
 'extra_people',
 'calendar_updated',
 'has_availability',
 'calendar_last_scraped',
 'first_review',
 'last_review',
 'requires_license',
 'license',
 'instant_

In [7]:
def find_object_feature_values(df):
    object_features = find_object_features(df)
    return df[object_features][:1].values[0]

In [8]:
find_object_feature_values(df)

array(['https://www.airbnb.com/rooms/2015', '2018-11-07',
       'Berlin-Mitte Value! Quiet courtyard/very central',
       'Great location!  30 of 75 sq meters. This wood floored/high ceiling typical Berlin "Altbau" section of an apartment consists of 1 simple large room, a small kitchen and a bathroom + shower. The apartment is in Mitte, close to Prenzlauer Berg/Mauerpark. Perfect for short visits, singles or couples. Your section is closed from the rest of the bigger flat wich is not noticeable. You will not be sharing your space.',
       'A+++ location! This „Einliegerwohnung“ is an extention of a larger apartment with a separate entrance, bathroom and kitchen. The door to the rest of the apartment is soundproof, hidden, locked and barely noticable (behind mirror in pictures). Your 30 sq meters are facing a quiet courtyard. This wood floored/high ceiling typical Berlin "Altbau" apartment consists of 1 large room with a large double bed, optionally with an extra matress for a 3rd g

In [9]:
import numpy as np

def find_booleans(df):
    columns = df.columns
    boolean_columns = np.array([column for column in columns if len(df[column].value_counts(dropna=True)) == 2])
    boolean_values = np.array([df[column].value_counts(dropna=True).index for column in boolean_columns])
    return boolean_columns, boolean_values, boolean_values[:, 0], boolean_values[:, 1], np.stack((boolean_columns, boolean_values[:, 0], boolean_values[:, 1])), np.stack((boolean_columns, boolean_values[:, 0], boolean_values[:, 1])).T

In [10]:
find_booleans(df)

(array(['last_scraped', 'host_is_superhost', 'host_has_profile_pic',
        'host_identity_verified', 'is_location_exact',
        'calendar_last_scraped', 'requires_license', 'instant_bookable',
        'require_guest_profile_picture',
        'require_guest_phone_verification'], dtype='<U32'),
 array([['2018-11-07', '2018-11-09'],
        ['f', 't'],
        ['t', 'f'],
        ['f', 't'],
        ['t', 'f'],
        ['2018-11-07', '2018-11-09'],
        ['t', 'f'],
        ['f', 't'],
        ['f', 't'],
        ['f', 't']], dtype=object),
 array(['2018-11-07', 'f', 't', 'f', 't', '2018-11-07', 't', 'f', 'f', 'f'],
       dtype=object),
 array(['2018-11-09', 't', 'f', 't', 'f', '2018-11-09', 'f', 't', 't', 't'],
       dtype=object),
 array([['last_scraped', 'host_is_superhost', 'host_has_profile_pic',
         'host_identity_verified', 'is_location_exact',
         'calendar_last_scraped', 'requires_license', 'instant_bookable',
         'require_guest_profile_picture',
         '

In [11]:
import numpy as np
def find_booleans(df):
    columns = df.columns
    boolean_columns = np.array([column for column in columns if len(df[column].value_counts(dropna=True)) == 2])
    boolean_values = np.array([df[column] for column in boolean_columns])
    columns_and_values = np.stack((boolean_columns, boolean_values[:, 0], boolean_values[:, 1])).T
    return columns_and_values

In [12]:
find_booleans(df)

array([['last_scraped', '2018-11-07', '2018-11-07'],
       ['host_is_superhost', 't', 'f'],
       ['host_has_profile_pic', 't', 't'],
       ['host_identity_verified', 't', 't'],
       ['is_location_exact', 'f', 't'],
       ['calendar_last_scraped', '2018-11-07', '2018-11-07'],
       ['requires_license', 't', 't'],
       ['instant_bookable', 'f', 'f'],
       ['require_guest_profile_picture', 'f', 'f'],
       ['require_guest_phone_verification', 'f', 'f']], dtype=object)

In [13]:
df.host_acceptance_rate.value_counts()

Series([], Name: host_acceptance_rate, dtype: int64)

In [14]:
df.host_acceptance_rate.value_counts().index

Float64Index([], dtype='float64')

In [15]:
df.host_acceptance_rate.index

RangeIndex(start=0, stop=22552, step=1)

In [16]:
df.host_acceptance_rate.head(2)

0   NaN
1   NaN
Name: host_acceptance_rate, dtype: float64

In [17]:
boolean_columns = find_booleans(df)

In [18]:
boolean_columns

array([['last_scraped', '2018-11-07', '2018-11-07'],
       ['host_is_superhost', 't', 'f'],
       ['host_has_profile_pic', 't', 't'],
       ['host_identity_verified', 't', 't'],
       ['is_location_exact', 'f', 't'],
       ['calendar_last_scraped', '2018-11-07', '2018-11-07'],
       ['requires_license', 't', 't'],
       ['instant_bookable', 'f', 'f'],
       ['require_guest_profile_picture', 'f', 'f'],
       ['require_guest_phone_verification', 'f', 'f']], dtype=object)

In [19]:
boolean_columns[:, 1]

array(['2018-11-07', 't', 't', 't', 'f', '2018-11-07', 't', 'f', 'f', 'f'],
      dtype=object)

In [20]:
def select_boolean(df, values = []):
    boolean_columns = find_booleans(df)
    matches = np.isin(boolean_columns[:, 1], values)
    return boolean_columns, matches, boolean_columns[matches]

In [21]:
boolean_values = ['t', 'f']
select_boolean(df, boolean_values)

(array([['last_scraped', '2018-11-07', '2018-11-07'],
        ['host_is_superhost', 't', 'f'],
        ['host_has_profile_pic', 't', 't'],
        ['host_identity_verified', 't', 't'],
        ['is_location_exact', 'f', 't'],
        ['calendar_last_scraped', '2018-11-07', '2018-11-07'],
        ['requires_license', 't', 't'],
        ['instant_bookable', 'f', 'f'],
        ['require_guest_profile_picture', 'f', 'f'],
        ['require_guest_phone_verification', 'f', 'f']], dtype=object),
 array([False,  True,  True,  True,  True, False,  True,  True,  True,
         True]),
 array([['host_is_superhost', 't', 'f'],
        ['host_has_profile_pic', 't', 't'],
        ['host_identity_verified', 't', 't'],
        ['is_location_exact', 'f', 't'],
        ['requires_license', 't', 't'],
        ['instant_bookable', 'f', 'f'],
        ['require_guest_profile_picture', 'f', 'f'],
        ['require_guest_phone_verification', 'f', 'f']], dtype=object))

In [22]:
def select_booleans(df, values = []):
    boolean_columns = find_booleans(df)
    matches = np.isin(boolean_columns[:, 1], values)
    return boolean_columns[matches]

In [23]:
boolean_values = ['t', 'f']
select_booleans(df, boolean_values)

array([['host_is_superhost', 't', 'f'],
       ['host_has_profile_pic', 't', 't'],
       ['host_identity_verified', 't', 't'],
       ['is_location_exact', 'f', 't'],
       ['requires_license', 't', 't'],
       ['instant_bookable', 'f', 'f'],
       ['require_guest_profile_picture', 'f', 'f'],
       ['require_guest_phone_verification', 'f', 'f']], dtype=object)

In [24]:
select_booleans(df, boolean_values)[:, 0]

array(['host_is_superhost', 'host_has_profile_pic',
       'host_identity_verified', 'is_location_exact', 'requires_license',
       'instant_bookable', 'require_guest_profile_picture',
       'require_guest_phone_verification'], dtype=object)

In [25]:
False == 0

True

In [26]:
boolean_mapping = {'t': 1, 'f': 0}

In [27]:
boolean_mapping.keys()

dict_keys(['t', 'f'])

In [28]:
boolean_mapping.values()

dict_values([1, 0])

In [29]:
import numpy as np
def to_booleans(df, boolean_mapping):
    boolean_values = list(boolean_mapping.keys())
    boolean_features = select_booleans(df, boolean_values)[:, 0]
    boolean_df = pd.DataFrame({})
    for feature in boolean_features:
        boolean_df[feature] = df[feature].map(boolean_mapping)
    return boolean_df

In [30]:
to_booleans(df, boolean_mapping).head()

Unnamed: 0,host_is_superhost,host_has_profile_pic,host_identity_verified,is_location_exact,requires_license,instant_bookable,require_guest_profile_picture,require_guest_phone_verification
0,1.0,1.0,1.0,0,1,0,0,0
1,0.0,1.0,1.0,1,1,0,0,0
2,0.0,1.0,1.0,1,1,1,0,0
3,0.0,1.0,1.0,1,1,0,0,0
4,1.0,1.0,1.0,1,1,0,0,0


In [31]:
new_boolean_cols = to_booleans(df, boolean_mapping)
new_boolean_cols[0:2]

Unnamed: 0,host_is_superhost,host_has_profile_pic,host_identity_verified,is_location_exact,requires_license,instant_bookable,require_guest_profile_picture,require_guest_phone_verification
0,1.0,1.0,1.0,0,1,0,0,0
1,0.0,1.0,1.0,1,1,0,0,0


In [32]:
new_boolean_cols['host_is_superhost'].dtypes

dtype('float64')

In [33]:
new_boolean_cols.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22552 entries, 0 to 22551
Data columns (total 8 columns):
host_is_superhost                   22526 non-null float64
host_has_profile_pic                22526 non-null float64
host_identity_verified              22526 non-null float64
is_location_exact                   22552 non-null int64
requires_license                    22552 non-null int64
instant_bookable                    22552 non-null int64
require_guest_profile_picture       22552 non-null int64
require_guest_phone_verification    22552 non-null int64
dtypes: float64(3), int64(5)
memory usage: 1.4 MB


### Detecting Almost Binary Features

In [34]:
def almost_binary(df):
    non_empty_columns = df.dropna(axis=1,how='all').columns
    return np.array([df[column].value_counts(normalize=True).values[0] for column in non_empty_columns]).reshape(-1, 1)

In [35]:
non_empty_columns = df.dropna(axis=1, how='all').columns
print(non_empty_columns)
print(len(non_empty_columns))
print(len(df.columns))

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'access', 'interaction', 'house_rules',
       'picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_is_superhost', 'host_thumbnail_url',
       'host_picture_url', 'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet',
       'price', 'we

In [36]:
print(df['zipcode'].value_counts(normalize=True))
print(df['zipcode'].value_counts())

10245                    0.039048
10247                    0.037998
10437                    0.032106
10115                    0.030188
10997                    0.027311
10439                    0.027220
10405                    0.026717
10999                    0.026215
10119                    0.024525
10967                    0.022333
12049                    0.021328
10249                    0.021282
12051                    0.021237
10243                    0.020415
12047                    0.019547
10961                    0.019182
10435                    0.019136
12045                    0.018177
12043                    0.017537
12053                    0.016944
12059                    0.015939
10407                    0.015893
12055                    0.015208
10965                    0.014295
13353                    0.014249
13357                    0.014158
10117                    0.014021
10179                    0.012788
13347                    0.011418
10178         

In [37]:
df.shape

(22552, 96)

In [38]:
df['zipcode'].head()

0    10119
1    10437
2    10405
3    10777
4    10437
Name: zipcode, dtype: object

In [39]:
df['zipcode'].value_counts(normalize=True).values[0:5]

array([0.03904823, 0.03799781, 0.03210632, 0.03018816, 0.02731092])

In [40]:
almost_binary(df)

array([[4.43419652e-05],
       [4.43419652e-05],
       [1.00000000e+00],
       [9.99866974e-01],
       [6.22415863e-04],
       [6.48478392e-04],
       [2.92439372e-03],
       [5.36936776e-04],
       [1.00000000e+00],
       [3.72616984e-03],
       [5.12820513e-03],
       [1.99447683e-03],
       [3.87561133e-03],
       [4.13223140e-03],
       [2.62031618e-03],
       [1.33025896e-04],
       [1.99538844e-03],
       [1.99538844e-03],
       [9.58891947e-03],
       [2.26405043e-03],
       [7.66090212e-01],
       [6.34551792e-03],
       [5.26092359e-01],
       [7.38013876e-01],
       [8.66332238e-01],
       [2.48601616e-03],
       [2.48601616e-03],
       [1.46408523e-01],
       [7.17082482e-01],
       [7.17082482e-01],
       [1.81935083e-01],
       [9.97513984e-01],
       [6.13690846e-01],
       [9.89579638e-01],
       [1.49806265e-01],
       [5.87531039e-02],
       [2.43747783e-01],
       [9.94234266e-01],
       [9.97730105e-01],
       [3.90482280e-02],


In [41]:
len(almost_binary(df))

91

In [42]:
def summarize_counts(df):
    non_empty_columns = df.dropna(axis=1,how='all').columns
    frequencies = np.array([df[column].value_counts(normalize=True).values[0] for column in non_empty_columns]).reshape(-1, 1)
    columns = np.array(non_empty_columns).reshape(-1, 1)
    top_values = np.array([df[column].value_counts(normalize=True).index[0] for column in non_empty_columns]).reshape(-1, 1)
    summarize = np.hstack((columns, frequencies, top_values))
    return summarize[summarize[:, 1].argsort()[::-1]]

In [43]:
df['zipcode'].value_counts(normalize=True).index

Index(['10245', '10247', '10437', '10115', '10997', '10439', '10405', '10999',
       '10119', '10967',
       ...
       '10969\n10969', '10119\n10119', '1043', '10349', '9248', '10176',
       '15366', '10436', '16548', '2455'],
      dtype='object', length=215)

In [44]:
summarize_counts(df)

array([['country', 1.0, 'Germany'],
       ['country_code', 1.0, 'DE'],
       ['experiences_offered', 1.0, 'none'],
       ['is_business_travel_ready', 1.0, 'f'],
       ['has_availability', 1.0, 't'],
       ['scrape_id', 1.0, '20181107122246'],
       ['last_scraped', 0.9998669741042923, '2018-11-07'],
       ['calendar_last_scraped', 0.9998669741042923, '2018-11-07'],
       ['market', 0.9997332028991952, 'Berlin'],
       ['requires_license', 0.9996452642781128, 't'],
       ['state', 0.9977301050382766, 'Berlin'],
       ['host_has_profile_pic', 0.997513983840895, 't'],
       ['city', 0.9942342661994944, 'Berlin'],
       ['smart_location', 0.9941025186236254, 'Berlin, Germany'],
       ['require_guest_profile_picture', 0.9930826534231997, 'f'],
       ['street', 0.9895796381695636, 'Berlin, Berlin, Germany'],
       ['require_guest_phone_verification', 0.9876285916991842, 'f'],
       ['bed_type', 0.9651472153245831, 'Real Bed'],
       ['property_type', 0.8968162468960624, 'Ap

In [45]:
summary = summarize_counts(df)

In [46]:
summary

array([['country', 1.0, 'Germany'],
       ['country_code', 1.0, 'DE'],
       ['experiences_offered', 1.0, 'none'],
       ['is_business_travel_ready', 1.0, 'f'],
       ['has_availability', 1.0, 't'],
       ['scrape_id', 1.0, '20181107122246'],
       ['last_scraped', 0.9998669741042923, '2018-11-07'],
       ['calendar_last_scraped', 0.9998669741042923, '2018-11-07'],
       ['market', 0.9997332028991952, 'Berlin'],
       ['requires_license', 0.9996452642781128, 't'],
       ['state', 0.9977301050382766, 'Berlin'],
       ['host_has_profile_pic', 0.997513983840895, 't'],
       ['city', 0.9942342661994944, 'Berlin'],
       ['smart_location', 0.9941025186236254, 'Berlin, Germany'],
       ['require_guest_profile_picture', 0.9930826534231997, 'f'],
       ['street', 0.9895796381695636, 'Berlin, Berlin, Germany'],
       ['require_guest_phone_verification', 0.9876285916991842, 'f'],
       ['bed_type', 0.9651472153245831, 'Real Bed'],
       ['property_type', 0.8968162468960624, 'Ap

In [47]:
def almost_binary(df, threshold = .95):
    return np.array([np.array([cat, top]) for cat, frequency, top in summarize_counts(df) if 1.0 > frequency > threshold])

In [48]:
almost_bin_feats = almost_binary(df)

In [49]:
almost_bin_feats

array([['last_scraped', '2018-11-07'],
       ['calendar_last_scraped', '2018-11-07'],
       ['market', 'Berlin'],
       ['requires_license', 't'],
       ['state', 'Berlin'],
       ['host_has_profile_pic', 't'],
       ['city', 'Berlin'],
       ['smart_location', 'Berlin, Germany'],
       ['require_guest_profile_picture', 'f'],
       ['street', 'Berlin, Berlin, Germany'],
       ['require_guest_phone_verification', 'f'],
       ['bed_type', 'Real Bed']], dtype='<U32')

In [50]:
def remove_punctuation(string):
    return string.strip().lower().replace(' ', '_').replace('(', '').replace(')', '').replace(',', '')

In [51]:
def matrix_new_features(df):
    bin_feats = almost_binary(df)
    new_bin_feats = np.array(['{column}_is_{top}'.format(column = column, top = remove_punctuation(top)) for column, top in bin_feats])
    return np.hstack((bin_feats[:, 0].reshape(-1, 1), bin_feats[:, 1].reshape(-1, 1), new_bin_feats.reshape(-1, 1)))

In [52]:
potential_new_features = matrix_new_features(df)

In [53]:
potential_new_features

array([['last_scraped', '2018-11-07', 'last_scraped_is_2018-11-07'],
       ['calendar_last_scraped', '2018-11-07',
        'calendar_last_scraped_is_2018-11-07'],
       ['market', 'Berlin', 'market_is_berlin'],
       ['requires_license', 't', 'requires_license_is_t'],
       ['state', 'Berlin', 'state_is_berlin'],
       ['host_has_profile_pic', 't', 'host_has_profile_pic_is_t'],
       ['city', 'Berlin', 'city_is_berlin'],
       ['smart_location', 'Berlin, Germany',
        'smart_location_is_berlin_germany'],
       ['require_guest_profile_picture', 'f',
        'require_guest_profile_picture_is_f'],
       ['street', 'Berlin, Berlin, Germany',
        'street_is_berlin_berlin_germany'],
       ['require_guest_phone_verification', 'f',
        'require_guest_phone_verification_is_f'],
       ['bed_type', 'Real Bed', 'bed_type_is_real_bed']], dtype='<U37')

In [54]:
def booleans_without_top_values(df, not_values):
    potential_new_features = matrix_new_features(df)
    not_tf = ~np.isin(potential_new_features[:, 1], not_values)
    return potential_new_features[not_tf]

In [55]:
not_values = ['t', 'f', '2018-11-07']
not_tf = ~np.isin(potential_new_features[:, 1], not_values)
not_tf

array([False, False,  True, False,  True, False,  True,  True, False,
        True, False,  True])

In [56]:
selected_booleans = booleans_without_top_values(df, ['t', 'f', '2018-11-07'])

In [57]:
selected_booleans

array([['market', 'Berlin', 'market_is_berlin'],
       ['state', 'Berlin', 'state_is_berlin'],
       ['city', 'Berlin', 'city_is_berlin'],
       ['smart_location', 'Berlin, Germany',
        'smart_location_is_berlin_germany'],
       ['street', 'Berlin, Berlin, Germany',
        'street_is_berlin_berlin_germany'],
       ['bed_type', 'Real Bed', 'bed_type_is_real_bed']], dtype='<U37')

In [58]:
selected_bool_cols = selected_booleans[:, 0]
selected_booleans_df = df[selected_bool_cols]
print(selected_bool_cols)
selected_booleans_df.head(3)

['market' 'state' 'city' 'smart_location' 'street' 'bed_type']


Unnamed: 0,market,state,city,smart_location,street,bed_type
0,Berlin,Berlin,Berlin,"Berlin, Germany","Berlin, Berlin, Germany",Real Bed
1,Berlin,Berlin,Berlin,"Berlin, Germany","Berlin, Berlin, Germany",Real Bed
2,Berlin,Berlin,Berlin,"Berlin, Germany","Berlin, Berlin, Germany",Real Bed


In [59]:
def almost_to_boolean(df):
    columns_to_replace = matrix_new_features(df)[:, 0]
    values_to_replace = matrix_new_features(df)[:, 1]
    new_column_names = matrix_new_features(df)[:, 2]
    to_replace_df = pd.DataFrame({})
    for column, value, new_name in zip(columns_to_replace, values_to_replace, new_column_names):
        bool_column = np.where(df[column] == value,1,0)
        to_replace_df[new_name] = bool_column
    return to_replace_df

In [60]:
columns_to_replace = matrix_new_features(df)[:, 0]
values_to_replace = matrix_new_features(df)[:, 1]
new_column_names = matrix_new_features(df)[:, 2]

print(list(zip(columns_to_replace, values_to_replace, new_column_names)))

[('last_scraped', '2018-11-07', 'last_scraped_is_2018-11-07'), ('calendar_last_scraped', '2018-11-07', 'calendar_last_scraped_is_2018-11-07'), ('market', 'Berlin', 'market_is_berlin'), ('requires_license', 't', 'requires_license_is_t'), ('state', 'Berlin', 'state_is_berlin'), ('host_has_profile_pic', 't', 'host_has_profile_pic_is_t'), ('city', 'Berlin', 'city_is_berlin'), ('smart_location', 'Berlin, Germany', 'smart_location_is_berlin_germany'), ('require_guest_profile_picture', 'f', 'require_guest_profile_picture_is_f'), ('street', 'Berlin, Berlin, Germany', 'street_is_berlin_berlin_germany'), ('require_guest_phone_verification', 'f', 'require_guest_phone_verification_is_f'), ('bed_type', 'Real Bed', 'bed_type_is_real_bed')]


In [61]:
almost = almost_to_boolean(selected_booleans_df)

In [62]:
almost.head(11)

Unnamed: 0,market_is_berlin,state_is_berlin,city_is_berlin,smart_location_is_berlin_germany,street_is_berlin_berlin_germany,bed_type_is_real_bed
0,1,1,1,1,1,1
1,1,1,1,1,1,1
2,1,1,1,1,1,1
3,1,1,1,1,1,0
4,1,1,1,1,1,1
5,1,1,1,1,1,1
6,1,1,1,1,1,1
7,1,1,1,1,1,1
8,1,1,1,1,1,1
9,1,1,1,1,1,1


In [63]:
almost.dtypes

market_is_berlin                    int32
state_is_berlin                     int32
city_is_berlin                      int32
smart_location_is_berlin_germany    int32
street_is_berlin_berlin_germany     int32
bed_type_is_real_bed                int32
dtype: object

In [64]:
boolean_mapping = {'t': 1, 'f': 0}
boolean_values= ['t', 'f']

def df_with_replaced_columns(original_df, selected_booleans_df):
    matrix_features = matrix_new_features(selected_booleans_df)
    
    cols_to_drop = matrix_features[:, 0]
    copied_df = original_df.copy()
    pruned_df = copied_df.drop(cols_to_drop, axis = 1)
    
    boolean_df = to_booleans(original_df, boolean_mapping)
    boolean_features = select_booleans(original_df, boolean_values)[:,0]
    cols_to_drop2 = boolean_features
    pruned_df2 = pruned_df.drop(cols_to_drop2, axis = 1)
    
    return pd.concat([pruned_df2, almost_to_boolean(selected_booleans_df), boolean_df], axis = 1)

In [65]:
new_df = df_with_replaced_columns(df, selected_booleans_df)
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22552 entries, 0 to 22551
Data columns (total 96 columns):
id                                  22552 non-null int64
listing_url                         22552 non-null object
scrape_id                           22552 non-null int64
last_scraped                        22552 non-null object
name                                22493 non-null object
summary                             21589 non-null object
space                               14020 non-null object
description                         22349 non-null object
experiences_offered                 22552 non-null object
neighborhood_overview               11540 non-null object
notes                               7215 non-null object
transit                             13036 non-null object
access                              10837 non-null object
interaction                         10406 non-null object
house_rules                         11449 non-null object
thumbnail_url           

In [66]:
matrix_features = matrix_new_features(selected_booleans_df)
matrix_features    

array([['market', 'Berlin', 'market_is_berlin'],
       ['state', 'Berlin', 'state_is_berlin'],
       ['city', 'Berlin', 'city_is_berlin'],
       ['smart_location', 'Berlin, Germany',
        'smart_location_is_berlin_germany'],
       ['street', 'Berlin, Berlin, Germany',
        'street_is_berlin_berlin_germany'],
       ['bed_type', 'Real Bed', 'bed_type_is_real_bed']], dtype='<U32')

In [67]:
selected_booleans_df.head(3)

Unnamed: 0,market,state,city,smart_location,street,bed_type
0,Berlin,Berlin,Berlin,"Berlin, Germany","Berlin, Berlin, Germany",Real Bed
1,Berlin,Berlin,Berlin,"Berlin, Germany","Berlin, Berlin, Germany",Real Bed
2,Berlin,Berlin,Berlin,"Berlin, Germany","Berlin, Berlin, Germany",Real Bed


In [68]:
new_df['market_is_berlin'].head()

0    1
1    1
2    1
3    1
4    1
Name: market_is_berlin, dtype: int32

In [69]:
new_df['host_is_superhost'].head()

0    1.0
1    0.0
2    0.0
3    0.0
4    1.0
Name: host_is_superhost, dtype: float64

In [70]:
new_df.shape

(22552, 96)

In [71]:
new_df.head(2).T

Unnamed: 0,0,1
id,2015,2695
listing_url,https://www.airbnb.com/rooms/2015,https://www.airbnb.com/rooms/2695
scrape_id,20181107122246,20181107122246
last_scraped,2018-11-07,2018-11-07
name,Berlin-Mitte Value! Quiet courtyard/very central,Prenzlauer Berg close to Mauerpark
summary,Great location! 30 of 75 sq meters. This wood...,
space,A+++ location! This „Einliegerwohnung“ is an e...,In the summertime we are spending most of our ...
description,Great location! 30 of 75 sq meters. This wood...,In the summertime we are spending most of our ...
experiences_offered,none,none
neighborhood_overview,It is located in the former East Berlin area o...,


### Summary