# Boolean Values

### Introduction

Deciding which features should be included and focused on in our linear model is an important skill of any data scientist.  As we saw previously, if we include features which are too collinear, we will improperly measure the coefficients related to our collinear features.  In addition, feature selection and prioritizing features with feature importance will help us to understand which features to devote our attention to in terms of feature engineering and domain understanding.  Finally, limiting the number of features in our model, and identifying the most crucial features in our model will make our models, and their insights more understandable.

### Working with AirBnb

For this lesson, we'll work with [AirBnb listings in Berlin](https://www.kaggle.com/brittabettendorf/berlin-airbnb-data).

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/mhan1/Data-Science/master/listings_summary.csv')

In [2]:
pd.set_option('display.max_rows',100)

### Feature engineering

Let's try to capture as much of this object data as possible.

In [3]:
def find_object_features(df):
    return list(df.dtypes[df.dtypes == 'object'].index)

In [4]:
def find_object_feature_values(df):
    object_features = find_object_features(df)
    return df[object_features][:1].values[0]

In [5]:
import numpy as np
def find_booleans(df):
    columns = df.columns
    boolean_columns = np.array([column for column in columns if len(df[column].value_counts(dropna=True)) == 2])
    boolean_values = np.array([df[column].value_counts(dropna=True).index.to_list() for column in boolean_columns])
    columns_and_values = np.stack((boolean_columns, boolean_values[:, 0], boolean_values[:, 1])).T
    return columns_and_values

In [6]:
boolean_columns = find_booleans(df)

AttributeError: 'Index' object has no attribute 'to_list'

In [7]:
def select_booleans(df, values = []):
    boolean_columns = find_booleans(df)
    matches = np.isin(boolean_columns[:, 1], values)
    return boolean_columns[matches]

In [8]:
boolean_values = ['t', 'f']
select_booleans(df, boolean_values)

AttributeError: 'Index' object has no attribute 'to_list'

In [9]:
boolean_mapping = {'t': 1, 'f': 0}

In [10]:
import numpy as np
def to_booleans(df, boolean_mapping):
    potential_columns = find_booleans(df)
    boolean_values = list(boolean_mapping.keys())
    boolean_features = select_booleans(df, boolean_values)[:, 0]
    boolean_df = pd.DataFrame({})
    for feature in boolean_features:
        boolean_df[feature] = df[feature].map(boolean_mapping)
    return boolean_df[boolean_features]

In [11]:
new_boolean_cols = to_booleans(df, boolean_mapping)
new_boolean_cols[0:2]

AttributeError: 'Index' object has no attribute 'to_list'

### Detecting Almost Binary Features

In [12]:
def almost_binary(df):
    non_empty_columns = df.dropna(axis=1,how='all').columns
    return np.array([df[column].value_counts(normalize=True).values[0] for column in non_empty_columns]).reshape(-1, 1)


In [13]:
def summarize_counts(df):
    non_empty_columns = df.dropna(axis=1,how='all').columns
    frequencies = np.array([df[column].value_counts(normalize=True).values[0] for column in non_empty_columns]).reshape(-1, 1)
    columns = non_empty_columns.to_numpy().reshape(-1, 1)
    top_values = np.array([df[column].value_counts(normalize=True).index[0] for column in non_empty_columns]).reshape(-1, 1)
    summarize = np.hstack((columns, frequencies, top_values))
    return summarize[summarize[:,1].argsort()[::-1]]

In [14]:
summary = summarize_counts(df)

AttributeError: 'Index' object has no attribute 'to_numpy'

In [15]:
summary[:2]

NameError: name 'summary' is not defined

In [18]:
def almost_binary(df, threshold = .95):
    return np.array([np.array([cat, top]) for cat, frequency, top in summarize_counts(df) if 1.0 > frequency > threshold])

In [19]:
almost_bin_feats = almost_binary(df)

AttributeError: 'Index' object has no attribute 'to_numpy'

In [20]:
almost_bin_feats

NameError: name 'almost_bin_feats' is not defined

In [22]:
def remove_punctuation(string):
    return string.strip().lower().replace(' ', '_').replace('(', '').replace(')', '').replace(',', '')

In [23]:
def matrix_new_features(df):
    bin_feats = almost_binary(df)
    new_bin_feats = np.array(['{column}_is_{top}'.format(column = column, top = remove_punctuation(top)) for column, top in bin_feats])
    return np.hstack((bin_feats[:, 0].reshape(-1, 1), bin_feats[:, 1].reshape(-1, 1), new_bin_feats.reshape(-1, 1)))

In [24]:
potential_new_features = matrix_new_features(df)

AttributeError: 'Index' object has no attribute 'to_numpy'

In [25]:
def booleans_without_top_values(df, not_values):
    potential_new_features = matrix_new_features(df)
    not_tf = ~np.isin(potential_new_features[:, 1], not_values)
    return potential_new_features[not_tf]

In [26]:
selected_booleans = booleans_without_top_values(df, ['t', 'f', '2018-11-07'])

AttributeError: 'Index' object has no attribute 'to_numpy'

In [27]:
selected_bool_cols = selected_booleans[:, 0]
selected_booleans_df = df[selected_bool_cols]

NameError: name 'selected_booleans' is not defined

In [29]:
def almost_to_boolean(df):
    columns_to_replace = matrix_new_features(df)[:, 0]
    values_to_replace = matrix_new_features(df)[:, 1]
    new_column_names = matrix_new_features(df)[:, 2]
    to_replace_df = pd.DataFrame({})
    for column, value, new_name in zip(columns_to_replace, values_to_replace, new_column_names):
        bool_column = np.where(df[column] == value,1,0)
        to_replace_df[new_name] = bool_column
    return to_replace_df

In [30]:
almost = almost_to_boolean(selected_booleans_df)

NameError: name 'selected_booleans_df' is not defined

In [31]:
almost.dtypes

NameError: name 'almost' is not defined

In [32]:
def df_with_replaced_columns(original_df, selected_booleans_df):
    matrix_features = matrix_new_features(selected_booleans_df)
    cols_to_drop = matrix_features[:, 0]
    pruned_df = copied_df.drop(cols_to_drop, axis = 1)
    return pd.concat([pruned_df, selected_booleans_df], axis = 1)

In [33]:
new_df = new_df_with_na_cols(df, selected_booleans_df)

NameError: name 'new_df_with_na_cols' is not defined

In [34]:
len(new_df.columns)

NameError: name 'new_df' is not defined

### Summary