# Feature Selection

### Introduction

Deciding which features should be included and focused on in our linear model is an important skill of any data scientist.  As we saw previously, if we include features which are too collinear, we will improperly measure the coefficients related to our collinear features.  In addition, feature selection and prioritizing features with feature importance will help us to understand which features to devote our attention to in terms of feature engineering and domain understanding.  Finally, limiting the number of features in our model, and identifying the most crucial features in our model will make our models, and their insights more understandable.

### Working with AirBnb

For this lesson, we'll work with [AirBnb listings in Berlin](https://www.kaggle.com/brittabettendorf/berlin-airbnb-data).

In [321]:
import pandas as pd
df = pd.read_csv('listings_summary.csv.zip')

In [322]:
df.shape

(22552, 96)

As we can see, our dataset as 22,500 rows and close to 100 features.  Our goal is to predict the price.

In [752]:
# df.columns

Some of these columns include data that we do not know how to handle - mainly textual data and images.  A lot of this data is simply not in the correct format.  Let's take a look.

In [324]:
pd.set_option('display.max_rows',100)

In [325]:
df.dtypes[10:15]

notes          object
transit        object
access         object
interaction    object
house_rules    object
dtype: object

### Feature engineering

Let's try to capture as much of this object data as possible.

In [326]:
def find_object_features(df):
    return list(df.dtypes[df.dtypes == 'object'].index)

In [327]:
def find_object_feature_values(df):
    object_features = find_object_features(df)
    return df[object_features][:1].values[0]

In [328]:
def find_booleans(df, boolean_values):
    object_features = find_object_features(df)
    object_feature_values = find_object_feature_values(df)
    return [feature for feature, value in zip(object_features, object_feature_values) if value in boolean_values]


In [329]:
boolean_values = ['t', 'f']

In [330]:
boolean_mapping = {'t': 1, 'f': 0}

In [331]:
import numpy as np
def to_booleans(df, boolean_values):
    boolean_mapping = dict(list(zip(boolean_values, [1, 0])))
    boolean_features = find_booleans(df, boolean_values)
    boolean_df = pd.DataFrame({})
    for feature in boolean_features:
        boolean_df[feature] = df[feature].map(boolean_mapping)
    return boolean_df[boolean_features]

In [332]:
to_booleans(df, boolean_values)[0:2]

Unnamed: 0,host_is_superhost,host_has_profile_pic,host_identity_verified,is_location_exact,has_availability,requires_license,instant_bookable,is_business_travel_ready,require_guest_profile_picture,require_guest_phone_verification
0,1.0,1.0,1.0,0,1,1,0,0,0,0
1,0.0,1.0,1.0,1,1,1,0,0,0,0


In [349]:
def merge_dfs(original_df, new_dfs):
    copied_original = original_df.copy()
    for new_df in new_dfs:
        copied_original[new_df.columns] = new_df
    return copied_original

In [351]:
def find_categorical(df, threshold = .5):
    
    object_feature_columns = find_object_features(df)
    # change to just return the categorical feature columns
    for column in object_feature_columns:
        if percentage_unique(df[column]) < threshold:
            categorical_df[column] = df[column]
    return categorical_df 

In [560]:
def find_categorical_columns(df, threshold = .5):
    categorical_df = pd.DataFrame({})
    object_feature_columns = find_object_features(df)
    categorical_columns = [column for column in object_feature_columns if percentage_unique(df[column]) < threshold]    
    return categorical_columns

In [551]:
percentage_unique(df.name)

0.9724358689370026

In [352]:
categorical_prospects = find_categorical(df)

In [316]:
percentage_unique(df.neighborhood_overview)

0.9342287694974003

In [312]:
def percentage_unique(df_series):
    series_filled = df_series.dropna()
    return len(series_filled.unique())/len(series_filled)

In [423]:
cat_prospies = find_categorical(updated_df)
cat_prospies.iloc[:1, 2:].T

Unnamed: 0,0
host_location,"Key Biscayne, Florida, United States"
host_response_time,within an hour
host_response_rate,96%
host_neighbourhood,Mitte
host_verifications,"['email', 'phone', 'reviews', 'jumio', 'offlin..."
street,"Berlin, Berlin, Germany"
neighbourhood,Mitte
neighbourhood_cleansed,Brunnenstr. Süd
neighbourhood_group_cleansed,Mitte
city,Berlin


In [462]:
def informative(df):
    non_informative = [column for column in df.columns if len(df[column].unique()) == 1]
    informative_columns = list(set(df.columns.to_list()) - set(non_informative))
    return df[informative_columns]

In [467]:
informative(cat_prospies)[:1]

Unnamed: 0,smart_location,neighbourhood,property_type,host_verifications,calendar_updated,host_neighbourhood,host_location,bed_type,host_response_time,neighbourhood_cleansed,room_type,cancellation_policy,market,state,city,host_response_rate,host_name,zipcode,neighbourhood_group_cleansed,street
0,"Berlin, Germany",Mitte,Guesthouse,"['email', 'phone', 'reviews', 'jumio', 'offlin...",3 months ago,Mitte,"Key Biscayne, Florida, United States",Real Bed,within an hour,Brunnenstr. Süd,Entire home/apt,strict_14_with_grace_period,Berlin,Berlin,Berlin,96%,Ian,10119,Mitte,"Berlin, Berlin, Germany"


In [469]:
informative(cat_prospies).columns

categorical = ['smart_location', 'neighborhood', 'property_type', 'host_neighborhood', 'host_location', 'bed_type', 'neighborhood_cleansed', 'room_type', '']

to_coerce = ['host_verifications', 'calendar_updated', 'host_response_time']

to_remove = ['host_name']


In [439]:
def percentage_to_num(percentage):
    if type(percentage) == str:
        return float(percentage[:-1])

In [419]:
response_time_nums = host_response_time_to_num(cat_prospies.host_response_time)

In [438]:
response_rate_nums = updated_df.host_response_rate.map(percentage_to_num)

In [440]:
cat_prospies.columns

Index(['experiences_offered', 'host_name', 'host_location',
       'host_response_time', 'host_response_rate', 'host_neighbourhood',
       'host_verifications', 'street', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'city',
       'state', 'zipcode', 'market', 'smart_location', 'country_code',
       'country', 'property_type', 'room_type', 'bed_type', 'calendar_updated',
       'cancellation_policy'],
      dtype='object')

In [396]:
from past_date import get_past_date

In [397]:
get_past_date(cat_prospies.calendar_updated[0])

'2019-04-05'

In [480]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = cat_prospies.bed_type.to_numpy().reshape(-1, 1)
enc.fit(X)

OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='ignore',
       n_values=None, sparse=True)

In [484]:
enc.transform(X).toarray().shape

(22552, 5)

In [494]:
cat_prospies.bed_type.value_counts()

Real Bed         21766
Pull-out Sofa      451
Futon              240
Couch               72
Airbed              23
Name: bed_type, dtype: int64

In [500]:
def categorical_to_combine(df):
    return [column for column in df.columns if len(df[column].value_counts()) > 5]

In [538]:
to_combine_columns = categorical_to_combine(cat_prospies)
to_combine_columns

['host_name',
 'host_location',
 'host_response_rate',
 'host_neighbourhood',
 'host_verifications',
 'street',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'city',
 'state',
 'zipcode',
 'market',
 'smart_location',
 'property_type',
 'calendar_updated']

In [539]:
cat_prospies[to_combine_columns].describe()

Unnamed: 0,host_name,host_location,host_response_rate,host_neighbourhood,host_verifications,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,property_type,calendar_updated
count,22526,22436,9657,17458,22552,22552,21421,22552,22552,22547,22468,21896,22489,22552,22552,22552
unique,5997,1036,64,181,301,86,91,14,12,60,19,215,6,61,33,75
top,Anna,"Berlin, Berlin, Germany",100%,Neukölln,"['email', 'phone', 'reviews']","Berlin, Berlin, Germany",Neukölln,other,Friedrichshain-Kreuzberg,Berlin,Berlin,10245,Berlin,"Berlin, Germany",Apartment,today
freq,216,17188,7127,2556,4103,22317,3209,11554,5497,22417,22417,855,22483,22419,20225,2517


In [725]:
df.street.value_counts(normalize=True).index[0]

'Berlin, Berlin, Germany'

In [740]:
def summarize_counts(df):
    frequencies = np.array([df[column].value_counts(normalize=True)[:1][0] for column in df.columns]).reshape(-1, 1)
    columns = df.columns.to_numpy().reshape(-1, 1)
    top_values = np.array([df[column].value_counts(normalize=True).index[0] for column in df.columns]).reshape(-1, 1)
    summarize = np.hstack((columns, frequencies, top_values))
    return summarize[summarize[:,1].argsort()[::-1]]

In [741]:
# find those more than 95 percent
# if more than .95 the same then 
summarize_counts(cat_prospies[to_combine_columns])

array([['market', 0.9997332028991952, 'Berlin'],
       ['state', 0.9977301050382766, 'Berlin'],
       ['city', 0.9942342661994944, 'Berlin'],
       ['smart_location', 0.9941025186236254, 'Berlin, Germany'],
       ['street', 0.9895796381695636, 'Berlin, Berlin, Germany'],
       ['property_type', 0.8968162468960624, 'Apartment'],
       ['host_location', 0.7660902121590302, 'Berlin, Berlin, Germany'],
       ['host_response_rate', 0.7380138759449104, '100%'],
       ['neighbourhood_cleansed', 0.51232706633558, 'other'],
       ['neighbourhood_group_cleansed', 0.2437477829017382,
        'Friedrichshain-Kreuzberg'],
       ['host_verifications', 0.18193508336289466,
        "['email', 'phone', 'reviews']"],
       ['neighbourhood', 0.1498062648802577, 'Neukölln'],
       ['host_neighbourhood', 0.14640852331309429, 'Neukölln'],
       ['calendar_updated', 0.11160872649875843, 'today'],
       ['zipcode', 0.039048227986846915, '10245'],
       ['host_name', 0.009588919470833703, 'Anna'

In [607]:
def remove_punctuation(string):
    return string.strip().lower().replace(' ', '_').replace('(', '').replace(')', '').replace(',', '')

[From here](https://medium.com/@chaimgluck1/working-with-pandas-fixing-messy-column-names-42a54a6659cd)

In [743]:
def almost_binary(df, threshold = .95):
    return np.array([np.array([cat, top]) for cat, frequency, top in summarize_counts(df) if frequency > .95])

In [744]:
bin_feats = almost_binary(cat_prospies[to_combine_columns])

In [745]:
bin_feats

array([['market', 'Berlin'],
       ['state', 'Berlin'],
       ['city', 'Berlin'],
       ['smart_location', 'Berlin, Germany'],
       ['street', 'Berlin, Berlin, Germany']], dtype='<U23')

In [746]:
new_bin_feats = new_binary_features(cat_prospies[to_combine_columns])

In [747]:
new_bin_feats

array(['market_is_berlin', 'state_is_berlin', 'city_is_berlin',
       'smart_location_is_berlin_germany',
       'street_is_berlin_berlin_germany'], dtype='<U32')

In [750]:
def matrix_new_features(df):
    bin_feats = almost_binary(df)
    new_bin_feats = np.array(['{column}_is_{top}'.format(column = column, top = remove_punctuation(top)) for column, top in bin_feats])
    return np.hstack((bin_feats[:, 0].reshape(-1, 1), bin_feats[:, 1].reshape(-1, 1), new_bin_feats.reshape(-1, 1)))

In [749]:
almost_binary(cat_prospies[to_combine_columns])

array([['market', 'Berlin'],
       ['state', 'Berlin'],
       ['city', 'Berlin'],
       ['smart_location', 'Berlin, Germany'],
       ['street', 'Berlin, Berlin, Germany']], dtype='<U23')

In [751]:
matrix_new_features(cat_prospies[to_combine_columns])

array([['market', 'Berlin', 'market_is_berlin'],
       ['state', 'Berlin', 'state_is_berlin'],
       ['city', 'Berlin', 'city_is_berlin'],
       ['smart_location', 'Berlin, Germany',
        'smart_location_is_berlin_germany'],
       ['street', 'Berlin, Berlin, Germany',
        'street_is_berlin_berlin_germany']], dtype='<U32')

In [656]:
def almost_to_boolean(df):
    columns_to_replace = matrix_new_features(df)[:, 0]
    values_to_replace = matrix_new_features(df)[:, 1]
    new_column_names = matrix_new_features(df)[:, 2]
    to_replace_df = pd.DataFrame({})
    for column, value, new_name in zip(columns_to_replace, values_to_replace, new_column_names):
        bool_column = np.where(df[column] == value,1,0)
        to_replace_df[new_name] = bool_column
    return to_replace_df

In [662]:
almost = almost_to_boolean(cat_prospies[to_combine_columns])

In [665]:
almost.dtypes

street_is_berlin_berlin_germany     int64
city_is_berlin                      int64
state_is_berlin                     int64
market_is_berlin                    int64
smart_location_is_berlin_germany    int64
dtype: object

In [667]:
cat_prospies[to_combine_columns].columns

Index(['host_name', 'host_location', 'host_response_rate',
       'host_neighbourhood', 'host_verifications', 'street', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'city',
       'state', 'zipcode', 'market', 'smart_location', 'property_type',
       'calendar_updated'],
      dtype='object')

In [520]:
def selected_cat_values(column, threshold = .05):
    values_counted = column.value_counts(normalize=True)
    return values_counted[values_counted > .025]

In [530]:
selected = selected_cat_values(cat_prospies.neighbourhood_cleansed, .02)
selected

Tempelhofer Vorstadt        0.058753
Frankfurter Allee Süd FK    0.056846
Alexanderplatz              0.048377
Reuterstraße                0.044431
Rixdorf                     0.039021
Neuköllner Mitte/Zentrum    0.035341
Brunnenstr. Süd             0.034276
Frankfurter Allee Nord      0.032591
Schillerpromenade           0.029354
südliche Luisenstadt        0.028512
Prenzlauer Berg Nordwest    0.027625
Prenzlauer Berg Südwest     0.027403
Schöneberg-Nord             0.025142
Name: neighbourhood_cleansed, dtype: float64

In [668]:
def reduce_cat_values(column, threshold = .02):
    selected_values = selected_cat_values(column, threshold).index
    column[~column.isin(selected_values)] = 'other'
    return column

In [533]:
column.value_counts(normalize=True)

other                       0.512327
Tempelhofer Vorstadt        0.058753
Frankfurter Allee Süd FK    0.056846
Alexanderplatz              0.048377
Reuterstraße                0.044431
Rixdorf                     0.039021
Neuköllner Mitte/Zentrum    0.035341
Brunnenstr. Süd             0.034276
Frankfurter Allee Nord      0.032591
Schillerpromenade           0.029354
südliche Luisenstadt        0.028512
Prenzlauer Berg Nordwest    0.027625
Prenzlauer Berg Südwest     0.027403
Schöneberg-Nord             0.025142
Name: neighbourhood_cleansed, dtype: float64