# Categorical Features

### Introduction

Deciding which features should be included and focused on in our linear model is an important skill of any data scientist.  As we saw previously, if we include features which are too collinear, we will improperly measure the coefficients related to our collinear features.  In addition, feature selection and prioritizing features with feature importance will help us to understand which features to devote our attention to in terms of feature engineering and domain understanding.  Finally, limiting the number of features in our model, and identifying the most crucial features in our model will make our models, and their insights more understandable.

### Working with AirBnb

For this lesson, we'll work with [AirBnb listings in Berlin](https://www.kaggle.com/brittabettendorf/berlin-airbnb-data).

In [201]:
import pandas as pd
df = pd.read_csv('listings_summary.csv.zip')

In [5]:
pd.set_option('display.max_rows',100)

### Feature engineering

Let's try to capture as much of this object data as possible.

In [6]:
def find_object_features(df):
    return list(df.dtypes[df.dtypes == 'object'].index)

In [7]:
def find_object_feature_values(df):
    object_features = find_object_features(df)
    return df[object_features][:1].values[0]

In [8]:
def informative(df):
    non_informative = [column for column in df.columns if len(df[column].unique()) == 1]
    informative_columns = list(set(df.columns.to_list()) - set(non_informative))
    return df[informative_columns]

In [9]:
def percentage_unique(df_series):
    series_filled = df_series.dropna()
    return len(series_filled.unique())/len(series_filled)

In [72]:
def find_categorical(df, threshold = .5):    
    categorical_df = pd.DataFrame({})
    for column in df.columns:
        if percentage_unique(df[column]) < threshold:
            categorical_df[column] = df[column]
    return categorical_df 

In [73]:
len(find_categorical_columns(df))

44

### Combine with Selecting Categorical Columns

In [86]:
def summarize_counts(df):
    non_empty_columns = df.dropna(axis=1,how='all').columns
    frequencies = np.array([df[column].value_counts(normalize=True).values[0] for column in non_empty_columns]).reshape(-1, 1)
    columns = non_empty_columns.to_numpy().reshape(-1, 1)
    top_values = np.array([df[column].value_counts(normalize=True).index[0] for column in non_empty_columns]).reshape(-1, 1)
    summarize = np.hstack((columns, frequencies, top_values))
    return summarize[summarize[:,1].argsort()[::-1]]

In [88]:
top_counts = summarize_counts(df)

In [239]:
def selected_summaries(df, not_values = [], lower_bound = .1, upper_bound = 1):
    potential_cols = summarize_counts(df)
    potential_cols = potential_cols[potential_cols[:, 1] > lower_bound]
    potential_cols = potential_cols[potential_cols[:, 1] < upper_bound]
    not_tf = ~np.isin(potential_cols[:, 2], not_values)
    return potential_cols[not_tf]

In [241]:
selected = selected_summaries(df, not_values = ['t', 'f'], upper_bound = .90)
selected[0:3]

array([['property_type', 0.8968162468960624, 'Apartment'],
       ['bathrooms', 0.8795293072824156, '1.0'],
       ['review_scores_communication', 0.805393184074115, '10.0']],
      dtype=object)

* But we may not want values with digits, as we could change them to floats.

In [242]:
def num_is_digit(array, str_index = 0):
    return np.array([value[str_index].isdigit() for value in array])

In [245]:
num_is_digit(selected[:, 2], str_index = 0)[0:10]

array([False,  True,  True,  True,  True, False,  True,  True,  True,
        True])

In [258]:
def remove_digits_from_selected(selected_matrix, col_idx, str_indices = [0, -1]):
    for idx in str_indices:
        selected_col = selected_matrix[~num_is_digit(selected_matrix[:, col_idx], idx)]
    return selected_col

In [260]:
remove_digits_from_selected(selected, 2, [0, -1])

array([['property_type', 0.8968162468960624, 'Apartment'],
       ['host_location', 0.7660902121590302, 'Berlin, Berlin, Germany'],
       ['host_response_rate', 0.7380138759449104, '100%'],
       ['host_response_time', 0.5260923586663906, 'within an hour'],
       ['room_type', 0.511440227030862, 'Private room'],
       ['cancellation_policy', 0.403600567577155, 'flexible'],
       ['neighbourhood_group_cleansed', 0.2437477829017382,
        'Friedrichshain-Kreuzberg'],
       ['host_verifications', 0.18193508336289466,
        "['email', 'phone', 'reviews']"],
       ['neighbourhood', 0.1498062648802577, 'Neukölln'],
       ['host_neighbourhood', 0.14640852331309429, 'Neukölln'],
       ['calendar_updated', 0.11160872649875843, 'today']], dtype=object)

### Cleaning Values

1. Find columns to clean

In [74]:
def categorical_plus_values(df, threshold = 5):
    categorical_cols = find_categorical_columns(df)
    return [column for column in categorical_cols if len(df[column].value_counts()) > threshold]

In [75]:
to_combine_columns = categorical_plus_values(df)

In [76]:
to_combine_columns[0:3]

['host_name', 'host_since', 'host_location']

In [153]:
# df[to_combine_columns].describe()

In [83]:
def selected_cat_values(column, threshold = .02):
    values_counted = column.value_counts(normalize=True)
    return values_counted[values_counted > threshold]

In [84]:
selected = selected_cat_values(df.neighbourhood_cleansed, .02)

In [85]:
selected[0:3]

other                       0.512327
Tempelhofer Vorstadt        0.058753
Frankfurter Allee Süd FK    0.056846
Name: neighbourhood_cleansed, dtype: float64

In [203]:
def reduce_cat_values(column, threshold = .02):
    column = column.copy()
    selected_values = selected_cat_values(column, threshold).index
    column[~column.isin(selected_values)] = 'other'
    return column.astype('category')

In [156]:
df[updated_nondigits[:, 0]][:1]

Unnamed: 0,property_type,host_location,host_response_time,neighbourhood_cleansed,room_type,cancellation_policy,neighbourhood_group_cleansed,host_verifications,neighbourhood,host_neighbourhood,calendar_updated
0,Guesthouse,"Key Biscayne, Florida, United States",within an hour,Brunnenstr. Süd,Entire home/apt,strict_14_with_grace_period,Mitte,"['email', 'phone', 'reviews', 'jumio', 'offlin...",Mitte,Mitte,3 months ago


In [176]:
categoricals = ['property_type', 'host_location', 'neighbourhood_cleansed', 'room_type', 'cancellation_policy', 'neighbourhood_group_cleansed', 'host_verifications', 'neighbourhood', 'host_neighbourhood']



In [204]:
def df_reduced_categories(df, categoricals, threshold = .01):
    new_df = pd.DataFrame()
    for category in categoricals:
        new_df[category] = reduce_cat_values(df[category], threshold)
    return new_df

In [205]:
df_reduced = df_reduced_categories(df, categoricals)

In [198]:
summarize_counts(df_reduced)

array([['property_type', 0.8968162468960624, 'Apartment'],
       ['host_location', 0.7621496984746364, 'Berlin, Berlin, Germany'],
       ['neighbourhood_cleansed', 0.51232706633558, 'other'],
       ['room_type', 0.511440227030862, 'Private room'],
       ['cancellation_policy', 0.403600567577155, 'flexible'],
       ['host_neighbourhood', 0.38479957431713374, 'other'],
       ['host_verifications', 0.2920361830436325, 'other'],
       ['neighbourhood_group_cleansed', 0.2437477829017382,
        'Friedrichshain-Kreuzberg'],
       ['neighbourhood', 0.21882759843916283, 'other']], dtype=object)

In [236]:
def replace_df_columns(original_df, replacing_df):
    replacing_cols = replacing_df.columns
    original_df = original_df.drop(columns = replacing_cols)
    new_df = pd.concat([original_df, replacing_df], axis = 1)
    return new_df

In [233]:
new_df = replace_df_columns(df, df_reduced)