# Using Correlations

### Introduction

In the last sections, we saw how we can use recursive feature selection to reduce the number of features while retaining a relatively accurate model.  The mechanism of recursive feature selection is quite automatic - that is, we simply go one by one and try to see which feature remoal would least hurt our score.  

The goal is to use some of our own judgment to find features which are multicollinear and remove them.

### Loading the Data

In [24]:
top_forty = ['first_reviewElapsed', 'first_reviewYear_is_na', 'reviews_per_month_is_na', 'last_reviewElapsed',
'last_reviewYear_is_na', 'first_reviewYear', 'host_sinceElapsed', 'host_sinceYear_is_na', 'host_total_listings_count_is_na', 
'host_listings_count_is_na', 'first_reviewDayofyear', 'host_sinceYear', 'host_sinceDayofyear', 'host_sinceMonth',
'last_reviewYear', 'cancellation_policy_x0_flexible', 'cancellation_policy_x0_moderate', 'cancellation_policy_x0_strict_14_with_grace_period',
'accommodates', 'room_type_x0_Entire home/apt', 'host_sinceDay', 'review_scores_communication_is_na', 'neighbourhood_group_cleansed_x0_Mitte', 'last_reviewDayofyear',
'review_scores_rating_is_na', 'bedrooms', 'review_scores_value_is_na', 'first_reviewMonth', 'cleaning_fee',
'neighbourhood_group_cleansed_x0_Friedrichshain-Kreuzberg', 'guests_included', 'availability_90',
'neighbourhood_cleansed_x0_Moabit West', 'bathrooms', 'neighbourhood_cleansed_x0_Parkviertel', 'neighbourhood_cleansed_x0_Wedding Zentrum',
'review_scores_location_is_na', 'neighbourhood_cleansed_x0_Osloer Straße', 'calculated_host_listings_count', 'property_type_x0_Apartment', 'price']

In [30]:
import pandas as pd
url = "./listings_train_df.csv"

train_df = pd.read_csv(url, usecols = top_forty)

FileNotFoundError: [Errno 2] File b'./listings_train_df.csv' does not exist: b'./listings_train_df.csv'

### Becoming more picky

In [11]:
bool_cols = X_train_pruned.select_dtypes(include='bool').columns

In [12]:
X_train_pruned[bool_cols] = X_train_pruned[bool_cols].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]


In [13]:
subsampled_df = X_train_pruned.sample(3000)

In [14]:
# pd.plotting.scatter_matrix(subsampled_df)
print('scatter matrix')

scatter matrix


A problem we have is that we still have a lot of features.  So we take a guess that we could have overlap in the geographic features.  So we gather together all of the columns that seem to have a geographic component.

In [15]:

neighborhood_cols = [col for col in X_train_pruned.columns if 'neighbourhood' in col]
zip_cols = [col for col in X_train_pruned.columns if 'zip' in col]
berlin_cols = [col for col in X_train_pruned.columns if 'berlin' in col]
geo_cols = neighborhood_cols + zip_cols + berlin_cols

In [16]:
len(geo_cols)

16

We see that almost half of our columns are geographic.

In [17]:
geo_df = X_train_pruned.loc[:, geo_cols]

In [18]:
subsampled_geo = geo_df.sample(3000)

In [23]:
indexed_corr = np.abs((geo_df.corr(method = 'spearman')) > .60)
indexed_corr

Unnamed: 0,neighbourhood_group_cleansed_Mitte,neighbourhood_group_cleansed_Neukölln,neighbourhood_group_cleansed_Reinickendorf,neighbourhood_group_cleansed_other,neighbourhood_Wedding,neighbourhood_Wilmersdorf,host_neighbourhood_Wilmersdorf,neighbourhood_cleansed_Moabit Ost,neighbourhood_cleansed_Moabit West,neighbourhood_cleansed_Parkviertel,zipcode_is_na,zip_dists_1082,zip_dists_1335,zip_dists_other,state_is_berlin,street_is_berlin_berlin_germany
neighbourhood_group_cleansed_Mitte,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
neighbourhood_group_cleansed_Neukölln,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False
neighbourhood_group_cleansed_Reinickendorf,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False
neighbourhood_group_cleansed_other,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False
neighbourhood_Wedding,False,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False
neighbourhood_Wilmersdorf,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False
host_neighbourhood_Wilmersdorf,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False
neighbourhood_cleansed_Moabit Ost,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False
neighbourhood_cleansed_Moabit West,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False
neighbourhood_cleansed_Parkviertel,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False


In [19]:
indexed_corr = np.abs((geo_df.corr(method = 'spearman')) > .60).sum()
correlated_cols = indexed_corr[indexed_corr > 1].index

geo_df[correlated_cols].corr(method = 'spearman')

Unnamed: 0,neighbourhood_Wedding,neighbourhood_Wilmersdorf,host_neighbourhood_Wilmersdorf,zip_dists_1335,state_is_berlin,street_is_berlin_berlin_germany
neighbourhood_Wedding,1.0,-0.038366,-0.031592,0.808018,0.016985,0.016855
neighbourhood_Wilmersdorf,-0.038366,1.0,0.803552,-0.033285,-0.005961,-0.010946
host_neighbourhood_Wilmersdorf,-0.031592,0.803552,1.0,-0.026839,-0.003056,-0.011327
zip_dists_1335,0.808018,-0.033285,-0.026839,1.0,0.017751,0.020733
state_is_berlin,0.016985,-0.005961,-0.003056,0.017751,1.0,0.748955
street_is_berlin_berlin_germany,0.016855,-0.010946,-0.011327,0.020733,0.748955,1.0


So from here we can see some pairings: 
    * neighborhood_Wedding and zip_dists_1335
    * neighbourhood_Wilmersdorf, host_neighbourhood_Wilmersdorf
    * state_is_berlin, street_is_berlin
    * street_is_berlin_berlin_germany, state_is_berlin

* We of course could also do this with our entire matrix.

But we can see any even stronger relationship, if we use a metric like `spearmanr`.  

In [64]:
#  here need to also have for less than .60
indexed_corr = (X_train_pruned.corr(method = 'spearman') > .60).sum()
correlated_cols = indexed_corr[indexed_corr > 1].index
correlated_cols

X_train_pruned[correlated_cols].corr(method = 'spearman')

Index(['host_response_rate_is_na', 'state_is_berlin',
       'street_is_berlin_berlin_germany', 'neighbourhood_Wedding',
       'neighbourhood_Wilmersdorf', 'host_neighbourhood_Wilmersdorf',
       'host_response_time_other', 'zip_dists_1335'],
      dtype='object')

In [62]:
X_train_pruned[correlated_cols].corr(method = 'spearman')

Unnamed: 0,host_response_rate_is_na,state_is_berlin,street_is_berlin_berlin_germany,neighbourhood_Wedding,neighbourhood_Wilmersdorf,host_neighbourhood_Wilmersdorf,host_response_time_other,zip_dists_1335
host_response_rate_is_na,1.0,0.008039,0.034077,0.046751,-0.027278,-0.02585,1.0,0.027216
state_is_berlin,0.008039,1.0,0.748955,0.016985,-0.005961,-0.003056,0.008039,0.017751
street_is_berlin_berlin_germany,0.034077,0.748955,1.0,0.016855,-0.010946,-0.011327,0.034077,0.020733
neighbourhood_Wedding,0.046751,0.016985,0.016855,1.0,-0.038366,-0.031592,0.046751,0.808018
neighbourhood_Wilmersdorf,-0.027278,-0.005961,-0.010946,-0.038366,1.0,0.803552,-0.027278,-0.033285
host_neighbourhood_Wilmersdorf,-0.02585,-0.003056,-0.011327,-0.031592,0.803552,1.0,-0.02585,-0.026839
host_response_time_other,1.0,0.008039,0.034077,0.046751,-0.027278,-0.02585,1.0,0.027216
zip_dists_1335,0.027216,0.017751,0.020733,0.808018,-0.033285,-0.026839,0.027216,1.0


So to this we can add: 
    * host_response_time_other, and host_response_rate_is_na

## Remove connected columns

* neighborhood_Wedding and zip_dists_1335
* neighbourhood_Wilmersdorf, host_neighbourhood_Wilmersdorf
* state_is_berlin, street_is_berlin
* street_is_berlin_berlin_germany, state_is_berlin
* host_response_time_other, and host_response_rate_is_na

In [65]:
X_train_dropped = X_train_pruned.drop(columns=['zip_dists_1335', 'host_neighbourhood_Wilmersdorf', 'state_is_berlin', 'street_is_berlin_berlin_germany', 'host_response_time_other'])

In [67]:
len(X_train_dropped.columns)

30

# Working with Dendrograms

### Introduction

In the last lesson, we saw how we can detect variables that are associated with each other by using correlations.  The motivation for examining the correlations between variables is that if two features are correlated, we can likely remove one of them with suffering a significant decrease in our model's accuracy.  And by reducing the number of features, we see our standard benefits of a more understable model, a decrease in variance, and a reduction in multicollinearity.

In the last section, we also saw two different types of correlations.  The pearson correlation which measured strictly the strength of a linear relationship between variables, and the spearman correlation or rank correlation, which also captured non-linear relationships.

As mentioned, we'll focus on the spearman correlation going forward.

### Plotting Correlations with Dendrograms

Now previously, we plotted both our relationships among variables with scatter plots.  In this section, we'll use a dendrogram to see if two variables are assosicated with one another.  Ok, so let's start by plotting our scatter matrix of the rank of the features.

In [41]:
import scipy.cluster.hierarchy as hc
corr = X_train_dropped.corr(method = 'spearman')
corr_condensed = hc.distance.squareform(1 - np.abs(corr))
z = hc.linkage(corr_condensed, method = 'average')

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(16,10))
dendrogram = hc.dendrogram(z, labels = X_train_dropped.columns, orientation = 'left', leaf_font_size=16)

NameError: name 'X_train_dropped' is not defined

In [70]:
X_train_dropped.columns

Index(['longitude', 'bedrooms', 'bedrooms_is_na', 'host_response_rate_is_na',
       'summary_is_na', 'license_is_na', 'zipcode_is_na', 'bathrooms_is_na',
       'beds_is_na', 'host_sinceIs_year_end', 'requires_license',
       'property_type_Loft', 'property_type_other',
       'neighbourhood_group_cleansed_Mitte',
       'neighbourhood_group_cleansed_Neukölln',
       'neighbourhood_group_cleansed_Reinickendorf',
       'neighbourhood_group_cleansed_other', 'neighbourhood_Wedding',
       'neighbourhood_Wilmersdorf', 'cancellation_policy_super_strict_60',
       'room_type_Private room', 'room_type_Shared room',
       'neighbourhood_cleansed_Moabit Ost',
       'neighbourhood_cleansed_Moabit West',
       'neighbourhood_cleansed_Parkviertel', 'zip_dists_1082',
       'zip_dists_other', 'last_reviewMonth_2.0', 'last_reviewMonth_3.0',
       'last_reviewMonth_12.0'],
      dtype='object')

In [73]:
first_pair = ['longitude', 'neighbourhood_group_cleansed_Neukölln']
second_pair = ['neighbourhood_group_cleansed_Mitte', 'neighbourhood_Wedding']
X_train_dropped[first_pair].corr(method = 'spearman')

Unnamed: 0,longitude,neighbourhood_group_cleansed_Neukölln
longitude,1.0,0.294026
neighbourhood_group_cleansed_Neukölln,0.294026,1.0


In [74]:
X_train_dropped[second_pair].corr(method = 'spearman')

Unnamed: 0,neighbourhood_group_cleansed_Mitte,neighbourhood_Wedding
neighbourhood_group_cleansed_Mitte,1.0,0.509146
neighbourhood_Wedding,0.509146,1.0


### Removing Features

Neither of these look too strong, so we can leave them both in.

In [None]:
# model.fit(X_train_pruned, y_train_pruned)
# model.score(X_test_pruned, y_test_pruned)

In [76]:
model.fit(X_train_dropped, y_train_pruned)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [77]:
dropped_columns =['zip_dists_1335', 'host_neighbourhood_Wilmersdorf', 'state_is_berlin', 'street_is_berlin_berlin_germany', 'host_response_time_other']

In [79]:
X_test_dropped = X_test_pruned.drop(columns = dropped_columns)

model.score(X_test_dropped, y_test_pruned)

0.46007398634498153

In [80]:
X_test_pruned.columns

Index(['longitude', 'bedrooms', 'bedrooms_is_na', 'host_response_rate_is_na',
       'summary_is_na', 'license_is_na', 'zipcode_is_na', 'bathrooms_is_na',
       'beds_is_na', 'host_sinceIs_year_end', 'requires_license',
       'state_is_berlin', 'street_is_berlin_berlin_germany',
       'property_type_Loft', 'property_type_other',
       'neighbourhood_group_cleansed_Mitte',
       'neighbourhood_group_cleansed_Neukölln',
       'neighbourhood_group_cleansed_Reinickendorf',
       'neighbourhood_group_cleansed_other', 'neighbourhood_Wedding',
       'neighbourhood_Wilmersdorf', 'host_neighbourhood_Wilmersdorf',
       'cancellation_policy_super_strict_60', 'room_type_Private room',
       'room_type_Shared room', 'host_response_time_other',
       'neighbourhood_cleansed_Moabit Ost',
       'neighbourhood_cleansed_Moabit West',
       'neighbourhood_cleansed_Parkviertel', 'zip_dists_1082',
       'zip_dists_1335', 'zip_dists_other', 'last_reviewMonth_2.0',
       'last_reviewMonth_3.0

So we notice that we removed five columns, and essentially wound up with the same as our previous larger model.

(35 features, and score of 0.4628732594292488)

Here are our new importances: 

In [84]:
feature_importances(X_test_dropped, model)[:, :-1]

array([['room_type_Shared room', -0.9389809471607655],
       ['cancellation_policy_super_strict_60', 0.8424899490730072],
       ['requires_license', 0.7047767596167392],
       ['longitude', -0.6794678066264861],
       ['room_type_Private room', -0.5248350063414015],
       ['beds_is_na', 0.49824074208911406],
       ['neighbourhood_cleansed_Moabit West', -0.3956887963818316],
       ['property_type_Loft', 0.37686854970723155],
       ['property_type_other', 0.37594275346111766],
       ['neighbourhood_Wedding', -0.3633285128970777],
       ['bathrooms_is_na', -0.3257359699532433],
       ['host_sinceIs_year_end', -0.3195202583914894],
       ['bedrooms_is_na', -0.31867848759722883],
       ['neighbourhood_cleansed_Moabit Ost', -0.2977904215325947],
       ['bedrooms', 0.2662678090864989],
       ['neighbourhood_group_cleansed_Reinickendorf',
        -0.2191956998422411],
       ['neighbourhood_group_cleansed_other', -0.21602486160547915],
       ['license_is_na', -0.188607483790663