# Using Correlations

### Introduction

In this lesson, we'll apply what we learned about correlations to reducing the features in our Airbnb dataset.  Let's get started.

### Loading the Data

We'll start by loading up our training set with the top features that we previously found using permutation importance.

In [169]:
import pandas as pd

listings_df = pd.read_csv('./listings_train_top_forty.csv', index_col = 0)
listings_df.shape

(17952, 41)

Included here is the `price` feature.

In [170]:
X = listings_df.drop('price', axis = 1)

In [171]:
y = listings_df['price']

### Finding Correlated Features In Code

Because we have so many features, we'll avoid plotting a scatter matrix.

Instead we can calculate our correlation in code.  To do so, we first should coerce all of our values to numbers.  So we can select our boolean features and coerce them to 1s and 0s.

In [172]:
bool_cols = listings_df.select_dtypes(include='bool').columns

# we do this with astype int
listings_df[bool_cols] = listings_df[bool_cols].astype(int)

Now that everything is numeric, we can calculate the correlation between all columns.

Let's start with the pearson correlation.  Start by assigning `pearson_corr_df` to a grid of the pearson correlation coefficients.

In [176]:
pearson_corr_df = X.corr(method = 'pearson')

pearson_corr_df.iloc[:5, :5]
# 	accommodates	guests_included	availability_90	calculated_host_listings_count	property_type_x0_Apartment
# accommodates	1.000000	0.502932	0.148772	0.182929	-0.118029
# guests_included	0.502932	1.000000	0.099042	0.099660	-0.059833
# availability_90	0.148772	0.099042	1.000000	0.223432	-0.178087
# calculated_host_listings_count	0.182929	0.099660	0.223432	1.000000	-0.159715
# property_type_x0_Apartment	-0.118029	-0.059833	-0.178087	-0.159715	1.000000

Unnamed: 0,accommodates,guests_included,availability_90,calculated_host_listings_count,property_type_x0_Apartment
accommodates,1.0,0.502932,0.148772,0.182929,-0.118029
guests_included,0.502932,1.0,0.099042,0.09966,-0.059833
availability_90,0.148772,0.099042,1.0,0.223432,-0.178087
calculated_host_listings_count,0.182929,0.09966,0.223432,1.0,-0.159715
property_type_x0_Apartment,-0.118029,-0.059833,-0.178087,-0.159715,1.0


In [178]:
pearson_corr_df.shape

# (40, 40)

(40, 40)

So with the above, we can see how well a line fits to each of the above pairs of features.  Next, for each feature, find the number of correlations over $.70$, and sort by this amount.

In [184]:
pearson_corr_amounts = (pearson_corr_df > .70).sum(axis = 1).sort_values(ascending = False)

In [185]:
pearson_corr_amounts[:10]

last_reviewYear_is_na                7
reviews_per_month_is_na              7
review_scores_value_is_na            7
review_scores_location_is_na         7
review_scores_communication_is_na    7
review_scores_rating_is_na           7
first_reviewYear_is_na               7
host_sinceYear_is_na                 3
host_total_listings_count_is_na      3
host_listings_count_is_na            3
dtype: int64

We can see that a lot of our na columns are correlated with each other.  Let's start by removing those columns that 

Let's see how well this applies to the above.

In [None]:
corr_df

In [152]:
np.argmax(corr_df.to_numpy(), axis = 1)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 30, 32, 33,
       34, 35, 36, 30, 36, 39, 40])

In [47]:
over_one_corr = indexed_corr.sum(axis = 0)

In [51]:
highly_corr_cols = over_one_corr.sort_values(ascending = False).index[:7]

In [102]:
highly_corr_cols

Index(['first_reviewYear_is_na', 'reviews_per_month_is_na',
       'review_scores_value_is_na', 'review_scores_location_is_na',
       'review_scores_communication_is_na', 'review_scores_rating_is_na',
       'last_reviewYear_is_na'],
      dtype='object')

### Drop Columns

In [55]:
X = listings_df.drop('price', axis = 1)

In [56]:
y = listings_df['price']

In [93]:
X_pruned = X.drop(columns = highly_corr_cols[:5])

In [94]:
from sklearn.model_selection import train_test_split

In [95]:
X_train, X_validate, y_train, y_validate = train_test_split(X_pruned, y)

In [96]:
from sklearn.linear_model import LinearRegression

In [97]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [98]:
model.score(X_validate, y_validate)

0.5007910801998332

In [99]:
highly_corr_cols[:5]

Index(['first_reviewYear_is_na', 'reviews_per_month_is_na',
       'review_scores_value_is_na', 'review_scores_location_is_na',
       'review_scores_communication_is_na'],
      dtype='object')

In [100]:
X_pruned.shape

(17952, 35)

### Resources

[Khan Academy correlation coefficient](https://www.youtube.com/watch?v=u4ugaNo6v1Q)

So from here we can see some pairings: 

* neighborhood_Wedding and zip_dists_1335
* neighbourhood_Wilmersdorf, host_neighbourhood_Wilmersdorf
* state_is_berlin, street_is_berlin
* street_is_berlin_berlin_germany, state_is_berlin

* We of course could also do this with our entire matrix.

But we can see any even stronger relationship, if we use a metric like `spearmanr`.  

In [64]:
#  here need to also have for less than .60
indexed_corr = (X_train_pruned.corr(method = 'spearman') > .60).sum()
correlated_cols = indexed_corr[indexed_corr > 1].index
correlated_cols

X_train_pruned[correlated_cols].corr(method = 'spearman')

Index(['host_response_rate_is_na', 'state_is_berlin',
       'street_is_berlin_berlin_germany', 'neighbourhood_Wedding',
       'neighbourhood_Wilmersdorf', 'host_neighbourhood_Wilmersdorf',
       'host_response_time_other', 'zip_dists_1335'],
      dtype='object')

In [62]:
X_train_pruned[correlated_cols].corr(method = 'spearman')

Unnamed: 0,host_response_rate_is_na,state_is_berlin,street_is_berlin_berlin_germany,neighbourhood_Wedding,neighbourhood_Wilmersdorf,host_neighbourhood_Wilmersdorf,host_response_time_other,zip_dists_1335
host_response_rate_is_na,1.0,0.008039,0.034077,0.046751,-0.027278,-0.02585,1.0,0.027216
state_is_berlin,0.008039,1.0,0.748955,0.016985,-0.005961,-0.003056,0.008039,0.017751
street_is_berlin_berlin_germany,0.034077,0.748955,1.0,0.016855,-0.010946,-0.011327,0.034077,0.020733
neighbourhood_Wedding,0.046751,0.016985,0.016855,1.0,-0.038366,-0.031592,0.046751,0.808018
neighbourhood_Wilmersdorf,-0.027278,-0.005961,-0.010946,-0.038366,1.0,0.803552,-0.027278,-0.033285
host_neighbourhood_Wilmersdorf,-0.02585,-0.003056,-0.011327,-0.031592,0.803552,1.0,-0.02585,-0.026839
host_response_time_other,1.0,0.008039,0.034077,0.046751,-0.027278,-0.02585,1.0,0.027216
zip_dists_1335,0.027216,0.017751,0.020733,0.808018,-0.033285,-0.026839,0.027216,1.0


So to this we can add: 
    * host_response_time_other, and host_response_rate_is_na

## Remove connected columns

* neighborhood_Wedding and zip_dists_1335
* neighbourhood_Wilmersdorf, host_neighbourhood_Wilmersdorf
* state_is_berlin, street_is_berlin
* street_is_berlin_berlin_germany, state_is_berlin
* host_response_time_other, and host_response_rate_is_na

In [65]:
X_train_dropped = X_train_pruned.drop(columns=['zip_dists_1335', 'host_neighbourhood_Wilmersdorf', 'state_is_berlin', 'street_is_berlin_berlin_germany', 'host_response_time_other'])

In [67]:
len(X_train_dropped.columns)

30

# Working with Dendrograms

### Introduction

In the last lesson, we saw how we can detect variables that are associated with each other by using correlations.  The motivation for examining the correlations between variables is that if two features are correlated, we can likely remove one of them with suffering a significant decrease in our model's accuracy.  And by reducing the number of features, we see our standard benefits of a more understable model, a decrease in variance, and a reduction in multicollinearity.

In the last section, we also saw two different types of correlations.  The pearson correlation which measured strictly the strength of a linear relationship between variables, and the spearman correlation or rank correlation, which also captured non-linear relationships.

As mentioned, we'll focus on the spearman correlation going forward.

### Plotting Correlations with Dendrograms

Now previously, we plotted both our relationships among variables with scatter plots.  In this section, we'll use a dendrogram to see if two variables are assosicated with one another.  Ok, so let's start by plotting our scatter matrix of the rank of the features.

In [41]:
import scipy.cluster.hierarchy as hc
corr = X_train_dropped.corr(method = 'spearman')
corr_condensed = hc.distance.squareform(1 - np.abs(corr))
z = hc.linkage(corr_condensed, method = 'average')

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(16,10))
dendrogram = hc.dendrogram(z, labels = X_train_dropped.columns, orientation = 'left', leaf_font_size=16)

NameError: name 'X_train_dropped' is not defined

In [70]:
X_train_dropped.columns

Index(['longitude', 'bedrooms', 'bedrooms_is_na', 'host_response_rate_is_na',
       'summary_is_na', 'license_is_na', 'zipcode_is_na', 'bathrooms_is_na',
       'beds_is_na', 'host_sinceIs_year_end', 'requires_license',
       'property_type_Loft', 'property_type_other',
       'neighbourhood_group_cleansed_Mitte',
       'neighbourhood_group_cleansed_Neukölln',
       'neighbourhood_group_cleansed_Reinickendorf',
       'neighbourhood_group_cleansed_other', 'neighbourhood_Wedding',
       'neighbourhood_Wilmersdorf', 'cancellation_policy_super_strict_60',
       'room_type_Private room', 'room_type_Shared room',
       'neighbourhood_cleansed_Moabit Ost',
       'neighbourhood_cleansed_Moabit West',
       'neighbourhood_cleansed_Parkviertel', 'zip_dists_1082',
       'zip_dists_other', 'last_reviewMonth_2.0', 'last_reviewMonth_3.0',
       'last_reviewMonth_12.0'],
      dtype='object')

In [73]:
first_pair = ['longitude', 'neighbourhood_group_cleansed_Neukölln']
second_pair = ['neighbourhood_group_cleansed_Mitte', 'neighbourhood_Wedding']
X_train_dropped[first_pair].corr(method = 'spearman')

Unnamed: 0,longitude,neighbourhood_group_cleansed_Neukölln
longitude,1.0,0.294026
neighbourhood_group_cleansed_Neukölln,0.294026,1.0


In [74]:
X_train_dropped[second_pair].corr(method = 'spearman')

Unnamed: 0,neighbourhood_group_cleansed_Mitte,neighbourhood_Wedding
neighbourhood_group_cleansed_Mitte,1.0,0.509146
neighbourhood_Wedding,0.509146,1.0


### Removing Features

Neither of these look too strong, so we can leave them both in.

In [None]:
# model.fit(X_train_pruned, y_train_pruned)
# model.score(X_test_pruned, y_test_pruned)

In [76]:
model.fit(X_train_dropped, y_train_pruned)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [77]:
dropped_columns =['zip_dists_1335', 'host_neighbourhood_Wilmersdorf', 'state_is_berlin', 'street_is_berlin_berlin_germany', 'host_response_time_other']

In [79]:
X_test_dropped = X_test_pruned.drop(columns = dropped_columns)

model.score(X_test_dropped, y_test_pruned)

0.46007398634498153

In [80]:
X_test_pruned.columns

Index(['longitude', 'bedrooms', 'bedrooms_is_na', 'host_response_rate_is_na',
       'summary_is_na', 'license_is_na', 'zipcode_is_na', 'bathrooms_is_na',
       'beds_is_na', 'host_sinceIs_year_end', 'requires_license',
       'state_is_berlin', 'street_is_berlin_berlin_germany',
       'property_type_Loft', 'property_type_other',
       'neighbourhood_group_cleansed_Mitte',
       'neighbourhood_group_cleansed_Neukölln',
       'neighbourhood_group_cleansed_Reinickendorf',
       'neighbourhood_group_cleansed_other', 'neighbourhood_Wedding',
       'neighbourhood_Wilmersdorf', 'host_neighbourhood_Wilmersdorf',
       'cancellation_policy_super_strict_60', 'room_type_Private room',
       'room_type_Shared room', 'host_response_time_other',
       'neighbourhood_cleansed_Moabit Ost',
       'neighbourhood_cleansed_Moabit West',
       'neighbourhood_cleansed_Parkviertel', 'zip_dists_1082',
       'zip_dists_1335', 'zip_dists_other', 'last_reviewMonth_2.0',
       'last_reviewMonth_3.0

So we notice that we removed five columns, and essentially wound up with the same as our previous larger model.

(35 features, and score of 0.4628732594292488)

Here are our new importances: 

In [84]:
feature_importances(X_test_dropped, model)[:, :-1]

array([['room_type_Shared room', -0.9389809471607655],
       ['cancellation_policy_super_strict_60', 0.8424899490730072],
       ['requires_license', 0.7047767596167392],
       ['longitude', -0.6794678066264861],
       ['room_type_Private room', -0.5248350063414015],
       ['beds_is_na', 0.49824074208911406],
       ['neighbourhood_cleansed_Moabit West', -0.3956887963818316],
       ['property_type_Loft', 0.37686854970723155],
       ['property_type_other', 0.37594275346111766],
       ['neighbourhood_Wedding', -0.3633285128970777],
       ['bathrooms_is_na', -0.3257359699532433],
       ['host_sinceIs_year_end', -0.3195202583914894],
       ['bedrooms_is_na', -0.31867848759722883],
       ['neighbourhood_cleansed_Moabit Ost', -0.2977904215325947],
       ['bedrooms', 0.2662678090864989],
       ['neighbourhood_group_cleansed_Reinickendorf',
        -0.2191956998422411],
       ['neighbourhood_group_cleansed_other', -0.21602486160547915],
       ['license_is_na', -0.188607483790663