# Using Correlations

### Introduction

In this lesson, we'll apply what we learned about correlations to reducing the features in our Airbnb dataset.  Let's get started.

### Loading the Data

We'll start by loading up our training set with the top features that we previously found using permutation importance.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/feature-selection/master/listings_train_top_forty.csv"
listings_df = pd.read_csv(url, index_col = 0)
listings_df.shape

(17952, 41)

Included here is the `price` feature.

In [170]:
X = listings_df.drop('price', axis = 1)

In [171]:
y = listings_df['price']

In [210]:
from sklearn.model_selection import train_test_split
X_train, X_validate, y_train, y_validate = train_test_split(X, y, random_state = 21)

In [212]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train).score(X_validate, y_validate)
# 0.4847218710182034

0.4847218710182034

### Finding Correlated Features In Code

Because we have so many features, we'll avoid plotting a scatter matrix.

Instead we can calculate our correlation in code.  To do so, we first should coerce all of our values to numbers.  So we can select our boolean features and coerce them to 1s and 0s.

In [199]:
bool_cols = X.select_dtypes(include='bool').columns

# we do this with astype int
X[bool_cols] = X[bool_cols].astype(int)

Now that everything is numeric, we can calculate the correlation between all columns.

Let's start with the pearson correlation.  Start by assigning `pearson_corr_df` to a grid of the pearson correlation coefficients.

In [200]:
pearson_corr_df = None

pearson_corr_df.iloc[:5, :5]
# 	accommodates	guests_included	availability_90	calculated_host_listings_count	property_type_x0_Apartment
# accommodates	1.000000	0.502932	0.148772	0.182929	-0.118029
# guests_included	0.502932	1.000000	0.099042	0.099660	-0.059833
# availability_90	0.148772	0.099042	1.000000	0.223432	-0.178087
# calculated_host_listings_count	0.182929	0.099660	0.223432	1.000000	-0.159715
# property_type_x0_Apartment	-0.118029	-0.059833	-0.178087	-0.159715	1.000000

Unnamed: 0,accommodates,guests_included,availability_90,calculated_host_listings_count,property_type_x0_Apartment
accommodates,1.0,0.502932,0.148772,0.182929,-0.118029
guests_included,0.502932,1.0,0.099042,0.09966,-0.059833
availability_90,0.148772,0.099042,1.0,0.223432,-0.178087
calculated_host_listings_count,0.182929,0.09966,0.223432,1.0,-0.159715
property_type_x0_Apartment,-0.118029,-0.059833,-0.178087,-0.159715,1.0


In [201]:
pearson_corr_df.shape

# (40, 40)

(40, 40)

So with the above, we can see how well a line fits to each of the above pairs of features.  Next, for each feature, find the number of correlations over $.70$, and sort by this amount.

In [202]:
pearson_corr_amounts = None

In [231]:
pearson_corr_amounts[:12]

# last_reviewYear_is_na                7
# reviews_per_month_is_na              7
# review_scores_value_is_na            7
# review_scores_location_is_na         7
# review_scores_communication_is_na    7
# review_scores_rating_is_na           7
# first_reviewYear_is_na               7
# host_sinceYear_is_na                 3
# host_total_listings_count_is_na      3
# host_listings_count_is_na            3
# host_sinceDayofyear                  2
# first_reviewElapsed                  2
# dtype: int64

last_reviewYear_is_na                7
reviews_per_month_is_na              7
review_scores_value_is_na            7
review_scores_location_is_na         7
review_scores_communication_is_na    7
review_scores_rating_is_na           7
first_reviewYear_is_na               7
host_sinceYear_is_na                 3
host_total_listings_count_is_na      3
host_listings_count_is_na            3
host_sinceDayofyear                  2
first_reviewElapsed                  2
dtype: int64

We can see that a lot of our na columns are correlated with each other.  Let's start by removing the columns with that have value of 7.  We can leave one of them there, so that the information is still captured in that one column.

Select the columns with a count equal to 7, drop them from our columns, train a model, and check the score on the validation set.

In [194]:
pearson_corr_cols = None

# ['last_reviewYear_is_na',
#  'reviews_per_month_is_na',
#  'review_scores_value_is_na',
#  'review_scores_location_is_na',
#  'review_scores_communication_is_na',
#  'review_scores_rating_is_na',
#  'first_reviewYear_is_na']

['last_reviewYear_is_na',
 'reviews_per_month_is_na',
 'review_scores_value_is_na',
 'review_scores_location_is_na',
 'review_scores_communication_is_na',
 'review_scores_rating_is_na',
 'first_reviewYear_is_na']

In [233]:
X_train_removed_pearson = None
X_validate_removed_pearson = None

In [234]:

# 0.4848176620843354

0.4848176620843354

In [236]:
X_train_removed_pearson.shape
# (13464, 34)

(13464, 34)

Ok, with this reduced model, for each feature, let's check the number of correlations above .70 for the spearman correlation.  First find the spearman correlation between each of the remaining features.

In [245]:
spearman_corr_df = None

In [246]:
spearman_corr_df[:2]

# 	accommodates	guests_included	availability_90	calculated_host_listings_count	property_type_x0_Apartment	room_type_x0_Entire home/apt	cancellation_policy_x0_flexible	cancellation_policy_x0_moderate	cancellation_policy_x0_strict_14_with_grace_period	neighbourhood_group_cleansed_x0_Friedrichshain-Kreuzberg	...	host_sinceDayofyear	first_reviewYear	first_reviewMonth	first_reviewDayofyear	last_reviewYear	last_reviewDayofyear	host_listings_count_is_na	host_total_listings_count_is_na	host_sinceYear_is_na	last_reviewYear_is_na
# accommodates	1.00000	0.44539	0.127489	0.131958	-0.073413	0.461801	-0.175517	0.034774	0.151342	-0.016655	...	-0.005133	-0.051330	-0.014691	-0.013655	0.085635	0.057022	0.005791	0.005791	0.005791	-0.092999
# guests_included	0.44539	1.00000	0.121660	0.110110	-0.049294	0.317260	-0.222347	0.084096	0.147671	-0.002118	...	0.001708	-0.096877	-0.020118	-0.019819	0.092963	0.070199	0.004369	0.004369	0.004369	-0.104033
# 2 rows × 34 columns

Unnamed: 0,accommodates,guests_included,availability_90,calculated_host_listings_count,property_type_x0_Apartment,room_type_x0_Entire home/apt,cancellation_policy_x0_flexible,cancellation_policy_x0_moderate,cancellation_policy_x0_strict_14_with_grace_period,neighbourhood_group_cleansed_x0_Friedrichshain-Kreuzberg,...,host_sinceDayofyear,first_reviewYear,first_reviewMonth,first_reviewDayofyear,last_reviewYear,last_reviewDayofyear,host_listings_count_is_na,host_total_listings_count_is_na,host_sinceYear_is_na,last_reviewYear_is_na
accommodates,1.0,0.44539,0.127489,0.131958,-0.073413,0.461801,-0.175517,0.034774,0.151342,-0.016655,...,-0.005133,-0.05133,-0.014691,-0.013655,0.085635,0.057022,0.005791,0.005791,0.005791,-0.092999
guests_included,0.44539,1.0,0.12166,0.11011,-0.049294,0.31726,-0.222347,0.084096,0.147671,-0.002118,...,0.001708,-0.096877,-0.020118,-0.019819,0.092963,0.070199,0.004369,0.004369,0.004369,-0.104033


In [247]:
spearman_corr_df.shape
# (34, 34)

(34, 34)

Then, for each feature, calculate the number of spearment correlations above .7.

In [248]:
spearman_corr_amounts = None

In [250]:
spearman_corr_amounts[spearman_corr_amounts > 1]
# host_total_listings_count_is_na    3
# host_listings_count_is_na          3
# host_sinceYear_is_na               3
# first_reviewElapsed                2
# host_sinceDayofyear                2
# host_sinceElapsed                  2
# last_reviewElapsed                 2
# host_sinceYear                     2
# host_sinceMonth                    2
# first_reviewYear                   2
# first_reviewMonth                  2
# first_reviewDayofyear              2
# last_reviewYear                    2
# dtype: int64

host_total_listings_count_is_na    3
host_listings_count_is_na          3
host_sinceYear_is_na               3
first_reviewElapsed                2
host_sinceDayofyear                2
host_sinceElapsed                  2
last_reviewElapsed                 2
host_sinceYear                     2
host_sinceMonth                    2
first_reviewYear                   2
first_reviewMonth                  2
first_reviewDayofyear              2
last_reviewYear                    2
dtype: int64

Ok, we see a lot of repeated words in the above features -- `host_since`, `first_review`, `listings_count`.  Let's try removing these potentially duplicate features.  We'll hold onto the `elapsed` columns, and drop the duplicates.  This means we'll drop:
* `host_total_listings_count_is_na`, 
* `host_sinceYear_is_na`, `host_sinceDayofyear`, `host_sinceYear`, `host_sinceMonth`
* `first_reviewYear`, `first_reviewMonth`, `first_reviewDayofyear`
* `last_reviewYear`

In [251]:
spearman_corr_cols = ['host_total_listings_count_is_na', 'host_sinceYear_is_na', 'host_sinceDayofyear', 'host_sinceYear',
                      'host_sinceMonth', 'first_reviewYear', 'first_reviewMonth', 'first_reviewDayofyear', 'last_reviewYear'
                     ]


In [253]:
X_train_removed_spearman = X_train_removed_pearson.drop(columns = spearman_corr_cols)

In [254]:
X_validate_removed_spearman = X_validate_removed_pearson.drop(columns = spearman_corr_cols)

In [255]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train_removed_spearman, y_train).score(X_validate_removed_spearman, y_validate)

0.4824619939089561

So we see the same score.

In [256]:
X_validate_removed_spearman.shape

(4488, 25)

And now we are down to 25 columns.

Let's take another look at our feature importances.

In [257]:
from eli5.sklearn import PermutationImportance
import eli5


perm = PermutationImportance(model).fit(X_validate_removed_spearman, y_validate)

exp_df = eli5.explain_weights_df(perm, feature_names = list(X_train_removed_spearman.columns))
exp_df

Unnamed: 0,feature,weight,std
0,last_reviewYear_is_na,37.105381,0.562376
1,last_reviewElapsed,31.327157,0.719975
2,cancellation_policy_x0_flexible,0.351787,0.004794
3,cancellation_policy_x0_moderate,0.287987,0.004986
4,cancellation_policy_x0_strict_14_with_grace_pe...,0.24642,0.008429
5,first_reviewElapsed,0.146035,0.006223
6,accommodates,0.121869,0.006517
7,room_type_x0_Entire home/apt,0.107334,0.003195
8,availability_90,0.044049,0.002377
9,bedrooms,0.04391,0.004688


Select the bottom four features from the `exp_df` dataframe above.

In [271]:
to_drop = None

Then we'll remove these features and see how we do.

In [272]:
from sklearn.linear_model import LinearRegression

reduced_X_train = X_train_removed_spearman.drop(columns = to_drop)
reduced_X_validate = X_validate_removed_spearman.drop(columns = to_drop)
model = LinearRegression()
model.fit(reduced_X_train, y_train).score(reduced_X_validate, y_validate)
# 0.4825615791102038

0.4825615791102038

So now we are down to 21 features.

In [286]:
selectd_cols = exp_df[:21].feature.values
# selectd_cols

While we could now use this information to focus our feature engineering, let's stop here for now.

### Summary

In this lesson, we reduced our features from 40 down to 21.  We did so by reducing features that were highly correlated with a large number of other features.  We discovered this by using both `pearson` and `spearman` correlation coefficients.  Then, after reducing our features with correlations, we took another look at our feature importances, and removed those with low scores, to find our model accuracy maintained.