## Considering Correlated Variables (a.k.a. Feature Selection)

Depending on the clustering technique, correlated variables can have an unexpected effect on the results by allowing some dimensions to be 'double-weighted' in the results. So we don't want to keep too many correlated variables in the clustering data since that will bias the clustering algorithms and may result in poor 'performance'. 

<div style="padding:5px;margin-top:5px;margin-bottom:5px;border:dotted 1px red;background-color:rgb(255,233,233);color:red">STOP. Think about _why_ correlation between two variables could lead to 'double-weighting in the clustering results!</div>

One way to deal this is to produce a correlation table for all variables and then look to remove problematic variables. For a gentle introduction (that kinds of leaves you hanging at the end) there's a nice-looking blog post on [Medium](https://medium.com/towards-artificial-intelligence/feature-selection-and-dimensionality-reduction-using-covariance-matrix-plot-b4c7498abd07): 

> Feature selection and dimensionality reduction are important because of three main reasons:
> - Prevents Overfitting: A high-dimensional dataset having too many features can sometimes lead to overfitting (model captures both real and random effects).
> - Simplicity: An over-complex model having too many features can be hard to interpret especially when features are correlated with each other.
> - Computational Efficiency: A model trained on a lower-dimensional dataset is computationally efficient (execution of algorithm requires less computational time).
> Dimensionality reduction, therefore, plays a crucial role in data preprocessing.

There's also [this post](https://towardsdatascience.com/feature-selection-correlation-and-p-value-da8921bfb3cf) and [this one](https://towardsdatascience.com/why-feature-correlation-matters-a-lot-847e8ba439c4). We could also use Principal Components Analysis (PCA) to perform dimensionality reduction whilst also dealing with correlation between the variables.

In [None]:
# Here's an output table which gives you nice, specific 
# numbers but is hard to read so I'm only showing the 
# first ten rows and columns... 
scdf.corr().iloc[1:7,1:7]

### Finding Strong Correlations Visually

In [None]:
# And here's a correlation heatmap... which is easier to read but has
# less detail. What it *does* highlight is high levels of *negative*
# correlation as well as positive, so you'll need absolute difference, 
# not just whether something is more than 0.x correlated.
# 
# From https://seaborn.pydata.org/examples/many_pairwise_correlations.html
cdf = scdf.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(cdf, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(10, 10))

# Generate a custom diverging colormap
cm = sns.diverging_palette(240, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(cdf, mask=mask, cmap=cm, vmax=1.0, vmin=-1.0, center=0,
            square=True, linewidths=.1, cbar_kws={"shrink": .5})

<div style="padding:5px;margin-top:5px;margin-bottom:5px;border:dotted 1px red;background-color:rgb(255,233,233);color:red">STOP. Make sure that you understand what the figure above is showing before proceeding to the next stage.</div>

### Finding Strong Correlations Numerically

In [None]:
# Generate the matrix but capture the output this time
cdf = scdf.corr()
cdf['name'] = cdf.index # We need a copy of the index

In [None]:
corrh = 0.66 # Specify threshold for highly correlated?
print("! High correlation threshold is {0}.".format(corrh))

num_corrs = []
hi_corrs  = []

for c in cdf.name.unique():
    if c != 'name':
        # Some formatting
        print("=" * 10 + f" {c} " + "=" * 10)
        
        # Find highly correlated variables
        hits = cdf.loc[(abs(cdf[c]) >= corrh), c]
        hits.drop(c, inplace=True)
        
        if hits.size == 0: # No correlations > corrs
            print("+ Not highly correlated with other variables.")
        else:
            num_corrs.append(hits.size)
            
            print("- High correlations ({0}) with other variables:".format(hits.size))
            print("    " + "\n    ".join(hits.index.values))
            hi_corrs.append(hits.size)  

In [None]:
sns.distplot(hi_corrs, bins=range(0,20), kde=False).set_title(
    "Number of Strong Correlations (> " + str(corrh) + ")  with Other Variables")

### Stripping Out 'Redundant' Variables

Let's remove any variable that has a '*lot*' of strong correlations correlations with other variables, though we need to define what is 'a lot'. This will reduce the dimensionality of our data and make clustering a bit easier. An alternative approach to dimensionality reduction -- which can be more 'robust' if we ensure that all of the data has unit variance (which we've done using the MinMaxScaler), though harder for many to understand -- would be to apply Principal Components Analysis (PCA) to the data set and to work with the eigenvalues afterwards. PCA is also available in `sklearn`.

We'll set our threshold at 5.0 based on a visual inspection of the chart above.

In [None]:
corrh     = 0.66 # Specify threshold for highly correlated?
maxcorrs  = 4.0 # What's our threshold for too many strong correlations?
threshold = 0.5*maxcorrs # What's our threshold for too many strong correlations with columns we keep!

print("! High correlation threshold is {0}.".format(corrh))

to_drop = [] # Columns to drop
to_keep = [] # Columns to keep

num_corrs = []
hi_corrs  = []

for c in cdf.columns:
    if c != 'name':
        
        # Find highly correlated variables, but let's
        # keep the focus on *positive* correlation now
        hits = cdf.loc[(cdf[c] >= corrh), c]
        hits.drop(c, inplace=True)
        
        multi_vals = False
        
        # Remove ones with many correlations
        if hits.size >= maxcorrs: 
            print(f"- {c} exceeds maxcorr ({maxcorrs}) correlation threshold (by {hits.size-threshold}).")
            s1 = set(to_keep)
            s2 = set(hits.index.values)
            #print("Comparing to_keep (" + ", ".join(s1) + ") to hits (" + ", ".join(s2) + ")")
            s1 &= s2
            #print("Column found in 'many correlations' :" + str(s1))
            if len(s1) >= threshold: 
                multi_vals = True
                print(f"    - Dropping b/c exceed {threshold} correlations with retained cols: \n        -" + "\n        -".join(s1))
            else:
                print(f"    + Keeping b/c fewer than {threshold} correlations with retained columns.")
        else: 
            print(f"+ {c} falls below maxcorr ({maxcorrs}) correlation threshold (by {abs(threshold-hits.size)}).")
            
        if multi_vals==True:
            to_drop.append(c)
        else:
            to_keep.append(c)
        

print(" ")
print("To drop ({0}): ".format(len(to_drop)) + ", ".join(to_drop))
print(" ")
print("To keep ({0}): ".format(len(to_keep)) + ", ".join(to_keep))

In [None]:
to_save = scdf.drop(to_drop, axis=1, errors='raise')
print("Retained variables: " + ", ".join(to_save.columns.values))
to_save.to_pickle(os.path.join('data','LSOA_2Cluster.pickle'))
del(to_save)