# Analyzing correlations betweeen variables in customer_data_clean_python

In this notebook, we are going to study the correlations.

* [Setup and loading the data](#setup)
* [Correlation matrix](#corr-matrix)
* [Scatterplot matrix](#scatter-matrix)
* [Detailed analysis between two variables](#two-vars)

<center><strong>Select Cell > Run All to execute the whole analysis</strong></center>

## Setup and dataset loading <a id="setup"></a> 

First of all, let's load the libraries that we'll use

In [0]:
%pylab inline
import dataiku                          # Access to Dataiku datasets
import pandas as pd, numpy as np        # Data manipulation 
import scipy.cluster.hierarchy as sch   # Used for reordering the correlation matrix
import warnings                         # Disable some warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

The first thing we do is now to load the dataset and put aside the three main types of columns:

* Numerics
* Categorical
* Dates

Since analyzing correlations requires to have the data in memory, we are only going to load a sample of the data. Modify the following cell to change the size of the sample.

In [0]:
# Take a handle on the dataset
mydataset = dataiku.Dataset("customer_data_clean_python")

# Load the first 100'000 lines.
# You can also load random samples, limit yourself to some columns, or only load
# data matching some filters.
#
# Please refer to the Dataiku Python API documentation for more information
df = mydataset.get_dataframe(
    limit = 100000)

# Get the column names
numerical_columns = list(df.select_dtypes(include=[np.number]).columns)
categorical_columns = list(df.select_dtypes(include=[object]).columns)
date_columns = list(df.select_dtypes(include=['<M8[ns]']).columns)

# Print a quick summary of what we just loaded
print("Loaded dataset")
print("   Rows: %s" % df.shape[0])
print("   Columns: %s (%s num, %s cat, %s date)" % (df.shape[1], 
                                                    len(numerical_columns), len(categorical_columns),
                                                    len(date_columns)))

# Correlation matrix <a id="corr-matrix"></a>

The very first correlation analysis consists of plotting the "Correlation matrix" for numerical variables.

For each couple of numerical variables, this computes the "strength" of the correlation (called the Pearson coefficient):

 * 1.0 means a perfect correlation
 * 0.0 means no correlation
 * -1.0 means a perfect "inverse" correlation
 
Since it does not really make sense to print this correlation plot for hundred of variables, we are restricting it to the first 50 numerical variables of the dataset. Modify the following cell to change this

In [0]:
# Select variables to plot for the correlation matrix
corr_matrix_vars = numerical_columns[0:50]

print("Plotting the correlation matrix on the following variables : %s" % corr_matrix_vars)

In [0]:
# Only select the requested columns
df_corr_matrix = df[corr_matrix_vars]

# This computes the Pearson coefficient for all couples
corr = df_corr_matrix.corr().fillna(0)

# Start drawing

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
size = max(10, len(corr.columns)/2.)
f, ax = plt.subplots(figsize=(size, size))

# Draw the heatmap with the mask and correct aspect ratio
p = ax.imshow(corr.mask(mask), cmap='inferno')
f.colorbar(p)
ax.set_yticks(range(0, len(corr.columns)))
ax.set_yticklabels(corr.columns)
ax.set_xticks(range(0, len(corr.columns)))
ax.set_xticklabels(corr.columns, rotation = 90)
plt.show()


### Reordered correlation matrix

An interesting improvement over the correlation matrix is to reorder it by similarity between the variables so that the "groups" of variables that are strongly correlated appear close in the matrix.

In [0]:
# Generate features and distance matrix.
D = corr.values
# Compute and plot dendrogram.
Y = sch.linkage(D, method='centroid')
Z = sch.dendrogram(Y, orientation='right',no_plot=True)
# Compute distance matrix.
index = Z['leaves']
D = D[index,:]
D = D[:,index]

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
size = max(10, len(corr.columns)/2.)
f, ax = plt.subplots(figsize=(size, size))

# Draw the heatmap with the mask and correct aspect ratio

p = ax.imshow(np.where(mask, float('NaN'), D), cmap='inferno')
f.colorbar(p)
ax.set_yticks(range(0, len(corr.columns)))
ax.set_yticklabels(corr.columns[index])
ax.set_xticks(range(0, len(corr.columns)))
ax.set_xticklabels(corr.columns[index], rotation = 90)
plt.show()


# Scatter matrices <a id="scatter-matrix"></a>

In [0]:
# Only generate the scatterplot matrix on a sample
df_scatter_samp = df.sample(min(5000, df.shape[0])) # 5000 points maximum on the scatter plot

# Take the first 4 numerical variables to plot the scatterplot matrix
scatter_matrix_vars = numerical_columns[0:4]

# If we have categorical variables, use the categorical variables with the lowest number of modalities
# to plot the points of the scatterplot
scatter_matrix_color = None

cat_cols_with_cards = [(x, df[x].nunique()) for x in categorical_columns]
# We don't want to take a column with only a single modality
# and also we don't want variables with more than 10 modalities (would not really make sense to plot)
cat_cols_with_cards_f = [x for x in cat_cols_with_cards if x[1] >= 2 and x[1] <= 10]

if len(cat_cols_with_cards_f) > 0:
    # We have at least one categorical variable with a good number of modalities, use it
    scatter_matrix_color = sorted(cat_cols_with_cards_f, key= lambda c : c[1])[0][0]
    
print("We will plot the following numerical variables : %s" % scatter_matrix_vars)
if scatter_matrix_color is not None:
    print("Coloring the scatters by: %s" % scatter_matrix_color)

In [0]:
# Uncomment this if you want to take manual control over which variables are plotted
# scatter_matrix_vars = ["num1", "num2", "num3"]
# scatter_matrix_color = "cat1"

In [0]:
fig, axs = plt.subplots(ncols=len(scatter_matrix_vars), nrows=len(scatter_matrix_vars), figsize=(16, 16))
c_s = df_scatter_samp[scatter_matrix_color].astype('category').cat.codes if scatter_matrix_color is not None else None
for i in range(0, len(scatter_matrix_vars)):
    for j in range(0, len(scatter_matrix_vars)):
        ax = axs[j, i]
        x_s = df_scatter_samp[scatter_matrix_vars[i]]
        y_s = df_scatter_samp[scatter_matrix_vars[j]]
        if i == j:
            if scatter_matrix_color is None:
                df_scatter_samp[scatter_matrix_vars[i]].plot.kde(ax=ax, secondary_y=True)
            else:
                df_scatter_samp.groupby(scatter_matrix_color)[scatter_matrix_vars[i]].plot.kde(ax=ax, secondary_y=True)
        else:
            ax.scatter(x_s, y_s, c=c_s)
            if i > 0:
                ax.set_yticks([])
            else:
                ax.set_ylabel(scatter_matrix_vars[i])
        if j < len(scatter_matrix_vars) - 1:
            ax.set_xticks([])
        else:
            ax.set_xlabel(scatter_matrix_vars[j])
# align X axis
for i in range(0, len(scatter_matrix_vars)):
    for j in range(0, len(scatter_matrix_vars)):
        axs[j, i].set_xlim(axs[i, i].get_xlim())
plt.show()   
