# Correlation and test of relationship

In this notebook, we briefly show how to visualize and test for correlation between two variables.

First we import the standard packages

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats

Then we import some example data. In this case Google Analytics webdata about daily users and daily number of purchases in the webshop

In [None]:
webdata = pd.read_excel("GA users and convertions.xlsx")

In [None]:
webdata

## Visualizing correlation between two numeric variables

Thus visualize the correlation between two numeric variables, we use a scatterplot.

In [None]:
sns.scatterplot(data = webdata, x = "Users", y = "PurchaseCompleted")
plt.title("Visualization of the correlation between users and purchases completed")
plt.savefig('corrplot.png')
plt.show()

## Correlation coefficient

To calculate the (Pearson) correlation coefficient between two numeric variables we can either use the `.corr` method in pandas or the `pearsonr` from SciPy.

In [None]:
webdata["Users"].corr(webdata["PurchaseCompleted"])

In [None]:
stats.pearsonr(webdata["Users"], webdata["PurchaseCompleted"])

## Relationship between a numeric and a categorical variable

To visualize the relationship, one can either do a histogram for each value of the categorical or boxplot.

We will use the adult dataset af example

In [None]:
from ucimlrepo import fetch_ucirepo 
  
adult_temp = fetch_ucirepo(id=2) 
  
X = adult_temp.data.features 
y = adult_temp.data.targets 
X["income"] = y
adult = X

adult.head()

We look at the relationship between *sex* and *hours-per-week*. We can do both histograms and boxplots.

In [None]:
g=sns.FacetGrid(data = adult, row="sex", height = 5)
g.map(sns.histplot, "hours-per-week", bins = 12)
plt.show()

In [None]:
sns.catplot(y="hours-per-week", hue = "sex", data = adult, kind="box", height = 7,
            showmeans=True,
            meanprops={"marker":"X", "markerfacecolor":"white", "markeredgecolor":"black", "markersize": "10"})
plt.show()

To test is the difference in hours-per-week is significant, we can use the tests from the statistics class such as the Sudent t-test or Mann-Whitney U test.

In [None]:
stats.ttest_ind(adult[adult["sex"]=="Female"]["hours-per-week"], adult[adult["sex"]=="Male"]["hours-per-week"])

In [None]:
stats.mannwhitneyu(adult[adult["sex"]=="Female"]["hours-per-week"], adult[adult["sex"]=="Male"]["hours-per-week"])

## Relationship between two categorical variables

For visualizing the relationship we can use a mosaic plot.

In [None]:
from statsmodels.graphics.mosaicplot import mosaic

mosaic(adult, ["sex", "income"])
plt.show()

It seems that there is something strange with the *income* variable...

In [None]:
adult["income"].groupby(adult["income"]).count()

It seems like there are "<=50k." that should be "<=50k" and ">50k." that should be ">50k". Thus we can fix it by just replacing "." with nothing in the *income* column.

In [None]:
adult['income'] = adult['income'].str.replace('.','')

In [None]:
adult["income"].groupby(adult["income"]).count()

In [None]:
mosaic(adult, ["sex", "income"])
plt.show()

We can get the numbers of each combined group by the pandas cross table

In [None]:
pd.crosstab(adult["sex"], adult["income"])

Since there are plenty of individuals in all combined groups, we can use the Chi-squared test to test for whether there is a statistically significant relationship between the two categorical variables. (If there were a combined group with less than 5 individuals, one should use the Fisher's Exact test instead.)

In [None]:
stats.chi2_contingency(pd.crosstab(adult["sex"], adult["income"]))

As the p-value is well below our significance level of 0.05, we reject the null hypothesis that there is no relationship between the two groups. Thus, we have statistical significant support for there being a diffence in income across different genders.