In [14]:
import pandas

In [15]:
salesDF = pandas.read_excel("SampleSalesData.xlsx")

In [16]:
# correlation study with pandas
# note: the default test used is pearson correlation but there are other tests as well
salesDF.corr()

Unnamed: 0,Row ID,Discount,Unit Price,Shipping Cost,Customer ID,Product Base Margin,Postal Code,Profit,Quantity ordered new,Sales,Order ID
Row ID,1.0,0.000159,0.006013,0.006623,0.035232,-0.001871,0.033987,-0.000298,-0.586831,-0.220869,0.918587
Discount,0.000159,1.0,0.001242,0.002907,-0.00918,0.006839,-0.001531,-0.032304,-0.00864,-0.014077,0.00035
Unit Price,0.006013,0.001242,1.0,0.311096,-0.017679,0.119362,-0.010841,0.245763,-0.03798,0.540051,0.00847
Shipping Cost,0.006623,0.002907,0.311096,1.0,-0.019277,0.417205,-0.013294,-0.083272,-0.030452,0.325326,0.007488
Customer ID,0.035232,-0.00918,-0.017679,-0.019277,1.0,-0.01551,-0.083468,-0.005771,-0.023643,-0.01577,0.03866
Product Base Margin,-0.001871,0.006839,0.119362,0.417205,-0.01551,1.0,-0.004987,-0.09351,0.002285,0.156021,0.004131
Postal Code,0.033987,-0.001531,-0.010841,-0.013294,-0.083468,-0.004987,1.0,0.032323,-0.0145,-0.017011,0.038635
Profit,-0.000298,-0.032304,0.245763,-0.083272,-0.005771,-0.09351,0.032323,1.0,0.105127,0.396902,0.00372
Quantity ordered new,-0.586831,-0.00864,-0.03798,-0.030452,-0.023643,0.002285,-0.0145,0.105127,1.0,0.360657,-0.612492
Sales,-0.220869,-0.014077,0.540051,0.325326,-0.01577,0.156021,-0.017011,0.396902,0.360657,1.0,-0.228632


The correlation chart above shows how a variable is correlated to another.<br>
The closer the number is to +1 or -1, the more correlated it is.

<img src="./images/about correlation.jpg" style="height:400px"/>


In [None]:
salesDF

## However, it isn't that straightforward!
<br>
Pearson's correlation is one of the most commonly used correlation study only works when both variables are normal distributions and is very sensitive to outliers. We run the pearson's correlation test when we want to know if there is a linear relationship between the variables<br/>

Spearman's correlation is typically superior to Pearson's correlation as it can work for variables that are not normal distributions (such as ordinals*)<br/>
see https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide.php and https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient for good explanations on why it is better.<br/>

*Ordinal = ordered categories i.e. "unhappy", "neutral", "happy"
<br/>

<img src="./images/spearman_pearson.png" style="height: 300px;"/>

Similarly Kendall's correlation is better than Spearman's as it allows for more outliers. It may be more computationally expensive but that is becoming trivial as computing power increases.

Note: these correlation tests are only hypotheses.. more on this later.


In [4]:
# lets start off with a chi square test for categorical data fields
# we import scipy, a library for scientific tests.
# it comes with a statistics library with many convenient functions for statistical tests
from scipy import stats

The Chi square test is to test if the a subset of population in the dataset causes significant skew towards an outcome.<br>

<img src="./images/hqdefault.jpg" style="height:300px">

In [5]:
# lets convert a measurement into a categorical variable.
salesDF["MadeProfit"]  = salesDF.Profit >= 0

In [6]:
# what is the global distribution of this category
salesDF.MadeProfit.value_counts()

True     4312
False    4120
Name: MadeProfit, dtype: int64

In [7]:
# not all categories are equal
salesDF["Order Priority"].value_counts()

High             1737
Low              1732
Not Specified    1680
Medium           1667
Critical         1615
Critical            1
Name: Order Priority, dtype: int64

In [8]:
# lets create a table similar to the one above for chi square
madeProfitCrossOrderPriority = pandas.crosstab(salesDF["MadeProfit"], salesDF["Order Priority"])
madeProfitCrossOrderPriority

Order Priority,Critical,Critical,High,Low,Medium,Not Specified
MadeProfit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
False,815,1,833,837,803,831
True,800,0,904,895,864,849


In [9]:
stats.chi2_contingency(madeProfitCrossOrderPriority)
# remember to check the documentation on the return values

(4.0389006116900203,
 0.54382892665861349,
 5,
 array([[  7.89112903e+02,   4.88614801e-01,   8.48723909e+02,
           8.46280835e+02,   8.14520873e+02,   8.20872865e+02],
        [  8.25887097e+02,   5.11385199e-01,   8.88276091e+02,
           8.85719165e+02,   8.52479127e+02,   8.59127135e+02]]))

In [10]:
chiSquareStatistic, pValue, degreeOfFreedom, _ = stats.chi2_contingency(madeProfitCrossOrderPriority)

In [11]:
chiSquareStatistic, pValue, degreeOfFreedom

(4.0389006116900203, 0.54382892665861349, 5)

What is this p-value?<br>
Recall the distribution curve from the previous excercise<br>

<img src="./images/2017-09-11-Statistical-Significance-P-Value-1.png"  style="height:300px">

This p-value is a derived value from the chi square statistic and the degrees of freedom.<br>
Further reading for those that want to delve deeper https://machinelearningmastery.com/chi-squared-test-for-machine-learning/

The p-value by rule of thumb should be less than 0.05 (5%). <br>
You can choose a lower p-value i.e. 0.01 which in turn means you are more sure of this result. <br>


Try this out on other categorical data fields.<br>
For the more advanced Python users, try writing a function to find correlated values and filter them out from the rest of the data.

## Note down the significant categorical columns

Pearson's correlation is used for numerical data (measurements). 


<img src="./images/corsample-1.png"  style="height:500px"/>

In [12]:
stats.pearsonr(salesDF["Quantity ordered new"], salesDF["Unit Price"])

(-0.037980245859996056, 0.00048607948268290975)

In [13]:
stats.pearsonr(salesDF["Quantity ordered new"], salesDF["Discount"])

(-0.008640134291191837, 0.42761151358930671)

Try this out on a few more measurements to see what you get!

There are numerous more statistical tests and special scenarios of when to use them.<br>
These are just an introduction to correlation studies and statistics but they do give a starting point on what variables are significant to each other. <br>

## A very important pitfall:

<img src="./images/dreamstime_m_37904189.jpg" style="height:500px">



## Additional Excercise:
1. What are the analytics pitfalls demostrated here.
2. How would you avoid or verify such pitfalls?
3. What can you do with these insights on correlation?
4. Additional homework: when should you not use chi square test?



## Correlation, Covariance, Cointegration

Statistical tests are continuously evolving to better quantify any relationship between variables.<br/>

Covariance is one such measure to see how tightly the relationship is between the variables<br/>
<img src="./images/correlation_covariance.jpg" style="height:400px"/>

As financial analytics (quants) gets more popular, more statistical research goes into studying how variables move similarly in the stock market. <br/>
Cointegration is a measurement for how correlation a set of variables across timeseries is related to each other.

<img src="./images/cointegration.png" style="height:400px"/>