# Relationships

Various tests for relationships between variables can be found in the **SciPy** package, which we will import under the short name **sp**.

In [72]:
import numpy as np
import pandas as pd
import scipy.stats as ss

## Pearson's Correlation Coefficient

This measure of relatedness between two variables can be used when the variables are both **normally distributed** (which implies that they are both numeric). 

Assuming that both the latitude and longitude variables in the magnetic pole data are normally distributed, we look at whether there is a correlation between the two using Pearson's Correlation Coefficient.

In [73]:
# Get the pole data
mag_pole_data = pd.read_csv("polar.csv", index_col=0)
print(mag_pole_data)

     lat   long
1  -26.4  324.0
2  -32.2  163.7
3  -73.1   51.9
4  -80.2  140.5
5  -71.1  267.2
6  -58.7   32.0
7  -40.8   28.1
8  -14.9  266.3
9  -66.1  144.3
10  -1.8  256.2
11 -52.1   83.2
12 -77.3  182.1
13 -68.8  110.4
14 -68.4  142.2
15 -29.2  246.3
16 -78.5  222.6
17 -65.4  247.7
18 -49.0   65.6
19 -67.0  282.6
20 -56.7   56.2
21 -80.5  108.4
22 -77.7  266.0
23  -6.9   19.1
24 -59.4  281.7
25  -5.6  107.4
26 -62.6  105.3
27 -74.7  120.2
28 -65.3  286.6
29 -71.6  106.4
30 -23.3   96.5
31 -74.3   90.2
32 -81.0  170.9
33 -12.7  199.4
34 -75.4  118.6
35 -85.9   63.7
36 -84.8   74.9
37  -7.4   93.8
38 -29.8   72.8
39 -85.2  113.2
40 -53.1   51.5
41 -38.3  146.8
42 -72.7  103.1
43 -60.2   33.2
44 -63.4  154.8
45 -17.2   89.9
46 -81.6  295.6
47 -40.4   41.0
48 -53.6   59.1
49 -56.2   35.6
50 -75.1   70.7


In [74]:
# Get the correlation coefficient and p-value (the function returns two values)
r, p = ss.pearsonr(mag_pole_data["lat"], mag_pole_data["long"])
print("r = " + str(r))
print("p = " + str(p))

r = -0.0021438009514379956
p = 0.9882112420590565


**Interpretation:** The calculated Pearson's Correlation Coefficient is low (0.002), indicating no correlation. The two-tailed p-value is very close to 1 (0.99), meaning that a higher absolute coefficient value would be obtained for in 99% of cases for entirely unrelated variables with the same parameters as ours.  

## Kendall's Tau

This measure of relatedness between two variables is a **rank** correlation coefficient, meaning that it can be used on variables with values that can be **ranked**, including any numeric or ordinal variables, regardless of their distribution.
     
Let us look at an example of a competition where judges rate competitors based on their performance. Let's say that the competitors are labelled A, B, C, D, E, F and G. We would like to see if there is a correlation between the ratings by two judges J1 and J2.

In [78]:
# Create lists of competitors ordered by rating, one for each judge
J1Ratings = ["A", "C", "B", "F", "D", "E", "G"]
J2Ratings = ["A", "B", "C", "F", "E", "D", "G"]

In [79]:
# Apply the Kendall Tau test to the listsabs
tau, p = ss.kendalltau(J1Ratings, J2Ratings)
print("tau = " + str(tau))
print("p = " + str(p))

tau = 0.8095238095238096
p = 0.010714285714285714


**Interpretation:** The calculated Kendall's Tau is high (0.81), indicating a positive correlation. The two-tailed p-value is approximately 0.01, meaning that in about 1% of cases this Tau value would be obtained for entirely unrelated sequences of the same length as ours (7). Now it is up to us to decide whether this gives us enough certainty in our particular scenario.  

## Chi-squared

The chi-squared statistic can measure the relatedness between two categorical variables. It is applied to a table of values, containing instance counts for all possible combinations of values for the two variables. 

### Example with ready data frame

In [80]:
# Numbers of patients with no, partial and full recovery for each of the 4 treatments
none = [20, 32, 8, 52]
partial = [9, 72, 8, 32] 
full = [16, 64, 30, 12] 

skin_treatment_df = pd.DataFrame({'None':none, 'Partial':partial, 'Full':full})
skin_treatment_df.index = ["Injection", "Tablet", "Laser", "Herbal"]

# Print the data frame representing the table of counts for each combination of treatment and 
# level of improvement (e.g. the number of patients that were treated with tablets and improved
# partially was 72).
print(skin_treatment_df)

           None  Partial  Full
Injection    20        9    16
Tablet       32       72    64
Laser         8        8    30
Herbal       52       32    12


In [81]:
# Apply the Chi-squared test to the data
chisq, p, dof, evs = ss.chi2_contingency(skin_treatment_df)
print("chi-squared " + str(chisq))
print("p = " + str(p))
print("degrees of freedom = " + str(dof))
print("expected value table: ")
print(evs)

chi-squared 66.16615112075414
p = 2.492436462625812e-12
degrees of freedom = 6
expected value table: 
[[14.1971831  15.33802817 15.46478873]
 [53.0028169  57.26197183 57.73521127]
 [14.51267606 15.67887324 15.8084507 ]
 [30.28732394 32.72112676 32.9915493 ]]


**Interpretation:** The Chi-squared statistic is 66.17 and for degrees of freedom equal to 6 this equates to a p-value of 2.49 * 10^12. This p-value indicates that it is practically impossible for a Chi-squared value like ours, or greater, to be obtained when two variables with 6 degrees of freedom are entirely unrelated. This means high certainty that there is a relationship between the type of treatment and outcome for the patient.

### Example with contingency table

In [66]:
# Read in some data about commuting in Ireland
irl_trans = pd.read_csv("ireland_transport.csv")
irl_trans

Unnamed: 0,Group,Sex,Mode,1986,1991,1996,2001,2006,2011,2016
0,Children at school aged between 5 and 12 years,Male,On foot,129973,107125,79164,57570,55799,59986,64764
1,Children at school aged between 5 and 12 years,Male,Bicycle,14507,14203,7837,3640,3043,4457,4858
2,Children at school aged between 5 and 12 years,Male,"Bus, minibus or coach",52928,55474,51008,38590,34413,31217,29410
3,Children at school aged between 5 and 12 years,Male,"Train, DART or LUAS",325,438,306,262,328,371,383
4,Children at school aged between 5 and 12 years,Male,Motorcycle or scooter,0,0,0,0,0,0,0
5,Children at school aged between 5 and 12 years,Male,Motor car: Driver,0,0,0,0,0,0,0
6,Children at school aged between 5 and 12 years,Male,Motor car: Passenger,66293,73398,84966,110882,125866,150818,165890
7,Children at school aged between 5 and 12 years,Male,Other means (incl. lorry or van),898,1375,1109,1049,797,588,573
8,Children at school aged between 5 and 12 years,Male,Work mainly at or from home,507,639,2410,3489,4394,135,147
9,Children at school aged between 5 and 12 years,Male,Not stated,20344,19591,14406,6421,5606,9140,13640


In [84]:
# Create a contingency table
grp_vs_gen_tab = pd.crosstab(irl_trans["Group"], irl_trans["Sex"], values=irl_trans["2016"], aggfunc=np.sum )
grp_vs_gen_tab

Sex,Female,Male
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
Children at school aged between 5 and 12 years,267251,279665
Population aged 15 years and over at work,913163,1057565
Students at school or college aged 19 years and over,98309,92486
Students at school or college aged between 13 and 18 years,172315,177853


In [85]:
# Apply the Chi-squared test to the contingency table
chisq, p, dof, evs = ss.chi2_contingency(grp_vs_gen_tab)
print("chi-squared " + str(chisq))
print("p = " + str(p))
print("degrees of freedom = " + str(dof))
print("expected value table: ")
print(evs)

chi-squared 3125.270216784954
p = 0.0
degrees of freedom = 3
expected value table: 
[[ 259463.18007119  287452.81992881]
 [ 934935.81086553 1035792.18913447]
 [  90515.32125899  100279.67874101]
 [ 166123.68780428  184044.31219572]]


**Interpretation:** The Chi-squared statistic is high and the p-value equal to 0, indicating that there is a relationship between the gender and group variables. This means that the sexes are not evenly represented across the groups in the sample.

### Chi-squared test on data fixed to match expected values

In [87]:
# Make some perfectly distributed data
col = [1111, 2222, 3333, 4444]
tab = pd.DataFrame({'one':col, 'two':col, 'three':col})

In [88]:
# Apply the Chi-squared test to the 'fixed' data
chisq, p, dof, evs = ss.chi2_contingency(tab)
print("chi-squared " + str(chisq))
print("p = " + str(p))
print("degrees of freedom = " + str(dof))
print("expected value table: ")
print(evs)

chi-squared 0.0
p = 1.0
degrees of freedom = 6
expected value table: 
[[1111. 1111. 1111.]
 [2222. 2222. 2222.]
 [3333. 3333. 3333.]
 [4444. 4444. 4444.]]


**Interpretation:** The Chi-squared statistic is 0 and the p-value 1. This indicates a perfect randomness and lack of relationship between the column and row variables. However, a result like this in a real scenario would be 'too good to be true' and may indicate data fixing.