# Day 7: Pearson Correlation Coefficient I 

In statistics, the Pearson correlation coefficient (PCC, pronounced /ˈpɪərsən/), also referred to as the Pearson's r, Pearson product-moment correlation coefficient (PPMCC) or bivariate correlation, is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

We use the following formula to calculate the Pearson correlation coefficient: 

$$\rho_{x,y} = \frac {\sum(x_i - \mu_x).(y_i - \mu_y)} {n.\sigma_x.\sigma_y}$$

#### Note: $n_X$ should equal $n_Y$

### Exercise

Print the value of the Pearson correlation coefficient, rounded to a scale of decimal places.

Sample Input

10

10 9.8 8 7.8 7.7 7 6 5 4 2 

200 44 32 24 22 17 15 12 8 4

The first line contains an integer, n, denoting the size of data sets X and Y.

The second line contains n space-separated real numbers (scaled to at most one decimal place), defining data set X.

The third line contains n space-separated real numbers (scaled to at most one decimal place), defining data set Y.


In [1]:
import math
# n = int(input())
# X = [float(i) for i in input().split()]
# Y = [float(i) for i in input().split()]

n = 10
X = [10, 9.8, 8, 7.8, 7.7, 7, 6, 5, 4, 2]
Y = [200, 44, 32, 24, 22, 17, 15, 12, 8, 4]

# Calulating mean, std of X and Y
mean_x = sum(X) / n
mean_y = sum(Y) / n

var_x = sum([(i - mean_x) ** 2 for i in X]) / n
std_x = math.sqrt(var_x)

var_y = sum([(i - mean_y) ** 2 for i in Y]) / n
std_y = math.sqrt(var_y)

lst = []
for i in range(n):
    lst.append((X[i] - mean_x) * (Y[i] - mean_y))

ro = sum(lst) / (n * std_x * std_y)
print(round(ro, 3))

0.612


In [30]:
# Creating function Pearson
import math

def Pearson(X, Y):
    'return the linear correlation between two variables X and Y'
    n = len(X)
    mean_x = sum(X) / n
    mean_y = sum(Y) / n
    var_x = sum([(i - mean_x) ** 2 for i in X]) / n
    std_x = math.sqrt(var_x)
    var_y = sum([(i - mean_y) ** 2 for i in Y]) / n
    std_y = math.sqrt(var_y)
    lst = []
    for i in range(n):
        try:
            lst.append((X[i] - mean_x) * (Y[i] - mean_y))
        except:
            print('Error: the number of values in X is different from in Y')
            exit()
    ro = sum(lst) / (n * std_x * std_y)
    return round(ro, 3)


X = [10, 9.8, 8, 7.8, 7.7, 7, 6, 5, 4, 2]
Y = [200, 44, 32, 24, 22, 17, 15, 12, 8, 4]

Pearson(X, Y)

0.612

# Day 7: Spearman's Rank Correlation Coefficient

https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

The point is to have $Rank_X$ and $Rank_Y$ from X and Y, then using them as input of function Pearson to calculate the Spearman's Rank Correlation Coefficient $r_s$ --> $r_s$ = Pearson($Rank_X$, $Rank_Y$)

$$r_s = Pearson(Rank_X, Rank_Y)$$

Special Case: X and Y Don't Contain Duplicates

$$r_s = 1 - \frac {6. \sum(d_i^2)} {n.(n^2 - 1)}$$

Here, $d_i$ is the difference between the respective values of $Rank_X$ and $Rank_Y$. 

### Exercise

https://www.hackerrank.com/challenges/s10-spearman-rank-correlation-coefficient/problem

Sample Input --> all unique values in both X and Y then we can use the formula for special case

10

10 9.8 8 7.8 7.7 1.7 6 5 1.4 2 

200 44 32 24 22 17 15 12 8 4


In [29]:
n = int(input())
X = [float(i) for i in input().split()]
Y = [float(i) for i in input().split()]


def Rank(data):
    lst = list(enumerate(sorted(data), start=1))
    di = dict([(k, v) for v, k in lst])
    lst = [di[i] for i in data]
    return lst


# Using formular for special case (no duplicate value in X and Y)
d = [(Rank(X)[i] - Rank(Y)[i]) ** 2 for i in range(len(X))]
result = 1 - 6 * sum(d) / (n * (n ** 2 - 1))
    
print(round(result, 3))

10
10 9.8 8 7.8 7.7 1.7 6 5 1.4 2 
200 44 32 24 22 17 15 12 8 4
0.903
