## Pearson correlation
Here, I give a worked example of the notes in [Lecture l18](https://github.com/iit-cs579/main/blob/master/lec/l18/l18.pdf) to compute the pearson correlation between two movies.

Recall that the goal is to compute the correlation between the ratings for two movies $x$ and $y$ using the formula given on slide 24:

$$
sim(x,y) = \frac{\sum_{s \in S_{xy}}(r_{xs} - \bar{r}_x)(r_{ys} - 
\bar{r}_y)}{\sqrt{\sum_{s \in S_{xy}}(r_{xs} - \bar{r}_x)^2}\sqrt{\sum_{s \in S_{xy}}(r_{ys} - \bar{r}_y)^2}}
$$
where
- $r_{xs}$ is the rating given by user $s$ to movie $x$
- $\bar{x}$ is the mean of $x$; $\bar{y}$ is the mean of $y$ (considering only non-zero values)
- $S_{xy}$ is the set of users who have rated both movie $x$ and movie $y$.

In [1]:
# In this example, we'll compute the similarity of
# rows 1 and 3 from slide 30, where 0 means "no rating".
import numpy as np
import math
m1 = np.array([0, 4, 0, 5, 0, 0, 5, 0, 0, 3, 0, 1])
m3 = np.array([0, 5, 3, 4, 0, 3, 0, 2, 1, 0, 4, 2])

In [18]:
from scipy.stats import pearsonr
pearsonr(m1, m3)[0]
# wrong b/c it doesn't deal with missing values appropriately.

0.14285714285714285

In [2]:
# find indices of overlapping ratings.
# S_xy
overlap = [i for i in range(len(m1)) if m1[i] != 0 and m3[i] != 0]
overlap

[1, 3, 11]

In [3]:
# Compute means (excluding non-zeros)
m1_mean = np.mean(m1[np.where(m1 != 0)])
m3_mean = np.mean(m3[np.where(m3 != 0)])
print('m1 mean=%.3f m3 mean=%.3f' % (m1_mean, m3_mean))

m1 mean=3.600 m3 mean=3.000


Compute numerator:
$$
\sum_{s \in S_{xy}}(r_{xs} - \bar{r}_x)(r_{ys} - 
\bar{r}_y)
$$

In [4]:
numerator = ((m1[overlap] - m1_mean) * (m3[overlap] - m3_mean)).sum()
numerator

4.8

Compute denominator

$$
\sqrt{\sum_{s \in S_{xy}}(r_{xs} - \bar{r}_x)^2}\sqrt{\sum_{s \in S_{xy}}(r_{ys} - \bar{r}_y)^2}
$$

In [5]:
denominator = (  math.sqrt(((m1[overlap] - m1_mean)**2).sum()) 
               * math.sqrt(((m3[overlap] - m3_mean)**2).sum())  )
denominator

7.299315036357863

In [6]:
numerator / denominator

0.6575959492214292

So, 0.658 is the similarity between movies m1 and m3.

You can repeat the process to get values for other movies.

In [7]:
def corr(r1, r2):
    m1 = np.mean(r1[np.where(r1 != 0)])
    m2 = np.mean(r2[np.where(r2 != 0)])
    overlap = [i for i in range(len(r1)) if r1[i] != 0 and r2[i] != 0]
    numerator = ((r1[overlap] - m1) * (r2[overlap] - m2))
    numerator = numerator.sum()
    denominator = (  math.sqrt(((r1[overlap] - m1)**2).sum()) 
                   * math.sqrt(((r2[overlap] - m2)**2).sum())  )
    return numerator/ denominator

corr(m1, m3)

0.6575959492214292

In [8]:
m = np.array([
    [0, 4, 0, 5, 0, 0, 5, 0, 0, 3, 0, 1],
    [3, 1, 2, 0, 0, 4, 0, 0, 4, 5, 0, 0],
    [0, 5, 3, 4, 0, 3, 0, 2, 1, 0, 4, 2],
    [0, 2, 0, 0, 4, 0, 0, 5, 0, 4, 2, 0],
    [5, 2, 0, 0, 0, 0, 2, 4, 3, 4, 0, 0],
    [0, 4, 0, 0, 2, 0, 0, 3, 0, 3, 0, 1]
    ])

for i in range(len(m)):    
    print('r(m1, m%d)=%.2f' % (i+1, corr(m[0], m[i])))

r(m1, m1)=1.00
r(m1, m2)=-0.96
r(m1, m3)=0.66
r(m1, m4)=-0.84
r(m1, m5)=-0.89
r(m1, m6)=0.77
