## Pearson correlation
Here, I correct the notes in [Lecture l18](https://github.com/iit-cs579/main/blob/master/lec/l18/l18.pdf), which miscalculated the pearson correlation between two movies.

Recall that the goal is to compute the correlation between the ratings for two movies $x$ and $y$ using the formula given on slide 24:

$$
sim(x,y) = \frac{\sum_{s \in S_{xy}}(r_{xs} - \bar{r}_x)(r_{ys} - 
\bar{r}_y)}{\sqrt{\sum_{s \in S_{xy}}(r_{xs} - \bar{r}_x)^2}\sqrt{\sum_{s \in S_{xy}}(r_{ys} - \bar{r}_y)^2}}
$$
where
- $r_{xs}$ is the rating given by user $s$ to movie $x$
- $\bar{x}$ is the mean of $x$; $\bar{y}$ is the mean of $y$
- $S_{xy}$ is the set of users who have rated both movie $x$ and movie $y$.

The problem in the slide is how the denominator is computed. The examples given on page 30 use all ratings to compute the norm; instead, only those ratings in $S_{xy}$ should be used.

In [8]:
# In this example, we'll compute the similarity of
# rows 1 and 3 from page 30, where 0 means "no rating".
import numpy as np
import math
m1 = np.array([0, 4, 0, 5, 0, 0, 5, 0, 0, 3, 0, 1])
m3 = np.array([0, 5, 3, 4, 0, 3, 0, 2, 1, 0, 4, 2])

In [9]:
# find indices of overlapping ratings.
overlap = [i for i in range(len(m1)) if m1[i] != 0 and m3[i] != 0]
overlap

[1, 3, 11]

In [10]:
# Compute means (excluding non-zeros)
m1_mean = np.mean(m1[np.where(m1 != 0)])
m3_mean = np.mean(m3[np.where(m3 != 0)])
print('m1 mean=%.3f m3 mean=%.3f' % (m1_mean, m3_mean))

m1 mean=3.600 m3 mean=3.000


Compute numerator:
$$
\sum_{s \in S_{xy}}(r_{xs} - \bar{r}_x)(r_{ys} - 
\bar{r}_y)
$$

In [11]:
numerator = ((m1[overlap] - m1_mean) * (m3[overlap] - m3_mean)).sum()
numerator

4.7999999999999998

Compute denominator

$$
\sqrt{\sum_{s \in S_{xy}}(r_{xs} - \bar{r}_x)^2}\sqrt{\sum_{s \in S_{xy}}(r_{ys} - \bar{r}_y)^2}
$$

In [14]:
denominator = (  math.sqrt(((m1[overlap] - m1_mean)**2).sum()) 
               * math.sqrt(((m3[overlap] - m3_mean)**2).sum())  )
denominator

7.299315036357863

In [15]:
numerator / denominator

0.65759594922142917

So, 0.658 should be the correct answer for the similarity between movies m1 and m3 (**not** 0.41, as listed on the slides).

You can repeat the process to get values for other movies.