## Pearson Correlation Coefficient 

The Pearson correlation coefficient, often denoted as r, is a statistical measure that quantifies the strength and direction of the linear relationship between two continuous variables. It ranges from -1 to +1, where:

- A value of +1 indicates a perfect positive linear correlation
- A value of -1 indicates a perfect negative linear correlation
- A value of 0 indicates no linear correlation

The Pearson correlation coefficient is widely used in statistics to measure the degree of linear dependence between two variables. It's important to note that it only captures linear relationships and may not accurately represent non-linear associations between variables. Let's calculate an example by using some data.

The data above represents a group of users and their associated ratings of items. There are 5 users and 6 items. Each item in the data represents the scores of the users 1-5 respectively. Data represented as nan is data we will want to predict using the pearson correlation coefficient! In particular we want to predict the ratings a user-1 will give an item-1 and item-6. We will do this by finding the two closest users that are representative to user 3 and use their raw ratings of the items to predict what user 3 will rate the item. Let's jump into it.


In [97]:
import pandas as pd
import numpy as np
import scipy.stats

# Ratings of users by products, 5 users 6 products
data = {
    'item-1': [7, 6, np.nan, 1, 1],
    'item-2': [6, 7, 3, 2, np.nan],
    'item-3': [7, np.nan, 3, 2, 1],
    'item-4': [4, 4, 1, 3, 2],
    'item-5': [5, 3, 1, 3, 3],
    'item-6': [4, 4, np.nan, 4, 3]
}
data_frame = pd.DataFrame(data, index=['user-1', 'user-2', 'user-3', 'user-4', 'user-5']).astype(float)

print("Data table prior to Pearson Correlation Coefficient calculation") 
print(data_frame)

Data table prior to Pearson Correlation Coefficient calculation
        item-1  item-2  item-3  item-4  item-5  item-6
user-1     7.0     6.0     7.0     4.0     5.0     4.0
user-2     6.0     7.0     NaN     4.0     3.0     4.0
user-3     NaN     3.0     3.0     1.0     1.0     NaN
user-4     1.0     2.0     2.0     3.0     3.0     4.0
user-5     1.0     NaN     1.0     2.0     3.0     3.0


## Calculate the coorelation

The coefficient is calculated as the covariance of the two variables divided by the product of their standard deviations:

r = cov(X,Y) / (σX * σY)

Where:
- cov(X,Y) is the covariance between variables X and Y
- σX is the standard deviation of X
- σY is the standard deviation of Y

Alternatively, it can be expressed as:

r = Σ((x - μx)(y - μy)) / (n * σx * σy)

Where:
- x and y are individual sample points
- μx and μy are the sample means of X and Y
- n is the sample size



## Use built in method
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html

* pearson : standard correlation coefficient (default)
* kendall : Kendall Tau correlation coefficient
* spearman : Spearman rank correlation

In [104]:
correlation = data_frame.T.corr();
print("Pearson coefficients for Users: ")
print(f"{correlation}")


Pearson coefficients for Users: 
          user-1    user-2    user-3    user-4    user-5
user-1  1.000000  0.723478  0.894427 -0.899229 -0.824226
user-2  0.723478  1.000000  0.970725 -0.720577 -0.899229
user-3  0.894427  0.970725  1.000000 -1.000000 -0.866025
user-4 -0.899229 -0.720577 -1.000000  1.000000  0.877058
user-5 -0.824226 -0.899229 -0.866025  0.877058  1.000000


## Interpret the data

How do we predict the item 1 and 6 score for user3? We leverage this correlation table with the top-k closest users to user 3. For this example we will use the top-2 which based on the data above is user 1 and user 2 with correlation scores of 0.89 and 0.96 respectively. This prediction score uses the Pearson weighted average of the raw ratings of users 1 and 2. Let's calculate the prediction score of user 3 item 1 and user 3 item 6 with this information.

In [105]:
prediction_user_3_item_1 = (correlation.loc['user-3']['user-1'] * data_frame.loc['user-1']['item-1'] + correlation.loc['user-3']['user-2'] * data_frame.loc['user-2']['item-1']) / (correlation.loc['user-3']['user-1'] + correlation.loc['user-3']['user-2'])
prediction_user_3_item_6 = (correlation.loc['user-3']['user-1'] * data_frame.loc['user-1']['item-6'] + correlation.loc['user-3']['user-2'] * data_frame.loc['user-2']['item-6']) / (correlation.loc['user-3']['user-1'] + correlation.loc['user-3']['user-2'])
print(f"Prediction User 3 Item 1: {prediction_user_3_item_1}")
print(f"Prediction User 3 Item 6: {prediction_user_3_item_6}")

Prediction User 3 Item 1: 6.479546404117822
Prediction User 3 Item 6: 4.0


## Answer

There we have it the predicted scores user 3 would give to item 1 and 6 based on the top-2 closest neighbors. User 3 would give item 1 a rating of 6.48 and item 6 a rating of 4.0. However, can we do better? We can use the mean-cenetered equation to calculate a predicated rating. This includes the mean ratings of the various users. Let's look at that below

In [106]:
means_data_frame = data_frame.copy().astype(float)
user_means = means_data_frame.mean(axis=1)
for user in means_data_frame.index:
    means_data_frame.loc[user] = means_data_frame.loc[user].subtract(user_means[user])

print("Means data frame")
print(f"{means_data_frame}")
mean_prediction_user_3_item_1 = user_means['user-3'] + (correlation.loc['user-3']['user-1'] * means_data_frame.loc['user-1']['item-1'] + correlation.loc['user-3']['user-2'] * means_data_frame.loc['user-2']['item-1']) / (correlation.loc['user-3']['user-1'] + correlation.loc['user-3']['user-2'])
mean_prediction_user_3_item_6 = user_means['user-3'] + (correlation.loc['user-3']['user-1'] * means_data_frame.loc['user-1']['item-6'] + correlation.loc['user-3']['user-2'] * means_data_frame.loc['user-2']['item-6']) / (correlation.loc['user-3']['user-1'] + correlation.loc['user-3']['user-2'])
print(f"Prediction User 3 Item 1: {mean_prediction_user_3_item_1}")
print(f"Prediction User 3 Item 6: {mean_prediction_user_3_item_6}")

Means data frame
        item-1  item-2  item-3  item-4  item-5  item-6
user-1     1.5     0.5     1.5    -1.5    -0.5    -1.5
user-2     1.2     2.2     NaN    -0.8    -1.8    -0.8
user-3     NaN     1.0     1.0    -1.0    -1.0     NaN
user-4    -1.5    -0.5    -0.5     0.5     0.5     1.5
user-5    -1.0     NaN    -1.0     0.0     1.0     1.0
Prediction User 3 Item 1: 3.3438639212353465
Prediction User 3 Item 6: 0.864317517117525


## Mean Weighted Average

Now that we have calculated the new predictions using the mean score user 3 will give item 1 a score of ~3.34 and item 6 a score of ~0.86. Given our original data this seems more appropriate given that user 3 doesn't historically give the highest ratings. The highest rating of user 3 is 3! 