# Cosine Similarity

Cosine similarity measures the similarity between two vectors by calculating the cosine of the angle between them. It provides a value between -1 and 1, where:
- 1 means the vectors are identical
- 0 means the vectors are perpendicular (completely different)
- -1 means the vectors are opposite

## Mathematical Definition

The cosine similarity between two vectors A and B is calculated as:

cos(θ) = (A · B) / (||A|| ||B||)

Where:
- A · B is the dot product of vectors A and B
- ||A|| and ||B|| are the magnitudes (Euclidean norms) of vectors A and B
- θ is the angle between vectors A and B

## In Terms of Ratings

For recommendation systems with user ratings:
- Vectors A and B represent ratings for two items
- The dot product (A · B) represents the sum of the products of ratings
- The magnitudes represent the square root of the sum of squared ratings
- Only mutual ratings (where both items have been rated) are considered
- Values are typically between 0 and 1 since ratings are usually positive

## Key Properties

1. Scale Independence: Cosine similarity measures the angle between vectors, not their magnitudes
2. Easy to calculate and interpret
3. Effective with sparse data (common in recommendation systems)
4. Handles the "ratings scale" problem where different users may use different rating scales

## Example

Lets calculate the cosine ratings of 5 users and their ratings of 6 different items so that we can predict the ratings user 3 would give item 1 and 6.


In [78]:
import pandas as pd
import numpy as np
import scipy.stats

# Ratings of users by products, 5 users 6 products
data = {
    'item-1': [7, 6, np.nan, 1, 1],
    'item-2': [6, 7, 3, 2, np.nan],
    'item-3': [7, np.nan, 3, 2, 1],
    'item-4': [4, 4, 1, 3, 2],
    'item-5': [5, 3, 1, 3, 3],
    'item-6': [4, 4, np.nan, 4, 3]
}
data_frame = pd.DataFrame(data, index=['user-1', 'user-2', 'user-3', 'user-4', 'user-5'])

print("Data table prior to Pearson Correlation Coefficient calculation") 
print(data_frame)

Data table prior to Pearson Correlation Coefficient calculation
        item-1  item-2  item-3  item-4  item-5  item-6
user-1     7.0     6.0     7.0       4       5     4.0
user-2     6.0     7.0     NaN       4       3     4.0
user-3     NaN     3.0     3.0       1       1     NaN
user-4     1.0     2.0     2.0       3       3     4.0
user-5     1.0     NaN     1.0       2       3     3.0


In [81]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity_raw = pd.DataFrame(cosine_similarity(data_frame.fillna(0)), index=data_frame.index, columns=data_frame.index)
print("Cosine similarity matrix")
print(f"{cosine_similarity_raw}")

Cosine similarity matrix
          user-1    user-2    user-3    user-4    user-5
user-1  1.000000  0.844441  0.776622  0.838615  0.723725
user-2  0.844441  1.000000  0.557773  0.774382  0.636469
user-3  0.776622  0.557773  1.000000  0.613795  0.365148
user-4  0.838615  0.774382  0.613795  1.000000  0.933859
user-5  0.723725  0.636469  0.365148  0.933859  1.000000


Now with mean centering

In [82]:
means_data_frame = data_frame.copy().astype(float)
user_means = means_data_frame.mean(axis=1)
for user in means_data_frame.index:
    means_data_frame.loc[user] = means_data_frame.loc[user].subtract(user_means[user])

print("Mean centered data")
print(f"{means_data_frame}")

cosine_similarity_mean = pd.DataFrame(cosine_similarity(means_data_frame.fillna(0)), index=data_frame.index, columns=data_frame.index)
print("Cosine similarity matrix")
print(f"{cosine_similarity_mean}")

Mean centered data
        item-1  item-2  item-3  item-4  item-5  item-6
user-1     1.5     0.5     1.5    -1.5    -0.5    -1.5
user-2     1.2     2.2     NaN    -0.8    -1.8    -0.8
user-3     NaN     1.0     1.0    -1.0    -1.0     NaN
user-4    -1.5    -0.5    -0.5     0.5     0.5     1.5
user-5    -1.0     NaN    -1.0     0.0     1.0     1.0
Cosine similarity matrix
          user-1    user-2    user-3    user-4    user-5
user-1  1.000000  0.612094  0.648886 -0.899229 -0.811107
user-2  0.612094  1.000000  0.730297 -0.700649 -0.578152
user-3  0.648886  0.730297  1.000000 -0.426401 -0.500000
user-4 -0.899229 -0.700649 -0.426401  1.000000  0.852803
user-5 -0.811107 -0.578152 -0.500000  0.852803  1.000000


Given this data let's calculate the expected ratings of the items for both the raw and mean centered data.

In [86]:
prediction_user_3_item_1 = (cosine_similarity_raw.loc['user-3']['user-1'] * data_frame.loc['user-1']['item-1'] + cosine_similarity_raw.loc['user-3']['user-2'] * data_frame.loc['user-2']['item-1']) / (cosine_similarity_raw.loc['user-3']['user-1'] + cosine_similarity_raw.loc['user-3']['user-2'])
prediction_user_3_item_6 = (cosine_similarity_raw.loc['user-3']['user-1'] * data_frame.loc['user-1']['item-6'] + cosine_similarity_raw.loc['user-3']['user-2'] * data_frame.loc['user-2']['item-6']) / (cosine_similarity_raw.loc['user-3']['user-1'] + cosine_similarity_raw.loc['user-3']['user-2'])
print(f"Prediction User 3 Item 1: {prediction_user_3_item_1}")
print(f"Prediction User 3 Item 6: {prediction_user_3_item_6}")
mean_prediction_user_3_item_1 = user_means['user-3'] + (cosine_similarity_mean.loc['user-3']['user-1'] * means_data_frame.loc['user-1']['item-1'] + cosine_similarity_mean.loc['user-3']['user-2'] * means_data_frame.loc['user-2']['item-1']) / (cosine_similarity_mean.loc['user-3']['user-1'] + cosine_similarity_mean.loc['user-3']['user-2'])
mean_prediction_user_3_item_6 = user_means['user-3'] + (cosine_similarity_mean.loc['user-3']['user-1'] * means_data_frame.loc['user-1']['item-6'] + cosine_similarity_mean.loc['user-3']['user-2'] * means_data_frame.loc['user-2']['item-6']) / (cosine_similarity_mean.loc['user-3']['user-1'] + cosine_similarity_mean.loc['user-3']['user-2'])
print(f"Mean Prediction User 3 Item 1: {mean_prediction_user_3_item_1}")
print(f"Mean Prediction User 3 Item 6: {mean_prediction_user_3_item_6}")

Prediction User 3 Item 1: 6.582002852403109
Prediction User 3 Item 6: 4.0
Mean Prediction User 3 Item 1: 3.34114572621006
Mean Prediction User 3 Item 6: 0.8706599721765269


Now lets calculate item to item similarity for funsies.

In [87]:
similarity_matrix = pd.DataFrame(cosine_similarity(data_frame.fillna(0).T), index=data_frame.columns, columns=data_frame.columns)
print("Cosine similarity matrix")
print(f"{similarity_matrix}")

Cosine similarity matrix
          item-1    item-2    item-3    item-4    item-5    item-6
item-1  1.000000  0.931378  0.702382  0.901024  0.868869  0.837828
item-2  0.931378  1.000000  0.699970  0.908527  0.832531  0.802788
item-3  0.702382  0.699970  1.000000  0.724462  0.813373  0.650814
item-4  0.901024  0.908527  0.724462  1.000000  0.972130  0.976458
item-5  0.868869  0.832531  0.813373  0.972130  1.000000  0.964274
item-6  0.837828  0.802788  0.650814  0.976458  0.964274  1.000000


Now item to item with mean centering

In [88]:
from sklearn.metrics.pairwise import paired_cosine_distances

means_data_frame = data_frame.copy().astype(float)
user_means = means_data_frame.mean(axis=1)
for user in means_data_frame.index:
    means_data_frame.loc[user] = means_data_frame.loc[user].subtract(user_means[user])

print("Mean centered data")
print(f"{means_data_frame}")

similarity_matrix = pd.DataFrame(cosine_similarity(means_data_frame.fillna(0).T), index=data_frame.columns, columns=data_frame.columns)
print("Cosine similarity matrix")
print(f"{similarity_matrix}")

Mean centered data
        item-1  item-2  item-3  item-4  item-5  item-6
user-1     1.5     0.5     1.5    -1.5    -0.5    -1.5
user-2     1.2     2.2     NaN    -0.8    -1.8    -0.8
user-3     NaN     1.0     1.0    -1.0    -1.0     NaN
user-4    -1.5    -0.5    -0.5     0.5     0.5     1.5
user-5    -1.0     NaN    -1.0     0.0     1.0     1.0
Cosine similarity matrix
          item-1    item-2    item-3    item-4    item-5    item-6
item-1  1.000000  0.624131  0.715771 -0.738780 -0.738330 -0.989620
item-2  0.624131  1.000000  0.374437 -0.733910 -0.905091 -0.522503
item-3  0.715771  0.374437  1.000000 -0.810889 -0.590281 -0.760974
item-4 -0.738780 -0.733910 -0.810889  1.000000  0.705671  0.721966
item-5 -0.738330 -0.905091 -0.590281  0.705671  1.000000  0.663676
item-6 -0.989620 -0.522503 -0.760974  0.721966  0.663676  1.000000
