# Cosine Similarity

In this section you will construct another similarity metric, now based on the cosinus.

Remember trigonometry (or better, linear algebra!) from your mathematics class? Well this metric is based on trigonometric operations and calculates the angle between two vectors. It might look difficult but it is rather simple. 

## 1. Load the dataset

In [None]:
import pandas as pd
df_books_ratings = pd.read_csv('data/BX-Book-Ratings-Subset.csv', sep=';')
df_books_ratings

## 2. Explore the dimensions (shape) of users' rating data

Here are the IDs of two users in our ratings dataset. What are their respective ratings' dimensions (shape)? How many books did these users rate respectively?

In [None]:
df = df_books_ratings

a = df[df['User-ID'] == 277427]
b = df[df['User-ID'] == 277203]

print(a.shape)
print(b.shape)

# what is the problem here? Different shapes... they need to be of the same size...

If we are to produce vectors from the users' ratings and apply trigonometric operations on them,  can you see a problem here? Are the vectors of the same dimension? If not, why is this a 'problem'?

## 3. Vectorize ratings

Can you vectorize the above users' ratings so they have the same dimension? To help you do this, here is sorted  list of all the ISBNs in our dataset. How can you use this list of all the ISBNs to create a (large!) vector for user_id_a?

In [None]:
import numpy as np

ISBNS_array = df['ISBN'].unique()
ISBNS_array = np.sort(ISBNS_array).tolist()

# print(isbns_array[:10])
# print(isbns_array[:10])

# Select a user

a_user_id = 277427

df_a_user = df[df['User-ID'] == a_user_id]

# print(df_a_user)

a_user_ISBN_rating = dict(zip(df_a_user['ISBN'], df_a_user['Book-Rating']))

# print(a_user_ISBN_rating)

a_user_ISBNS = a_user_ISBN_rating.keys()

# print(a_user_ISBN)

a_user_ISBN_rating_vector = [0 if v not in a_user_ISBNS else a_user_ISBN_rating[v] for v in ISBNS_array]

# len(a_user_ISBN_rating_vector)

print(a_user_ISBN_rating_vector)

# big (and sparse) vector!!


In [None]:
# We can of course use pandas to do this vectorization

# create empty df with ISBNS columns
dfv = pd.DataFrame(columns=ISBNS_array)

# concat this empty df with the dict for a user's ratings indexed by the user's ID
dfv = pd.concat([dfv, pd.DataFrame(a_user_ISBN_rating, index=[a_user_id])])

# make sure we compute with 0 (ignore the warning)
dfv.fillna(0, inplace=True)

# convert to numpy array
dfv.loc[a_user_id].to_numpy()


## 4. Helper functions

Below are two functions that (1) retrieve user ratings from a given dataset and (2) vectorize these ratings according to a certain dimension (all ISBNs).

In [None]:
def get_user_ratings(user_id, df_subset):
    
    df_user = df_subset[df_subset['User-ID'] == user_id]
        
    return dict(zip(df_user['ISBN'], df_user['Book-Rating']))

In [None]:
def create_ratings_vector(user_ISBN_rating_dict, all_ISBNS_array):
    
    user_ISBNS = user_ISBN_rating_dict.keys()
    
    return [0 if v not in user_ISBNS else user_ISBN_rating_dict[v] for v in all_ISBNS_array]    

## 5. Cosine distance function

Can you finish the writing of the function below that calculates the angle between two vectors have the same dimension? As you can see, use the numpy `dot` and `norm` operators do translate the given formula into code.

In [None]:
from numpy import dot
from numpy.linalg import norm

def cosine_distance(ratings_vector_user_a, ratings_vector_user_b):
    
#     a . b  -> dot(a, b)
#     -----
#     |a||b| -> norm(a) * norm(b)
    
    return dot(ratings_vector_user_a, ratings_vector_user_b) / (norm(ratings_vector_user_a) * norm(ratings_vector_user_b))
    

## 6. Calculate distances

Here is the ID of a user in our dataset (you can of course choose another one!).

Can you calculate this user's cosine distance from all the other users in the dataset?

In [None]:
all_user_ids = df['User-ID'].unique().tolist()

a_user_ISBN_ratings = get_user_ratings(a_user_id, df)

a_user_ratings_vector = create_ratings_vector(a_user_ISBN_ratings, ISBNS_array)

for u_id in all_user_ids:
    
    if u_id == a_user_id:
        continue
    
    user_ISBN_ratings = get_user_ratings(u_id, df)
    
    user_ratings_vector = create_ratings_vector(user_ISBN_ratings, ISBNS_array)
    
    d = cosine_distance(a_user_ratings_vector, user_ratings_vector)
    
    if d > 0.0:
        print(f'{a_user_id} - {u_id} : d={d}')


## 7. Function calculating distances

Considering the code above, can you make a function that will take as input a given user's ID and calculate its distance from all other users in our dataset?

In [None]:
# rather than following the instructions here (!) I will calculate all distances between users and record them in a pandas dataframe

# to do so, I will first create a df containing all users' book ratings and work with this df to calculate distances

# create empty df with ISBNS columns
df_ratings = pd.DataFrame(columns=ISBNS_array)

# input ratings from all users in the empty df (⚠️ this will take a rather long time)
for u_id in all_user_ids:

    # retrieve user ratings
    user_ISBN_ratings = get_user_ratings(u_id, df)

    # add ratings to the df 
    df_ratings = pd.concat([df_ratings, pd.DataFrame(user_ISBN_ratings, index=[u_id])])

In [None]:
# now let's use this df to calculate the distances and record them in a new df

# since the computation below can take up very long time, let's subset the number if users
user_ids_subset = all_user_ids[:150]  # you can change the number here if you want to process more

df_distances = pd.DataFrame(index=user_ids_subset, columns=user_ids_subset)

# for all users (⚠️ this will take a very long time)
for u_id in user_ids_subset:

    # retrieve the user's ratings vector from df_ratings 
    user_ratings_vect = df_ratings.loc[u_id].to_numpy()

    # for all users (⚠️ this will take a very long time)
    for r_id in user_ids_subset:

        # check if we are calculating distance with ourselves
        if u_id == r_id:
            df_distances.loc[u_id, r_id] = 1.0
            continue

        # check if we already calculate the distance  (r_id -> u_id)
        if not pd.isna(df_distances.loc[r_id, u_id]):
            df_distances.loc[u_id, r_id] = df_distances.loc[r_id, u_id]
            continue

        # retrieve the other user's ratings vector from df_ratings 
        other_ratings_vect = df_ratings.loc[r_id].to_numpy()      

        # convert all nan to 0
        user_ratings_vect[pd.isnull(user_ratings_vect)] = 0
        other_ratings_vect[pd.isnull(other_ratings_vect)] = 0

        # calculate distance
        d = cosine_distance(user_ratings_vect, other_ratings_vect)

        # record result in new df
        df_distances.loc[u_id, r_id] = d
                    
df_distances