## Cosine Similarity
- The Goal of this notebook is to actually "Recommend" movies to a user based on their previous reviews. This process is done by using Cosine Similarity. Cosine similarity in layman's terms determines how similar two vectors are. In this case, I am using cosine similarity to determine how similar a single user's review is to other users. This notebook will be going in depth on how this process is done

In [3]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from sklearn.metrics.pairwise import cosine_similarity

- These cells create the connection to the postgres SQL server created in the Preprocessing notebook. The query being run is simply pulls every single columns, which is every movie, and every users review for each movie, and loads it into a Pandas data frame
- Note: The password and IP address are not provided with this project. I recommend loading a saved csv file that should have been created after the preprocessing notebook.

In [4]:
engine = create_engine('postgres://postgres:password@IP_address:5432/postgres')

In [5]:
query = """
SELECT * FROM raw_user"""
all_users = pd.read_sql(query,engine,index_col='user_id')

- Because Panda data frames are very memory intensive, this cell create a numpy matrix of all of the review data. This cell also creates dictionaries of the user id's and movies titles, and their location in the numpy matrix.

In [4]:
all_user_index = dict(zip(all_users.index,range(all_users.shape[0])))
all_user_columns = dict(zip(all_users.columns,range(all_users.shape[1])))
data_mat = np.asarray(all_users)
del all_users

This function is what is doing the recommendation through several steps.
 - Step 1: Step 1 generates a SQL query based on the user_id provided by the user. The query then pulls all of the review data from the user specified, and creates a panda data frame.
 - Step 2: Step 2 creates a list of movies that the user has seen and reviewed. This list is important if we want to recommend movies that the user has not seen yet
 - Step 3: Step 3 Is where the cosine similarity calculation happens. The variable similarity is a Pandas data frame of each user cosine similarity to the user. Because a user is going to be most similar with itself, the row in the data frame for the user's similarity with itself is dropped
 - Step 4: Step 4 then pulls the top n users from the index based on the ones that have the highest cosine similarity to the user being queried. n users is determined when calling the function
 - Step 5: Step 5 is a for loop that iterates through the users that were determined to be similar to the user being queried in step 3, and creates a data frame for each user that contains their reviews for every movie.
 - Step 6: This final step has two options. The function can return the average of all of the similar users scores for every movie, including movies that the user has already seen, and then sorts then in descending order, presenting the most recommended movies first, or it can return the same results, but exclude movies that the user has seen. By default, this movie returns recommendations for movies which can include movies that the user has seen before.

In [5]:
def cosine_for_user(user_id,num_of_sim_users= 10,return_watched = True):
    #Step 1:
    query = """
    SELECT * FROM raw_user
    WHERE user_id = """ + str(user_id)
    user_df = pd.read_sql(query,engine,index_col='user_id',columns=all_user_columns)
    #Step 2
    unwatched = []
    for column in user_df.columns:
        if float(user_df[column]) != 0:
            unwatched.append(column)
    #Step 3
    similarity = pd.DataFrame(cosine_similarity(user_df,data_mat),index = ["similar_users"],columns=all_user_index.keys()).T
    similarity.drop(index = user_id,inplace = True)
    #Step 4
    top_similar = similarity.sort_values('similar_users',ascending=False).index[0:num_of_sim_users]
    #Step 5
    user_data = []
    for user in top_similar:
        user_data.append(pd.DataFrame(data_mat[all_user_index[user]],index=all_user_columns.keys(),columns = ["movies"]))
    #Step 6
    if return_watched == True:
        recomendations_watched = (sum(user_data)/num_of_sim_users).sort_values('movies',ascending = False)
        return recomendations_watched
    else:
        recomendations_unwached = (sum(user_data)/num_of_sim_users).sort_values('movies',ascending = False).drop(index = unwatched)
        return recomendations_unwached

- This cell demonstrates the end result of the function above. According to the function created above, these are the top 5 movie that are recommended for user 7 based on the preferences of other users.

In [6]:
user_7 = cosine_for_user(7,num_of_sim_users=10)
user_7.head()

Unnamed: 0,movies
Terms of Endearment,0.90549
The Big Chill,0.874393
Good Morning,0.834938
As Good as It Gets,0.833926
Thelma & Louise: Special Edition,0.787187


In [9]:
user_79 = cosine_for_user(79,num_of_sim_users=10)
user_79.head()

Unnamed: 0,movies
Sleepless in Seattle,1.113931
Sister Act 2: Back in the Habit,1.035549
Three Men and a Baby,0.949249
The First Wives Club,0.944583
Titanic,0.917973
