### Similarities of tables
In this notebook the tables produced by Olga will be used to do collaborative (user and item) based analysis

In [21]:
# Loading in Olga's tables and the dataset

# Importing libraries needed for data preprocessing
import numpy as np
import pandas as pd
import os
%matplotlib inline

# Reading in small movie dataset
movie = pd.read_csv("ml-latest-small/movies.csv")
# Functions used for loading in movie dataset
def match_genre(row, cur_genre):
    movie_genres = row['genres'].split('|')
    return int(cur_genre in movie_genres)
# Extracting genres and year, putting them into table
movie['year'] = movie['title'].str.extract('.*\((.*)\).*', expand=True)
unique_genres = pd.unique(movie[['genres']].values.ravel('K'))
split_genre = [unique_genres.split('|') for unique_genres in unique_genres]
genres_set = sorted(set([item for sublist in split_genre for item in sublist]))
# create additional 20 features for content-based analysis
for genre in genres_set:
    movie[genre] = movie.apply (lambda row: match_genre(row, genre),axis=1)

# Loading in tags from small movie dataset, TODO: Work in progress by Olga
tags = pd.read_csv("ml-latest-small/tags.csv")
tags = tags.loc[:,["userId","movieId","tag"]]
tag_counts = tags['tag'].value_counts()
movies_tags = movie.merge(tags, on='movieId', how='inner')

    
# Loading in rating data from small movie dataset
rating = pd.read_csv("ml-latest-small/ratings.csv")
# what we need is that user id, movie id and rating
rating = rating.loc[:,["userId","movieId","rating"]]
# estimate average rating for each movie to replace missed values
avg_ratings= rating.groupby('movieId', as_index=False).mean()
del avg_ratings['userId']   #TODO: Fix, somehow leaves userID in there
# Add average ratings to movie table
movies_tag_avgrating = movies_tags.merge(avg_ratings, on='movieId', how='inner')
data = pd.merge(movie,rating)

# Pivot table for collaborative item based filtering
# lets make a pivot table in order to make rows are users and columns are movies. And values are rating
pivot_table = data.pivot_table(index = ["userId"],columns = ["title"],values = "rating")




# Showing tables so far:
print("Table 1: Movie ID |Movie Title|Genres(remove)|Year it came out|Unique Genres (with 0 if not in there, 1 if in there)|Tags features (TODO)|UserID (remove)|Average rating")
display(movies_tag_avgrating.head(5))

print("Table 2: User ID's vs Movie titles their rating")
display(pivot_table.head(5))


Table 1: Movie ID |Movie Title|Genres(remove)|Year it came out|Unique Genres (with 0 if not in there, 1 if in there)|Tags features (TODO)|UserID (remove)|Average rating


Unnamed: 0,movieId,title,genres,year,(no genres listed),Action,Adventure,Animation,Children,Comedy,...,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,userId,tag,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,0,0,1,1,1,1,...,0,0,0,0,0,0,0,336,pixar,3.92093
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,0,0,1,1,1,1,...,0,0,0,0,0,0,0,474,pixar,3.92093
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,0,0,1,1,1,1,...,0,0,0,0,0,0,0,567,fun,3.92093
3,2,Jumanji (1995),Adventure|Children|Fantasy,1995,0,0,1,0,1,0,...,0,0,0,0,0,0,0,62,fantasy,3.431818
4,2,Jumanji (1995),Adventure|Children|Fantasy,1995,0,0,1,0,1,0,...,0,0,0,0,0,0,0,62,magic board game,3.431818


Table 3: User ID's vs Movie titles their rating


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


title
Carnage (2011)                                                1.000000
Texas Chainsaw Massacre 2, The (1986)                         1.000000
Hills Have Eyes II, The (2007)                                1.000000
Mr. 3000 (2004)                                               1.000000
Blue Jasmine (2013)                                           1.000000
Down to You (2000)                                            1.000000
Spy Kids 2: The Island of Lost Dreams (2002)                  1.000000
Mississippi Burning (1988)                                    1.000000
Mindhunters (2004)                                            1.000000
Brave One, The (2007)                                         1.000000
Holy Man (1998)                                               1.000000
Millennium Actress (Sennen joyû) (2001)                       1.000000
The Great Raid (2005)                                         1.000000
Mighty Joe Young (1949)                                       1.000000


In [None]:
# Trying out collaborative item based filtering
#The original Item-based recommendation is totally based on user-item ranking (e.g., a user rated a movie with 3 
#stars, or a user "likes" a video). When you compute the similarity between items, you are not supposed to know 
#anything other than all users' history of ratings. So the similarity between items is computed based 
#on the ratings instead of the meta data of item content.

# Recommending users movies that are similar to the movies that theyve liked so far (based on ratings)
# Jaccard similarity?

movie_watched = pivot_table["Bad Boys (1995)"]
similarity_with_other_movies = pivot_table.corrwith(movie_watched)  # find correlation between "Bad Boys (1995)" and other movies
similarity_with_other_movies = similarity_with_other_movies.sort_values(ascending=False)
display(similarity_with_other_movies.head(1000))

In [None]:
# Trying out collaborative user based filtering

# Finding similar users based on the ratings they gave to movies

# Recommending similar users movies that similar users have watched and liked, but have not seen yet

In [2]:
# Trying out content based filtering

# Finding movies that are similar based on genres, year, rating and tags

# Possible similarity measures:
#Cosine-Based Similarity
#Correlation-Based Similarity
#Adjusted Cosine Similarity
#1-Jaccard distance
