# Loading and cleaning the data

We will use a smaller version of the [movielens](https://grouplens.org/datasets/movielens/) dataset. It contains data of movies rated by users, demographic data from these users and some information about the movies like release date and genre.

Demographic data will not be used, the ratings will be binarized (0 if not watched or if given a rating lower than 2, 1 otherwise) and genre will be considered.

In [None]:
import pandas as pd
import numpy as np

from urllib.request import urlretrieve
import zipfile

from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

In [None]:
#Download the dataset
print('Downloading Dataset...')
urlretrieve("http://files.grouplens.org/datasets/movielens/ml-100k.zip", "movielens.zip")
zip_ref = zipfile.ZipFile('movielens.zip', "r")
zip_ref.extractall()
print("Done. Dataset contains:")
print(zip_ref.read('ml-100k/u.info').decode('utf-8'))

In [None]:
#Demographic data about users
users_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv(
    'ml-100k/u.user', sep='|', names=users_cols, encoding='latin-1')

#User rated movies
ratings_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(
    'ml-100k/u.data', sep='\t', names=ratings_cols, encoding='latin-1')

#Movies info
genre_cols = [
    "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
    "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
    "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
]
movies_cols = [
    'movie_id', 'title', 'release_date', "video_release_date", "imdb_url"
] + genre_cols
movies = pd.read_csv(
    'ml-100k/u.item', sep='|', names=movies_cols, encoding='latin-1')

In [None]:
users.head(5)

In [None]:
ratings.head(5)

In [None]:
movies.head(5)

In [None]:
#ids start at one, shift to zero
users["user_id"] = users["user_id"].apply(lambda x: str(x-1))
movies["movie_id"] = movies["movie_id"].apply(lambda x: str(x-1))
movies["year"] = movies['release_date'].apply(lambda x: str(x).split('-')[-1])
ratings["movie_id"] = ratings["movie_id"].apply(lambda x: str(x-1))
ratings["user_id"] = ratings["user_id"].apply(lambda x: str(x-1))
ratings["rating"] = ratings["rating"].apply(lambda x: float(x))

In [None]:
movielens = ratings.merge(movies, on='movie_id').merge(users, on='user_id')

In [None]:
movielens.head()

In [None]:
#select only the relevant columns for our model
relevant_cols = ['user_id', 'movie_id', 'rating'] + genre_cols
df = movielens[relevant_cols]

In [None]:
df.sample(5)

In [None]:
#binarize the ratings
df.loc[:, 'rating'] = df['rating'].apply(lambda x: 0 if x <=2 else 1)

In [None]:
df.head(5)

In [None]:
df.describe(include=[np.object])

In [None]:
#discard zero values (they'll be reintroduced when we store the values in a one-hot encode matrix)
df = df[df['rating'] == 1]

In [None]:
df.describe(include=[np.object])

In [None]:
df

# One-hot encode, normalize and calculate similarities

In [None]:
matrix1 = df.drop(genre_cols, axis=1).groupby(['movie_id', 'user_id']).size().unstack().fillna(0)
matrix2 = df.drop(['user_id'], axis=1).groupby('movie_id').first()

In [None]:
matrix = pd.concat([matrix1, matrix2], axis=1)

matrix.index.name = None

In [None]:
matrix

In [None]:
magnitude = np.sqrt(np.square(matrix).sum(axis=1))

In [None]:
matrix = matrix.divide(magnitude, axis=0)

In [None]:
def calculate_similarity(matrix):
    data_sparse = sparse.csr_matrix(matrix)
    similarities = cosine_similarity(data_sparse)
    sim = pd.DataFrame(data=similarities, index=matrix.index.values, columns=matrix.index.values)
    return sim

def translate_sim(s_matrix):
    results = []
    for item in s_matrix.iteritems():
        idx = item[0]
        movie = movies[movies['movie_id'].astype(str) == idx]['title'].to_list()[0]
        results.append((movie, item[1]))
    return results

In [None]:
sim_matrix = calculate_similarity(matrix)

In [None]:
translate_sim(sim_matrix.loc['1'].nlargest(10))

In [None]:
#sample a random movie id and see for yourself
item_id = movies.sample(1)['movie_id'].to_list()[0]
r = translate_sim(sim_matrix.loc[item_id].nlargest(10))
r

# Conclusion

This is one of the simplest methods one can use to find similarities with colaborative filtering. We decided to include the genres as a means to input more information about how the movies should be aggregated. Some of the recommendations doesn't make any sense, could be because we simplified the ratings, used just a percentage of the dataset, the are not enough users or because of the not very much sophisticated model based on just cosine similarities.

There are lots of models which use machine learning and perform way better. We are using this one to introduce some key concepts like vectorization, one-hot encoding, cosine distance, etc.