# Diversity Score on the MovieLens Dataset

In this notebook, we look at measuring diversity of movie consumption in the MovieLens dataset. This dataset comes from the interaction between users and movies at [MovieLens](https://movielens.org/), a public movie recommendation service run by [GroupLens](https://grouplens.org/), based at the University of Minesotta.

Our goal is to try Spotify's diversity measure on the MovieLens dataset. This will require training a MovieLens model on the dataset, followed by computing the Generalist-Specialist metric on users.

In [1]:
import os
import pandas as pd
import numpy as np
import re
import logging  
import multiprocessing
import umap
import matplotlib.pyplot as plt
import requests
import seaborn as sns

from time import time
from ast import literal_eval
from gensim.models import Word2Vec
from gensim.models.phrases import Phrases, Phraser
from tqdm import tqdm

from core.util import *

%matplotlib inline

logger = setup_logging()

## Dataset Download and Preparation

First we download the dataset into the current working directory and unzip it. We also set up the paths to the user interaction with movies and ratings they gave.

In [2]:
dataset_dir_name = "ml-latest"
base_path = os.path.join(os.getcwd(), dataset_dir_name)
ratings_path = os.path.join(base_path, "ratings.csv")
links_path = os.path.join(base_path, "links.csv")
movies_path = os.path.join(base_path, "movies.csv")

In [3]:
if not os.path.exists(base_path):
    dataset_url = "http://files.grouplens.org/datasets/movielens/" + dataset_dir_name + ".zip"
    # don't download if not necessary     
    if not os.path.exists(base_path + ".zip"):
        download_dataset(dataset_url, base_path + ".zip")
    
    logger.info("unzipping file")
    unzip_file(dataset_dir_name + ".zip")

INFO - 10:59:40: unzipping file


## Read and Explore Dataset

In [None]:
ratings_df = pd.read_csv(ratings_path)

In [14]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,307,3.5,1256677221
1,1,481,3.5,1256677456
2,1,1091,1.5,1256677471
3,1,1257,4.5,1256677460
4,1,1449,4.5,1256677264


In [31]:
movies_df = pd.read_csv(movies_path)

In [32]:
movies_df["movie_title"] = movies_df["title"].apply(lambda x: x.split(" (")[0])

In [33]:
movies_df.head()

Unnamed: 0,movieId,title,genres,movie_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II


In [34]:
movies_df["genres"] = movies_df["genres"].apply(lambda x: x.split("|"))

In [35]:
movies_df.head()

Unnamed: 0,movieId,title,genres,movie_title
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",Toy Story
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]",Jumanji
2,3,Grumpier Old Men (1995),"[Comedy, Romance]",Grumpier Old Men
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",Waiting to Exhale
4,5,Father of the Bride Part II (1995),[Comedy],Father of the Bride Part II


In [36]:
movies_df = movies_df[["movieId", "movie_title", "genres"]]

In [37]:
df = pd.merge(ratings_df, movies_df, on="movieId")

In [38]:
df.head()

Unnamed: 0,userId,movieId,rating,timestamp,movie_title,genres
0,1,307,3.5,1256677221,Three Colors: Blue,[Drama]
1,6,307,4.0,832059248,Three Colors: Blue,[Drama]
2,56,307,4.0,1383625728,Three Colors: Blue,[Drama]
3,71,307,5.0,1257795414,Three Colors: Blue,[Drama]
4,84,307,3.0,999055519,Three Colors: Blue,[Drama]


In [44]:
users_movie = ratings_df.sort_values(by=["timestamp"]).groupby("userId").agg({"movieId": lambda x: list(x), "rating": lambda x: list(x)})

In [43]:
users_movie.

Unnamed: 0,movie_title,rating
0,"[Hollow Man, Three Colors: Blue, Event Horizon...","[2.0, 3.5, 2.5, 4.0, 3.5, 4.0, 4.5, 4.5, 3.5, ..."
1,"[Driving Miss Daisy, Escape from L.A., L.A. St...","[4.0, 3.5, 3.5, 4.0, 3.0, 3.5, 3.0, 4.0, 3.5, ..."
2,"[Godfather: Part II, The, Angel on My Shoulder...","[4.0, 3.0, 4.0, 4.0, 3.0, 4.0, 4.0, 3.0, 3.0, ..."
3,"[Austin Powers: The Spy Who Shagged Me, Being ...","[3.5, 4.0, 4.0, 5.0, 3.5, 4.0, 5.0, 3.5, 3.5, ..."
4,"[Sex, Lies, and Videotape, She's All That, Col...","[2.0, 3.0, 3.0, 4.0, 3.5, 3.5, 4.5, 4.0, 4.0, ..."
...,...,...
283223,"[Birdcage, The, Twelve Monkeys, Grumpier Old M...","[4.0, 3.0, 4.0, 5.0, 3.0, 3.0, 3.0, 3.0, 4.0, ..."
283224,"[Cable Guy, The, Election, Grease, Grosse Poin...","[3.0, 4.0, 2.5, 4.0, 3.0, 2.5, 3.0, 2.5, 3.0, ..."
283225,"[Ace Ventura: Pet Detective, Dumb & Dumber, Be...","[3.0, 2.0, 1.0, 2.0, 2.0, 2.0, 1.0, 2.0, 1.0, ..."
283226,"[Fried Green Tomatoes, Beavis and Butt-Head Do...","[4.0, 2.5, 5.0, 5.0, 4.0, 5.0, 3.0, 2.5, 5.0, ..."


In [42]:
users_movie

Unnamed: 0_level_0,movie_title,rating
userId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"[Hollow Man, Three Colors: Blue, Event Horizon...","[2.0, 3.5, 2.5, 4.0, 3.5, 4.0, 4.5, 4.5, 3.5, ..."
2,"[Driving Miss Daisy, Escape from L.A., L.A. St...","[4.0, 3.5, 3.5, 4.0, 3.0, 3.5, 3.0, 4.0, 3.5, ..."
3,"[Godfather: Part II, The, Angel on My Shoulder...","[4.0, 3.0, 4.0, 4.0, 3.0, 4.0, 4.0, 3.0, 3.0, ..."
4,"[Austin Powers: The Spy Who Shagged Me, Being ...","[3.5, 4.0, 4.0, 5.0, 3.5, 4.0, 5.0, 3.5, 3.5, ..."
5,"[Sex, Lies, and Videotape, She's All That, Col...","[2.0, 3.0, 3.0, 4.0, 3.5, 3.5, 4.5, 4.0, 4.0, ..."
...,...,...
283224,"[Birdcage, The, Twelve Monkeys, Grumpier Old M...","[4.0, 3.0, 4.0, 5.0, 3.0, 3.0, 3.0, 3.0, 4.0, ..."
283225,"[Cable Guy, The, Election, Grease, Grosse Poin...","[3.0, 4.0, 2.5, 4.0, 3.0, 2.5, 3.0, 2.5, 3.0, ..."
283226,"[Ace Ventura: Pet Detective, Dumb & Dumber, Be...","[3.0, 2.0, 1.0, 2.0, 2.0, 2.0, 1.0, 2.0, 1.0, ..."
283227,"[Fried Green Tomatoes, Beavis and Butt-Head Do...","[4.0, 2.5, 5.0, 5.0, 4.0, 5.0, 3.0, 2.5, 5.0, ..."
