# Embeddings for Recommendation Systems

As we’ve mentioned, the concept of embeddings is useful in so many other domains. In industry, it’s widely used for recommendation systems, for example.

we’ll use the word2vec algorithm to embed songs using human-made music playlists. Imagine if we treated each song as we would a word or token, and we treated each playlist like a sentence. These embeddings can then be used to recommend similar songs that often appear together in playlists.

The dataset we’ll use was collected by Shuo Chen from Cornell University. It contains playlists from hundreds of radio stations around the US. Figure 2-17 demonstrates this dataset.

![Three playlists containing watched video IDs](../assets/videos_playlists.png)

Figure 2-17. For video embeddings that capture video similarity we’ll use a dataset made up of a collection of playlists, each containing a list of videos.


Let’s demonstrate the end product before we look at how it’s built. So let’s give it a few songs and see what it recommends in response.



### Training a Song Embedding Model

We’ll start by loading the dataset containing the song playlists as well as each song’s metadata, such as its title and artist:



In [32]:
import pandas as pd
from urllib import request
import zipfile
import io

url = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
response = request.urlopen(url)
with zipfile.ZipFile(io.BytesIO(response.read())) as z:
    with z.open('ml-latest-small/ratings.csv') as f:
        ratings = pd.read_csv(f)
    with z.open('ml-latest-small/movies.csv') as f:
        movies_df = pd.read_csv(f)

movies_df['movieId'] = movies_df['movieId'].astype(str)
movies_df = movies_df.set_index('movieId')

movie_watch_lists = ratings.groupby('userId')['movieId'].apply(list).values.tolist()

In [33]:
print( 'Playlist #1:\n ', playlists[0], '\n')
print( 'Playlist #2:\n ', playlists[1])

Playlist #1:
  ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '2', '42', '43', '44', '45', '46', '47', '48', '20', '49', '8', '50', '51', '52', '53', '54', '55', '56', '57', '25', '58', '59', '60', '61', '62', '3', '63', '64', '65', '66', '46', '47', '67', '2', '48', '68', '69', '70', '57', '50', '71', '72', '53', '73', '25', '74', '59', '20', '46', '75', '76', '77', '59', '20', '43'] 

Playlist #2:
  ['78', '79', '80', '3', '62', '81', '14', '82', '48', '83', '84', '17', '85', '86', '87', '88', '74', '89', '90', '91', '4', '73', '62', '92', '17', '53', '59', '93', '94', '51', '50', '27', '95', '48', '96', '97', '98', '99', '100', '57', '101', '102', '25', '103', '3', '104', '105', '106', '107', '47', '108', '109', '110', '111', '112', '113', '25', '63', '62', '114', '115', '84', '116', '117',

Based on the official [Gensim Word2Vec documentation](https://radimrehurek.com/gensim/models/word2vec.html), here is the description for each parameter, of the next code snippet calling `Word2Vec`:

* **`sentences` (playlists):** The input data. It must be an iterable of lists of tokens (in your case, song IDs or names within a playlist).
* **`vector_size=32`:** The dimensionality of the word vectors. This defines the number of features in the hidden layer of the neural network used to represent each item.
* **`window=20`:** The maximum distance between the current and predicted word within a sentence. A larger window captures more global context.
* **`negative=50`:** Specifies how many "noise words" should be drawn for **Negative Sampling**. According to the documentation, values between 5 and 20 are typical for small datasets, while 2 to 5 suffice for large ones. You have set this high (50) to increase training rigor.
* **`min_count=1`:** The model ignores all words with a total frequency lower than this. Setting it to 1 ensures every item in your playlists is included in the vocabulary.
* **`workers=4`:** The number of worker threads used to train the model, allowing for multicore parallelization to speed up training.

In [34]:
!pip install gensim



In [35]:
from gensim.models import Word2Vec

model = Word2Vec(
    sentences=movie_watch_lists,
    vector_size=32,
    window=10,
    negative=20,
    min_count=2,
    workers=4
)

In [36]:
!pip install gensim



In [37]:
from gensim.models import Word2Vec

movie_watch_lists_str = [[str(m) for m in playlist] for playlist in movie_watch_lists]

model = Word2Vec(
    sentences=movie_watch_lists_str,
    vector_size=32,   # Dimensionality of the movie vectors
    window=10,        # Context window for movies appearing together
    negative=20,       # Number of noise words for negative sampling
    min_count=1,       # Ensure even rare movies are included in the vocabulary
    workers=4         # Use parallel processing for training
)

movie_id = '1'

try:
    similar_movies = model.wv.most_similar(positive=movie_id, topn=5)

    print(f"--- Similar movies for Movie ID: {movie_id} ---")
    for movie in similar_movies:
        print(f"Movie ID: {movie[0]}, Similarity Score: {movie[1]:.4f}")

except KeyError:
    # Handle cases where the movie ID might not be in the training data
    print(f"Movie ID {movie_id} not found. Try another ID such as {movie_watch_lists_str[0][0]}")

--- Similar movies for Movie ID: 1 ---
Movie ID: 16, Similarity Score: 0.9965
Movie ID: 32, Similarity Score: 0.9944
Movie ID: 34, Similarity Score: 0.9912
Movie ID: 10, Similarity Score: 0.9907
Movie ID: 36, Similarity Score: 0.9893


In [38]:
print(songs_df.iloc[2172])

title     Fade To Black
artist        Metallica
Name: 2172 , dtype: object


In [39]:
import numpy as np

def print_movie_recommendations(movie_id):
    try:
        mid = str(movie_id)

        similar_ids = [item[0] for item in model.wv.most_similar(positive=mid, topn=5)]

        return movies_df.loc[similar_ids]
    except KeyError:
        return f"ID {movie_id} not found. Make sure you trained the model with strings."

print_movie_recommendations('1')

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
16,Casino (1995),Crime|Drama
32,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),Mystery|Sci-Fi|Thriller
34,Babe (1995),Children|Drama
10,GoldenEye (1995),Action|Adventure|Thriller
36,Dead Man Walking (1995),Crime|Drama


In [40]:
print("--- Recommendations for Toy Story (ID: 1) ---")
print(print_movie_recommendations('1'))

--- Recommendations for Toy Story (ID: 1) ---
                                             title                     genres
movieId                                                                      
16                                   Casino (1995)                Crime|Drama
32       Twelve Monkeys (a.k.a. 12 Monkeys) (1995)    Mystery|Sci-Fi|Thriller
34                                     Babe (1995)             Children|Drama
10                                GoldenEye (1995)  Action|Adventure|Thriller
36                         Dead Man Walking (1995)                Crime|Drama


In [41]:
print_recommendations(842)

Unnamed: 0_level_0,title,artist
id,Unnamed: 1_level_1,Unnamed: 2_level_1
892,Impossible,Shontelle
804,No Soy Nada,Lider Musical
848,Monday Speaks (w\/ Norman Brown),Everette Harp
809,Tu Ultima Parranda,Paquita La Del Barrio
781,Una Noche En Tu Vida,Ezequiel Pena


In [42]:
def get_movie_recommendations(movie_id):
    try:
        mid = str(movie_id)
        original = movies_df.loc[mid]
        print(f"--- Recommendations for: {original['title']} ({original['genres']}) ---")

        similar = model.wv.most_similar(positive=mid, topn=5)
        similar_ids = [item[0] for item in similar]

        return movies_df.loc[similar_ids]
    except KeyError:
        return "ID not found in movies dataset."

print(get_movie_recommendations('1'))

--- Recommendations for: Toy Story (1995) (Adventure|Animation|Children|Comedy|Fantasy) ---
                                             title                     genres
movieId                                                                      
16                                   Casino (1995)                Crime|Drama
32       Twelve Monkeys (a.k.a. 12 Monkeys) (1995)    Mystery|Sci-Fi|Thriller
34                                     Babe (1995)             Children|Drama
10                                GoldenEye (1995)  Action|Adventure|Thriller
36                         Dead Man Walking (1995)                Crime|Drama
