# COMM7380 Recommender Systems for Digital Media

In [None]:
# Install required packages using pip package manager in the current Jupyter kernel
import sys
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install nltk
!{sys.executable} -m pip install gensim

# Importing our dataset

The dataset includes the descriptions of famous movies from [imdb.com](https://www.imdb.com/)

In [None]:
import pandas as pd 
import numpy as np

In [None]:
movies = pd.read_csv('../data/' + 'imdb_movie_description.csv')

In [None]:
# Take a glance at the head 
movies.head(5)

In [None]:
movies.shape # How many?

The `id` column is a duplicated index, can be dropped safely

In [None]:
movies.drop(columns=['id'], inplace=True)

In [None]:
movies.columns

# Text processing pipeline

Import the Natural Language toolkit (`nltk`) library. 

In [None]:
import nltk
import string

Define a function to take care of text preprocessing (cleaning, tokenizing and stemming)

In [None]:
def tokenize(text):
    # Remove punctuation
    tokenizer = nltk.RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text)
    
    # Remove Stop Words (all lowercase in corpora)
    stop_words = set(nltk.corpus.stopwords.words('english')) 
    filteredTokens = [w for w in tokens if not w.lower() in stop_words]
    
    # Create stemmed tokens using the Porter stemmer
    # Also convert to lowercase
    stemmer = nltk.PorterStemmer()
    stems = []
    for item in filteredTokens:
        stems.append(stemmer.stem(item))
    return stems

Select a part of dataset

In [None]:
selectedMovies = movies.head(500)
#selectedMovies

Create tokenized descriptions of the movies

In [None]:
tokenDesc = []
for index, movie in movies.iterrows():
    tokenDesc.append(tokenize(movie['description']))

It's a list of lists

In [None]:
tokenDesc[0]

## Excercise

1. Create a function `add_genre` that will add the movie genres to tokens in the form $<genreId>genre$ (e.g. `18genre`)

# Topic modeling using LDA

Need to generate a corpora for training our model. 
We will use `gensim` library to create
- a dictionary 
- and a corpora of bags of words.

In [None]:
from gensim import corpora, models, similarities

First generate a dictionary from our token list

In [None]:
dictionary = corpora.Dictionary(tokenDesc)

Then, generate a corpus containing a Bag of Words for each description.

In [None]:
corpus = [dictionary.doc2bow(text) for text in tokenDesc]

## Generate an LDA model

The number of topics (10 in our example) is one of the many parameters that can be tuned.
We are going to use the standard values for the other parameters, for a full list please refer to the [gensim LDA manual page](https://radimrehurek.com/gensim/models/ldamodel.html).

Note: changing parameters when your recommender is "live" require to update all the previously computed topics.

Performance wise, training this model can be time and memory consuming. Since it is model based, it's suggested to persist the model to disk once you find a good combination of number of topics (and eventually of other parameters). You can do it by using `lda_model.save(filename)` and load it by using `lda_model.load(filename)` methods.

In [None]:
n_topics = 10

lda_model = models.ldamodel.LdaModel(corpus=corpus,
                                     id2word=dictionary,
                                     num_topics=n_topics)

Let's see which topics have been discovered. Can be confusing to analyse.

Each tuple start with topic number and a list of terms and their probability.

In [None]:
lda_model.show_topics()

Topics are a machine interpretation of text. This doesn't mean that they are human-interpretable.

Some sanity check is needed, to see if it can "make sense" (e.g. doesn't categorise Terminator as Romance... even if...).

### Topic distribution of a document

Let's input a new document to our model, to extract its classification.

In [None]:
# Get from dataset
newdoc = movies['description'].iloc[1001]

# Tokenize the text
tokensNewdoc = tokenize(newdoc)
print(tokensNewdoc)

# Convert to Bag of Words
newCorpus = dictionary.doc2bow(tokensNewdoc)

# Extract the topic of the new document with the model
sim = lda_model[newCorpus]

The result is a list of similarities with topics in the model

In [None]:
print(sim)

## Content similarity using LDA topics

Compute cosine similarity between the documents. The result is a square similarity matrix.

In [None]:
indexLDA = similarities.MatrixSimilarity(lda_model[corpus])

Print the first row

In [None]:
for item in indexLDA:
    print(item)
    break

Let's check if there are some differences with the cosine similarity between the items

In [None]:
indexCos = similarities.MatrixSimilarity(corpus)

In [None]:
for item in indexCos:
    print(item)
    break

# Recommending similar items

Let's recommend some content to our user.

To simplify the discussion, we skip the selection of the best movies. We already selected somewhere else and saved to a list.

In [None]:
bestMoviesIDs = [11, 181808, 140607, 284053, 73338, 424, 27205, 354912, 332562]

bestMovies = movies[movies['movie_id'].isin(bestMoviesIDs)]
print(bestMovies['title'])
print(bestMovies.index)

## Find similar items

First define a function for returning similarity matrix rows

In [None]:
def search_item_by_id(simMatrix, itemId):
    '''
        Input: gensim similaritymatrix and itemId
        Returns the item similarity if found
        Returns false otherwise
    '''
    
    count = 0
    search = itemId
    for item in simMatrix:
        if count == search:
            itemFound = item
            break
        count += 1
        
    itemFound == False
    
    return itemFound

Then we define a function for recommending items basing on content

In [None]:
def content_rec_items(df, itemList, minSim, maxSim, maxItems):
    
    # Create list of indexes
    count = 0
    indexList = []
    for item in itemList:
        if item < maxSim:
            if item > minSim:
                indexList.append(count)
        count +=1
    
    items = df.iloc[indexList]
    
    if maxItems > 0:
        items = items.head(maxItems) # Can be optimised?
        
    #print(items['title'])
    return items

Compute the recommendations for our user, basing on the description of movies he liked compared to the descriptions of other items.

Use some parameters to filter the results.

In [None]:
minSim, maxSim, maxItems = 0.1, 1, 3
sim = indexLDA

for item in bestMovies.index:
    similarItems = search_item_by_id(sim, item)
    #print(similarItems)
    recItems = content_rec_items(movies, similarItems, minSim, maxSim, maxItems)
    print(recItems['title'])

Let's check the same using cosine similarity between the word vectors

In [None]:
minSim, maxSim, maxItems = 0.1, 1, 3
sim = indexCos

for item in bestMovies.index:
    similarItems = search_item_by_id(sim, item)
    #print(similarItems)
    recItems = content_rec_items(movies, similarItems, minSim, maxSim, maxItems)
    print(recItems['title'])

### Exercises

1. When using the LDA similarity matrix, print the starting movie name before the recommendation (eventually "cleanup" the list print)
1. Tune the recommendation parameters to see if you can achieve better results than cosine similarity
1. Tune the number of topics extracted by the LDA model, can you find a number of topics that makes more sense when evaluating the recommendations?

- Course Instructor: Dr. Paolo Mengoni (Visiting Scholar, School of Communication, Hong Kong Baptist University) 
  - pmengoni@hkbu.edu.hk

- The codes in this notebook take insipiration from various sources. All codes are for educational purposes only and released under the CC1.0. 