In [None]:
# !git clone <repo>
# !pip install <packages>

In [2]:
# imports
import numpy as np
import pandas as pd

# Information Retrieval Workshop (Rice Datathon 2023)

In this notebook, we will build a simple movie search engine.

Using the metadata we have available, we want to retrieve the movies most relevant to a user's query.

We will build progressively more complex methods for finding which movies are relevant.


<!-- explain data and show examples in a few different slides -->
## Data
First, let's look at the data...

In [18]:
data = pd.read_csv('movies_db.csv', converters={'release_date':str, 'title':str, 'overview':str, 'genres':str, 'characters':str, 'cast':str, 'genome_tags_gte_085':str})
data

Unnamed: 0,release_date,title,overview,genres,characters,cast,genome_tags_gte_085,popularity,vote_average,vote_count
0,1995-10-30,Toy Story,"Led by Woody, Andy's toys live happily in his ...","adventure, animation, children, comedy, fantasy","Woody, Buzz Lightyear, Mr. Potato Head, Slinky...","Tom Hanks, Tim Allen, Don Rickles, Jim Varney,...","childhood, animation, imdb top 250, nostalgic,...",21.946943,7.7,5415.0
1,1995-12-15,Jumanji,When siblings Judy and Peter discover an encha...,"adventure, children, fantasy","Alan Parrish, Samuel Alan Parrish / Van Pelt, ...","Robin Williams, Jonathan Hyde, Kirsten Dunst, ...","fun movie, animals, fantasy world, fantasy, ju...",17.015539,6.9,2413.0
2,1995-12-22,Grumpier Old Men,A family wedding reignites the ancient feud be...,"comedy, romance","Max Goldman, John Gustafson, Ariel Gustafson, ...","Walter Matthau, Jack Lemmon, Ann-Margret, Soph...","good sequel, sequels, comedy, sequel",11.712900,6.5,92.0
3,1995-12-22,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...","comedy, drama, romance","Savannah 'Vannah' Jackson, Bernadine 'Bernie' ...","Whitney Houston, Angela Bassett, Loretta Devin...","women, chick flick",3.859495,6.1,34.0
4,1995-02-10,Father of the Bride Part II,Just when George Banks has recovered from his ...,comedy,"George Banks, Nina Banks, Franck Eggelhoffer, ...","Steve Martin, Diane Keaton, Martin Short, Kimb...","pregnancy, good sequel, sequel, sequels, famil...",8.387519,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...
10358,2017-08-03,Wind River,An FBI agent teams with the town's veteran gam...,"action, crime, mystery, thriller","Cory Lambert, Jane Banner, Martin Hanson, Nata...","Jeremy Renner, Elizabeth Olsen, Gil Birmingham...","native americans, suspense, violence, tense, g...",40.796775,7.4,181.0
10359,2017-07-13,Shot Caller,A newly-released prison gangster is forced by ...,"action, crime, drama, thriller","Jacob / Money, Frank 'Shotgun', Kate, Kutcher,...","Nikolaj Coster-Waldau, Jon Bernthal, Lake Bell...","drama, brutal, prison, great acting, gangsters...",15.786854,6.9,324.0
10360,2017-07-21,Girls Trip,Four girlfriends take a trip to New Orleans fo...,comedy,"Ryan Pierce, Sasha Franklin, Lisa Cooper, Dina...","Regina Hall, Queen Latifah, Jada Pinkett Smith...","women, funny, comedy",37.964872,7.1,393.0
10361,2017-07-28,Detroit,A police raid in Detroit in 1967 results in on...,"crime, drama, thriller","Melvin Dismukes, Philip Krauss, Larry Reed, Ca...","John Boyega, Will Poulter, Algee Smith, Jason ...","brutality, segregation, history, race issues",9.797505,7.3,67.0


## A Few Simple Methods

### 1 - Retrieve Movies by Overlap with Movie Title
We start by simply retrieving all movies that have some overlap with the query. We score them according to the amount of overlap.

In [19]:

# title overlap
def scoring_func(row, query):
    score = len(set(row.loc['title'].split()).intersection(set(query.split())))
    return score


def retrieve_results(movie_db, query):
    scores = movie_db.apply(scoring_func, axis=1, **{'query': query})
    scores.sort_values(ascending=False, inplace=True)  # sort scores
    scores = scores[scores > 0.0]  # remove zero scores
    retrieved = movie_db.loc[scores.index]  # get only scored results
    retrieved['scores'] = scores  # add the scores back
    retrieved.reset_index(inplace=True)
    return retrieved


print(retrieve_results(data, 'Lion King')[['title', 'scores']])


                                            title  scores
0                  The Lion King 2: Simba's Pride       2
1                                   The Lion King       2
2                                The Lion King 1½       2
3                              Lion of the Desert       1
4   The Librarian: Return to King Solomon's Mines       1
5                                       King Kong       1
6                              The King of Comedy       1
7                            King Solomon's Mines       1
8                                 King Kong Lives       1
9        Davy Crockett, King of the Wild Frontier       1
10                                      King Corn       1
11                                   Joe the King       1
12                      The Last King of Scotland       1
13                           King Solomon's Mines       1
14                          The Wind and the Lion       1
15                               King of New York       1
16            

### 2 - Break Ties by Accounting for Movie Popularity
It should be clear that title overlap results in many movies having the exact same score and no way to break those ties.
So, we will also include popularity as a factor in our scoring. We assume that the more popular movie is more often the one being searched for.

In [20]:

# popularity tiebreaker
def scoring_func(row, query):
    title_overlap_score = len(set(row.loc['title'].split()).intersection(set(query.split())))
    popularity_score = row.loc['popularity']
    return (title_overlap_score * 100) * np.log1p(popularity_score)


def retrieve_results(movie_db, query):
    scores = movie_db.apply(scoring_func, axis=1, **{'query': query})
    scores.sort_values(ascending=False, inplace=True)  # sort scores
    scores = scores[scores > 0.0]  # remove zero scores
    retrieved = movie_db.loc[scores.index]  # get only scored results
    retrieved['scores'] = scores  # add the scores back
    retrieved.reset_index(inplace=True)
    return retrieved


print(retrieve_results(data, 'Lion King')[['title', 'scores', 'popularity']])

                                            title      scores  popularity
0                                   The Lion King  623.640957   21.605761
1                  The Lion King 2: Simba's Pride  468.692828    9.417261
2                                The Lion King 1½  459.132900    8.931033
3                King Arthur: Legend of the Sword  381.223292   44.251369
4   The Lord of the Rings: The Return of the King  341.195128   29.324358
5                                       King Kong  303.308413   19.761164
6                               Anna and the King  292.585473   17.650160
7                                            Lion  289.509088   17.085145
8                              King of California  264.538193   13.088825
9   The Librarian: Return to King Solomon's Mines  260.224588   12.494010
10                               King of New York  250.389448   11.230031
11                                The Fisher King  250.015521   11.184385
12                              The Sc

### 3 - Expand Context by Adding List of Characters
A common approach to improve our results is to pre- and/or post-process our queries and documents. For instance, we can fix spelling mistakes in the query or include additional information about each document. In this case, we will add information about the characters in a movie so that we can better distinguish queries.

In [21]:

# add overview
def scoring_func(row, query):
    title_overlap_score = len(set(row.loc['title'].split()).intersection(set(query.split())))
    overview_overlap_score = len(set(row.loc['overview'].split()).intersection(set(query.split())))
    popularity_score = row.loc['popularity']
    return (title_overlap_score * 100 + overview_overlap_score * 10) * min(np.log1p(popularity_score), 2.1)


def retrieve_results(movie_db, query):
    scores = movie_db.apply(scoring_func, axis=1, **{'query': query})
    scores.sort_values(ascending=False, inplace=True)  # sort scores
    scores = scores[scores > 0.0]  # remove zero scores
    retrieved = movie_db.loc[scores.index]  # get only scored results
    retrieved['scores'] = scores  # add the scores back
    retrieved.reset_index(inplace=True)
    return retrieved


print(retrieve_results(data, 'Lion King Nala')[['title', 'scores', 'popularity']])

                                             title      scores  popularity
0                   The Lion King 2: Simba's Pride  441.000000    9.417261
1                                 The Lion King 1½  420.000000    8.931033
2                                    The Lion King  420.000000   21.605761
3    The Librarian: Return to King Solomon's Mines  231.000000   12.494010
4                  Aladdin and the King of Thieves  231.000000    8.654244
..                                             ...         ...         ...
134          The Little Mermaid: Ariel's Beginning    3.475577    0.415606
135                               The Wizard of Oz    3.439301    0.410480
136                        Kids of the Round Table    2.677918    0.307075
137                       The Man Who Saw Tomorrow    2.483652    0.281928
138       The Haunted World of Edward D. Wood, Jr.    0.367590    0.037443

[139 rows x 3 columns]


### 4 - TFIDF
TFIDF is how we commonly refer to the strategy for weighing terms with the equation:
$$
(Term\ Frequency) * (Inverse\ Document\ Frequency)
$$

You can think of *document frequency* ($DF$) as the distinctiveness of a term. If the $DF$ of a term is 1, then we know it only appears in a single document; therefore, if we see that term in the user's query, it is a dead giveaway of which document the user is looking for. Note that, to weigh our terms, we use $IDF$ instead ($IDF = \frac{1}{DF}$) since we want more distinctive terms, those with lower $DF$, to have greater weight.

In [23]:

# tfidf
def calculate_idf(movie_db):
    term_df = {}
    for item in movie_db.itertuples():

        under_consideration = set()
        under_consideration.update(set(item.title.split()))
        under_consideration.update(set(item.overview.split()))

        for term in under_consideration:
            if term not in term_df:
                term_df[term] = 1
            else:
                term_df[term] += 1

    return {term: (1.0 / df) for term, df in term_df.items()}


def scoring_func(row, query, term_weights):
    title_overlap = set(row.loc['title'].split()).intersection(set(query.split()))
    title_overlap_score = sum([term_weights[t] for t in title_overlap])

    overview_overlap = set(str(row.loc['overview']).split()).intersection(set(query.split()))
    overview_overlap_score = sum([term_weights[t] for t in overview_overlap])

    popularity_score = row.loc['popularity']
    return (title_overlap_score * 100 + overview_overlap_score * 10) * np.log1p(popularity_score)


def retrieve_results(movie_db, query):
    scores = movie_db.apply(scoring_func, axis=1, **{'query': query, 'term_weights':calculate_idf(movie_db)})
    scores.sort_values(ascending=False, inplace=True)  # sort scores
    scores = scores[scores > 0.0]  # remove zero scores
    retrieved = movie_db.loc[scores.index]  # get only scored results
    retrieved['scores'] = scores  # add the scores back
    retrieved.reset_index(inplace=True)
    return retrieved


print(retrieve_results(data, 'Lion King Nala')[['title', 'scores', 'popularity']])

                                        title     scores  popularity
0              The Lion King 2: Simba's Pride  51.221986    9.417261
1                               The Lion King  36.973738   21.605761
2                                        Lion  32.167676   17.085145
3                            The Lion King 1½  27.220566    8.931033
4                       The Wind and the Lion  25.162123    8.627187
..                                        ...        ...         ...
134                           Change of Habit   0.038966    0.685640
135     The Little Mermaid: Ariel's Beginning   0.025937    0.415606
136                   Kids of the Round Table   0.019984    0.307075
137                  The Man Who Saw Tomorrow   0.018535    0.281928
138  The Haunted World of Edward D. Wood, Jr.   0.002743    0.037443

[139 rows x 3 columns]


# Learning to Rank
As you can see from the previous examples, there are many parameters that can be adjusted. Therefore, we may want to learn from the data how best to rank our results.


Different methods of learning to rank:
- **Point-wise** -
You learn to assign a *score* to each entry; sorting by these scores leads to a ranking
- **Pair-wise** -
You learn a *comparison function* between entries; this comparison function can be used to sort entries into a ranking
- **List-wise** -
You learn to output a *permutation*; that is your ranking