# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"  # allows multiple outputs from a cell
import numpy as np, pandas as pd, seaborn as sns, matplotlib.pyplot as plt, scipy
import re, json                    # JavaScript Object Notation (JSON) is a string of key-value pairs
from sentence_transformers import SentenceTransformer

pd.set_option('max_rows', 20, 'max_columns', 15, 'max_colwidth', 1000, 'precision', 2)

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**

Review the code Professor Melnikov used to prepare, encode and query the movie database in search of movies that are semantically similar to the query phrase.

## **JSON Strings**

A JavaScript Object Notation ([JSON](https://www.json.org/json-en.html)) string is a dictionary-like object (written out as a string) containing key-value pairs (unique keys and the corresponding non-unique values). The JSON object is an exceptionally popular choice for storing and sharing data due to its transparency, safety and relative ease of use. The cost is speed of processing (compared to binary formats). The following JSON string has a key `"id"` mapped to the value `28` and key `"name"` mapped to the value `"Action"`.

     '{"id": 28, "name": "Action"}'
     
Note that the whole value is a string containing double quotes specifying that both keys should themselves be interpreted as strings and only the value 28 should be interpreted as a number. To properly convert this JSON string as a Python dictionary, you will need a [*`json`*](https://docs.python.org/3/library/json.html) package, which, like many others, allows parsing and manipulation of JSON strings.

## **The Movie Database (TMDB)**

[TMBD](https://www.themoviedb.org/) data comes in a tabular datafile (saved as a compressed text file `movies.zip`) containing one row per movie with columns containing numeric and textual movie attributes, such as original title, movie budget, relevant genres, etc. Each movie has a original title, id, and a homepage URL, but can have multiple genres or use multiple spoken languages. These multiple values of movie attributes are stored as a list of JSON objects.

Below is a sample of two movies with their attributes presented in transposed fashion (with rows/columns flipped). Each field describes the given movie either semantically or structurally or both. You will try to use some textual values to build a meaningful vector representation for each film for the purpose of a better search engine.

In [None]:
pd.set_option('max_rows', 20, 'max_columns', 100, 'max_colwidth', 1000, 'precision', 2)
df = pd.read_csv('movies.zip').fillna('').set_index('original_title')
print(f'df.shape = {df.shape}')
df[:2].T

## **Parsing Movie Genres**

Note how the **genres** field contains a string which is a list-like format with several dictionary-like values (i.e. JSON strings). Thus, the movie *Avatar* has the following 4 genres packaged as 4 JSON strings: 

    '[{"id": 28,  "name": "Action"}, 
      {"id": 12,  "name": "Adventure"}, 
      {"id": 14,  "name": "Fantasy"}, 
      {"id": 878, "name": "Science Fiction"}]'

The following UDF `JSON_Values()` takes one string (a list-like of JSON strings) and the desired key (`'name'`, in this case) and retreives (using [regex](https://docs.python.org/3/library/re.html)) all values associated with this key. Thus, a list of genres is returned for each movie, if the UDF is applied to each value of the column `genres`.

In [None]:
def JSON_Values(sJSONs, sKey='name', asString=True, sep=', '):
    # Convert: '[{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "es", "name": "Español"}]' --->>> ['English', 'Español']
    sJSONs = re.sub('[\[\]]', '', sJSONs)   # remove square brackets in a string
    LsJSONs = re.sub('}, {', '}|{', sJSONs).split('|')   # relace comma with a pipe character separating JSON
    try:    LsValues = [json.loads(s)[sKey] for s in LsJSONs]   # in case of an error, use empty list
    except: LsValues = []
    return sep.join(LsValues) if asString else LsValues

print(df[:1].genres[0])                              # original JSON records as a string
print(JSON_Values(df[:1].genres[0], asString=True))  # extracted values as a string
print(JSON_Values(df[:1].genres[0], asString=False)) # extracted values as a list of strings

## **Building More Complete Movie Descriptions**

You will now preprocess the data. Several textual fields carrying semantically relevant text are concatenated (with a space) so as to produce a more complete textual description of each movie containing its title, tagline, overview, keywords, genres, etc. All JSON strings are parsed out into the lists using the UDF `ToStr()`. The goal is to pull together as much relevant text as possible in hopes that full sentences, phrases and even individual words will yield "higher quality" descriptive embeddings (i.e. word vectors).

In [None]:
pd.set_option('max_rows', 5)
ToStr = lambda pdSeries: ' ' + pdSeries.apply(JSON_Values)
dfMovAll = (df.title  + ' ' + df.tagline + ' ' + df.overview + ToStr(df.keywords) + ToStr(df.genres) + ToStr(df.production_countries)).to_frame().rename(columns={0:'Desc'})
dfMovAll

Embedding all ~4803 takes 10 minutes on a central processing unit (CPU), so you will work with a smaller subset of a movies only. In contrast, embedding takes about 10 seconds on a graphical processing unit (GPU), but this notebook uses CPU only.

In [None]:
dfMov = dfMovAll[dfMovAll.Desc.str.contains('Fantasy')][:100]  # select a subset of movies
dfMov.shape

## **Encoding Movie Descriptions**

After preprocessing, you are now ready to apply the [SBERT](https://www.sbert.net/) language model to embed each description into the same length numeric vector. 

<strong> Note:</strong> In the previous video, Professor Melnikov uses the `paraphrase-distilroberta-base-v1`(330 MB) model. In this activity, you will use a smaller model, `paraphrase-albert-small-v2` (~50 MB), which still encodes any sized text into a 768-dimensional vector.

In [None]:
%time SBERT = SentenceTransformer('paraphrase-albert-small-v2')  # load a pre-trained language model

Next, you will derive a 768-dimensional movie vector for each of the selected movies. These vectors (as rows) will be packed together into a matrix and wrapped into the following dataframe `dfEmb`, with movie titles as row indices.

In [None]:
%time mEmb = SBERT.encode(dfMov.Desc.tolist())  # embedding movie descriptions as vectors in a 768D vector space
dfEmb = pd.DataFrame(mEmb, index=dfMov.index)
dfEmb

## **Finding a Similar Movie**

These vectors can now be used just like any other numerical vector in a vector space. That means you can compute the proximity (i.e. similarity or distance) between any pair of movie vectors. Also, you can encode any search phrase, say `sQuery`, and use its vector to sort movies by relevance. 

In [None]:
CosSim = lambda u, v: 1 - scipy.spatial.distance.cosine(u,v)
sQuery = 'moon'
vQuery = SBERT.encode(sQuery)
dfCS = dfEmb.apply(lambda v: CosSim(vQuery, v), axis=1).to_frame().rename(columns={0:'CosSim'}) # Cosine similarities in oridinal order
dfCSDecr = dfCS.sort_values('CosSim', ascending=False)    # Cosine similarities in decreasing order
dfCSDecr.T

You can also print a bar plot with, say, the first few and every hundredth movie and its cosine similarity to the query phrase. Most of the movies show some positive cosine similarity, but we might be interested in, say, top 10 movie recommendations for the user making this search.

In [None]:
RowIx = list(range(10)) + list(range(10, len(dfCS), 100)) # draw top 10 and then every 100th movie
dfCSDecr.iloc[RowIx].plot(grid=True, figsize=(25,4), title=f'CosSim for movies similar to "{sQuery}""', kind='bar');

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**

Now you will practice application of SBERT to a movie search.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.

## Task 1

Find top 5 movies related to query *'a whole new world'*.

<b>Hint:</b> You can re-use the code above, which takes <code>sQuery</code> and returns all ranked movies, but keep only top few from that list.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
sQuery = 'a whole new world'
vQuery = SBERT.encode(sQuery)
dfCS = dfEmb.apply(lambda v: CosSim(vQuery, v), axis=1).to_frame().rename(columns={0:'CosSim'}) # Cosine similarities in oridinal order
dfCSDecr = dfCS.sort_values('CosSim', ascending=False)    # Cosine similarities in decreasing order
dfCSDecr[:5].T
</pre>
</details> 
</font>

<hr>

## Task 2

Find bottom five movies (i.e., least) related to movie title *'Battles of aliens'*.

<b>Hint:</b> You can re-use the code above, which takes <code>sQuery</code> and returns all ranked movies, but keep only bottom few from that list.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
sQuery = 'Battles of aliens'
vQuery = SBERT.encode(sQuery)
dfCS = dfEmb.apply(lambda v: CosSim(vQuery, v), axis=1).to_frame().rename(columns={0:'CosSim'}) # Cosine similarities in oridinal order
dfCSDecr = dfCS.sort_values('CosSim', ascending=False)    # Cosine similarities in decreasing order
dfCSDecr[-5:].T
</pre>
</details> 
</font>

<hr>