Jina AI introduced an open source embedding model that supports 8K (8192) context length `jina-embeddings-v2`, which puts it on par with OpenAI's proprietary model `text-embedding-ada-002`. Refer to [this announcement](https://jina.ai/news/jina-ai-launches-worlds-first-open-source-8k-text-embedding-rivaling-openai/) for more details.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [1]:
# You need to put the "kaggle.json" file in a directory and assign it to "KAGGLE_CONFIG_DIR" env var.
import os

os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/Colab Notebooks/kaggle"

## Dataset

In [3]:
!kaggle datasets download -d cryptexcode/mpst-movie-plot-synopses-with-tags

Downloading mpst-movie-plot-synopses-with-tags.zip to /content
 97% 28.0M/28.8M [00:08<00:00, 12.2MB/s]
100% 28.8M/28.8M [00:08<00:00, 3.54MB/s]


In [4]:
!mkdir -p movie_dataset && unzip *.zip -d movie_dataset && rm *.zip

Archive:  mpst-movie-plot-synopses-with-tags.zip
  inflating: movie_dataset/mpst_full_data.csv  
  inflating: movie_dataset/partition.json  


In [2]:
import pandas as pd

df = pd.read_csv('movie_dataset/mpst_full_data.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14828 entries, 0 to 14827
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   imdb_id          14828 non-null  object
 1   title            14828 non-null  object
 2   plot_synopsis    14828 non-null  object
 3   tags             14828 non-null  object
 4   split            14828 non-null  object
 5   synopsis_source  14828 non-null  object
dtypes: object(6)
memory usage: 695.2+ KB


## Preprocess Data

In [4]:
df = df[~df.title.isnull()]
df = df[~df.plot_synopsis.isnull()]
df = df[~df.tags.isnull()]
df = df.sort_values('title').reset_index(drop=True)
df['lev'] = None
df

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source,lev
0,tt0068152,$,"Set in Hamburg, West Germany, several criminal...",murder,test,imdb,
1,tt0190938,$windle,A 6th grader named Griffin Bing decides to gat...,flashback,train,wikipedia,
2,tt2614684,'71,"Gary Hook, a new recruit to the British Army, ...","suspenseful, neo noir, murder, violence",train,wikipedia,
3,tt0085127,'A' gai wak,Sergeant Dragon Ma (Jackie Chan) is part of th...,"cult, violence",train,wikipedia,
4,tt0080310,'Breaker' Morant,"In Pretoria, South Africa, in 1902, Major Char...","murder, anti war, violence, flashback, tragedy...",train,wikipedia,
...,...,...,...,...,...,...,...
14823,tt3543634,Ängelby,Vera Fors lives an ordinary life with her husb...,"paranormal, murder",train,wikipedia,
14824,tt0065261,Ådalen 31,"In 1931, the working-class family Andersson of...",romantic,val,wikipedia,
14825,tt3425936,Él,The film opens during a foot washing ceremony ...,autobiographical,train,wikipedia,
14826,tt0235198,Ôdishon,"Shigeharu Aoyama (Ryo Ishibashi), a middle-age...","cruelty, murder, cult, violence, flashback, ps...",train,imdb,


In [5]:
!python -m pip install levenshtein



In [6]:
from Levenshtein import distance

for a in range(len(df)-1):
    if distance(df.iloc[a].title, df.iloc[a+1].title) <= 3:
        df.at[a, 'lev'] = distance(df.iloc[a].title, df.iloc[a+1].title)
# we filter similar movies
df = df[df['lev'].isnull()].reset_index(drop=True)
df

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source,lev
0,tt0068152,$,"Set in Hamburg, West Germany, several criminal...",murder,test,imdb,
1,tt0190938,$windle,A 6th grader named Griffin Bing decides to gat...,flashback,train,wikipedia,
2,tt2614684,'71,"Gary Hook, a new recruit to the British Army, ...","suspenseful, neo noir, murder, violence",train,wikipedia,
3,tt0085127,'A' gai wak,Sergeant Dragon Ma (Jackie Chan) is part of th...,"cult, violence",train,wikipedia,
4,tt0080310,'Breaker' Morant,"In Pretoria, South Africa, in 1902, Major Char...","murder, anti war, violence, flashback, tragedy...",train,wikipedia,
...,...,...,...,...,...,...,...
12953,tt3543634,Ängelby,Vera Fors lives an ordinary life with her husb...,"paranormal, murder",train,wikipedia,
12954,tt0065261,Ådalen 31,"In 1931, the working-class family Andersson of...",romantic,val,wikipedia,
12955,tt3425936,Él,The film opens during a foot washing ceremony ...,autobiographical,train,wikipedia,
12956,tt0235198,Ôdishon,"Shigeharu Aoyama (Ryo Ishibashi), a middle-age...","cruelty, murder, cult, violence, flashback, ps...",train,imdb,


In [7]:
# Manually find Avengers duplicates
for i in range(len(df)):
    if df.iloc[i]['title'].find('Avengers') != -1:
        pass
        # print(i, df.iloc[i]['imdb_id'], df.iloc[i]['title'])
# drop extra
# df = df.drop([9572]).reset_index(drop=True) #i can do 1, 2, 3... to drop multiple
# df

In [8]:
df.to_csv('movie_dataset/mpst_no_duplicates.csv')

## Encode Data

In [9]:
!python -m pip install transformers



In [10]:
from transformers import AutoModel

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model = model.to('cuda')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [11]:
from tqdm import tqdm
import numpy as np

tqdm.pandas()
MAX_LENGTH = 2048

df_ = df.copy()
df_['desc'] = df_['title'] + ", " + df_['plot_synopsis'] + ", " + df_['tags']
df_['desc'] = df_['desc'].progress_apply(lambda x: model.encode(x, max_length=MAX_LENGTH))
df_index = df_.pop('title')
df_ = df_[['desc']]
df_ = pd.DataFrame(np.column_stack(list(zip(*df_.values))))
df_.index = df_index
df_

100%|██████████| 12958/12958 [23:45<00:00,  9.09it/s]


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
$,-0.305243,-0.478121,0.573684,-0.416724,-0.063186,-0.104392,0.272295,-0.385394,0.613305,0.671237,...,-0.218130,0.251676,0.266929,-0.269716,-0.097133,-0.075343,0.155538,0.376259,-0.647098,-0.654646
$windle,-0.549233,-0.584572,0.183384,-0.415960,-0.102594,0.176665,0.334762,-0.761044,0.538778,0.453364,...,-0.211560,0.169959,0.815229,0.164532,-0.439802,-0.154438,0.196866,0.134196,-0.379456,-0.448769
'71,-0.710627,-0.277886,0.703571,-0.205504,-0.094787,-0.096663,0.390978,-0.281449,0.699541,0.398097,...,0.154344,0.192901,-0.243648,-0.277452,-0.095334,-0.602673,0.007325,0.325771,-0.496855,-0.147031
'A' gai wak,-0.907518,-0.563106,0.398969,-0.163028,-0.443714,-0.183414,0.203620,-0.399402,0.692871,0.427145,...,-0.203693,0.602331,0.015307,-0.417551,-0.279454,-0.098001,-0.287980,0.022968,-0.583060,-0.261329
'Breaker' Morant,-0.569088,-0.454713,0.695941,0.149527,-0.092742,-0.280021,0.327507,-0.393071,0.666448,0.265905,...,0.181207,0.119770,-0.030417,-0.107527,-0.100398,-0.019875,0.058334,0.199861,-0.353650,-0.364900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ängelby,-0.024747,-0.848546,0.971305,-0.330309,-0.230411,-0.083415,0.443274,-0.636123,0.634419,0.825696,...,0.563743,0.672568,0.448131,0.210745,-0.038528,-0.839997,-0.270615,0.350160,-0.405666,-0.247988
Ådalen 31,-0.489226,-0.146825,0.471086,-0.051622,-0.274401,-0.067239,0.469520,-0.472675,0.292372,0.289693,...,-0.032497,0.183455,0.003145,0.316135,-0.010873,-0.402247,-0.462090,0.370307,0.081739,-0.398458
Él,-0.441571,-0.626663,0.970927,0.290038,-0.327410,-0.338166,0.236051,-0.627263,0.482511,0.596960,...,-0.433985,0.714728,-0.120911,-0.324769,-0.033564,-0.595949,0.104578,0.555788,-0.266922,-0.161359
Ôdishon,-0.586189,-0.445926,0.548699,-0.103584,-0.623864,-0.334163,0.522906,-0.441216,0.529069,0.400032,...,-0.433847,0.421194,0.094667,-0.002444,-0.047095,-0.123772,-0.189924,0.161609,-0.068636,-0.345551


In [12]:
df_.to_csv('movie_dataset/mpst_dedup_embedding.csv')

## Perform Nearest Neighbor Search

In [14]:
df_movies_embed = pd.read_csv('movie_dataset/mpst_dedup_embedding.csv')
df_movies_embed.index = df_movies_embed.pop('title')
df_movies_embed

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
$,-0.305243,-0.478121,0.573684,-0.416724,-0.063186,-0.104392,0.272295,-0.385394,0.613305,0.671237,...,-0.218130,0.251676,0.266929,-0.269716,-0.097133,-0.075343,0.155538,0.376259,-0.647098,-0.654646
$windle,-0.549233,-0.584572,0.183384,-0.415960,-0.102594,0.176665,0.334762,-0.761044,0.538778,0.453364,...,-0.211560,0.169959,0.815229,0.164532,-0.439802,-0.154438,0.196866,0.134196,-0.379456,-0.448769
'71,-0.710627,-0.277886,0.703571,-0.205504,-0.094787,-0.096663,0.390978,-0.281449,0.699541,0.398097,...,0.154344,0.192901,-0.243648,-0.277452,-0.095334,-0.602673,0.007325,0.325771,-0.496855,-0.147031
'A' gai wak,-0.907518,-0.563106,0.398969,-0.163028,-0.443714,-0.183414,0.203620,-0.399402,0.692871,0.427145,...,-0.203693,0.602331,0.015307,-0.417551,-0.279454,-0.098001,-0.287980,0.022968,-0.583060,-0.261329
'Breaker' Morant,-0.569088,-0.454713,0.695941,0.149527,-0.092742,-0.280021,0.327507,-0.393071,0.666448,0.265905,...,0.181207,0.119770,-0.030417,-0.107527,-0.100398,-0.019875,0.058334,0.199861,-0.353650,-0.364900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Ängelby,-0.024747,-0.848546,0.971305,-0.330309,-0.230411,-0.083415,0.443274,-0.636123,0.634419,0.825696,...,0.563743,0.672568,0.448131,0.210745,-0.038528,-0.839997,-0.270615,0.350160,-0.405666,-0.247988
Ådalen 31,-0.489226,-0.146825,0.471086,-0.051622,-0.274401,-0.067239,0.469520,-0.472675,0.292372,0.289693,...,-0.032497,0.183455,0.003145,0.316135,-0.010873,-0.402247,-0.462090,0.370307,0.081739,-0.398458
Él,-0.441571,-0.626663,0.970927,0.290038,-0.327410,-0.338166,0.236051,-0.627263,0.482511,0.596960,...,-0.433985,0.714728,-0.120911,-0.324769,-0.033564,-0.595949,0.104578,0.555788,-0.266922,-0.161359
Ôdishon,-0.586189,-0.445926,0.548699,-0.103584,-0.623864,-0.334163,0.522906,-0.441216,0.529069,0.400032,...,-0.433847,0.421194,0.094667,-0.002444,-0.047095,-0.123772,-0.189924,0.161609,-0.068636,-0.345551


In [15]:
embeddings = df_movies_embed.to_numpy(dtype='float32')
embeddings.shape, embeddings.dtype

((12958, 768), dtype('float32'))

In [16]:
def calc_pairwise_dist(features1, features2):
  N, D = features1.shape
  T, _ = features2.shape
  assert features1.dtype == features2.dtype
  dtype = features1.dtype
  return np.matmul(features1**2, np.ones((D, T), dtype=dtype)) + np.matmul(np.ones((N, D), dtype=dtype), features2.T**2) - 2 * np.matmul(features1, features2.T)

# a = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3], [4, 4, 4]], dtype=int)
# b = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [9, 10, 11]], dtype=int)
# calc_pairwise_dist(a, b)

In [17]:
def find_similar_videos(query, top_k=3):
  query_embedding = model.encode(query, max_length=MAX_LENGTH)

  dist = calc_pairwise_dist(embeddings, query_embedding[None, :])
  dist = np.squeeze(dist, axis=1)
  indices = np.argsort(dist)
  return df.iloc[indices[:top_k]]

In [18]:
query = '''The Avengers follows the aftermath of several previous Marvel Cinematic Universe (MCU) films, where various superheroes have been introduced, each with their own unique abilities and backgrounds. The movie opens with the Tesseract, a powerful energy source, falling into the hands of the malevolent Loki (Tom Hiddleston), who plans to use it to conquer Earth and subjugate humanity. In response to the imminent threat, Nick Fury (Samuel L. Jackson), the director of S.H.I.E.L.D., assembles a team of extraordinary individuals to thwart Loki’s nefarious plans. This team, known as the Avengers, comprises Iron Man, Captain America, Thor, Hulk, Black Widow, and Hawkeye. However, their initial inability to work together as a cohesive unit threatens to jeopardize their mission. As the Avengers confront Loki and his army of Chitauri invaders in New York City, they must overcome their differences and egos to forge a formidable alliance. Together, they showcase their unique strengths and talents in a pulse-pounding battle to protect Earth and its inhabitants from ultimate destruction.'''

find_similar_videos(query, 3)

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source,lev
9572,tt0848228,The Avengers,"Nick Fury (Samuel L. Jackson), director of S.H...","comedy, boring, flashback, good versus evil, h...",train,imdb,
887,tt0203247,Avengers,"The Asgardian Loki encounters the Other, the l...",good versus evil,train,wikipedia,
889,tt2395427,Avengers: Age of Ultron,"In the Eastern European country of Sokovia, th...","murder, violence, flashback, good versus evil,...",train,wikipedia,


In [19]:
query = 'chris evans #chrisevans #paulrudd #captainamerica #antman #theavengers #avengers #marvel #sebastianstan'

find_similar_videos(query, 3)

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source,lev
9573,tt1626038,The Avengers: Earth's Mightiest Heroes,=== Season One ===\nAs 75 of the world's most ...,violence,test,wikipedia,
1767,tt1843866,Captain America: The Winter Soldier,"Two years after the events of 'The Avengers', ...","murder, violence, flashback, good versus evil,...",val,imdb,
730,tt0478970,Ant-Man,"In 1989, Hank Pym (Michael Douglas) resigns fr...","comedy, murder, violence, flashback, good vers...",train,imdb,


In [20]:
query = 'Ayo this thing racist #spiderman #avengers #marvel #marvelstudios #funny #foryou'

find_similar_videos(query, 3)

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source,lev
5763,tt3322904,Lego Marvel Super Heroes: Maximum Overload,The mischievous Loki challenges the Marvel Sup...,humor,train,wikipedia,
4771,tt1754811,InAPPropriate Comedy,The framing device has Vince Offer pressing bu...,comedy,train,wikipedia,
12468,tt0066550,Watermelon Man,"Los Angeles, 1968. Jeff Gerber (Godfrey Cambri...","comedy, blaxploitation",train,imdb,


In [21]:
query = 'It focuses on the transformation of his youngest son, Michael Corleone, from reluctant family outsider to ruthless mafia boss.'

find_similar_videos(query, 3)

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source,lev
10215,tt0099674,The Godfather: Part III,"The movie begins in 1979, with a brief flashba...","murder, dramatic, cult, flashback, good versus...",train,imdb,
10214,tt1198207,The Godfather II,"In 1901, the family of nine-year-old Vito Ando...","violence, humor, murder",train,wikipedia,
4227,tt0070165,Heavy Traffic,"The film starts out in live action, introducin...","pornographic, cult, violence, psychedelic, hum...",train,wikipedia,


In [23]:
query = '''This is Quentin Tarantino at his finest! Engaging characters; non-stop action; and intricately woven individual stories that produce a mesmerizing whole.

The depth of the overall plot is astounding. One minute dark and moody, the next riddled with black comedy and unforgettable utterances that remain noteworthy
quotes even into present day. Violence and blood aplenty, rampant drug use, startlingly intense situations- all signature Tarantino. And yet, this film is interspersed with sparkling nuggets of stark morality and unexpected wisdom. John Travolta as Vincent Vega, Samuel L. Jackson as Jules, Uma Thurman as Mia Wallace, and Bruce Willis as Butch are exceptionally complex characters who must learn to survive in an often cold and brutal world that still has hidden gems of humanity at its most basic fineness. Pulp Fiction is a roller coaster ride that will leave you gasping for more!'''

find_similar_videos(query, 3)

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source,lev
2327,tt0082220,Cutter's Way,Cutter's way is hard to put in a category. It'...,"revenge, neo noir, murder",train,imdb,
7688,tt0110912,Pulp Fiction,"Late one morning in the Hawthorne Grill, a res...","comedy, murder, stupid, cult, action, revenge,...",train,imdb,
7626,tt0119942,Primary Colors,"Well acted drama, adapted from a novel by Joe ...","satire, atmospheric",train,imdb,


In [24]:
query = "I haven't seen The Godfather in years but as it was on BBC2 last night, 13 December, decided to watch it again.  So glad I did as it is a superb film.  Although made in 1971-72 it has not dated, as so many films of that decade do.  Performances are superb, notably Marlon Brando and Al Pacino.  Pacino's face is like a mask - no emotion is shown  and he can casually order an opponent's execution as sitting down to a meal.  I also love the stillness of the movie and how family life was vividly shown - the Italian love of good food and drink.  Colours are muted in browns and golds and Coppola made a masterpiece.  5 star rating."

find_similar_videos(query, 3)

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source,lev
10215,tt0099674,The Godfather: Part III,"The movie begins in 1979, with a brief flashba...","murder, dramatic, cult, flashback, good versus...",train,imdb,
3920,tt0929425,Gomorra,The film opens with the murder of gangsters re...,"violence, cruelty, murder",train,wikipedia,
10214,tt1198207,The Godfather II,"In 1901, the family of nine-year-old Vito Ando...","violence, humor, murder",train,wikipedia,


In [25]:
query = "复仇者联盟"

find_similar_videos(query, 3)

Unnamed: 0,imdb_id,title,plot_synopsis,tags,split,synopsis_source,lev
5596,tt0084228,Laberinto de pasiones,Un hombre y una mujer caminan por una abarrota...,melodrama,train,imdb,
559,tt0058898,"Alphaville, une étrange aventure de Lemmy Caution",A man arrives in a hotel very possessive about...,"avant garde, neo noir, mystery, dramatic, viol...",train,imdb,
2624,tt0082259,"Deprisa, deprisa",CONTIENE SPOILERS\nLa película trata la histor...,murder,val,imdb,
