# Demonstration of Embeddings for Similarity Measure
Description of the demo.

Firstly, let us make the necessary imports.

In [1]:
import pandas as pd
from operations.calling import EmbeddingManager
from embeddings.embeddings import LatentFeatureAnalysis, SimilarityMeasurement

# Import the data
The data consists of movie plots scraped from Wikipedia. In this
case move plots are movies belonging from the adventure genre,
starting from 1980s up to the 2020s.

In [2]:
raw_data = pd.read_csv(
    filepath_or_buffer=r"D:\Projects\025_scenario_writing\other\output\screenplays_adventure_films_1980_2020_prepared.csv"
)

It is fairly easy to generate embeddings.

In [3]:
pieces_of_text = [summary for summary in raw_data.loc[:, 'content'].values]
embedding_manager = EmbeddingManager(
    texts=pieces_of_text,
    limit=None
)
embedding_manager.generate_embeddings()
embeddings = embedding_manager.get_embeddings()

Embedding piece of text: 1 out of 836
Embedding piece of text: 2 out of 836
Embedding piece of text: 3 out of 836
Embedding piece of text: 4 out of 836
Embedding piece of text: 5 out of 836
Embedding piece of text: 6 out of 836
Embedding piece of text: 7 out of 836
Embedding piece of text: 8 out of 836
Embedding piece of text: 9 out of 836
Embedding piece of text: 10 out of 836
Embedding piece of text: 11 out of 836
Embedding piece of text: 12 out of 836
Embedding piece of text: 13 out of 836
Embedding piece of text: 14 out of 836
Embedding piece of text: 15 out of 836
Embedding piece of text: 16 out of 836
Embedding piece of text: 17 out of 836
Embedding piece of text: 18 out of 836
Embedding piece of text: 19 out of 836
Embedding piece of text: 20 out of 836
Embedding piece of text: 21 out of 836
Embedding piece of text: 22 out of 836
Embedding piece of text: 23 out of 836
Embedding piece of text: 24 out of 836
Embedding piece of text: 25 out of 836
Embedding piece of text: 26 out of

Once that embeddings are available, it is possible to perform some of
the latent feature analyses. This will (a) preserve the most of the
variance in the data, while at the same time providing us (b) to
compute similarity measurement at fraction of the cost, of what
would be required if we would be using original embeddings.
Computation over two dimensions is much faster than the computation
over thousands of dimensions.

In [4]:
latent_feature_analyses = LatentFeatureAnalysis(
    data=embeddings,
    method='PCA'
)
latent_feature_analyses.execute_analysis()

Let us now empirically evaluate the quality of reduction. This can be
done by selecting a reference movie, and finding the most similar 
movies by their plots.

In [5]:
reference_movie = 'The Batman'
mask = raw_data.loc[:, 'film_name'] == reference_movie
reference_index = raw_data.index[mask].to_list()[0]
print(reference_index)

832


Now that index of the reference movie is retrieved, it is possible to
utilize it in the measurement of similarity and identification of the
most similar movie plots. Count of most similar movie plots is set to
10.

In [6]:
similarity_measurement = SimilarityMeasurement(
    data=latent_feature_analyses.get_reduction(),
    reference_index=reference_index,
    top_k=5,
    method='euclidean'
)
similarity_measurement.compute_distances()
top_k_distances = similarity_measurement.get_top_k()

Let us discover most similar movies, to our reference movie, according
embeddings and LFA performed.

In [7]:
for most_similar in top_k_distances:
    name = raw_data.iloc[most_similar, :]['film_name']
    source = raw_data.iloc[most_similar, :]['source'] 
    print(f"{name}, {source}")

The Hunger Games: Mockingjay – Part 1, https://en.wikipedia.org/wiki/The_Hunger_Games:_Mockingjay_%E2%80%93_Part_1
X-Men: The Last Stand, https://en.wikipedia.org/wiki/X-Men:_The_Last_Stand
Army of Darkness, https://en.wikipedia.org/wiki/Army_of_Darkness
X-Men, https://en.wikipedia.org/wiki/X-Men_(film)
Lupin the 3rd: The Golden Legend of Babylon, https://en.wikipedia.org/wiki/Lupin_the_3rd:_The_Golden_Legend_of_Babylon


As it can be seen from the results, embeddings, LFA and similarity
measurements require further investigation.