# word_importance
To comprehend the plot of the entire film, or just an individual scene, we'll need to understand what the characters are speaking about. We'll use a variety of NLP tools to parse dialogue and try and see what the most important things they're talking about, be it other characters or concepts.

In [1]:
from subtitle_dataframes_io import *
import pysrt
import nltk
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from collections import Counter
pd.set_option('display.max_colwidth', None)
nlp = spacy.load('en')

We'll start with a scene from *Plus One* (2019). We've previously defined a few functions to parse the subtitles file.

In [2]:
subs = pysrt.open('../subtitles/plus_one.srt')
subtitle_df = generate_base_subtitle_df(subs)
subtitle_df = generate_subtitle_features(subtitle_df)
subtitle_df['cleaned_text'] = subtitle_df['concat_sep_text'].map(clean_line)
sentences = partition_sentences(remove_blanks(subtitle_df['cleaned_text'].tolist()), nlp)

We have a list of the film's entire sentences called `sentences`. We'll also define two more data objects at the scene level: `scene_sentences` which is a single long string of the scene's sentences, and then a `spaCy` doc of those scene-level sentences called `scene_nlp_doc`.

In [3]:
scene_sentences = (' ').join(sentences[880:976])
scene_nlp_doc = nlp(scene_sentences)

## Entity Counting
`spaCy` can conduct Entity Recognition, identifying people and organizations, as well as more abstract things like quantities or periods of time (like "tomorrow"). We can simply count up the most common entites in a scene.

In [4]:
entities = []

for ent in scene_nlp_doc.ents:
    entities.append(ent.text)
count = Counter(entities)
count.most_common(10)

[('Nate', 6),
 ('Alice', 3),
 ('Jess Ramsey', 2),
 ('first', 2),
 ('Ben King', 2),
 ('Maggie', 1),
 ('one night', 1),
 ('Alice Mori', 1),
 ('tomorrow', 1),
 ('two weeks', 1)]

In this scene, the pair (Alice and Ben King) is arguing about exes and prospective romantic partners: Nate, Maggie, and Jess Ramsey. This was a rough solution, but it worked, mostly because they're talking about named entities.

This would be further improved if we could conduct pronoun resolution to link pronouns to their original subject. This string of sentences early in the scene is all about Maggie: *Um... she was good. Yeah she was... she was really... Yeah. She was cool.* If we could tell these sentences were about Maggie, she should appear higher in our list.

## TF-IDF
Term-frequency - inverse document frequency is a measure of word importance that compares a word's number of appearances within a specific document, compared to its appearances within all documents. In other words, if a word appears often within a specific document, but not that often overall among all documents, it must be pretty important to that specific document.

In our context, we're comparing the frequency of words in our scene against the frequency of words throughout the entire movie (minus this specific scene).

So we'll have to build two data objects, `scene_doc` which contains the sentences of just the scene, and `film_doc` which contains all the film's sentences minus the scene's sentences. Then we'll create a two-element list for loading into the tf-idf object.

In [5]:
film_doc = sentences.copy()
scene_doc = film_doc[880:976]
del film_doc[880:976]
scene_doc_joined = (' '.join(scene_doc))
film_doc_joined = (' '.join(film_doc))
film_scene_doc = [scene_doc_joined, film_doc_joined]

Next, we'll use `scikit-Learn`'s TF-IDF capability. We can set the `ngram_range` argument to set the max tuple-size for phrases. In this case, we've set it to 3, meaning it may look for three-word phrases that work together like "what the heck".

In [6]:
vectorizer = TfidfVectorizer(use_idf=True, stop_words='english', ngram_range = (1,3))
idf_transformed = vectorizer.fit_transform(film_scene_doc)

In [7]:
tf_idf_df = pd.DataFrame(idf_transformed[0].T.todense(), index=vectorizer.get_feature_names(), columns=["TF-IDF"])
tf_idf_df = tf_idf_df.sort_values('TF-IDF', ascending=False)

In [8]:
tf_idf_df.head(10)

Unnamed: 0,TF-IDF
know,0.317364
just,0.244126
don,0.170888
nate,0.170888
got,0.146476
right,0.122063
yeah,0.122063
alice,0.122063
moping nate,0.102933
moping,0.102933


This got us decent results, it identified that one of the characters is moping about Nate. Some of the phrase tuples may sound strange because stop words were removed — the phrase in the dialogue was "moping over Nate". And the word "don" is actually "don't", but this doesn't get picked up because the word was tokenized into two separate tokens "don" and "'t".

## Scene-Level Word Importance
Now that we've been able to identify specific scenes, we can apply some of these techniques to encapsulated conversations.

In [9]:
import sys
sys.path.append('../data_serialization')
from serialization_preprocessing_io import *
from time_reference_io import *
from scene_identification_io import *
from scene_details_io import *
from character_identification_io import *
from character_details_io import *

In [10]:
film = 'plus_one_2019'
srt_df, subtitle_df, sentence_df, vision_df, face_df = read_pickle(film)
scene_dictionaries = generate_scenes(vision_df, face_df, substantial_minimum=4, anchor_search=8)
character_dictionaries = generate_characters(scene_dictionaries)

In [11]:
scene_dict = scene_dictionaries[7]
scene_start_time = frame_to_time(scene_dict['first_frame'])
scene_end_time = frame_to_time(scene_dict['last_frame'] + 1) # add 1 second; scene ends one second after this frame is onscreen
scene_subtitle_df = subtitle_df[(subtitle_df['end_time'] > scene_start_time) & (subtitle_df['start_time'] < scene_end_time)].copy()
scene_sentence_indices = []
x = 0
for sub_index_list in sentence_df.subtitle_indices.values:
    for sub_index in sub_index_list:
        if sub_index in scene_subtitle_df.index.values:
            scene_sentence_indices.append(x)
    x += 1
scene_sentence_df = sentence_df[scene_sentence_indices[0]: scene_sentence_indices[-1] + 1]
scene_sentences = scene_sentence_df.sentence.tolist()

In [12]:
scene_sentence_doc = nlp((' '.join(scene_sentence_df.sentence.tolist())))

We can look through all the scene's conversation's tokens, and identify people. From there, we can differentiate between a name being addressed and a name being discussed.

In [13]:
people_addressed = []
for token in scene_sentence_doc:
    if token.ent_type_ == 'PERSON' and token.dep_ == 'npadvmod':
        people_addressed.append(token.text)
people_addressed

['Ben']

In [14]:
people_discussed = []
for token in scene_sentence_doc:
    if token.ent_type_ == 'PERSON' and token.dep_ != 'npadvmod':
        people_discussed.append(token.text)
people_discussed

['Lily']

In [15]:
print(scene_sentences[20])
print(scene_sentences[10])

Ben, my parents don't like each other.
I was bugged out at first when Lily asked me to be her maid of honor.
