# character_identification
We can learn the names of characters by getting creative with the subtitle track. Character names are important for NLP-based plot comprehension. We'll also want to identify names and tie them to face and vocal encodings, to persistently track characters throughout the film.

The audience learns who the names of characters by listening to the dialogue (except for the cases where character names are displayed onscreen, most often in documentaries or docu-dramatizations). So screenwriters know they have to put character names in dialogue. These might take the form of self-introductions "I am Detective Lieutenant Elliot" or more subtle hints like a line that addresses them in second-person "I'm sorry, Marta."

Screenwriters need to drop these hints when the character is introduced. But since we can analyze movies non-chronologically (all at once, in an instant), we can look for these types of clues everywhere.

In previous notebooks, we've demonstrated how to parse and clean subtitles. For clarity, we'll just be typing in the lines of dialogue manually.

In [1]:
import pysrt
import spacy

In [2]:
nlp = spacy.load('en')

# Introductions
## Self-Introduction
The most basic form of character introduction is the first-person introduction, which may take the form of "I'm Alice", or "My name is Marlowe."

This is a good time to clarify that many of the sentence structures we're looking for will be somewhat hard-coded. The two examples above are very common — there are only so many ways screenwriters can have a character introduce herself.

### "I'm Alan" or "I am Alan"

In [3]:
sent = "Hey, I'm Vlad." # Teen Spirit (2018), subtitle 70
sent_doc = nlp(sent)

In [4]:
for token in sent_doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_)

Hey hey INTJ UH intj Xxx
, , PUNCT , punct ,
I -PRON- PRON PRP nsubj X
'm be AUX VBP ROOT 'x
Vlad Vlad PROPN NNP attr Xxxx
. . PUNCT . punct .


We can have spaCy analyze this simple sentence. This three-word sentence turns into six tokens. Vlad is properly labeled as a PROPN, a proper noun. Below, it also labels the words "Detective" and "Lieutenant" as proper nouns. This way we can get the character's full name, exactly how he introduced himself.

In [5]:
sent = "I am Detective Lieutenant Elliot" # Knives Out (2019), subtitle 64
sent_doc = nlp(sent)

In [6]:
for token in sent_doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_)

I -PRON- PRON PRP nsubj X
am be AUX VBP ROOT xx
Detective Detective PROPN NNP compound Xxxxx
Lieutenant Lieutenant PROPN NNP compound Xxxxx
Elliot Elliot PROPN NNP attr Xxxxx


We can look for these two cases of self-introduction: "I'm Alan" or "I am Alan". We'll look for the tokens "I" and "'m" or "I" and "am", and then check if the next token is a proper noun. If it is, we'll look at the next token to see if it's also a proper noun, presumably another component of the character name.

In [7]:
pnoun_components = []
pnoun_flag = 0
start_token = 0 # this will be a parameter defining which token to start looking for "I", so we can iterate through

if sent_doc[start_token].text == 'I' and (sent_doc[start_token + 1].text == "'m" or sent_doc[start_token + 1].text == "am") and sent_doc[start_token + 2].pos_ == 'PROPN':
    while pnoun_flag == 0 and start_token + 2 < len(sent_doc):
        if sent_doc[start_token + 2].pos_ == 'PROPN':
            pnoun_components.append(sent_doc[start_token + 2].text)
            start_token += 1
        else:
            pnoun_flag = 1

pnoun_components
string_value = ' '.join(pnoun_components)
print(string_value)  # this will be returned

Detective Lieutenant Elliot


### "My name is Alan"
We can do something similar for the phrasing "My name is Alan"

In [8]:
sent = "My name is Henckels." # The Grand Budapest Hotel (2018), subtitle 642
sent_doc = nlp(sent)

In [9]:
pnoun_components = []
pnoun_flag = 0
start_token = 0

if sent_doc[start_token].text in ['My', 'my'] and sent_doc[start_token + 1].text == "name" and sent_doc[start_token + 2].text == "is" and sent_doc[start_token + 3].pos_ == 'PROPN':
    while pnoun_flag == 0 and start_token + 3 < len(sent_doc):
        if sent_doc[start_token + 3].pos_ == 'PROPN':
            pnoun_components.append(sent_doc[start_token + 3].text)
            start_token += 1
        else:
            pnoun_flag = 1

pnoun_components
string_value = ' '.join(pnoun_components)
print(string_value)

Henckels


## Other-Introduction
We can also identify phrases where someone introduces another character. We use similar logic as above.

### "This is Alan"

In [10]:
sent =  "and this is Trooper Wagner." # Knives Out (2019), subtitle 65
sent_doc = nlp(sent)

In [11]:
pnoun_components = []
pnoun_flag = 0
start_token = 1

if sent_doc[start_token].text in ['This', 'this'] and sent_doc[start_token + 1].text == "is" and sent_doc[start_token + 2].pos_ == 'PROPN':
    while pnoun_flag == 0 and start_token + 2 < len(sent_doc):
        if sent_doc[start_token + 2].pos_ == 'PROPN':
            pnoun_components.append(sent_doc[start_token + 2].text)
            start_token += 1
        else:
            pnoun_flag = 1

string_value = ' '.join(pnoun_components)
print(string_value)

Trooper Wagner


### Interjection Cleanup
A quick interruption, to deal with a common problem: there's an interjection, like an "um" or "uh". This will break our above logic. We can define interjection strings to be removed, and then redefine the sentence without it.

These ums and uhs are pretty common in naturalistic comedies, so we'll be sure to remove these when we can. We may also choose to save these in the DataFrame, to denote that there was a hesitation/interjection in the original sentence, before we cleaned it.

In [12]:
sent = "This is, uh, Maggie." # Plus One (2019), subtitle 619
sent_doc = nlp(sent)

In [13]:
interjection_string = ', uh,'

found_sent = sent.find(interjection_string)
if found_sent != -1:
    sent = sent[:found_sent] + sent[(len(sent) - found_sent - len(interjection_string)) * -1:]

sent_doc = nlp(sent)
print(sent)

This is Maggie.


We can now look for an other-introduction in this cleaned sentence.

In [14]:
pnoun_components = []
pnoun_flag = 0
start_token = 0

if sent_doc[start_token].text in ['This', 'this'] and sent_doc[start_token + 1].text == "is" and sent_doc[start_token + 2].pos_ == 'PROPN':
    while pnoun_flag == 0 and start_token + 2 < len(sent_doc):
        if sent_doc[start_token + 2].pos_ == 'PROPN':
            pnoun_components.append(sent_doc[start_token + 2].text)
            start_token += 1
        else:
            pnoun_flag = 1

string_value = ' '.join(pnoun_components)
print(string_value)

Maggie
