# subtitle_cleaning
We can extract a film's English-language subtitle track to get the ground-truth dialogue. We can glean clues about a scene's location, characters, and context. While we're not yet ready to use subtitles to analyze the film's entire plot, we can start small and see what localized information we can learn. But first we'll need to get the subtitles into a usable format to feed into our NLP analysis.

## Loading Subtitles
We'll be using the `pysrt` library to parse .srt subtitle files.

In [1]:
import pysrt
from collections import Counter
from subtitle_cleaning_io import *

In [2]:
subs = pysrt.open('../subtitles/booksmart.srt')

In [3]:
len(subs)

2373

Since each two-line dialogue in a subtitle file is explicitly numbered, starting at 1, there's an off-by-one discrepency with the list object (starting at 0). We can offset the list by just duplicating the first subtitle item.

In [4]:
# subtitle files (.srt) are explicitly numbered, and start at 1
subs.insert(0, subs[0])

Each SubRipItem contains the subtitle text as well as the start and end time.

In [5]:
print(subs[4].text)
print(subs[4].start)
print(subs[4].end)

Take a deep breath.
00:00:09,177
00:00:11,011


## Subtitle Cleanup
Subtitle files are already formatted very neatly. It shouldn't be too hard to clean and shape this data into a format we can use.
### Individual Line Cleaning
Subtitle text spans either one or two lines. Text that span two lines may contain dialogue from either one character, or two separate characters.

This is a one-liner, which is just one character speaking.

`29
00:01:19,747 --> 00:01:21,081
I missed you.`

Here, a single character spoke enough dialogue to span two lines.

`69
00:02:43,331 --> 00:02:45,248
I mean, he's you know,
he's the vice president.`

And this is a two-liner that has two characters speaking. (Molly's name is printed because she's speaking from offscreen.) It starts with a dash on each line.

`30
00:01:21,165 --> 00:01:22,832
-I missed you so much.
-MOLLY: Been one night.`

Note that occasionally, long subtitle lengths may occasionally render as multiple lines when played in a media player. However, these are properly processed as two lines with no issue. For example, this line from *Double Indemnity (1944)* renders onscreen as three lines because of the long second line.

`782
00:47:40,148 --> 00:47:43,985
From here on, it was a question
of following the timetable move by move.`

In [6]:
subs[29].text # one-liner

'I missed you.'

In [7]:
subs[69].text # two-liner from one character

"I mean, he's you know,\nhe's the vice president."

In [8]:
subs[30].text # two-liner spoken by two characters

'-I missed you so much.\n-MOLLY: Been one night.'

For best results during NLP processing, we'll want to separate the two line, two character text into two separate lines. We'll also want to combine the two line, one character text into a single line. The key to this is searching for the newline escape sequence.

If there's no newline escape, then it's a one-liner.  If it has a newline sequence and both the top and bottom lines start with a dash, it's a two line, two character text and should be broken into two separate lines (and discarding both dashes). And if it has a newline sequence but without the dashes, it's a single character speaking across two lines, and we'll concatenate the two.

In [9]:
def clean_line(text):
    newline = text.find('\n')
    if newline == -1:                     # one-liner
        return text, 0
    elif text[0] == '-' and text[newline + 1] == '-': # two-liner spoken by two characters
        top_line = text[1:newline]
        bottom_line = text[newline + 2:]
        return top_line.lstrip(), bottom_line.lstrip()
    else:                                        # two-liner from one character
        concat_line = text[:newline] + ' ' + text[newline + 1:]
        return concat_line, 0

In [10]:
clean_line(subs[29].text) # one-liner

('I missed you.', 0)

In [11]:
clean_line(subs[30].text) # two-liner spoken by two characters

('I missed you so much.', 'MOLLY: Been one night.')

In [12]:
clean_line(subs[69].text) # two-liner from one character

("I mean, he's you know, he's the vice president.", 0)

With this function, we can separate or combine each line appropriately. These can be collected into a single list.

In [13]:
all_dialogue = []
for sub_object in subs:
    text = sub_object.text
    line_a, line_b = clean_line(text)
    all_dialogue.append(line_a)
    if line_b != 0:
        all_dialogue.append(line_b)

In [14]:
all_dialogue[57:67]

['True.',
 'PRINCIPAL BROWN: I hope',
 'I never have to see any of you',
 'ever again, okay.',
 "That's it. Signin' off.",
 'Go, Crocketts!',
 '(mic feedback)',
 'Boom.',
 'MOLLY: Principal Brown?',
 '(groaning)']

In [15]:
print(len(subs)) # number of subtitle objects
print(len(all_dialogue)) # number of cleaned lines

2374
2700


### Parsing Specific Cases
Though the majority of subtitle text is spoken dialogue, there are non-dialogue lines which clarify non-word sounds like laughter or intentionally inaudible audio. They may also contain song lyrics or denote an off-screen speaker. All of these specific cases have distinct formatting.

- parenthetical, entire-line: may describe laughter, sighing, indistinct muttering
- music, entire-line: may transcribe song lyrics sung by characters, or non-diegetic score music
- laughter, partial- or entire-line: describes laughter as the entire line, or perhaps a quick chuckle before speaking dialogue. The list of strings that might describe laughter, like '(laughing)' or '(chuckles)' is small enough where we can hard-code them
- offscreen character: clarification on the speaker, if coming from an offscreen character — the hearing-impaired need this clue because they aren't able to recognize voices
- italics, entire-line: may indicate narration, voice-over, or an off-screen voice speaking on the phone

These will all be available as functions.

In [16]:
for line in all_dialogue[57:67]:
    if line[:1] == '(' and line[-1:] == ')':    # parenthetical, entire-line
        print(line)

(mic feedback)
(groaning)


In [17]:
for line in all_dialogue[35:45]:
    if line[:1] == '♪' and line[-1:] == '♪':   # music, entire-line
        print(line)

♪ I don't wanna stress you out ♪
♪ I just wanna tell you the truth ♪
♪ Motherfuckers try to tear us apart ♪
♪ But we're electric linked ♪


In [18]:
laugh_strings = ['(laughing)', '(laughs)', '(chuckles)']      # laughter
for line in all_dialogue[650:700]:
    for laugh in laugh_strings:
        if laugh in line:
            print(line)

(laughing)
every single night! (laughs)


In [19]:
# this finds offscreen speakers in the form of 'ADAM: No way', but some subtitles format this as '[Adam] No way'
# the alternative case will be identified in subtitle_dataframes.ipynb
# for now, the alternative case is saved as a parenthetical

for line in all_dialogue[30:50]:
    colon_find = line.find(':')            # off-screen speaker
    if line[0:colon_find].isupper():
        print(line[0:colon_find])

AMY
MOLLY
MOLLY
BOY


In [20]:
subs = pysrt.open('../subtitles/the_grand_budapest_hotel.srt') # switching subtitles to The Grand Budapest Hotel
subs.insert(0, subs[0])
all_dialogue = []
for sub_object in subs:
    text = sub_object.text
    line_a, line_b = clean_line(text)
    all_dialogue.append(line_a)
    if line_b != 0:
        all_dialogue.append(line_b)
        
for line in all_dialogue[25:29]:                # italics
    if line[:3] == '<i>' and line[-4:] == '</i>':
        print(line)

<i>I decided to spend the month of August</i>
<i>in the spa town of Nebelsbad below the Alpine Sudetenwaltz,</i>
<i>and had taken up rooms in the Grand Budapest,</i>
<i>a picturesque, elaborate, and once widely celebrated establishment.</i>


## Gathering Text and Populating DataFrame
We've turned the above functionality into various functions found in `subtitle_cleaning_io.py`. Each of the functions returns cleaned text, which will be fed as NLP input, as well as a flag or other piece of information for the DataFrame. For example, the `italic_clean()` function returns a flag if the entire line is in italics. The `speaker_clean()` function will return the speaker name, if any.

In [21]:
subs = pysrt.open('../subtitles/booksmart.srt')
subs.insert(0, subs[0])
single_lines = generate_single_lines(subs)

In [22]:
italic_flags = []
music_flags = []
laugh_flags = []
speakers = []
parenthetical_flags = []

cleaned_lines = []

for line in single_lines:
    entire_line_italic, line = italic_clean(line)
    italic_flags.append(entire_line_italic)
    
    entire_line_music, line = music_clean(line)
    music_flags.append(entire_line_music)
    
    laugh_found, line = laugh_clean(line)
    laugh_flags.append(laugh_found)
    
    speaker, line = speaker_clean(line)
    speakers.append(speaker)
    
    entire_line_parenthetical, line = parenthetical_clean(line)
    parenthetical_flags.append(entire_line_parenthetical)
    
    cleaned_lines.append(line)

In [23]:
for line in cleaned_lines[100:105]:
    print(line)

Don't call her that.
Everybody calls her that.
She gave roadside assistance
to three senior guys last year.
You hear them getting degrading Nicknames?


Not only do we have a list of cleaned, NLP-ready dialogue, but we also have various lists that we'll use for the DataFrame. Here are two examples: these lists indiate where a line contains laughter, or a specific offscreen speaker.

In [24]:
laugh_flags[80:100]

[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [25]:
speakers[100:120]

['none',
 'none',
 'none',
 'none',
 'AMY',
 'none',
 'none',
 'none',
 'none',
 'none',
 'none',
 'none',
 'none',
 'none',
 'none',
 'none',
 'none',
 'MISS FINE',
 'none',
 'none']

We now have a list of properly separated or concatenated, clean lines of dialogue. We still have some blank lines, which were generated when we were cleaning an entire line — the above blank is from a parenthetical description "(mic feedback)".

We can easily remove these.

In [26]:
for line in cleaned_lines[59:65]:
    print(line)

I never have to see any of you
ever again, okay.
That's it. Signin' off.
Go, Crocketts!

Boom.


In [27]:
blanks_removed = []

for line in cleaned_lines:
    if line:
        blanks_removed.append(line)

In [28]:
print(len(cleaned_lines))  # all cleaned lines
print(len(blanks_removed)) # blanks removed

2700
2270


The `blanks_removed` list is ready to be put through various NLP analyses. We can also start populating the DataFrame by combining all the lists. These will take place in the next notebooks.