## Read and Parse Mulan Screenplay

https://imsdb.com/scripts/Mulan.html

Load libraries for web scraping and regular expressions

In [3]:
from bs4 import BeautifulSoup
import requests
import re

Read in Mulan screenplay

In [6]:
response = requests.get("https://imsdb.com/scripts/Mulan.html")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

Isolate the text

In [10]:
mulan_screenplay = document.find("pre").text

In [49]:
# emply list to put dictionaries to make a Pandas dataframe
dict_to_df = []

for line in mulan_screenplay.split("\r\n\r\n"):
    
    # Split character name and dialogue
    character_dialogue = re.split(":", line)
    
    # Basically, check if the line is dialogue and contained a :
    if len(character_dialogue) >= 2:
        
        # First item is name
        name = character_dialogue[0]
        
        # Other items should be dialogue
        dialogue = " ".join(character_dialogue[1:])
        
        print(f"{name}: {dialogue}")
        
        # Append a dictionary to the empty list
        dict_to_df.append({
            "character": name,
            "dialogue": dialogue
        })

Disney's Mulan
Compiled by Barry Adams  during theater showings in 1998
Last updated:  August 18, 1998
Guard [yelling]:   We're under attack!  Light the signal!
Guard [sternly]:   Now all of China knows you're here.
Shan-Yu [taking the flag and holding it over the fire]:   Perfect.
General Li:   Your Majesty, the Huns have crossed our Northern border.
Chi Fu:   Impossible! No one can get through The Great Wall.  [The Emperor
motions for Chi Fu's silence]
General Li:   Shun-Yu is leading them.  We'll set up defenses around your
palace immediately.
Emperor [forcefully]:   No!  Send your troops to protect my people.  Chi Fu, 
Chi Fu:   Yes, your highness.
Emperor:  Deliver conscription notices throughout all the provinces.  Call up
reserves and as many new recruits as possible.
General Li:   Forgive me your Majesty, but I believe my troops can stop him.
Emperor:   I wont take any chances, General.  A single grain of rice can tip
the scale.  One man may be the difference between victory an

Make a dataframe

In [51]:
import pandas as pd

In [52]:
pd.DataFrame(dict_to_df)

Unnamed: 0,character,dialogue
0,Disney's Mulan\r\nCompiled by Barry Adams dur...,"August 18, 1998"
1,Guard [yelling],We're under attack! Light the signal!
2,Guard [sternly],Now all of China knows you're here.
3,Shan-Yu [taking the flag and holding it over t...,Perfect.
4,General Li,"Your Majesty, the Huns have crossed our Nort..."
...,...,...
542,Mushu [spoken while swinging on a chain],Call out for egg rolls!
543,First Ancestor [disgusted],Guardians.
544,Mulan,"Thanks, Mushu [kisses Mushu on the forehead]."
545,Little Brother,"Bark, bark, bark, bark, bark, bark, bark"


In [58]:
mulan_dilaogue_df = pd.DataFrame(dict_to_df)

Remove dialogue directions

In [53]:
def remove_direction(text):
    return re.sub("\[.*\]", "", text)

In [59]:
mulan_dilaogue_df["character"] = mulan_dilaogue_df["character"].apply(remove_direction)

Strip whitespace

In [65]:
mulan_dilaogue_df['character'] = mulan_dilaogue_df['character'].str.strip()

Count number of words

In [99]:
def count_words(dialogue):
    words = re.split("\W+", dialogue)
    # make sure to get only words (things longer than 1 character)
    words = [word for word in words if len(word) >1]
            
    num_words = len(words)
    return num_words

In [97]:
mulan_dilaogue_df["num_words"] = mulan_dilaogue_df["dialogue"].apply(count_words)

In [98]:
mulan_dilaogue_df

Unnamed: 0,character,dialogue,num_words
0,Disney's Mulan\r\nCompiled by Barry Adams dur...,"August 18, 1998",3
1,Guard,We're under attack! Light the signal!,7
2,Guard,Now all of China knows you're here.,8
3,Shan-Yu,Perfect.,1
4,General Li,"Your Majesty, the Huns have crossed our Nort...",9
...,...,...,...
542,Mushu,Call out for egg rolls!,5
543,First Ancestor,Guardians.,1
544,Mulan,"Thanks, Mushu [kisses Mushu on the forehead].",7
545,Little Brother,"Bark, bark, bark, bark, bark, bark, bark",7


# How many words did each character speak?

In [100]:
mulan_dilaogue_df.groupby("character")["num_words"].sum().sort_values(ascending=False)

character
Mushu              2260
Mulan              1348
Shang               834
Chi Fu              419
Yao                 338
                   ... 
Archer Guy            1
Recruits              1
Cow                   1
Man in Crowd #2       1
All Soldiers          1
Name: num_words, Length: 83, dtype: int64

## Why don't our results match up with The Pudding's?

In [103]:
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 10000

In [104]:
mulan_dilaogue_df

Unnamed: 0,character,dialogue,num_words
0,Disney's Mulan\r\nCompiled by Barry Adams during theater showings in 1998\r\nLast updated,"August 18, 1998",3
1,Guard,We're under attack! Light the signal!,7
2,Guard,Now all of China knows you're here.,8
3,Shan-Yu,Perfect.,1
4,General Li,"Your Majesty, the Huns have crossed our Northern border.",9
5,Chi Fu,Impossible! No one can get through The Great Wall. [The Emperor\r\nmotions for Chi Fu's silence],16
6,General Li,Shun-Yu is leading them. We'll set up defenses around your\r\npalace immediately.,14
7,Emperor,"No! Send your troops to protect my people. Chi Fu,",10
8,Chi Fu,"Yes, your highness.",3
9,Emperor,Deliver conscription notices throughout all the provinces. Call up\r\nreserves and as many new recruits as possible.,17


In [5]:
html_string

'<html>\r\n<head><meta name="viewport" content="width=device-width, initial-scale=1" />\r\n<meta name="HandheldFriendly" content="true">\r\n<meta http-equiv="content-type" content="text/html; charset=iso-8859-1">\r\n<meta http-equiv="Content-Language" content="EN">\r\n\r\n<meta name=objecttype CONTENT=Document>\r\n<meta name=ROBOTS CONTENT="INDEX, FOLLOW">\r\n<meta name=Subject CONTENT="Movie scripts, Film scripts">\r\n<meta name=rating CONTENT=General>\r\n<meta name=distribution content=Global>\r\n<meta name=revisit-after CONTENT="2 days">\r\n\r\n<link href="/style.css" rel="stylesheet" type="text/css">\r\n\r\n<script type="text/javascript">\r\n  var _gaq = _gaq || [];\r\n  _gaq.push([\'_setAccount\', \'UA-3785444-3\']);\r\n  _gaq.push([\'_trackPageview\']);\r\n\r\n  (function() {\r\n    var ga = document.createElement(\'script\'); ga.type = \'text/javascript\'; ga.async = true;\r\n    ga.src = (\'https:\' == document.location.protocol ? \'https://ssl\' : \'http://www\') + \'.google-a