In this script, the data needed for the network analysis of the show South Park are generated. These are:  
 - The characters
 - The scripts, split in scenes, for each episode  
 
The analysis is based on relationships between the characters, as they are built between the scenes of each episode. To this end, the following actions are required:  
 - Get the scripts for each episode from each season. The Wiki Fandom website is used.
 - From each script, get the characters, what they are saying, as well as the scenes, to be used as separators. This is needed because an episode might be consisting of parallel stories, with different character interactions. We want to create relationships between characters that are in the same scenes, not only because they appear in an episode.
 
From the above gathered data, the unique characters list can be built. Moreover, after some formatting for later use, the scripts are also available to be used for sentiment and other text based analysis.  

The character names can be aquired via 2 ways:  
1. From the beginning of the dialogue in the script. This is the most obvious way.
2. Taking advantage of the wiki website, that lists all the characters in a single episode in the beginning of a script page.  

However, this second way uses the full names, as a result further analysis will be required to match the characters to the names in the scripts. As a result, the first method will be used.

In [12]:
# Necessary imports
import re
import requests
from bs4 import BeautifulSoup
import numpy as np
from pathlib import Path
import pandas as pd

from tqdm.autonotebook import tqdm

In [13]:
# Create a folder to store the scripts

scripts_dir = Path.cwd() / 'Scripts'
scripts_dir.mkdir(exist_ok=True)

### Part 1: Get the scripts links for all episodes of each season  

The goal is to get the source of the wiki page that stores the links. It is much easier to get using the library `bs4`, since only one textbox is present in a source page. The use regular expressions to extract the links for the scripts of each season.

In [14]:
def link_to_url(link):
    '''
    Convert a link aquired from scraping to wiki link for the source text.
    '''
    # Have a base url to append the links to, in order to request the page needed at a time
    base_url = 'https://southpark.fandom.com/wiki/'
    
    # Define an 'api' string, that when appended to a url, it directs to its source
    source_api = '?action=edit'
    
    url = base_url + link + source_api
    return url

In [15]:
def url_to_textbox(url):
    '''
    Return the text from the (unique) textbox of the source wiki page
    '''
    raw_html = requests.get(url)
    soup = BeautifulSoup(raw_html.text, 'html.parser')
    main_table = soup.find_all('textarea', class_='mw-editfont-default')
    
    try:
        text = main_table[0].text
        return text
    except IndexError as e:
        print(e)
        print(url)
        return -1

Links for each season:

In [16]:
def get_links_titles(text, mode='s'):
    '''
    Return tuple of lists. First element is a list of urls. Second is the title.
    mode: 's' for season , 'e' for episodes.
    '''
    if mode.lower()=='s':
        pattern = r'\|\[{2}(.+)\|(.+)\]{2}'
    elif mode.lower() == 'e':
        pattern = r'\"\[{2}(.+)\|(.+)\]{2}'
    else:
        print('wrong mode')
        return None
    
    matches = re.findall(pattern, text)
    max_elements = len(str(len(matches))) if matches else 0
    
    links = [] # Separate lists for urls
    titles = [] # and titles
    for i, match in enumerate(matches):
        links.append(link_to_url(match[0].replace(' ', '_').replace('?', '%3F')))
        
        # Append 0 to the numbering to avoid sorting problems
        el_num = '0'*(max_elements-len(str((i+1)))) + str(i+1)
        element_title = match[1].replace(' ', '_')
        element_title = element_title.replace("'", '_')
        element_title = element_title.replace('?', '%3F')
        titles.append(el_num+'_'+element_title)
    return links, titles

From the textbox of source page of each wiki link, two elements can be identified:  
 - `ScriptScene`: Describes a scene setting  
 - `ScriptDialog` : Describes the character and what he/she says  

Use regular expressions to isolate the content of each of the above elements.  

Since these are stored in a text file, the startegy to sepate them later is laid:  
- Each scene begins and ends with 3 `+`: +++ Something happens +++
- The character's name is followed by a space,  `:`, a space and then what they say.

In [17]:
def process_dialog(dialog_textbox):
    dialog_pattern = r'\{\{ScriptDialog\|(.+)\|(.+)\}\}'
    matches = re.findall(dialog_pattern, dialog_textbox)
    
    document = ''
    for match in matches:
        document += match[0] + " : "
        document += match[1] + '\n'
    return document

In [18]:
def process_script_textbox(textbox):
    # Identify the scenes. Get their content and (start) and ending index
    scene_pattern = r'\{\{ScriptScene\|(.+)\}\}'
    matches  =re.finditer(scene_pattern, textbox)
    
    scene_separators = '+'*5
    
    starts = []
    ends = []
    content = []
    for match in matches:
        starts.append(match.start())
        ends.append(match.end())
        content.append(match.group(1))
    
    
    # Take only the parts between scenes. Each episode starts with a scene and ends with dialogue
    starts_ = starts[1:]
    ends_ = ends[:-1]
    
    # Start creating the document for this episode
    document = ''
    document = scene_separators + content[0] + scene_separators + '\n'
    
    content_i = 1
    for start_i, end_i in zip(ends_, starts_):
        dialog_txt = textbox[start_i: end_i]
                
        processed_dialog_txt = process_dialog(dialog_txt)
        document += processed_dialog_txt
        
        # Add the next scene text
        document += scene_separators + content[content_i] + scene_separators + '\n'
        content_i += 1
    
    # Add the last dialog part
    dialog_txt = textbox[ends[-1]:]

    processed_dialog_txt = process_dialog(dialog_txt)
    document += processed_dialog_txt
        
    return document

In [46]:
url = 'https://southpark.fandom.com/wiki/Portal:Scripts?action=edit'
seasons_textbox = url_to_textbox(url)
season_urls, season_names = get_links_titles(seasons_textbox)

# Create a dataframe to store season names, links and episode title and link
links_df = pd.DataFrame({'Season':[], 'Episode':[], 'URL':[]})

for i in tqdm(range(len(season_names))):
    # Create a dir in the scripts dir for the season
    season_path = scripts_dir / season_names[i]
    season_path.mkdir(exist_ok=True)
    
    # Get all the episode links and names of the season
    episodes_textbox = url_to_textbox(season_urls[i])
    episodes_urls, episodes_names = get_links_titles(episodes_textbox, 'e')
    
    # Create lists for the dataframe update
    season_nm = []
    episode_nm = []
    episode_lnk = []
    for j in range(len(episodes_names)):
        
        # To update dataframe
        season_nm.append(season_names[i])
        episode_nm.append(episodes_names[j])
        episode_lnk.append(episodes_urls[j])
        
        # Get the raw data for the scripts of this episode
        episode_textbox = url_to_textbox(episodes_urls[j])
        
        # Find scenes and dialogs and add them to a formatted document
        document = process_script_textbox(episode_textbox)
        
        # Save the formatted script in a text file    
        ep_name = episodes_names[j].replace(':', ' ')
        with open(season_path.as_posix()+f'/{ep_name}.txt', 'w', encoding='utf8') as f:
            f.write(document)
            
    temp_df = pd.DataFrame({'Season': season_nm, 'Episode':episode_nm, 'URL':episode_lnk})
    links_df = pd.concat([links_df, temp_df])
        

# Save the created dataframe in a csv file
links_df.to_csv('episode_script_urls.csv', index=False)

  0%|          | 0/25 [00:00<?, ?it/s]

### Part 2: Get the characters from the documents  

Many things can be combined, like that character list and the relationship building between the characters for the network. However, it is better to proceed step by step, since there is no need for performance or running the script real time.

In [47]:
def get_episode_characters(formatted_episode_script):
    characters_pattern = r"\n([\w\d\s.,'-]+)\s:"
    matches = re.findall(characters_pattern, formatted_episode_script)
    
    characters = []
    for match in matches:
        # Remove groups
        if len(match.split(','))>1:
            continue
        if len(match.split(' and '))>1:
            continue
        if len(match.strip().split(" ")) > 1 and match.strip().split(' ')[-1].isnumeric():
            continue        
        characters.append(match.strip())
        
    
    # Remove duplicates due to introduction
    single_name = [character for character in set(characters) if len(character.split())<2]
    dual_name = [character for character in set(characters) if len(character.split())>1]
    tmp_dual_chars = dual_name[:]
    
    for character in single_name:
        for el in tmp_dual_chars:
            if character in el.split():
                tmp_dual_chars.remove(el)

    characters = single_name[:] + tmp_dual_chars[:]
    
    # Remove the aggregation of Man and Woman and generic names
    if 'Man' in characters:
        characters.remove('Man')
    if 'Woman' in characters:
        characters.remove('Woman')
    if 'All' in characters:
        characters.remove('All')
    if 'Everyone' in characters:
        characters.remove('Everyone')
    
    return set(characters)

In [48]:
characters = []
for ep_script in tqdm(scripts_dir.glob('**/*.txt')):
    with open(ep_script.as_posix(), 'r', encoding='utf8') as f:
        document = f.read()
    episode_characters = get_episode_characters(document)
    characters += episode_characters
    
# We want unique characters, avoid repetition between episodes
characters = list(set(characters))

# Save the characters as a pandas dataframe
characters_df = pd.DataFrame.from_dict({'name':characters})
characters_df.to_csv('characters_df.csv', index=False)

0it [00:00, ?it/s]

### Relationships between characters

In [22]:
relationships_path = Path.cwd() / 'Relationships'
relationships_path.mkdir(exist_ok=True)

We are only interested in the names of the characters here. So we can create a dictionary with keys the full name and values the first name. Or maybe we could create two lists:

In [23]:
# Go with the two lists approach
names = characters_df.name.to_list()

In [24]:
# Create a dictionary to create a data frame from it afterwards
# It will be of the form:   'first_char': 'second_char'
characters_interactions = {} 

Define two functions used to search for names and relationships in the scripts:

In [25]:
def get_characters_in_text(text, characters_list):

    pattern = r'\n(.+)\s:'
    matches = re.findall(pattern, text)
    chars = [nm.strip() for nm in matches]
    
    char_list = []
    for character in chars:
        if character in characters_list:
            char_list.append(character)
    
    return char_list

In [26]:
def create_relationship_dict(char_list):
    relationship_dict_list = []
    for i, el in enumerate(char_list[:-1]):
        for character in char_list[i+1:]:
            if not character == el:
                relationship_dict_list.append({ 'source':el, 'target':character })
    return relationship_dict_list

In [49]:
# # Define a regex pattern. Compile it to be faster since there are many files
# pattern = r"[+]{2}\n([^+]+)[+]{2}"
# prog = re.compile(pattern)

# total_relationships_dict_list = []

# for file_ in tqdm(scripts_dir.glob('**/*.txt')):
        
#     episode_relationship_dict_list = []
#     season_nr = file_.as_posix().split('/')[-2]
#     fname = file_.as_posix().split('/')[-1]
#     fname = fname.split('.')[0]
    
#     # Create a folder for each season and save the csv of the relationships in there for each episode
#     season_path = relationships_path / f"{season_nr}"
#     season_path.mkdir(exist_ok=True)
    
#     with open(file_, 'r', encoding='utf-8') as f:
#         test_txt = f.read()
            
    
#     # Use regex to find the text between the pluses
#     results = prog.findall(test_txt)
#     for result in results:
#         # Get the list of characters in this scene
#         chars_in_part = get_characters_in_text(result, names)
#         # If there are more than 1 characters in the list, create a relationship between them and
#         # append to the corresponding lists

#         if len(chars_in_part)>1:
#             rel_lst = create_relationship_dict(chars_in_part)                
#             episode_relationship_dict_list += rel_lst
#             total_relationships_dict_list += rel_lst
    
#     # For this episode, create now a dataframe from the episode relationships
#     episode_rel_df = pd.DataFrame(episode_relationship_dict_list)
    
#     # I have duplicates. I can add them as weights.
#     # But first to have all of the same names on the same column
#     # I want for a specific pair of source and target
#     # the source to be always on the same column of the dataframe
#     episode_rel_df = pd.DataFrame( np.sort(episode_rel_df.values, axis=1), columns=episode_rel_df.columns )
    
#     # For the duplicates, we can add them up to form weights on the edges, representing
#     # how strong the relationship is
#     episode_rel_df['weight'] = 1 # initialize
#     try:
#         episode_rel_df = episode_rel_df.groupby(['source', 'target'], sort=False, as_index=False).sum()
#         episode_rel_df.to_csv(season_path.as_posix()+'/'+fname+'.csv')
#     except KeyError as e:
#         print(e)
#         print(season_nr, fname)
#         print(episode_rel_df)
#         print()
        
        
# # Do the same for the total relationship
# total_relationships_dict_list = pd.DataFrame(total_relationships_dict_list)
# total_relationships_dict_list = pd.DataFrame( np.sort(total_relationships_dict_list.values, axis=1), columns=total_relationships_dict_list.columns )
# total_relationships_dict_list['weight'] = 1
# total_relationships_dict_list = total_relationships_dict_list.groupby(['source', 'target'], sort=False, as_index=False).sum()
# total_relationships_dict_list.to_csv(relationships_path.as_posix()+'/'+'total_relationships'+'.csv')

In [50]:
# Define a regex pattern. Compile it to be faster since there are many files
pattern = r"[+]{2}\n([^+]+)[+]{2}"
prog = re.compile(pattern)

total_relationships_dict_list = []

for folder in tqdm(scripts_dir.iterdir()):
    season_relationship_dict_list = []
    season_name = folder.parts[-1]
    for file_ in folder.iterdir():
        
        episode_relationship_dict_list = []
        season_nr = file_.as_posix().split('/')[-2]
        fname = file_.as_posix().split('/')[-1]
        fname = fname.split('.')[0]

        # Create a folder for each season and save the csv of the relationships in there for each episode
        season_path = relationships_path / f"{season_nr}"
        season_path.mkdir(exist_ok=True)

        with open(file_, 'r', encoding='utf-8') as f:
            test_txt = f.read()


        # Use regex to find the text between the pluses
        results = prog.findall(test_txt)
        for result in results:
            # Get the list of characters in this scene
            chars_in_part = get_characters_in_text(result, names)
            # If there are more than 1 characters in the list, create a relationship between them and
            # append to the corresponding lists

            if len(chars_in_part)>1:
                rel_lst = create_relationship_dict(chars_in_part)                
                episode_relationship_dict_list += rel_lst
                season_relationship_dict_list += rel_lst
                total_relationships_dict_list += rel_lst

        # For this episode, create now a dataframe from the episode relationships
        episode_rel_df = pd.DataFrame(episode_relationship_dict_list)

        # I have duplicates. I can add them as weights.
        # But first to have all of the same names on the same column
        # I want for a specific pair of source and target
        # the source to be always on the same column of the dataframe
        episode_rel_df = pd.DataFrame( np.sort(episode_rel_df.values, axis=1), columns=episode_rel_df.columns )

        # For the duplicates, we can add them up to form weights on the edges, representing
        # how strong the relationship is
        episode_rel_df['weight'] = 1 # initialize
        try:
            episode_rel_df = episode_rel_df.groupby(['source', 'target'], sort=False, as_index=False).sum()
            episode_rel_df.to_csv(season_path.as_posix()+'/'+fname+'.csv')
        except KeyError as e:
            print(e)
            print(season_nr, fname)
            print(episode_rel_df)
            print()
    
    season_relationship_dict_list = pd.DataFrame(season_relationship_dict_list)
    season_relationship_dict_list = pd.DataFrame( np.sort(season_relationship_dict_list.values, axis=1), columns=season_relationship_dict_list.columns )
    season_relationship_dict_list['weight'] = 1
    season_relationship_dict_list = season_relationship_dict_list.groupby(['source', 'target'], sort=False, as_index=False).sum()
    season_relationship_dict_list.to_csv(relationships_path.as_posix()+'/'+season_name+'.csv')
    
        
        
# Do the same for the total relationship
total_relationships_dict_list = pd.DataFrame(total_relationships_dict_list)
total_relationships_dict_list = pd.DataFrame( np.sort(total_relationships_dict_list.values, axis=1), columns=total_relationships_dict_list.columns )
total_relationships_dict_list['weight'] = 1
total_relationships_dict_list = total_relationships_dict_list.groupby(['source', 'target'], sort=False, as_index=False).sum()
total_relationships_dict_list.to_csv(relationships_path.as_posix()+'/'+'total_relationships'+'.csv')

0it [00:00, ?it/s]

In [51]:
total_relationships_dict_list

Unnamed: 0,source,target,weight
0,Cartman,Kyle,20509
1,Kyle,Stan,17212
2,Ike,Kyle,591
3,Cartman,Stan,16394
4,Cartman,Ike,209
...,...,...,...
17319,Linda Black,Stevens,2
17320,Butters,Stevens,2
17321,Butters,Saint Patrick,24
17322,Saint Patrick,Yates,16


### Create each character's text

In [53]:
texts_path = Path.cwd() / 'Texts'
texts_path.mkdir(exist_ok=True)

In [76]:
def get_char_text_dict(document):
    pattern = r'(.+)\s:\s(.+)\n'
    matches = re.findall(pattern, document)
    
    char_text_dict = {}
    for match in matches:
        repl_pattern = r"\[.+\]"
        char_txt = re.sub(repl_pattern, '', match[1])
        char_text_dict[match[0]] = char_text_dict.get(match[0], '') + char_txt + ' '
        
    return char_text_dict        

In [79]:
def update_dictionary(dict_to_be_updated, new_dict):
    old_dict = dict_to_be_updated.copy()
    for new_el in list(new_dict.keys()):
        old_dict[new_el] = old_dict.get(new_el, '') + new_dict[new_el]
        
    return old_dict

In [86]:
total_char_texts_dict = {}
for season in tqdm(scripts_dir.iterdir()):
    season_name = season.stem
    season_texts_path = texts_path / season_name
    season_texts_path.mkdir(exist_ok=True)
    
    season_char_texts_dict = {}
    
    for episode in season.iterdir():
        ep_name = episode.stem.replace('.', '')
        episode_texts_path = season_texts_path / ep_name
        episode_texts_path.mkdir(exist_ok=True)
        
        episode_char_texts_dict = {}
        
        # Open the script of the episode
        with open(episode, 'r', encoding='utf-8') as f:
            doc = f.read()
        
        # Get the characters and what they are saying
        episode_text_dict = get_char_text_dict(doc)
        
        # Create the files for each episode
        for character in list(episode_text_dict.keys()):
            if character in characters:
                char_name_path = texts_path / season_name / ep_name / character
                with open(char_name_path.as_posix()+ '.txt', 'a', encoding='utf-8') as f:
                    f.write(episode_text_dict[character])
                    
        # Update the season's dictionary
        season_char_texts_dict = update_dictionary(season_char_texts_dict, episode_text_dict)
    
    
    # Create the files for each season
    for character in list(season_char_texts_dict.keys()):
        season_name_path = texts_path / season_name / season_name
        with open(season_name_path.as_posix()+ '.txt', 'a', encoding='utf-8') as f:
            f.write(season_char_texts_dict[character])
        
    # Update the total dictionary
    total_char_texts_dict = update_dictionary(total_char_texts_dict, season_char_texts_dict)
        
# Create the files for the whole show
for character in list(total_char_texts_dict.keys()):
    pth = texts_path 
    with open(pth.as_posix()+ 'texts.txt', 'a', encoding='utf-8') as f:
        f.write(total_char_texts_dict[character])   

0it [00:00, ?it/s]