# When Data Breaks Bad — Analysing Breaking Bad Dialogues

# Introduction

In this project, we aim to analyse Breaking Bad transcripts to identify patterns. We'll extract data from an online source and organise it into a dataframe. Our goal is to uncover insights such as:

- Most speaking characters per season.
- Top 8 characters by dialogue frequency.
- Top 3 characters with the shortest and longest sentences.
- Season with the highest number of sentences.
- Relationship between sentences per character and sentence length.
- Episode theme identification through cluster analysis.
- Common interactions among characters via network analysis of dialogue exchanges.

## Summary of Results

Here are the main conclusions we reached so far:

- The top 3 characters with the longest sentences are Hank, Walter, and Skyler. The top 3 characters with the shortest sentences are 'Everyone' (i.e., a group of people), Skyler, and Walter.
- Out of all the seasons we managed to analyse, season 2 is the longest. Yet, we only managed to analyse season 1 and 2 and part of season 3. We couldn't analyse season 4 and 5 due to the character's names issue.
- We found a linear correlation between longest sentences and the number of sentences. So he more sentences a character says, the longer their largest sentence is. This is because the main characters often speak profound and deep sentences.
- We discovered that Jesse and Walter Jr have most distinguished clusters. Characters speaking similarly include women in love. For instance, Skyler, Marie, Jane. Also young characters, such as Walter Jr and Jane. Finally, characters who discuss similar topics (i.e., Walter, Hank, and Saul).
- The most frequent conversation happens between Walter and Jesse. The second one is between Walter and Skyler. The third strongest relationship is the one between Jesse and Jane.

For more details, please refer to the the full analysis below.

## Loading the Data

To avoid spending time on writing transcripts, we'll use existing data. Let's see if we can get some interesting patterns from it.

We'll use [Forever Dreaming](https://foreverdreaming.org/), an online source of online entertainment. The website offers a URL for each BB's episode, from which we can download all the webpage source code. This will allow us to extract the data we want (i.e., transcripts, episode title, number, and season). Once that's done, we can then built a data frame, cleanse it, explore it, analyse it, and make some plots. Let's start by extracting the data.

In [3]:
# import necessary libraries 
import urllib.request
import re
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import sent_tokenize



In [4]:
# assign URL to a variable
base_url = 'https://transcripts.foreverdreaming.org/viewtopic.php?t={}'

# create a list of episodes' urls 
urls = []
start_number = 10044
end_number = 10107

for number in range(start_number, end_number + 1):
    episode_url = base_url.format(number)
    urls.append(episode_url)

In [5]:
# remove unnecessary urls 
urls.pop(1)
urls.pop(-2)

'https://transcripts.foreverdreaming.org/viewtopic.php?t=10106'

In [6]:
# download episodes' webpages 
from tqdm.auto import tqdm
webpage_texts = [] 

for url in tqdm(urls):
    fid = urllib.request.urlopen(url)
    webpage_text = fid.read().decode('utf-8')
    webpage_texts.append(webpage_text) 

  0%|          | 0/62 [00:00<?, ?it/s]

In [7]:
# create empty lists to store the data 
transcripts = [] 
titles = [] 
seasons = [] 
episodes = [] 

# extract transcripts, titles, seasons, and episodes
for webpage_text in webpage_texts:
    # extract episodes' transcript
    transcript_start_idx = webpage_text.find('<div class="content"')
    transcript_end_idx = webpage_text.find('<div class="share-list"')
    transcript = webpage_text[transcript_start_idx:transcript_end_idx]
    transcripts.append(transcript)

    # extract episodes' title
    title_start_idx = webpage_text.find('<h2 class="topic-title">')
    title_end_idx = webpage_text.find('</h2>')
    title = webpage_text[title_start_idx:title_end_idx]
    pattern = r'<a.*?>(.*?)</a>'
    match = re.search(pattern, title)
    if match:
        extracted_text = match.group(1)
        episode_title = extracted_text[7:]
        titles.append(episode_title)
        season_num = int(extracted_text[0])
        seasons.append(season_num)
        episode_num = int(extracted_text[2:4])
        episodes.append(episode_num)

Now that we've managed to extract the data, we can create the dataframe we will analyse. Below, we're also cleansing the data from unnecessaries words and formats. This will make our analysis easier later on. 

In [8]:
# create dataframe
df = pd.DataFrame({'Title': titles, 'Season': seasons, 'Episodes': episodes, 'Transcript': transcripts})

In [9]:
# create empty lists to store the data 
char_names = []
char_sentences = [] 
sentence_orders = [] 
episodes_names = [] 
episodes_numbers = []
seasons_numbers = [] 

# extract characters names, character sentences, etc. 
for name, episode_number, season_number, transcript in zip(df["Title"], df["Episodes"], df["Season"], df["Transcript"]):
    for idx, sentence in enumerate(transcript.split('<br>\n<br>\n')):
        if '<div class' in sentence:
            continue
        if '<br>\n' in sentence:
            sentence = sentence.replace('<br>\n', '')
        if '</div>\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t' in sentence:
            sentence = sentence.replace('</div>\n\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t', '')
        sentence = sentence.replace("Mr: Pinkman:", "Mr Pinkman:")
        if any([":" in x for x in sentence.split(' ')[:3]]) and ('Then he said' not in sentence):
            char_name = sentence.split(':')[0]
            char_names.append(char_name)
            char_sentence = sentence.split(':')[1]
            char_sentences.append(char_sentence)
            sentence_orders.append(idx)
            episodes_names.append(name)
            episodes_numbers.append(episode_number)
            seasons_numbers.append(season_number)
        else:
            char_sentences[-1] = char_sentences[-1] + sentence

We found out that some texts do not have the colon. The colon helps us split the script into character's name and sentence. This is because, in some texts, the character's name was missing. So unfortunately, we can only use Season n1, 2, and 3 up until episode n.8. Season 3 from episode n.9, season 4, and 5 will not be available for this project.

In [10]:
# create merged dataframe
new_df = pd.DataFrame({'episode_title': episodes_names, 'episode_number': episodes_numbers, 'season_number': seasons_numbers, 'character_name': char_names, 'character_sentence': char_sentences, 'sentence_order':sentence_orders})

In [11]:
# cleanse the data
new_df.drop(new_df.tail(1).index,inplace=True) 
new_df = new_df[new_df.character_name != "Scene"]
new_df["character_name"] = new_df["character_name"].replace("Walt", "Walter")

In [12]:
# rename the characters column 
new_df[(new_df['character_name'] == 'Lawyer') & (new_df['episode_title'] == 'Down')] = new_df[(new_df['character_name'] == 'Lawyer') & (new_df['episode_title'] == 'Down')].replace("Lawyer", "Lawyer_1")
new_df[(new_df['character_name'] == 'Lawyer') & (new_df['episode_title'] == 'Caballo Sin Nombre')] = new_df[(new_df['character_name'] == 'Lawyer') & (new_df['episode_title'] == 'Caballo Sin Nombre')].replace("Lawyer", "Lawyer_3")
new_df[(new_df['character_name'] == 'Lawyer') & (new_df['episode_title'] == 'One Minute')] = new_df[(new_df['character_name'] == 'Lawyer') & (new_df['episode_title'] == 'One Minute')].replace("Lawyer", "Lawyer_4")

In [13]:
# rename the characters column 
list_of_episodes = ['No Mas', 'I.F.T.', 'Mas']

for episode in list_of_episodes:
    new_df[(new_df['character_name'] == 'Lawyer') & (new_df['episode_title'] == episode)] = new_df[(new_df['character_name'] == 'Lawyer') & (new_df['episode_title'] == episode)].replace("Lawyer", "Lawyer_2")

In [14]:
# replace incorrect names with right names 
list_of_tuples = [('Walter Junior', 'Walter Jr'), 
                  ('Gus', 'Gustavo'), 
                  ('Mr Pinkman', 'Mr. Pinkman'),
                  ('Hank(on the news)', 'Hank'), 
                  ('Jane’s Voicemail', 'Jane'), 
                  ('Jesse(Answering Machine)', 'Jesse'), 
                  (' Krazy-8', 'Krazy-8'),
                  ('Skyler (Walt’s Imagination)', 'Skyler'), 
                  ('Walter(Answering Machine)', 'Walter'), 
                  ('Reporter(on the news)', 'Reporter')
                 ]

for old_name, new_name in list_of_tuples:
    new_df.loc[new_df['character_name'] == old_name, 'character_name'] = new_name

In [15]:
# extract length of sentences 
list_of_len_of_sen = [] 

for sentence in new_df['character_sentence']:
    list_of_len_of_sen.append(len(sentence.split(" ")))
    
# create new length of sentences column
new_df["sentence_length"] = list_of_len_of_sen

In [16]:
# save dataframe into a csv file 
new_df.to_csv("processed_data.csv", index=False)

We've exported, cleansed, and reformatted the data. Now we're ready to explore it to see if we can find some interesting patterns. 