# Data Cleaning

## Overview

This script is for cleaning of PDF scripts (located in the ```data``` folder). It outputs the script in a CSV format, with column for the ```character``` and one their corresponding ```line``` column, in show-order.

First, all necessary packages are imported

In [52]:
import pdfplumber
import re
import csv

Then, the path for the raw script PDF is set, and the path for the output CSV.This is done for all 3 seasons of data.

In [53]:
pdf_path = "data/s1_raw_pdf.pdf"
output_file = "data/s1_split_csv.csv"

There is a specific list of characters this project is observing, only characters from the **main cast**. This list is set so lines from other characters are ignored.

In [54]:
ALLOWED_CHARACTERS = {
    "Jay",
    "Gloria",
    "Phil",
    "Claire",
    "Cameron",
    "Mitchell",
    "Manny",
    "Luke",
    "Alex",
    "Haley"
}

There is specific pattern to how lines are notated in the PDF. This includes two formats, ```[Character]: [Line]``` and ```[Character] : [Line]```. To properly extract lines and characters, a regex script is written.

In [55]:
speaker_pattern = re.compile(r'^([^:]+?)\s*:\s*(.*)')

Then, we can create arrays and indexes to begin scanning through each line.

In [56]:
rows = []
current_character = None
current_dialogue = []

Two functions are used to process all the data. One is the ```save_current()``` function. This allows us to save who the current character is and what their dialogue is. The next function is ```process_lines```. This function then allows us to go through the current character and dialogue and strip each line according to the regex scripts. It is important here to ignore any other metadata, such as headers and page numbers.

In [57]:
def save_current():
    global current_character, current_dialogue
    if current_character in ALLOWED_CHARACTERS and current_dialogue:
        rows.append([
            current_character,
            " ".join(current_dialogue).strip()
        ])
    current_character = None
    current_dialogue = []

def process_lines(lines):
    global current_character, current_dialogue

    for line in lines:
        line = line.strip()
        if not line:
            continue

        # Skip metadata
        if line.startswith("Modern Family"):
            continue
        if re.match(r'^\d+x\d+', line):
            continue
        if re.match(r'^\[.*\]$', line):
            continue

        match = speaker_pattern.match(line)

        if match:
            save_current()

            speaker = match.group(1).strip()
            dialogue = match.group(2).strip()

            if speaker in ALLOWED_CHARACTERS:
                current_character = speaker
                current_dialogue = [dialogue] if dialogue else []
            else:
                current_character = None
                current_dialogue = []
        else:
            if current_character:
                current_dialogue.append(line)

Due to the nature of the PDF being two columns, we have to use a special library, ```pdfplumber``` which can read in and format the two columns in order. 

In [None]:
with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:

        width = page.width
        height = page.height

        left_bbox = (0, 0, width / 2, height)
        right_bbox = (width / 2, 0, width, height)

        left_column = page.crop(left_bbox)
        right_column = page.crop(right_bbox)

        left_text = left_column.extract_text()
        if left_text:
            process_lines(left_text.split("\n"))

        right_text = right_column.extract_text()
        if right_text:
            process_lines(right_text.split("\n"))

save_current()

Finally, to write all the lines and characters to a csv, we use the python ```csv``` library.

In [59]:
with open(output_file, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(["character", "line"])
    writer.writerows(rows)

This results in the cleaned data that is ready for further tokenization and pre-processing.