# Tokenization

## Overview

This script allows us to combine all seasons CSV data, tokenize each line, specifically pre-processing by lowercasing lines and removing punctuation.

First, we import all packages. This includes downloading the ```punkt``` package from the ```Natural Language ToolKit (nltk)```.

In [11]:
import csv
import pandas as pd
import glob
import os
import nltk
from collections import defaultdict
from nltk.tokenize import word_tokenize

# nltk.download("punkt")

Now, we can begin by combining all csvs from the data cleaning step into one large csv.

In [12]:
combined_csv = 'data/combined_seasons.csv'

all_seasons = glob.glob(os.path.join("./data", "*.csv"))

df_csvs = [pd.read_csv(f) for f in all_seasons]
combined_df = pd.concat(df_csvs, ignore_index=True)

combined_df.to_csv(combined_csv, index=False)

input_file = combined_csv

For project purposes, there are counters created to determine the total number of tokens, lines, and vocab size for each character and overall.

In [13]:
token_counts = defaultdict(int)
vocab = defaultdict(set)
line_counts = defaultdict(int)

The csv can then be opened, and each row in the CSV is iterated through, tokenized with the proper pre-processing as well. Then, for each character, the number of tokens, vocab size, and number of lines is updated.

In [14]:
with open(input_file, "r", encoding="utf-8") as f:
    reader = csv.DictReader(f)

    for row in reader:
        character = row["character"]
        line = row["line"]

        tokens = word_tokenize(line.lower())

        tokens = [t for t in tokens if t.isalpha()]

        token_counts[character] += len(tokens)
        vocab[character].update(tokens)
        line_counts[character] += 1

At the end, the data for each character is printed out.

In [15]:
for character in sorted(token_counts.keys()):
    total_tokens = token_counts[character]
    vocab_size = len(vocab[character])

    print(f"{character}")
    print(f"  Total Lines: {line_counts[character]}")
    print(f"  Total Tokens: {total_tokens}")
    print(f"  Vocabulary Size: {vocab_size}")
    print()

unique_vocab_total = {word for char_vocab in vocab.values() for word in char_vocab}

print("Overall Statistics")
print(f"  Total Lines Overall: {sum(line_counts.values())}")
print(f"  Total Tokens: {sum(token_counts.values())}")
print(f"  Vocabulary Size: {len(unique_vocab_total)}")

Alex
  Total Lines: 885
  Total Tokens: 7278
  Vocabulary Size: 1415

Cameron
  Total Lines: 2451
  Total Tokens: 27850
  Vocabulary Size: 3475

Claire
  Total Lines: 3433
  Total Tokens: 34934
  Vocabulary Size: 3087

Gloria
  Total Lines: 1907
  Total Tokens: 20496
  Vocabulary Size: 2257

Haley
  Total Lines: 1116
  Total Tokens: 9716
  Vocabulary Size: 1476

Jay
  Total Lines: 2473
  Total Tokens: 29196
  Vocabulary Size: 3143

Luke
  Total Lines: 960
  Total Tokens: 7516
  Vocabulary Size: 1414

Manny
  Total Lines: 1030
  Total Tokens: 10034
  Vocabulary Size: 1855

Mitchell
  Total Lines: 2752
  Total Tokens: 29456
  Vocabulary Size: 3079

Phil
  Total Lines: 3316
  Total Tokens: 37590
  Vocabulary Size: 4029

Overall Statistics
  Total Lines Overall: 20323
  Total Tokens: 214066
  Vocabulary Size: 9564


At this point, the tokenization and pre-processing is complete and the tokens are ready for following steps.