In [2]:
import pandas as pd
import numpy as np

In [6]:
raw_data = pd.read_csv('./SouthParkData-master/All-seasons.csv')
raw_data.head()

Unnamed: 0,Season,Episode,Character,Line
0,10,1,Stan,"You guys, you guys! Chef is going away. \n"
1,10,1,Kyle,Going away? For how long?\n
2,10,1,Stan,Forever.\n
3,10,1,Chef,I'm sorry boys.\n
4,10,1,Stan,"Chef said he's been bored, so he joining a gro..."


In [9]:
raw_data['Line'][4]

"Chef said he's been bored, so he joining a group called the Super Adventure Club. \n"

## We're going to attempt to implement the Viterbi algorithm

### Here's what we need:

#### parameters:

K => number of hidden states

T => length of sequence of observations

N => number of possible observations 

S => the "state space", i.e. all possible words (s1, s2, ... , sK)

Priors => an array of prior probabilities for each state (ie. how likely is a word to occur w/o context?)

Transition Matrix A => K x K matrix where A[i, j] stores probability of transitioning from si to sj

Emission Matrix   B => K x N matrix where B[i, j] stores probability of observing oj from state si
    
#### output:

X - a sequence of states (x1, x2, ..., xT)

Let's first see how large of a vocabulary we're working with

I'm going to include all individual words, plus some basic punctuation like ',' '.' '!' and '?'

In [47]:
# Takes a string, tokenizes it 
# Returns a list of the tokens
def tokenize_str(string):
    
    string = string.lower()
    
    # give space on either side of punctuation
    string = string.replace(',', ' ,')
    string = string.replace('.', ' .')
    string = string.replace('!', ' !')
    string = string.replace('?', ' ?')
    
    # split string and add newline at end
    string = string.split()
    string += ['\n']
    
    return string

print(tokenize_str(raw_data['Line'][4]))

['chef', 'said', "he's", 'been', 'bored', ',', 'so', 'he', 'joining', 'a', 'group', 'called', 'the', 'super', 'adventure', 'club', '.', '\n']


In [48]:
# line_lists = [simplify_str(s).split() for s in raw_data['Line']]
tokens = [word for sentence in raw_data['Line'] for word in tokenize_str(sentence)]

states = set(tokens)
vocab_size = len(states)
print(vocab_size)

32834


32834 words is a lot to work with, but it might work. We may have to reduce dimensionality at some point.

Since we're trying to generate text from this corpus, our # of observations N will be the same as K (vocab_size)

Now we have our states, and also our # of states K, and our # of observations N. In our application, K == N


In [None]:
# We still need the transition matrix A and our emission matrix B 
