<a href="https://colab.research.google.com/github/guru3/the_office_series_analysis/blob/master/The%20Office%20Transcript%20Generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pickle
import random
import sys
import keras
import numpy as np
from keras.layers import LSTM, Dense
from keras.models import Sequential

Using TensorFlow backend.


In [0]:
[season_map_parsed, season_map_cleaned, theOfficeIMDBRating] = pickle.load(open('./the_office_transcript.pickle', 'rb'))

#### Let's write a dialogue generator for Michael, Dwight and Creed!

In [0]:
CHARACTERS = ['MICHAEL', 'DWIGHT', 'CREED']
chr_dialogue_map = {};
for char in CHARACTERS:
    chr_dialogue_map[char] = []

for season in season_map_parsed.keys():
    episodes = season_map_parsed[season]
    for episode in episodes.keys():
        dialogues = episodes[episode]
        for dialogue in dialogues:
            char = dialogue[0]
            if not char in CHARACTERS:
                continue;
            d = dialogue[1]
            chr_dialogue_map[ char ].append(d);

In [0]:
maxlen = 50 #length of input sequence
step = 3    #sample a new sequence after every step characters

In [0]:
def getDataForCharacter( char ):
    sentences = []   #input 
    next_chars = []  #output

    dialogues = chr_dialogue_map[char];
    for dialogue in dialogues:
        for i in range(0, len(dialogue) - maxlen, step):
            sentences.append( dialogue[i: i+maxlen] )
            next_chars.append( dialogue[i+maxlen] )
    chars = sorted(list(set(' '.join(dialogues))))
    char_indices = dict((char, chars.index(char)) for char in chars)
    x = np.zeros( (len(sentences), maxlen, len(chars)), dtype=np.bool )
    y = np.zeros( (len(sentences), len(chars)), dtype=np.bool)
    for i, sentence in enumerate(sentences):
        for t,char in enumerate(sentence):
            x[i, t, char_indices[char] ] = 1
        y[i, char_indices[next_chars[i]]] = 1
    return x,y, char_indices, dialogues

In [0]:
def getModelLSTM(chars):
    model = Sequential()
    model.add( LSTM(256, return_sequences=True, input_shape=(maxlen, len(chars))))
    model.add( LSTM(128))
    model.add( Dense(len(chars), activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adaDelta')
    return model

In [0]:
def sample(preds, temperature=1.0):
    preds=  np.asarray(preds).astype('float64')
    preds = np.log(preds)/temperature
    exp_preds = np.exp(preds) #can use np.power too instead of log and exp
    preds = exp_preds/np.sum(exp_preds)
    probs = np.random.multinomial(1, preds, 1)
    return np.argmax(probs)

In [0]:
def runForCharacter( charName ):
    x, y, char_indices, dialogues = getDataForCharacter( charName );
    chars = list(char_indices.keys())
    model = getModelLSTM(chars);
    
    for i in range(6):
      model.fit(x, y, batch_size=128, epochs=20)

      while(True):
          dialogue = dialogues[random.randint(0, len(dialogues))]
          if len(dialogue) <= maxlen:
              continue
          start_index = random.randint(0, len(dialogue) - maxlen - 1)
          break
      
      for temperature in [0.2, 0.5, 1.0, 1.2]:
          generated_text = dialogue[start_index: start_index + maxlen]
          print('--- Generating with seed: "' + generated_text + '"')
          print('------ temperature:', temperature)
          sys.stdout.write(generated_text)

          # We generate 400 characters
          for i in range(400):
              sampled = np.zeros((1, maxlen, len(chars)))
              for t, char in enumerate(generated_text):
                  sampled[0, t, char_indices[char]] = 1.

              preds = model.predict(sampled, verbose=0)[0]
              next_index = sample(preds, temperature)
              next_char = chars[next_index]

              generated_text += next_char
              generated_text = generated_text[1:]

              sys.stdout.write(next_char)
          print()
      
      sys.stdout.flush()

In [0]:
runForCharacter( 'MICHAEL' );

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
nd that is the watch that you are going to wear? N
--- Generating with seed: "nd that is the watch that you are going to wear? N"
------ temperature: 0.2
nd that is the watch that you are going to wear? Now this the and and the wat in the a do he to the to the the ke to the to the want the and the and the the the kere the than that and the and the se to go the a bane wo he to go the me to the wank hat and and the and and wase I go the and the wase and what wase the and and we the wes and and the were ho he the a want the a to the and wowe want the to to and and of and the the the and of an have wo
nd that is the watch that you are going to wear? N
--- Generating with seed: "nd that is the watch that you are going to wear? N"
------ temperature: 0.5
nd that is

### Alright after some tweaking of model structures, we finally got something close to making sense, yet far away from it!
#### Example : 
"ke it's hot. Forward it like it's hot. "Old Schoole, I'm goung to sreat it at out out like me... I'm not! we, gotan. I'm and n't eace?... I'm not goy forgt. . I would toll you that.... I wout't gunting it outfreaver."
#### MANUALLY and poorly tweaking to :
"ke it's hot. Forward it like it's hot. "Old School, I'm young to sweat it at out out like me... I'm not! we, gotan. I'm and n't eace?... I'm not guy <who> forgets. . I would tell you that.... I wouldn't gunting it outforever. "

#### Other Example :
I dnow. No, don'l whave I and overyy? doont, I manz ha1p topre soreace.... Mim.... I wantte furt2n the kus'me the youc. 

#### MANUALLY and poorly tweaking to :
I know. No, don't have I and over? dont, I may he1p to pre soreace.... Mim.... I wantted further the kus'me the your.

#### Not satisfactory enough though :(

#### Let's make an attempt at generating the transcript itself! We will use words as tokens now though!

In [0]:
maxlen = 5

def getData():
    transcripts = []
    for season in season_map_cleaned.keys():
        episodes = season_map_cleaned[season]
        for episode in episodes.keys():
            dialogues = episodes[episode]
            for dialogue in dialogues:
                char = dialogue[0]
                words = dialogue[1].split();
                transcripts.append( words )
    
    sentences = []   #input 
    next_words = []  #output

    for dialogue in transcripts:
        for i in range(0, len(dialogue) - maxlen, step):
            sentences.append( dialogue[i: i+maxlen] )
            next_words.append( dialogue[i+maxlen] )
    
    all_words = []
    for sentence in sentences:
      all_words = all_words + sentence;
    for word in next_words:
      all_words.append( word )
    
    all_words = sorted(list(set(all_words)))
    word_indices = dict((word, all_words.index(word)) for word in all_words)
    
    x = np.zeros( (len(sentences), maxlen, len(all_words)), dtype=np.bool )
    y = np.zeros( (len(sentences), len(all_words)), dtype=np.bool)
    for i, sentence in enumerate(sentences):
        for t,char in enumerate(sentence):
            x[i, t, word_indices[char] ] = 1
        y[i, word_indices[next_words[i]]] = 1
    return x,y, word_indices, transcripts

In [0]:
def runTranscript(rangeToRun=6):
    x, y, word_indices, dialogues = getData();
    words = list(word_indices.keys())
    maxlen = 5

    model = Sequential()
    model.add( LSTM(128, return_sequences=True, input_shape=(maxlen, len(words))))
    model.add( LSTM(64))
    model.add( Dense(64, activation='relu'))
    model.add( Dense(len(words), activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adaDelta')
    
    for i in range(rangeToRun):
      model.fit(x, y, batch_size=512, epochs=50)
      
      for temperature in [0.2, 0.5, 1.0, 1.2]:
          print('------ temperature:', temperature)
          for i in range(3):
            while(True):
              dialogue = dialogues[random.randint(0, len(dialogues))]
              if len(dialogue) <= maxlen:
                  continue
              start_index = random.randint(0, len(dialogue) - maxlen - 1)
              break
            generated_text = dialogue[start_index: start_index + maxlen]
            print('--- Generating with seed: "' + ' '.join(generated_text) + '"')
            sys.stdout.write(' '.join(generated_text))

            # We generate 40 words
            for i in range(40):
                sampled = np.zeros((1, maxlen, len(words)))
                for t, char in enumerate(generated_text):
                    sampled[0, t, word_indices[char]] = 1.

                preds = model.predict(sampled, verbose=0)[0]
                next_index = sample(preds, temperature)
                next_word = words[next_index]

                generated_text.append(next_word)
                generated_text = generated_text[1:]

                sys.stdout.write(' ' + next_word)
            print('\n')
          print('\n\n')
      
      sys.stdout.flush()

In [0]:
runTranscript()

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
--- Generating with seed: "yeah and diameter sun 870"
------ temperature: 0.2
yeah and diameter sun 870 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

  This is separate from the ipykernel package so we can avoid doing imports until


 join mouth going lets thats writing got man wants gettin don jim guys start party its right inroads its i going great needs i appreciate live town energy years intruders for relief company i asked excuse am place and fool places going better come cake exactly month eyes thank ha and sleep did i know youre feel try eating so camera coaster way stop but whats having theres incompetent so studio file herd technique brownie feeding would official filled sea time breakfast right went okay youre annoying know make orgasm hygienist things tuna right style hot asprin do tube anybody right make gone does great wait said they hey have did love sofas disobeying kids little handsome s biggest they know women talk time do feel charge personality going leave happen world new sit like use tip jim you trying need printers why handle she come can good 20 justine start little baby 20 50 chair birthday twin ringie antidote appreciated door whats gave whoa join hired felony bring attention mob person sen

There are lot of mini-sentences starting with 'i', which is something network may have learnt given that the characters do talk directly to the camera in whole series. However this is nowhere close to actual transcripts!

I think the correct way would be to break down each dialogue using proper grammer segregation and pass in that knowledge to network somehow!
For now let's run it for few more iterations and hope that loss drops down even more

In [0]:
runTranscript(rangeToRun=20)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
------ temperature: 0.2
--- Generating with seed: "i called tallahassee he ask"
i called tallahassee he ask i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

--- Generating with seed: "ok im gon na office"
ok im gon na office i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

--- Generating with seed: "oh pamcake no we love"
oh pamcake no we love

  This is separate from the ipykernel package so we can avoid doing imports until


 gang stop its married away gay solution frequent absorb sexist right did oh

--- Generating with seed: "recent incident involving phyllis man"
recent incident involving phyllis man and i think ill just know there event stuff remember stanley there friendly studying whoever morale nipples cheer sexist number set day picture sad roy pound good like really asleep here instead totally salesmen who railing broke lady why wouldnt




------ temperature: 1.2
--- Generating with seed: "used sales calls time in"
used sales calls time in named lives black license hysterectomy experts smell say monster gang bouche skates kids bernies 200 plop thomas build news reciepts paper rich s—hey infants skirt yeah ooh dogs robot join catastrophe look athletes warm marie brush humiliating jo approval animal

--- Generating with seed: "hey whos charge making drinks"
hey whos charge making drinks im called christmas 308 test sabre sherman sabre keyboard bag notices fitness speech keeping awkward trip okay ad

In [0]:
#### We got some bits of good dialogues :
#### im saying shes love but left feel shes bad today um day ive sales you thinkin anytime keeps saw woody friends