# Assignment 05: What are they worried about?

For this assignment, I want to create a poem based on questions and ways to answer them. For this, I will use spacy to find questions and their answers. Then, I'll use Markov chains on questions and answers separetely to finally produce some text based on that.

As a source material, I'll use the script of the movie [Kiss Kiss Bang Bang](http://www.imdb.com/title/tt0373469/), a noir black comedy film. (I plan to do this with more movies later, but it is difficult to get movie scripts because of how different they all are. Maybe some web scraping tools will help 🙄😅)


In [1]:
# IMPORT ALL THE PACKAGES!
import random as rng
from collections import Counter
import re
import spacy

nlp = spacy.load('en_core_web_md')


## File loading and cleaning

First, I will load the file, process it with spacy and have a list of the sentences.

In [2]:
kkbb_script = [line.strip() for line in open("./sources/kisskissbangbang.txt").readlines()]

In [4]:
kkbb = nlp( ' '.join(kkbb_script) )

In [16]:
# get sentences!
kkbb_sents = [line.text.strip() for line in list(kkbb.sents)]

In [18]:
# let's see what we have
kkbb_sents

["I'll tell you what, I take notes, in general...   ...",
 "so if you're just real specific, thorough and precise, that'll help.",
 'I remember only lying to her one time.',
 'I said to her:   "Maybe the man who is living here is not your father. "',
 'You told your sister she was adopted, yeah?',
 'No, I did one better.',
 'I told her her real father was an actor...   ...who was in the movie that came through town.',
 '- Jonny Gossamer- - Jonny Gossamer movie.',
 "That's right.",
 "I told Jenna one day she'd go to Hollywood...   ...",
 "and she'd meet her famous, real father.",
 'She believed me, Harry, and she came out here looking for him.',
 'Okay, this is a bunch.',
 'I can start my...',
 '- ...process, I guess.',
 '- Thank you.',
 "I'm on the case.",
 'Here is my card.',
 "- It's a magic card, so.... -",
 'Wow, "the Amazing Harold. "',
 '- Just say "abracadabra. "',
 '- What happened, somebody sue you?',
 'It used to be "alakazam" when you cut me in half.',
 'And, not to be picky

There's some clean-up to do. A lot of '...', '   ' and '-' that need to be erased. But it's hard to discriminate which should be taken away and which one doesn't. The ones I'll clean are 
- '   ...'
- '- '
- ' -'
- '   -'
- '   '

But besides that, each line has a sentence, so discriminating between questions and answers will not be that hard.

In [23]:
kkbb_cleans = [line.replace('- ','').replace(' -','').replace('   -','').replace('   ...', ' ').replace('   ',' ') for line in kkbb_sents]

In [24]:
kkbb_cleans

["I'll tell you what, I take notes, in general... ",
 "so if you're just real specific, thorough and precise, that'll help.",
 'I remember only lying to her one time.',
 'I said to her: "Maybe the man who is living here is not your father. "',
 'You told your sister she was adopted, yeah?',
 'No, I did one better.',
 'I told her her real father was an actor... who was in the movie that came through town.',
 'Jonny GossamerJonny Gossamer movie.',
 "That's right.",
 "I told Jenna one day she'd go to Hollywood... ",
 "and she'd meet her famous, real father.",
 'She believed me, Harry, and she came out here looking for him.',
 'Okay, this is a bunch.',
 'I can start my...',
 '...process, I guess.',
 'Thank you.',
 "I'm on the case.",
 'Here is my card.',
 "It's a magic card, so....",
 'Wow, "the Amazing Harold. "',
 'Just say "abracadabra. "',
 'What happened, somebody sue you?',
 'It used to be "alakazam" when you cut me in half.',
 'And, not to be picky, but it was "Harold the Great. "',

Looks much better now!

Now, we can create lists of questions and answers. As the sentences have been separated, it is easy to discriminate questions as they will have a '?' sign. And the answer will just be the next line. It is very easy to do a list comprehension to get the questions (`[line for line in kkbb_cleans if '?' in line]`), but we have to expand it to get the next one (the answer). 

Thankfully, the internet exists and the `zip()` function creates a tupple of two lists. The beauty is that we can use `zip(kkbb_cleans, kkbb_cleans[1:])` to create pairs of two consecutive elements in a list. 

In [28]:
q_a_lines = [[q_line, a_line] for q_line,a_line in zip(kkbb_cleans, kkbb_cleans[1:]) if '?' in q_line]
q_a_lines

[['You told your sister she was adopted, yeah?', 'No, I did one better.'],
 ['What happened, somebody sue you?',
  'It used to be "alakazam" when you cut me in half.'],
 ['You hear anything?', "No, there's nothing in the papers, so...."],
 ['What case?', 'Leave her alone.'],
 ['that the girl in the lake, that was Veronica Dexter?',
  'Positive ID, scars, dental records.'],
 ['You talked to your police guy?', 'Yeah, not much there.'],
 ['That was the last anybody saw of her? With a symmetrical, ungooshed head.',
  'Police ever find the car?'],
 ['Police ever find the car?', 'No, genius, that was us.'],
 ['Remember?', 'Oh, yeah, right.'],
 ['They were?  ', "The killers were at Dexter's?"],
 ["The killers were at Dexter's?", "That's how they recognized you."],
 [', okay?', 'Colin Farrell wants too much money.'],
 ['Do you get me now?', 'Dabney, he unearths a discovery.'],
 ['You told her?', 'Pick those up.'],
 ['Why did you tell her?', "You didn't have to tell her."],
 ['What, are you thr

Now, we just have to separate the lines into questions and answers

In [29]:
kkbb_q = [pair[0] for pair in q_a_lines]
kkbb_a = [pair[1] for pair in q_a_lines]

## Markov analysis

First, let's generate a function that generates the tuples from each question. These analysis could be character by character or word by word. I'm more interested in the latter, as I want to recreate the worries and questions from the characters in the movie. 

Another particular choice is how to separate the words in each question or answer. Is "Who" the same as "Who's"? Or will they have separate values ("Who" and "'s") for the markov analysis? I will let technology decide (not really) and use nlp on eaach of the lists.

In [36]:
nlp_q = nlp( ' '.join(kkbb_q) )
q_words = [item.text for item in nlp_q]

In [41]:
# and let's see if I got what I wanted
"'s" in q_words

True

In [42]:
# the same for the answers
nlp_a = nlp( ' '.join(kkbb_a) )
a_words = [item.text for item in nlp_a]

Now, onto the model generation. I will use word ngrams of length 2, as I'm not really sure of what will turn up. 

In [44]:
print(q_words[-1])
print(a_words[-1])

?
...


In [51]:
def markov_model(list_source, n):
    model = {}
    # append None to the end of the list - I'm not sure...
#     source = list(list_source) + [None]
    source = list(list_source)
    # and we go over the source list
    for i in range( len(source)-n  ):
        # grab the key AS A TUPLE!
        key = tuple(source[i:i+n])
        # find if that key exists already in the model
        if key not in model:
            # initialize the dictionary entry
            model[key] = []
        # append the value it leads to
        model[key].append( source[i+n] )
    return model

In [52]:
q_model = markov_model(q_words, 2)
a_model = markov_model(a_words, 2)

## Preparing the scripts

Now to the testing... I don't really have an "escape" route in each model. I could have appended 'None' to each '?' symbol, but I want to be able to create multiple questions if I want to. 

So, first, I need to create two generation functions to use the chain and get a desired output.

In [54]:
def get_question(list_start, model):
    # initialize output
    output = list(list_start)
    # iterate a lot to get a question
    for i in range(20):
        # transform the start list into a tuple (so it can be used as key)
        key = tuple(output[-2:])
        # search for the next word and append it
        word = rng.choice( model[key] )
        output.append(word)
        # check for stopping case
        if '?' == word:
            break
    # transform the output to a string and return
    return ' '.join(output)

In [119]:
def get_answer(list_start, model):
    # initialize output
    output = list(list_start)
    # iterate a lot to get a question
    for i in range(40):
        # transform the start list into a tuple (so it can be used as key)
        key = tuple(output[-2:])
        # search for the next word and append it
        word = rng.choice( model[key] )
        output.append(word)
        # check for stopping case
        if '...' == word or '.' == word:
            break
    # transform the output to a string and return
    return ' '.join(output)

In [58]:
# let's test them!
print( get_question(["What","'s"], q_model) )

What 's my present , Slick ?


In [63]:
print( get_answer(["I","just"], a_model) )

I just put in one bullet , did n't have to tell her .


The final problem is how to select two words for each start case. As I have a list of all the questions and answers, the first thing I'll try is to extract the first two words from each, and see where that gets me.

In [93]:
q_nlp_starts = [nlp(line) for line in kkbb_q]
q_starts = [ (words[0].text,words[1].text) for words in q_nlp_starts ]

In [80]:
a_nlp_starts = [nlp(line) for line in kkbb_a]
a_starts = [ (words[0].text,words[1].text) for words in a_nlp_starts ]

IndexError: Attempt to access token at 1, max length 1

So, I got an error, because the list has some length 1 sentences. As nlp separates '.' as a word, the only problem is the multiple occurrance of '-'. As the `remove()` function only takes out the word once, I will create a small function that does it.

In [86]:
def remove_all(source_list, target):
    while target in source_list:
        source_list.remove(target)

In [87]:
remove_all(kkbb_a, '-')

In [89]:
'-' in kkbb_a

False

In [94]:
# lets try again
a_nlp_starts = [nlp(line) for line in kkbb_a]
a_starts = [ (words[0].text,words[1].text) for words in a_nlp_starts ]

Let's see if there's any recurring pairs in each of the lists

In [98]:
q_pairs = Counter(q_starts)
q_pairs.most_common(13)

[(('What', '?'), 10),
 (('What', "'s"), 7),
 (('Do', 'you'), 4),
 (('What', ','), 3),
 (('You', 'know'), 3),
 (('What', 'do'), 3),
 (('What', 'is'), 3),
 (('Harry', ','), 3),
 (('Where', 'the'), 3),
 (('Hello', '?'), 3),
 (('Who', 'are'), 3),
 (('How', 'about'), 3),
 (('You', 'told'), 2)]

In [99]:
a_pairs = Counter(a_starts)
a_pairs.most_common(13)

[(('No', ','), 10),
 (('I', "'m"), 8),
 (('Oh', ','), 5),
 (('I', 'just'), 4),
 (('Yeah', '.'), 4),
 (('Well', ','), 4),
 (('Come', 'on'), 3),
 (('I', 'do'), 3),
 (('Yeah', ','), 2),
 (('That', "'s"), 2),
 (('Not', 'really'), 2),
 (('Nothing', '.'), 2),
 (('It', "'s"), 2)]

This is good enough to do some interesting things. I'll pick only the pairs that have 3 or more repetitions to have a better spread of words.

In [111]:
print(a_pairs.most_common(10)[0][0])
print(a_pairs.most_common(10)[0][1])

('No', ',')
10


In [113]:
a_common = [ list(item[0]) for item in a_pairs.most_common() if item[1] >= 3]
a_common

[['No', ','],
 ['I', "'m"],
 ['Oh', ','],
 ['I', 'just'],
 ['Yeah', '.'],
 ['Well', ','],
 ['Come', 'on'],
 ['I', 'do']]

In [114]:
q_common = [ list(item[0]) for item in q_pairs.most_common() if item[1] >= 3]
q_common

[['What', '?'],
 ['What', "'s"],
 ['Do', 'you'],
 ['What', ','],
 ['You', 'know'],
 ['What', 'do'],
 ['What', 'is'],
 ['Harry', ','],
 ['Where', 'the'],
 ['Hello', '?'],
 ['Who', 'are'],
 ['How', 'about']]

Now, I can finally do something with all this! 😃

## LET'S HAVE FUN

### A simple dialog

First, let's create a simple dialog alternating some q&a's

In [120]:
for i in range(10):
    q_cue = rng.choice(q_common) #haha
    print( get_question(q_cue, q_model) )
    
    a_cue = rng.choice(a_common)
    print( get_answer(a_cue, a_model) )
    
    print('\n')

What 's up , honey ?
Well , he unearths a discovery .


How about you ?
No , moron .


What is he , like , probe deeper ?
Come on , Harry .


You know what , Harry ?
Come on , breathe .


What do you ?
Come on , I do for a living .


You know what else is nuts ?
I just saw your really distinctive ears .


How about you ?
Well , he did .


What 's up ?
Yeah . Oh , forget it .


Do you think I 'm playing a little game called " Am I Bluffing ?
I just want to say something to her .


What ? Why not simply go to the hospital ?
I do n't wanna come up .




### A world full of questions

We have so many questions. The world is a horrible place. Can we answer all of these with one sentence? 

(Idea: make this again, but with the script of "The Hitchhicker's Guide to the Galaxy" and see how many questions can we reply with "42")

In [124]:
for i in range(rng.randrange(5,15)):
    q_cue = rng.choice(q_common) #haha again
    print( get_question(q_cue, q_model) )

print('\n')
a_cue = rng.choice(a_common)
print( get_answer(a_cue, a_model) )

Where the fuck is Harmony ?
What ? And who would that be ?
Harry , do n't you give the kid a break ?
What 's up ?
What , are you going ?
What ? What happened , somebody sue you ?


Well , he used to beat me in half .


In [125]:
for i in range(rng.randrange(5,15)):
    q_cue = rng.choice(q_common) #haha again
    print( get_question(q_cue, q_model) )

print('\n')
a_cue = rng.choice(a_common)
print( get_answer(a_cue, a_model) )

How about yours ?
What , are you ?
You know what else is nuts ?
Where the fuck is Harmony ?
What ?    You 're not worried ?
Who are you doing ?
Who are you going with this ?
What is he , like 6'4 " ?
Who are you doing ?
What 's going on ?
What is it ?


I do n't think so .
