# word2vec

Here, I will experiment with word2vec in order to get the questions I want from the dialogs and extract the keywords. With these, I will define the topics with which I will create the recipes.

In [2]:
import random as rng
from collections import Counter
import re
import spacy
nlp = spacy.load('en_core_web_md')

In [3]:
import numpy as np
from numpy import dot
from numpy.linalg import norm

## sentences and word parsing

First, I load the text files. I will experiment with the Harry Potter dialogs I got

In [3]:
hp_dialogs = [line.strip() for line in open("harrypotter1_dialog.txt").readlines()]
hp_dialogs.extend( [line.strip() for line in open("harrypotter3_dialog.txt").readlines()] )
hp_dialogs.extend( [line.strip() for line in open("harrypotter4_dialog.txt").readlines()] )
hp_dialogs.extend( [line.strip() for line in open("harrypotter5_dialog.txt").readlines()] )
hp_dialogs.extend( [line.strip() for line in open("harrypotter6_dialog.txt").readlines()] )
hp_dialogs.extend( [line.strip() for line in open("harrypotter7_dialog.txt").readlines()] )

Now, I will use nlp to get the sentences and noun_chunks

In [4]:
hp_dialogs_joined = ' '.join(hp_dialogs)

In [None]:
nlp_hp = nlp( hp_dialogs_joined )

In [None]:
hp_sents = [line.text.strip() for line in list(nlp_hp.sents)]

In [None]:
hp_nounch = nlp_hp.noun_chunks

![damn_kernel.png](attachment:damn_kernel.png)

It seems to be too much information... Let's see if I can do it book by book


In [1]:
hp1_dialog = [line.strip() for line in open("harrypotter1_dialog.txt").readlines()]

In [4]:
nlp_hp1 = nlp( ' '.join(hp1_dialog) )

In [20]:
hp1_sents  = [line.text.strip() for line in list(nlp_hp1.sents)]
hp1_nounch = nlp_hp1.noun_chunks

In [28]:
hp1_tokens = list(set([w.text for w in nlp_hp1 if w.is_alpha]))

Awesome, it works with only one book!

Now, onto the questions and answers...

In [23]:
q_a_lines = [[q_line, a_line] for q_line,a_line in zip(hp1_sents, hp1_sents[1:]) if '?' in q_line]

In [25]:
hp1_q = [pair[0] for pair in q_a_lines]
hp1_a = [pair[1] for pair in q_a_lines]

I need to have a Counter to see the recurring themes on the answers

In [134]:
hp1_a_nlp = nlp( ' '.join(hp1_a) )

And to know what to use... Let's see the difference between noun_chuncks and entities!

In [135]:
hp1_a_noun = hp1_a_nlp.noun_chunks
hp1_noun_counter = Counter([item.text.capitalize() for item in hp1_a_noun])
hp1_noun_counter.most_common(20)

[('He', 138),
 ('I', 133),
 ('You', 130),
 ('Harry', 103),
 ('It', 80),
 ('Ron', 52),
 ('We', 39),
 ('Me', 38),
 ('Hagrid', 36),
 ('They', 36),
 ('Him', 34),
 ('What', 32),
 ('She', 25),
 ('Hermione', 21),
 ('Who', 19),
 ('Dumbledore', 17),
 ('Them', 15),
 ('Snape', 14),
 ('Professor mcgonagall', 12),
 ('Something', 12)]

In [136]:
hp1_noun_counter.most_common()[20:40]

[('Malfoy', 8),
 ('Aunt petunia', 7),
 ('Uncle vernon', 7),
 ('The stone', 7),
 ('People', 6),
 ('Himself', 6),
 ('Nothing', 6),
 ('Wood', 6),
 ('The door', 5),
 ('His face', 5),
 ('Us', 5),
 ('Hogwarts', 5),
 ('The train', 5),
 ('Ronan', 5),
 ('Quirrell', 5),
 ('The muggles', 4),
 ('Dudley', 4),
 ('Charlie', 4),
 ('Gryffindor', 4),
 ('Anything', 4)]

In [137]:
hp1_a_noun = hp1_a_nlp.noun_chunks
for np in hp1_a_noun:
    for nn in np:
        print(np, np.dep_)
#     print(nn, nn.dep_)   : doesn't exist
#     print(nn, nn.label_) : all are "NP"
#     print(nn, nn.tag_)   : doesn't exist
#     print(nn, nn.pos_)   : doesn't exist
#     pass

AttributeError: 'spacy.tokens.span.Span' object has no attribute 'dep_'

In [140]:
hp1_a_ents = hp1_a_nlp.ents
hp1_ents_counter = Counter([item.text.capitalize() for item in hp1_a_ents])
hp1_ents_counter.most_common(15)

[('Harry', 117),
 ('Ron', 55),
 ('Hagrid', 42),
 ('Dumbledore', 23),
 ('Hermione', 23),
 ('Snape', 17),
 ('Malfoy', 10),
 ('One', 9),
 ('Hogwarts', 8),
 ('Mcgonagall', 8),
 ('Professor mcgonagall', 7),
 ('Aunt petunia', 7),
 ('Quirrell', 7),
 ('Uncle vernon', 6),
 ('Quidditch', 6)]

In [None]:
hp1_a_ents = hp1_a_nlp.ents
for nn in hp1_a_ents:
    print(nn, nn.label_)
#     print(nn, nn.tag_)   : doesn't exist
#     print(nn, nn.pos_)   : doesn't exist

The things that get mentioned the most are people. 

In [145]:
[(thing, thing.label_) for thing in hp1_a_nlp.ents if thing.label_ not in ["PERSON", "CARDINAL", "QUANTITY", "ORDINAL", "TIME", "DATE"] ]

[(Tufty, 'NORP'),
 (Majorca, 'GPE'),
 (Aunt Petunia, 'PRODUCT'),
 (Aunt Petunia, 'WORK_OF_ART'),
 (Piers, 'GPE'),
 (Aunt Petunia, 'WORK_OF_ART'),
 (Aunt Petunia, 'WORK_OF_ART'),
 (Aunt Petunia, 'WORK_OF_ART'),
 (Aunt Petunia, 'PRODUCT'),
 (Aunt Petunia, 'PRODUCT'),
 (Uncle Vernon, 'GPE'),
 (London, 'GPE'),
 (Gringotts, 'LOC'),
 (Harry Potter, 'WORK_OF_ART'),
 (D-Defense Against the D-D-Dark Arts, 'WORK_OF_ART'),
 (Gringotts, 'LOC'),
 (Blimey, 'NORP'),
 (King's Cross, 'ORG'),
 (Harry Potter, 'WORK_OF_ART'),
 (Harry Potter, 'WORK_OF_ART'),
 (Starving, 'WORK_OF_ART'),
 (Chocolate Frogs, 'ORG'),
 (Romania, 'GPE'),
 (Africa, 'LOC'),
 (Daily Prophet, 'WORK_OF_ART'),
 (Scabbers, 'ORG'),
 (The Harry Potter, 'WORK_OF_ART'),
 (Malfoy, 'ORG'),
 (Malfoy, 'ORG'),
 (Malfoy, 'ORG'),
 (Wood, 'GPE'),
 (Crabbe, 'ORG'),
 (Malfoy, 'ORG'),
 (Nimbus, 'ORG'),
 (Malfoy, 'ORG'),
 (The Chasers throw the Quaffle, 'WORK_OF_ART'),
 (Wood, 'ORG'),
 (Three Chasers try and score with the Quaffle, 'WORK_OF_ART'),
 (Ke

In [144]:
set( [person.text for person in hp1_a_nlp.ents if person.label_ == "PERSON"] )

{'Agrippa',
 'Bane',
 'Bill',
 'Bin',
 'Blimey',
 'Bludger',
 'CarA',
 'Centaurs',
 'Charlie',
 'DURSLEY',
 'Dean',
 'Draco Malfoy',
 'Dudley',
 'Dumbledore',
 'Dursley',
 'Dursleys',
 'Fang',
 'Figg',
 'Firenze',
 'Fitch',
 'Flamel',
 'Flitwick',
 'Fluffy',
 'Fred',
 'Gallopin',
 'George',
 'George Weasley',
 'Gorgons',
 'Gringotts',
 'Griphook',
 'Gryffindor',
 'Gulpin',
 'Hagrid',
 'Harry',
 'Harry Potter',
 'Hermione',
 'Hmm',
 'Hogwarts',
 'Howard',
 'Hufflepuff',
 'Hurry',
 'Jus',
 'Malfoy',
 'Malfoys',
 'Malkin',
 'Marcus Flint',
 'McGonagall',
 'Mighta',
 'Miss Granger',
 'Mmm',
 'Mom',
 'Nah',
 'Neville',
 'Neville Longbottom',
 'Nicholas de Mimsy-Porpington',
 'Nicolas',
 'Nicolas Flamel',
 'Parvati',
 'Parvati Patil',
 'Paws',
 'Percy',
 'Pomfrey',
 'Potter',
 'Professor McGonagall',
 'Quidditch',
 'Quirrell',
 'Remembrall',
 'Ron',
 'Ronan',
 'Seamus',
 'Snape',
 'Snitch',
 'Sprouts',
 'Stalagmite',
 'Stone',
 'Ted',
 'Tibbles',
 'Tom',
 'Trevor',
 'Uncle Vernon',
 'Vernon'

What if I just analyze the things in the text?

In [111]:
hp1_counter = Counter([word for word in nlp_hp1 if word.is_alpha])
hp1_counter.most_common(20)

[(At, 1),
 (half, 1),
 (past, 1),
 (eight, 1),
 (Dursley, 1),
 (picked, 1),
 (up, 1),
 (his, 1),
 (briefcase, 1),
 (pecked, 1),
 (Dursley, 1),
 (on, 1),
 (the, 1),
 (cheek, 1),
 (and, 1),
 (tried, 1),
 (to, 1),
 (kiss, 1),
 (Dudley, 1),
 (good, 1)]

But every spacy word is unique... So it won't be aggregated in a Counter! D:

In [122]:
for word in nlp_hp1[30010:30090]:
    if word.is_alpha:
        print(word.text, word.pos_, word.dep_, word.tag_)

snarled VERB ROOT VBD
Hermione PROPN nsubj NNP
rolled VERB ROOT VBD
up PART prt RP
the DET det DT
sleeves NOUN dobj NNS
of ADP prep IN
her ADJ poss PRP$
gown NOUN pobj NN
flicked VERB conj VBD
her ADJ poss PRP$
wand NOUN dobj NN
and CCONJ cc CC
said VERB conj VBD
Wingardium PROPN compound NNP
Leviosa PROPN ccomp NNP
Oh INTJ intj UH
well INTJ advmod UH
done VERB ccomp VBN
cried VERB ROOT VBD
Professor PROPN compound NNP
Flitwick PROPN nsubj NNP
clapping VERB advcl VBG
Everyone NOUN nsubj NN
see VERB ccomp VBP
here ADV advmod RB
Miss PROPN compound NNP
Granger PROPN nsubj NNP
done VERB ROOT VBN
it PRON dobj PRP
Ron PROPN nsubj NNP
was VERB ROOT VBD
in ADP prep IN
a DET det DT
very ADV advmod RB
bad ADJ amod JJ
mood NOUN pobj NN
by ADP prep IN
the DET det DT
end NOUN pobj NN
of ADP prep IN
the DET det DT
class NOUN pobj NN
It PRON nsubj PRP
no DET det DT
wonder NOUN attr NN
no DET det DT
one NOUN nsubj NN
can VERB aux MD
stand VERB ccomp VB
her PRON dobj PRP
he PRON nsubj PRP
said VERB RO

Might as well just leave this for the word2vec results :/

## word2vec-ing

Functions for word2vec

In [177]:
def addv(coord1, coord2):
    return [c1 + c2 for c1, c2 in zip(coord1, coord2)]

def subtractv(coord1, coord2):
    return [c1 - c2 for c1, c2 in zip(coord1, coord2)]

def meanv(coords):
    # assumes every item in coords has same length as item 0
    sumv = [0] * len(coords[0])
    for item in coords:
        for i in range(len(item)):
            sumv[i] += item[i]
    mean = [0] * len(sumv)
    for i in range(len(sumv)):
        mean[i] = float(sumv[i]) / len(coords)
    return mean

In [181]:
# get spacy vector
def vec(s):
    return nlp.vocab[s].vector

def sentvec(s):
    sent = nlp(s)
    return meanv([w.vector for w in sent])

# cosine similarity
def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1, v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

# closest word to target vector from token list
def spacy_closest(token_list, vec_to_check, n=10):
    return sorted(token_list, key=lambda x: cosine(vec_to_check, vec(x)), reverse=True)[:n]

# closest sentence to target vector from token list
def spacy_closest_sent(token_list, vec_to_check, n=10):
    return sorted(token_list, key=lambda x: cosine(vec_to_check, sentvec(x)), reverse=True)[:n]

First, let's pick a topic for the question and the solution we're aiming for, and from that I'll work on getting the ingredients

In [179]:
q_target = "sadness"
a_target = "happiness"

In [183]:
q_possibilities = spacy_closest_sent( hp1_q, vec(q_target), 10 )

In [188]:
q_possibilities

['"And what if whatever hurt the unicorn finds us first?" said Malfoy, unable to keep the fear out of his voice.',
 '"What on earth were you thinking of?" said Professor McGonagall, with cold fury in her voice.',
 'Am I?" said Harry, feeling dazed.',
 'They watched the birds soaring overhead, glittering -- glittering?',
 '"That\'s your problem, isn\'t it?" said Filch, his voice cracking with glee.',
 '"Where was I?" said Hagrid, but at that moment, Uncle Vernon, still ashen-faced but looking very angry, moved into the firelight.',
 '"Mrs. Norris?" breathed Ron, squinting through the dark. "',
 'Have you no shame?',
 'Never made things happen when you was scared or angry?"',
 '"Light?" said Ron, but Hermione told him to be quiet until she\'d looked something up, and started flicking frantically through the pages, muttering to herself.']

In [189]:
sel_q = rng.choice(q_possibilities)
print(sel_q)

They watched the birds soaring overhead, glittering -- glittering?


Now, let's extract the noun_chunk(s) to know what is it referring to...

In [190]:
sel_chunks = [ch.text for ch in nlp(sel_q).noun_chunks]
sel_chunks

['They', 'the birds']

In [191]:
# and get the vector
vec_sel_q = sentvec(rng.choice(sel_chunks))

In [192]:
# so we want to go from q_target to one of the sel_chunks
chunk_to_target = subtractv(vec(q_target), vec_sel_q)

And starting from that, I'll scrape the answer list for the noun_chunks or entities that will be the ingredients

In [193]:
# using noun_chunks
# recalling: hp1_a_noun = hp1_a_nlp.noun_chunks
hp1_tokens_noun = list(set( [ch.text for ch in hp1_a_nlp.noun_chunks] ))

In [194]:
a_noun = spacy_closest(hp1_tokens_noun, addv(chunk_to_target, vec(a_target)))

In [195]:
a_noun

['awe',
 'amazement',
 'goodness',
 'freedom',
 'heaven',
 'ourselves',
 'Nothing',
 'nothing',
 'death',
 'myself']

In [196]:
# using entities
# recalling: hp1_a_ents = hp1_a_nlp.ents
hp1_tokens_ents = list(set( [ch.text for ch in hp1_a_nlp.ents] ))

In [197]:
a_ents = spacy_closest(hp1_tokens_ents, addv(chunk_to_target, vec(a_target)))

In [198]:
a_ents

['Mom',
 'Erm',
 'Hmm',
 'Agrippa',
 'Hurry',
 'Blimey',
 'Bane',
 'Nah',
 'Mmm',
 'Centaurs']

For this example, the noun_chunks are better... let's see if it is mantained!

In [199]:
q_targets = ["death", "sadness", "misserable", "stress", "lost", "hate"]
a_targets = ["happiness", "love", "time", "peace"]

In [200]:
for i in range(10):
    q_trgt = rng.choice(q_targets)
    question = rng.choice( spacy_closest_sent( hp1_q, vec(q_trgt), 10 ) )
    print("QUESTION ")
    print(question)
    question_chunks = [ch.text for ch in nlp(question).noun_chunks]
    question_to_target = subtractv(vec(q_trgt), sentvec(rng.choice(question_chunks)))
    a_trgt = rng.choice(a_targets)
    answer_noun = spacy_closest(hp1_tokens_noun, addv(question_to_target, vec(a_trgt)))
    answer_ents = spacy_closest(hp1_tokens_ents, addv(question_to_target, vec(a_trgt)))
    print("NOUNS ")
    print(answer_noun)
    print("ENTS ")
    print(answer_ents)
    print("")
    print("===========")

QUESTION 
"What on earth were you thinking of?" said Professor McGonagall, with cold fury in her voice.
NOUNS 
['time', 'Things', 'things', 'everyone', 'myself', 'we', 'We', 'Thought', 'what', 'What']
ENTS 
['one', 'One', 'years', 'first', 'today', 'Mom', 'third', 'half', 'Hurry', 'tonight']

QUESTION 
Listen, I'm glad we've run inter yeh, Ronan, 'cause there's a unicorn bin hurt -- you seen anythin'?"
NOUNS 
['everyone', 'Things', 'things', 'Nothing', 'nothing', 'people', 'freedom', 'everything', 'Thought', 'anything']
ENTS 
['one', 'One', 'Mom', 'today', 'half', 'two', 'Two', 'Starving', 'first', 'Hurry']

QUESTION 
"Hagrid and my aunt and uncle -- so who sent these?"
NOUNS 
['death', 'heaven', 'goodness', 'horror', 'Nothing', 'nothing', 'Things', 'things', 'me', 'everything']
ENTS 
['Christmas', 'Mom', 'one', 'One', 'tonight', 'Mmm', 'Charlie', 'Halloween', 'George', 'Harry']

QUESTION 
D'yeh think yer parents didn't leave yeh anything?"
NOUNS 
['I', 'everyone', 'Thought', 'Mom', 'p

Maybe I can get something different (and better) if I curate the entities list

In [201]:
curated_labels = ["PERSON", "CARDINAL", "QUANTITY", "ORDINAL", "TIME", "DATE"]
hp1_tokens_ents2 = list(set( [ch.text for ch in hp1_a_nlp.ents if ch.label_ not in curated_labels] ))

In [202]:
for i in range(10):
    q_trgt = rng.choice(q_targets)
    question = rng.choice( spacy_closest_sent( hp1_q, vec(q_trgt), 10 ) )
    print(question)
    question_chunks = [ch.text for ch in nlp(question).noun_chunks]
    question_to_target = subtractv(vec(q_trgt), sentvec(rng.choice(question_chunks)))
    a_trgt = rng.choice(a_targets)
    answer_ents2 = spacy_closest(hp1_tokens_ents2, addv(question_to_target, vec(a_trgt)))
    print(answer_ents2)
    print("")
    print("===========")

Yer not still lookin' fer Nicolas Flamel, are yeh?"
['Blimey', 'Malfoy', 'POTTER', 'Potter', 'Slytherin', 'Greek', 'Starving', 'Romania', 'Erm', 'Fluffy']

"How could a car crash kill Lily an' James Potter?
['Starving', 'Filch', 'Africa', 'POTTER', 'Potter', 'Piers', 'Erm', 'Slytherin', 'Malfoy', 'Greek']

"They've never lost a hundred and fifty points in one go, though, have they?" said Harry miserably.
['Starving', 'POTTER', 'Potter', 'Erm', 'Greek', 'Keeper', 'Fluffy', 'Mars', 'Majorca', 'Filch']

"What on earth were you thinking of?" said Professor McGonagall, with cold fury in her voice.
['Erm', 'Fluffy', 'Blimey', 'Starving', 'Malfoy', 'Slytherin', 'Filch', 'Crabbe', 'POTTER', 'Potter']

Don't mess with me, Peeves, now where did they go?"
['Starving', 'Erm', 'Wood', 'Filch', 'Piers', 'Africa', 'London', 'Mars', 'Blimey', 'POTTER']

"Hagrid and my aunt and uncle -- so who sent these?"
['Erm', 'POTTER', 'Potter', 'Mars', 'Greek', 'Fluffy', 'Malfoy', 'Starving', 'Wood', 'London']

W

Overall, I like best the possibilities of the noun_chunks lists