# GIAN 7:  Word Embeddings

In this notebook, we will learn more about how vector representations for words, also called *word embeddings*, can be used in text-mining tasks.

In [1]:
import os
import random
from collections import Counter

SpaCy has prebuilt models that includes semantic vectors for English. These vectors were constructed using a [dependency-based word2vec model](https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/) (Levy & Goldberg, 2014). This means that SpaCy doesn't build new vectors when you run through the NLP pipeline. Instead it uses existing vectors based on analysis of very large corpora. 

However, the default model for English (en_core_web_sm) does not include vectors. You will need to install an additional model (en_core_web_md).

In [2]:
!python -m spacy download en_core_web_md

# # Note: if you created an additional environment in Anaconda, 
# # you should first activate the environment and then install the model, by uncommenting the line below.
# !source activate TextMiningClass; python -m spacy download en_core_web_md

^C
Traceback (most recent call last):
  File "/Users/emmanuel/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
Traceback (most recent call last):
  File "/Users/emmanuel/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/emmanuel/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/emmanuel/anaconda3/lib/python3.6/site-packages/spacy/__main__.py", line 31, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/Users/emmanuel/anaconda3/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Users/emmanuel/anaconda3/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Users/emmanuel/anaconda3/lib/python3.6/site-packages/spacy/cli/download.py", line 36, in download
    .format(m=model_name, v=version), pip_args)
 

In [3]:
import spacy
import en_core_web_md
nlp=en_core_web_md.load(disable=["parser", "tagger", "ner"])

Finding the vector for a word is easy. You can either access the vector through the built-in vocabulary.

In [4]:
v_gold=nlp.vocab['gold'].vector

In [5]:
len(v_gold)

300

Or access the vector via a token from a text that has gone through the nlp pipeline 

In [6]:
doc=nlp("I've been a miner for a heart of gold")
doc[9].vector

array([-1.9140e-01, -7.1784e-02,  4.9133e-01,  2.1924e-01, -1.5227e-01,
       -4.1885e-01, -1.5643e-02, -1.3133e-01, -4.6671e-01,  1.4846e+00,
       -5.4597e-01, -2.6525e-01,  6.4074e-01, -1.0065e-01,  8.2549e-01,
       -2.7165e-01,  1.0220e-01,  1.5139e+00, -7.1276e-02, -3.9695e-01,
        2.8444e-01, -2.4974e-01, -2.5262e-01, -5.5885e-01, -1.3782e-01,
        1.0003e-01,  5.9343e-02,  2.7292e-01, -4.5149e-01,  3.1552e-01,
       -4.6274e-01, -1.0079e-01,  4.2756e-01,  7.7212e-02, -1.8375e-01,
       -6.7547e-01,  4.5902e-01, -5.0890e-01,  8.3544e-01,  2.0627e-01,
        4.7060e-02,  5.4251e-01,  6.6573e-01, -7.7310e-01,  1.3013e+00,
       -4.6937e-01,  3.9686e-02, -4.6208e-01,  2.4292e-01, -2.6870e-01,
       -1.7660e-01,  6.0788e-01, -1.3447e-01, -5.6040e-02,  5.6556e-01,
        1.5908e-01, -7.7136e-01,  1.2400e-02, -2.3463e-01, -1.3738e-01,
       -1.3710e-01, -6.4679e-01,  8.7458e-02,  7.4438e-02, -3.2940e-01,
       -1.3547e-02,  7.5329e-01,  1.5656e-01, -3.8866e-01,  4.16

As we can see, all the values in the previous vector and in this one are exactly the same.

In [7]:
all(v_gold==doc[9].vector)

True

Let's now compare the vector for 'gold' to vectors for some other materials. Similarity is computed using a cosine metric.

In [8]:
def similarity(s1, s2, vocab=nlp.vocab):
    """Compute the similarity between two word embeddings.
    
    The higher the score, the more similar two word embeddings are"""
    return(vocab[s1].similarity(vocab[s2]))

In [9]:
similarity("gold", "silver")

0.8831682

In [10]:
similarity("gold", "bronze")

0.6275987

In [11]:
similarity("gold", "cat")

0.2118918

In [12]:
similarity("religion", "science")

0.5280959

Let's now turn to a more serious task. Word embeddings can also be used to estimate similarity between documents.

For this, all of the vectors for the documents token's are averaged. This average vector represent the meaning of a document and similarity between documents corresponds to the similarity between the document vectors.

We will use some of the subtitle files from lecture 5 to illustrate this.

We select 100 random files and run the nlp pipeline over them (only tokenizing in this case). If a document contains more than 10 tokens, we append it to our list of documents.

In [16]:
docs=[]
doc_labels=[]
filenames=[filename for filename in os.listdir('GIAN5_data/') if filename.endswith('.srt')]
filenames=random.sample(filenames, 500)
for filename in filenames:
    text=open("GIAN5_data/{:s}".format(filename), encoding="utf-8").read()
    doc=nlp(text)
    if len(doc)>10:
        docs.append(doc)
        doc_labels.append(filename[:5])

Each document has a vector which is the average of the vector of its tokens.

In [17]:
len(docs[0].vector)

300

In [18]:
all(docs[0].vector==sum([token.vector for token in docs[0]])/len(docs[0]))

True

Now, we can compute the similarity between any pair of documents.

In [19]:
docs[0].similarity(docs[1])

0.9952307992204985

And we can rank documents by their similarity to a document of our choice.

In [20]:
def sort_by_similarity(docs, i):
    """Compute the vector similarity between document i and all other documents in docs and 
    returns a sequence of tuples with the first element giving the similarity
    and the second element giving the document number"""
    similarities=[(docs[i].similarity(docs[j]),j) for j,doc in enumerate(docs) if i!=j]
    similarities.sort(reverse=True)
    return(similarities)

In [26]:
similar_docs = sort_by_similarity(docs, 0)
similar_docs[:10]

[(0.996878820265896, 221),
 (0.9965588484378802, 68),
 (0.9965578522480503, 468),
 (0.9960479610648957, 261),
 (0.995800366872086, 461),
 (0.9957089517806117, 429),
 (0.9956411627722322, 100),
 (0.9955323612894019, 287),
 (0.9955282445262172, 271),
 (0.9955176536024245, 428)]

Let's look at the first 100 tokens of document 0

In [27]:
docs[0][:100]

She passed the water test, mate, she's not a sea ghost.
Tai, the ghosts have gone, it's been weeks.
The lagoon is finally running clear. You whipped Ragnar's butt.
I did, didn't I?
Chloe, this is Natasha.
She's spending the summer here as my lab assistant.
  Do you want to go diving tomorrow?
  Can I see the device?
We check it every day, make sure the green light's on.

In [28]:
similar_docs[0]

(0.996878820265896, 221)

And the first 100 tokens of the most similar document

In [29]:
docnum=similar_docs[0][1]
docs[docnum][:1000]

Vincent, can you hear me?
Please, Vincent.
It's not enough he goes drinking all round the town,
now the whole neighbourhood has to listen to his screaming.
He's very ill, Madame Vernet.
Look at this, even worse than his usual rubbish.
What's it supposed to be?
It was found behind the wall in an attic in France.
It's genuine...it's a Van Gogh...
Why bring it to me?
Because it's obviously a message...
and you can see who it's for.
Can't say I understand it.
You're not supposed to understand it, Prime Minister.
You're supposed to deliver it.
Cell 426.
The Doctor? Do you mean Dr Song?
Give me that. Seriously, just give it to me. I'm entitled to phone calls.
Doctor?
No, and neither are you. Where is he?
'You're phoning the time vortex, it doesn't always work.
'But the TARDIS is smart, she's re routed the call.'
Talk quickly. This connection will last less than a minute.
Dr Song.
Are you finished with that?
You're new here, aren't you?
First day.
Then I'm very sorry.
Stay exactly where you a

... the first 100 characters of the second most similar document

In [30]:
docnum=similar_docs[1][1]
docs[docnum][:100]

  Will it be me, Uncle?
  Yes, it's going to be you.
I only wish I could go in your place, Idris.
Nah, I don't, cos it's really going to hurt.
It's starting.
What will happen?
Oh. Um, er, Nephew will drain
your mind and your soul from your body and leave your body empty.
  I'm scared!
  I expect so, dear. But soon you'll have a new

Finally, lets look at the first 100 characters of the *least* similar document

In [31]:
docnum=similar_docs[-2][1]
docs[docnum][:100]

Attention all humans...
  Will somebody get him a tissue?
Boo!
Za ba!
Hmm?
Oh! Oh?
Aargh!
Hey!
Yay!
Argh!
Hey!
Huh?
Ah ha!
Hello!
Hmmm!
Ha ha!
Bah!
  Argh!
  Ha ha ha ha ha ha!
Weee! Woo hoo! Woo hoo!
Whee! Huh? Oh...!
Wahey! Ha ha!
Ba doing,

We can now use these vectors instead of the actual words in any classification or clustering task.

Since many programmes are part of a series, let's see if we can predict the series using the document vectors. 

Let's first build a dataset containing only the n most common programmes

In [32]:
n=5
Counter(doc_labels).most_common(5)

[('Newsn', 15), ('BBC_N', 15), ('EastE', 12), ('Docto', 11), ('Weake', 10)]

In [33]:
select_labels=[programme for programme, frequency in Counter(doc_labels).most_common(n)]

In [34]:
cl_docs=[]
cl_labels=[]
for i, label in enumerate(doc_labels):
    if label in select_labels:
        cl_docs.append(docs[i].vector)
        cl_labels.append(label)

In [35]:
cl_docs

[array([-2.69896165e-02,  1.53131813e-01, -1.33663729e-01, -7.37608671e-02,
         6.36853650e-02, -2.06807461e-02,  4.51411493e-03, -9.31164920e-02,
        -1.43433129e-02,  2.20053267e+00, -1.83704391e-01,  3.56338881e-02,
         9.57479998e-02, -3.16022113e-02, -1.35628000e-01, -8.46068859e-02,
        -8.10736716e-02,  9.63173747e-01, -1.58656418e-01, -2.70170011e-02,
         2.07313821e-02, -3.16821933e-02, -2.14550197e-02, -2.67721117e-02,
         5.32116694e-03,  3.50165367e-02, -8.73820260e-02, -3.14923711e-02,
         4.58393842e-02, -5.90042546e-02, -3.58132608e-02,  9.02599692e-02,
        -2.71498784e-02,  5.83959967e-02,  7.46400878e-02, -4.76663336e-02,
         7.46128894e-03,  1.70130115e-02, -6.64383546e-02, -8.59879926e-02,
        -8.31035804e-03,  7.72152171e-02,  3.02234199e-02, -6.08552098e-02,
         5.20839617e-02,  4.17471603e-02, -1.24576867e-01, -3.51445638e-02,
        -5.12586767e-03,  7.23268697e-03, -5.30469194e-02,  6.21188208e-02,
        -2.7

In [36]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
cl_knb=KNeighborsClassifier(n_neighbors=1)

In [37]:
cl_knb.fit(cl_docs, cl_labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')

In [38]:
scores = cross_val_score(cl_knb, cl_docs, cl_labels, cv=5)

In [39]:
scores

array([0.92857143, 0.92307692, 1.        , 1.        , 0.91666667])