## word vectors

Code from Allison Parrish's [word vector tutorial](https://github.com/aparrish/rwet/blob/master/understanding-word-vectors.ipynb) with modifications by kathy wu.


In [192]:
import math

In [45]:
import json

In [193]:
color_data = json.loads(open("xkcd.json").read())

In [194]:
def hex_to_int(s):
    s = s.lstrip("#")
    return int(s[:2], 16), int(s[2:4], 16), int(s[4:6], 16)

In [195]:
colors = dict()
for item in color_data['colors']:
    colors[item["color"]] = hex_to_int(item["hex"])

"`meanv` function takes a list of vectors and finds their mean or average:"

In [23]:
def meanv(coords):
    # assumes every item in coords has same length as item 0
    sumv = [0] * len(coords[0])
    for item in coords:
        for i in range(len(item)):
            sumv[i] += item[i]
    mean = [0] * len(sumv)
    for i in range(len(sumv)):
        mean[i] = float(sumv[i]) / len(coords)
    return mean
meanv([[0, 1], [2, 2], [4, 3]])

[2.0, 2.0]

### this function finds the closest item

In [196]:
def closest(space, coord, n=10):
    closest = []
    for key in sorted(space.keys(),
                      key=lambda x: distance(coord, space[x]))[:n]:
        closest.append(key)
    return closest

In [198]:
closest(colors, colors['blue'])

['blue',
 'vibrant blue',
 'electric blue',
 'azul',
 'blue blue',
 'vivid blue',
 'bright blue',
 'cerulean blue',
 'rich blue',
 'true blue']

#### (Doing bad digital humanities)

## averaging body colors of alien species in popular Sci Fi


In [36]:
import spacy
nlp = spacy.load('en_core_web_md')

In [181]:
starwars = nlp(open("starwars.txt").read())
# use word.lower_ to normalize case
drac_colors1 = [colors[word.lower_] for word in starwars if word.lower_ in colors]
avg_color1 = meanv(drac_colors1)

In [202]:
print(avg_color1)
print("----")

print("Colors of aliens in Star Wars");
starwars_colors = closest(colors, avg_color1)
print(starwars_colors);

[110.29268292682927, 95.59756097560975, 92.1951219512195]
----
Colors of aliens in Star Wars
['greyish brown', 'dark taupe', 'dirty purple', 'grey brown', 'slate grey', 'gunmetal', 'dark mauve', 'dull brown', 'brownish grey', 'brownish purple']


In [182]:
stargate = nlp(open("stargate.txt").read())
# use word.lower_ to normalize case
drac_colors2 = [colors[word.lower_] for word in stargate if word.lower_ in colors]
avg_color2 = meanv(drac_colors2)

In [203]:
print(avg_color2)
print("----")

print("Colors of aliens in Star Gate");
stargate_colors = closest(colors, avg_color2)
print(stargate_colors);

[151.23333333333332, 111.63333333333334, 72.7]
----
Colors of aliens in Star Gate
['mocha', 'dirt', 'brownish', 'dull brown', 'earth', 'puce', 'coffee', 'cocoa', 'tan brown', 'dark taupe']


In [183]:
doctorwho = nlp(open("doctorwho.txt").read())
# use word.lower_ to normalize case
drac_colors3 = [colors[word.lower_] for word in doctorwho if word.lower_ in colors]
avg_color3 = meanv(drac_colors3)

In [204]:
print(avg_color3)
print("----")

print("Colors of aliens in Doctor Who");
doctorwho_colors = closest(colors, avg_color3)

print(doctorwho_colors);

[146.6530612244898, 108.16326530612245, 86.12244897959184]
----
Colors of aliens in Doctor Who
['brownish', 'mocha', 'dull brown', 'brownish grey', 'dirt', 'grey brown', 'dark taupe', 'greyish brown', 'puce', 'cocoa']


## adding colors to sentences


In [210]:
print("aliens of doctor who")
print("----")

print("On another planet, where people are " + doctorwho_colors[0]+".")
print("How different they seem, with their " + doctorwho_colors[1] + " complexion.")
print("And their " + doctorwho_colors[2] + " eyes,")
print("and their " + doctorwho_colors[3] + " hair.")
print("Watch as they eat their " + doctorwho_colors[4] + " food")
print("Watch as they drive their " + doctorwho_colors[5] + " cars")

aliens of doctor who
----
On another planet, where people are brownish.
How different they seem, with their mocha complexion.
And their dull brown eyes,
and their brownish grey hair.
Watch as they eat their dirt food
Watch as they drive their grey brown cars


In [209]:
print("aliens of starwars")
print("----")

print("On another planet, where people are " + starwars_colors[0]+".")
print("How different they seem, with their " + starwars_colors[1] + " complexion.")
print("And their " + starwars_colors[2] + " eyes,")
print("and their " + starwars_colors[3] + " hair.")
print("Watch as they eat their " + starwars_colors[4] + " food")
print("Watch as they drive their " + starwars_colors[5] + " cars")

aliens of starwars
----
On another planet, where people are greyish brown.
How different they seem, with their dark taupe complexion.
And their dirty purple eyes,
and their grey brown hair.
Watch as they eat their slate grey food
Watch as they drive their gunmetal cars


In [208]:
print("aliens of stargate")
print("----")

print("On another planet, where people are " + stargate_colors[0]+".")
print("How different they seem, with their " + stargate_colors[1] + " complexion.")
print("And their " + stargate_colors[2] + " eyes,")
print("and their " + stargate_colors[3] + " hair.")
print("Watch as they eat their " + stargate_colors[4] + " food")
print("Watch as they drive their " + stargate_colors[5] + " cars")

aliens of stargate
----
On another planet, where people are mocha.
How different they seem, with their dirt complexion.
And their brownish eyes,
and their dull brown hair.
Watch as they eat their earth food
Watch as they drive their puce cars


## word vectors + spacy + zuckerberg rambles

Using spacy to analyze facebook hearing text

In [50]:
import spacy

In [72]:
#The following cell loads the language model and parses the input text:
nlp = spacy.load('en_core_web_md')
doc = nlp(open("facebook.txt").read())

In [83]:
# all of the words in the text file
tokens = list(set([w.text for w in doc if w.is_alpha]))

In [75]:
def vec(s):
    return nlp.vocab[s].vector

from tutorial notes:

### Cosine similarity and finding closest neighbors

The cell below defines a function `cosine()`, which returns the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) of two vectors. Cosine similarity is another way of determining how similar two vectors are, which is more suited to high-dimensional spaces. [See the Encyclopedia of Distances for more information and even more ways of determining vector similarity.](http://www.uco.es/users/ma1fegan/Comunes/asignaturas/vision/Encyclopedia-of-distances-2009.pdf)

(You'll need to install `numpy` to get this to work. If you haven't already: `pip install numpy`. Use `sudo` if you need to and make sure you've upgraded to the most recent version of `pip` with `sudo pip install --upgrade pip`.)

In [199]:
import numpy as np
from numpy import dot
from numpy.linalg import norm

# cosine similarity
def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1, v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

The following cell defines a function that iterates through a list of tokens and returns the token whose vector is most similar to a given vector.

In [81]:
def spacy_closest(token_list, vec_to_check, n=10):
    return sorted(token_list,
                  key=lambda x: cosine(vec_to_check, vec(x)),
                  reverse=True)[:n]

In [98]:
#finding similar words
spacy_closest(tokens, vec("laughter"))

['LAUGHTER',
 'happiness',
 'displeasure',
 'encouragement',
 'conversation',
 'feelings',
 'pleasure',
 'fear',
 'sudden',
 'voices']

In [100]:
tokens

['intervention',
 'strictly',
 'legally',
 'Excuse',
 'What',
 'includes',
 'sit',
 'harvested',
 'past',
 'As',
 'HAWAII',
 'revelations',
 'concrete',
 'pace',
 'algorithm',
 'gave',
 'misusing',
 'R',
 'fire',
 'revealing',
 'LINDSEY',
 'Delaware',
 'malware',
 'Kids',
 'spell',
 'cause',
 'bowl',
 'unprecedented',
 'scheduling',
 'reach',
 'Transportation',
 'voices',
 'imagine',
 'blindness',
 'endless',
 'absolutely',
 'interviewed',
 'recess',
 'patchwork',
 'regards',
 'establishing',
 'embrace',
 'altogether',
 'displeasure',
 'Service',
 'Picture',
 'leave',
 'affecting',
 'profitability',
 'COO',
 'yours',
 'Those',
 'expectation',
 'personalization',
 'okay',
 'approved',
 'CRUZ',
 'board',
 'out',
 'face',
 'journalist',
 'More',
 'settings',
 'purposes',
 'previously',
 'comparison',
 'softball',
 'Myanmar',
 'defined',
 'experiences',
 'messaged',
 'personal',
 'interrupting',
 'MO',
 'lots',
 'receiving',
 'Final',
 'officials',
 'standards',
 'employ',
 'abuses',
 'tak

In [106]:
def sentvec(s):
    sent = nlp(s)
    return meanv([w.vector for w in sent])

In [107]:
sentences = list(doc.sents)

In [108]:
def spacy_closest_sent(space, input_str, n=10):
    input_vec = sentvec(input_str)
    return sorted(space,
                  key=lambda x: cosine(np.mean([w.vector for w in x], axis=0), input_vec),
                  reverse=True)[:n]

## "that's not what i asked."

In [201]:
print("an abridged reading of april 11, 2018");
print("---")
print(" ")

for sent3 in spacy_closest_sent(sentences, "But is it safe?"):
    print(sent3.text)
    print(" ")
    
for sent1 in spacy_closest_sent(sentences, "LAUGHTER"):
    print(sent1.text)
    print(" ")
for sent2 in spacy_closest_sent(sentences, "That's not what I asked."):
    print(sent2.text)
    print(" ")


an abridged reading of april 11, 2018
---
 
Is that it?

 
So is this — is then a question of Facebook is about feeling safe, or are users actually safe?
 
But is it ever really gone?

 
Is that right?

 
but now it isn't with them.
 
How long is that?

 
Isn't that correct?

 
Isn't that correct?

 
So that just isn't a feature that's even available anymore.
 
So it doesn't.
 
(LAUGHTER)
...
 
(LAUGHTER)
...
 
(LAUGHTER)

 
(LAUGHTER)

 
(LAUGHTER)

 
(LAUGHTER)

 
(LAUGHTER)

 
(LAUGHTER)

 
(LAUGHTER)

 
(LAUGHTER)

 
— that's not what I'm asking.
 
That's what I want to see.
 
I think that that's very important.
 
But here's the concern that I have.
 
That's all I need.
 
I think that that's an important conversation to have.
 
I figured that would be the answer.
 
I think common sense would tell us that that's pretty difficult.
 
It's not that I expect anything that I say here today — to necessarily change people's view.

 
If I own that data, I know it's being breached.
 
