# Orthographic variation in word embeddings

Word embeddings are a neural network's representation of the relationships between words, something that we have [obsessed about](http://www.lab41.org/anything2vec/) for awhile at Lab41. A network that has seen,
say, 20 billion words in its lifetime will have a lot to say about them. We are going to talk about why word embeddings know more about _style_ than you do--at least when it comes to linguistic style.

A word embedding takes the form of giant matrix; what's neat about that is that every row of the matrix represents a word as a vector of real numbers. This by itself is nothing new -- so-called "sparse" vector representations of words have been around for decades. "One-hot" vectors, which represent a word like _hello_ by taking a bunch of zeros and changing exactly one of them to 1, at a position predetermined to correspond to the word "hello"--are very useful, but they have an interesting property: every word is equally distant from every other word. From _hello_ to _goodbye_ to _hamburger_ to _justice_, every word in a one-hot vector space is exactly $\sqrt 2$ units away from every other word.

<table border="0"><tr><td><img src="pictures/hellogoodbye.png" /></td><td><img src=pictures/hamburgerjustice.png /></td></tr></table>

These are all the same distance from each other!

GloVe, word2vec, and related models are made up of *dense* rather than sparse vectors, meaning they use a lot of nonzero values and so there are more dimensions on which any two words could be differentiated.  With this vector you can compare words in nifty ways. Computing the cosine of the angle between two vectors gives the cosine similarity score, which maxes out at 1 if the vectors have the same direction and gets lower as the angle between the vectors increases:

$$\cos(\theta) = \frac{x \cdot y}{||x|| \space||y||}$$

## Riding English
Using this metric you can choose an arbitrary vector and find the words closest to it, on whatever dimensions you want. This vector could be a word, or something you calculate yourself. As an example, let's look up the 10 nearest neighbors to "English" in a 200-dimensional GloVe embedding trained on 27 billion words from Twitter (available from the GloVe creators; the full code I used to process the pretrained vectors is available [here](https://github.com/pcallier/glove-twitter)):

In [116]:
import os
import urllib2
import pprint
import zipfile
import numpy as np
from miniglove import Glove

In [117]:
myglove = Glove()
vocab_size=300000
glove_folder = "/home/pcallier/data/datasets/glove-twitter/downloads"
glove_path = os.path.join(glove_folder, "glove.twitter.27B.200d.txt")
# Download if necessary (big)
if not os.path.isfile(glove_path):
    print("Downloading pretrained GloVe")
    if not os.path.isdir(glove_folder):
        os.makedirs(glove_folder)
    glove_zip_path=os.path.join(glove_folder,"glove.zip")
    with open(glove_zip_path, "wb") as glove_zip:
        glove_url = "http://nlp.stanford.edu/data/glove.twitter.27B.zip"
        glove_data = urllib2.urlopen(glove_url)
        glove_zip.write(glove_data.read())
    print("Downloaded.\nExtracting...")
    glove_zip = zipfile.ZipFile(glove_zip_path, "r")
    glove_zip.extractall(glove_folder)
    glove_zip.close()
    print("Extracted")
myglove.load_glove(glove_path, max_entries=vocab_size, gz=False)
near_words = [i[0] for i in myglove.get_nearest('english')]
for wd in near_words:
    print wd

Bad entry 38522 , GloVe components: " [u'\x85'] "
Stopping:  300000 300000
english
spanish
language
math
french
speaking
class
arabic
exam
essay


This example shows some of the diversity in relationships that a word embedding model can represent. The relationship
between _English_ and _Spanish_ is different than the relationship between _English_ and _language_. It also points up some of the shortcomings of the model, as the kind of _English_ that is related to _math_, _class_, and _exam_ is a different word sense than the _English_ that is related to _language_ and _speaking_.

A lot of ink has already been spilled on how and why GloVe and word2vec encode semantic and syntactic content of words. What I'd like to point out is the extent to which they also encode
*stylistic* relationships as well, even across semantically and syntactically diverse contexts. 

Twitter data provides a nice playground for this because it plays host to many different styles and varieties of English. One variation in written style that is super fruitful is the one between "working" and "workin", as in "I'm working hard on this overgrown book report" vs "I'm workin hard not to crack up right now."

As it turns out, this difference can be represented as... a difference--i.e. subtraction:

$$v_{in'} \approx v_{workin}-v_{working}$$

And you can add $v_{in'}$ (I do delight in subtle word2vec humor) to other _-ing_ words to get the _-in_ forms back:

$$v_{goin} \approx v_{in'}+v_{going}$$

Evidence from the Twitter GloVe results, showing the nearest neighbors of $v_{in'}+v_{going}$:

In [118]:
myg=myglove
pprint.pprint(myg[myg['workin']-myg['working']+myg['going']][:5])

[(u'goin', 0.88831662665994482),
 (u'comin', 0.77613290568496363),
 (u'gonna', 0.75843861554999137),
 (u'going', 0.75744115550051516),
 (u'gone', 0.75716465566920121)]


Interestingly, this trick works for more than just taking the 'g' off of words. Turns out it can make almost any word more laid-back and relatable:

In [119]:
pprint.pprint(myg[myg['workin']-myg['working']+myg['better']][:5])

[(u'better', 0.74960560902773565),
 (u'betta', 0.74875750122722051),
 (u'gon', 0.69962990087166721),
 (u'gettin', 0.69176624707648626),
 (u'aint', 0.68820858269487417)]


The second-nearest neighbor of $v_{workin}-v_{working}+v_{better}$ is _betta_. If you're a linguist, this is cool because _-er_=>_-a_ and _-ing_=>_-in_ are different processes. This kind of vector math, in a model that knows nothing about English grammar and phonology, demonstrates that, in a sense, _betta_:_better_::_workin_:_working_.  It makes you wonder if GloVe knows anything about English that linguists and grammarians *haven't* figured out yet.

"But Patrick," you say, "'better' is still at the top of the nearest neighbors list. How can I even be sure that you have gone very far at all?  Is 'betta' normally one of the nearest neighbors of 'better'? You are the most boring liar ever."

Thanks for writing in! While I yield the latter point, you can just look and see that 'better' has some *really* boring nearest neighbors when you don't mess with it:

In [120]:
pprint.pprint(myg[myg["better"]][:5])

[(u'better', 1.0000000000000002),
 (u'than', 0.8628326837717033),
 (u'think', 0.83140772783299211),
 (u'but', 0.8271108152387191),
 (u'should', 0.82638419111297556)]


Similar results hold for _never_, where the second neighbor of $v_{in'} + v_{never}$ is _neva_, as in "neva say never."

In [121]:
pprint.pprint(myg[myg['workin']-myg['working']+myg['never']][:5])

[(u'never', 0.76820980718345255),
 (u'neva', 0.75148017760913943),
 (u'aint', 0.71765577611700071),
 (u'gon', 0.68844101454285755),
 (u'kno', 0.67143423520239087)]


Finally, check out how this style transformation affects a word like _hello_, whose nearest neighbors ordinarily sound like someone's grandma inviting you in for pie:

In [122]:
pprint.pprint(myg[myg['hello']])

[(u'hello', 0.99999999999999967),
 (u'hey', 0.79004704935816128),
 (u'hi', 0.76735115675033327),
 (u'dear', 0.71815507273413681),
 (u'welcome', 0.69861532706782348),
 (u'morning', 0.68387741688549786),
 (u'goodbye', 0.65207510130197299),
 (u'thanks', 0.63994165327909924),
 (u'thank', 0.6307863015211308),
 (u'yes', 0.6267277876940357)]


After adding the vector $v_{betta}-v_{better}$, _hello_'s neighbors begin to sound more like someone you need to block on Tinder:

In [123]:
pprint.pprint(myg[myg['workin']-myg['working']+myg['hello']][:7])

[(u'hello', 0.69317891346005667),
 (u'heyy', 0.59410393324734401),
 (u'hey', 0.58777962383237869),
 (u'wassup', 0.56454447734209456),
 (u'heeey', 0.54898272580485286),
 (u'helloo', 0.54826528148085985),
 (u'heey', 0.53740413715101087)]


If this is beginning to look more and more like translation, well--it is a _lot_ like translation. There has been a lot of research on using dense vector models for machine translation, but there are easy examples you can pull out with the Twitter GloVe vectors. Here we take the difference between _jaune_ 'yellow' and _yellow_ (this gives us a vector that means something like "is in French") and add it to _dog_:

In [124]:
pprint.pprint(myg[myg['jaune']-myg['yellow']+myg['dog']])

[(u'jaune', 0.59636133712321482),
 (u'chien', 0.53978101024027159),
 (u'dog', 0.52195876478283232),
 (u'lapin', 0.45008049145266815),
 (u'vert', 0.42810379873608584),
 (u'singe', 0.4163984796954972),
 (u'pet', 0.41430019738637597),
 (u'pr\xe9f\xe9r\xe9', 0.41164841171416805),
 (u'poussin', 0.40766867307001764),
 (u'flipper', 0.4066168237385443)]


We get back _chien_ (albeit in the second slot), which means, of course, 'dog.' Nasty trick, though, is that the same result comes out when we try it with _cat_. Like, the *same* result:

In [125]:
pprint.pprint(myg[myg['jaune']-myg['yellow']+myg['cat']])

[(u'jaune', 0.60934649332900459),
 (u'chien', 0.50129263425099924),
 (u'lapin', 0.47679089647578032),
 (u'cat', 0.47173153372284493),
 (u'poussin', 0.46016845213828739),
 (u'vert', 0.44008617657357857),
 (u'singe', 0.43984166150506449),
 (u'noir', 0.42799578578268194),
 (u'rire', 0.42102450976252132),
 (u't\xeate', 0.42084313238627324)]


<img src="pictures/dog-and-cat.jpg" width=300 />

Unfortunately, _chat_ is more than just a French word for "Make Pat sneeze and cry," so its neighborhood is a little different than you might expect:

In [126]:
pprint.pprint(myg[myg['chat']])

[(u'chat', 1.0000000000000002),
 (u'skype', 0.76015260467855716),
 (u'fb', 0.6996539906125141),
 (u'chats', 0.64665352211403826),
 (u'bbm', 0.63690212719119388),
 (u'twitter', 0.63678889555221718),
 (u'dm', 0.62655024532587489),
 (u'whatsapp', 0.6259950508742973),
 (u'kik', 0.61840193558276491),
 (u'facebook', 0.61717381818479344)]


Word sense disambiguation in dense vector models: also an area of active research.

Machine translation and content representation are two of the main commercial applications for dense vector models of words and documents. But word vectors are exceptionally versatile. The methods that word2vec and GloVe use to encode them inevitably end up capturing all sorts of cool information, like the stylistic permutations we played with today. As algorithmic chat [re-enters the consumer landscape](http://www.forbes.com/sites/parmyolson/2016/06/07/yahoo-chat-bots-kik-weather-news-monkeys/), getting style right in NLP may soon be just as important as nailing the content, and dense vector representations may be part of it.

In [127]:
pprint.pprint(myg[myg['king']-myg['man']+myg['woman']])

[(u'king', 0.69214193817476943),
 (u'queen', 0.65601796378599242),
 (u'woman', 0.59939595640420673),
 (u'prince', 0.55449537284821027),
 (u'princess', 0.54145840558796821),
 (u'royal', 0.53234426923594003),
 (u'mother', 0.50829932162282887),
 (u'elizabeth', 0.50362984236174746),
 (u'women', 0.47577372704148152),
 (u'lion', 0.47300935148718831)]
