
# Word embeddings as a service
<br/>
<br/>
#### François Scharffe ([@lechatpito](http://www.twitter.com/lechatpito)), [3Top Inc.](http://www.3top.com)
<br/>
<br/>
#### PyData NYC 2015

## Outline of the talk

* What is 3Top?
* What are word embeddings?
* How to implement a simple recommendation system for 3Top categories?

## Rank Anything, Rank Everything
* 3Top is a ranking and recommendation platform
* Rankings convey more information that star ratings
    * Who cares about 3 stars or less? I just want the best stuff
    * I'd rather trust my friends than reading through reviews
    * If I have more than 3 items to rank, I can probably use a more precise category


* Not yet launched, but the site is up

Let's take a look at http://www.3top.com



<img src="imgs/1980s Movies.jpg">

* Places
<img src="imgs/Gyms for Students near Lower East Side.png">
http://www.3top.com/category/1138/gyms-for-students-near-lower-east-side

 * Movies
<img src="imgs/Movies About Wall Street.png">
http://www.3top.com/category/142/movies-about-wall-street

* Anything really
<img src="imgs/Foods Named After People.png">
http://www.3top.com/category/765/foods-named-after-people

## Data & knowledge engineering at 3Top
Building a solid data engineering architecture before launching the site.
* Natural language processing pipeline
    * Parsing categories
    * Detecting named entities, locations
* A large knowledge graph backed by an ontology
* An itemization pipeline 
    * matching free text items to entities in the knowledge graph

<img src="imgs/Knowledge Graph.jpg">

## Category recommendation

How are we going to build a simple recommendation system without having any significant number of user, categories or rankings?

Note the impressive figures:
 * Number of Users: 316
 * Number of Rankings: 2123
 * Number of Categories: 1316

        Wow ! ;)
 
Feel free to add a few rankings: http://www.3top.com

### Word embeddings ?
Who hasn't heard about word2vec?

Word embeddings allow to represent words in a high dimensional space in a way that words appearing in the same context will be close in that space.
* Dimensionality of the space is not that high, typically around 100 dimensions.
* Word embeddings is a language modeling method, more precisely a distributed vector representation of words. 

Compared to Bag of words:
 * Dimensionality is low and constant
 * Depending on the technique, partially learned models give partially good results (5)

Compared to topic modeling:
 * Better granularity, the base element is a word
 * Phrases vector can also be learnt 

What is it good at:
 * Modeling similarity between words
       sim(tomato, beefsteak) < sim(apple, tomato) < sim(pear, apple)
 * Allows algebric operations on word vectors
       v(Paris) - v(France) ~= v(Berlin) - v(Germany)

## Examples

Examples here are using a small GloVe model (100d, 400k vocab, trained on Wikipedia and Gigaword).

In [3]:
from gensim.models import Word2Vec
model = Word2Vec().load_word2vec_format("./glove.6B.100d.txt")

In [4]:
model.most_similar("python", topn=10)

[(u'monty', 0.6886237859725952),
 (u'php', 0.586538553237915),
 (u'perl', 0.5784406661987305),
 (u'cleese', 0.5446674823760986),
 (u'flipper', 0.5112984776496887),
 (u'ruby', 0.5066927671432495),
 (u'spamalot', 0.505638837814331),
 (u'javascript', 0.5030568838119507),
 (u'reticulated', 0.4983375668525696),
 (u'monkey', 0.49764129519462585)]

In [5]:
model.most_similar_cosmul(positive=["python", "programming"], topn=5)

[(u'perl', 0.5658619999885559),
 (u'scripting', 0.559501588344574),
 (u'scripts', 0.5469149351119995),
 (u'php', 0.5461974740028381),
 (u'language', 0.5350533127784729)]

In [6]:
model.most_similar_cosmul(positive=["python", "venomous"], topn=5)

[(u'scorpion', 0.5413044095039368),
 (u'snakes', 0.5263831615447998),
 (u'snake', 0.5222328901290894),
 (u'spider', 0.5214570164680481),
 (u'marsupial', 0.517005205154419)]

The classical example: 
                
                v(king) - v(man) + v(woman) -> v(queen)

In [7]:
model.most_similar_cosmul(positive=["king", "woman"], negative=["man"])

[(u'queen', 0.8964556455612183),
 (u'monarch', 0.8495977520942688),
 (u'throne', 0.8447030782699585),
 (u'princess', 0.8371668457984924),
 (u'elizabeth', 0.835679292678833),
 (u'daughter', 0.8348594903945923),
 (u'prince', 0.8230059742927551),
 (u'mother', 0.8154449462890625),
 (u'margaret', 0.8147734999656677),
 (u'father', 0.8100854158401489)]

## Training a model

* Very easy once you have a clean corpus
* Great tools in Python
  * Tutorial on training a model using Gensim: http://rare-technologies.com/word2vec-tutorial/
  * Radim Řehůřek gave a talk last year at PyData Berlin about optimizations in Cython: https://www.youtube.com/watch?v=vU4TlwZzTfU
  * For GLoVe https://github.com/maciejkula/glove-python

Gensim word2vec implementation specifics:
 * Training time ~ 8hours on a 8 proc/8 threads to learn 600 dimensions on a <em>1.9B words</em> corpus
 * Memory requirements depends on the vocabulary size and on the number of dimensions:
       3 matrices * 4 bytes (float) * |dimensions| * |vocabulary|

The [GloVe implementation in Python](https://github.com/maciejkula/glove-python/) takes half the time but has a quadratic memory size instead of linear. Check pull requests for memory optimizations.

A good think to know: a bigger training set does improve the quality of the model, <em>even for specialized tasks</em>.

As a consequence, you probably want to use a huge corpus. Good models are available.

Building you own model can be useful when you want to find out about the properties of your corpus, or you want to compare different corpora together. For examples evolution of language in a newspaper during different periods of time.

## Finding a model

From: https://github.com/3Top/word2vec-api/

| Model file | Number of dimensions | Corpus (size)| Vocabulary size | Author | Architecture | Training Algorithm | Context window - size | Web page |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| [Google News](GoogleNews-vectors-negative300.bin.gz) | 300 |Google News (100B) | 3M | Google | word2vec | negative sampling | BoW - ~5| [link](http://code.google.com/p/word2vec/) |
| [Freebase IDs](https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit?usp=sharing) | 1000 | Gooogle News (100B) | 1.4M | Google | word2vec, skip-gram | ? | BoW - ~10 | [link](http://code.google.com/p/word2vec/) |
| [Freebase names](https://docs.google.com/file/d/0B7XkCwpI5KDYeFdmcVltWkhtbmM/edit?usp=sharing) | 1000 | Gooogle News (100B) | 1.4M | Google | word2vec, skip-gram | ? | BoW - ~10 | [link](http://code.google.com/p/word2vec/) |
| [Wikipedia+Gigaword 5](http://nlp.stanford.edu/data/glove.6B.zip) | 50/100/200/300 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | [link](http://nlp.stanford.edu/projects/glove/) |
| [Common Crawl 42B](http://nlp.stanford.edu/data/glove.42B.300d.zip) | 300 | Common Crawl (42B) | 1.9M | GloVe | GloVe | GloVe | AdaGrad | [link](http://nlp.stanford.edu/projects/glove/) |
| [Common Crawl 840B](http://nlp.stanford.edu/data/glove.840B.300d.zip) | 300 | Common Crawl (840B) | 2.2M | GloVe | GloVe | GloVe | AdaGrad | [link](http://nlp.stanford.edu/projects/glove/) |
| [Twitter (2B Tweets)](http://www-nlp.stanford.edu/data/glove.twitter.27B.zip) | 25/50/100/200 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | [link](http://nlp.stanford.edu/projects/glove/) |
| [Wikipedia dependency](http://u.cs.biu.ac.il/~yogo/data/syntemb/deps.words.bz2) | 300 | Wikipedia (?) | 174,015 | Levy \& Goldberg | word2vec modified | word2vec | syntactic dependencies | [link](https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/) |
| [DBPedia vectors](https://github.com/idio/wiki2vec/raw/master/torrents/enwiki-gensim-word2vec-1000-nostem-10cbow.torrent) | 1000 | Wikipedia (?) | ? | wiki2vec | word2vec | word2vec, skip-gram | BoW, 10 | [link](https://github.com/idio/wiki2vec#prebuilt-models) |

## Building a recommendation engine for 3Top categories

By combining word vectors, we build category vectors.



In [8]:
def build_category_vector(category):    
    pass  # Get the postags
    vector = []
    for tag in postags:
        if tag.tagged in ['NN', 'NNS', 'JJ', 'NNP', 'NNPS', 'NNDBN', 'VBG', 'CD']:  # Only keep meaningful words
            try:
                v = word2vec(tag.tagValue)  # Get the word vector
                if v.any():
                    vector.append(v)
            except:
                logger.debug("Word not found in corpus: %s" % tag.tagValue)
            tagset.add(tag.tagValue)
    if vector:
        return matutils.unitvec(np.array(vector).mean(axis=0))  # Average the vector
    else:
        return np.empty(300)

We store those vectors in a category space, and at page load time compute the most similar categories for a given category.

Now let us look at the similarity method.

    sim(c1, c2) = cos(v(c1), v(c2))

In [21]:
cs = CategorySimilarity()
# print(Category.objects.all().count())
category = Category.objects.get(category=u"Blue-collar beers that come in a can")
_ = [print(c) for c in cs.most_similar_categories(category, n=5)]

DEBUG Category space size (as found in the cache): 1125


Belgian Trappist Beers 
Belgian Beer Cafe in NYC
Dark and Stormy Cocktail in NYC
Brands of Ginger Beer
Pink Drinks


In [10]:
category = Category.objects.get(category=u"Italian Restaurants in NYC.")
_ = [print(c) for c in cs.most_similar_categories(category, n=5)]

Italian Restaurants in NY
Restaurants in Nyc
NYC Mexican Restaurants 
Romanian Restaurants in NYC
Thai Restaurants in NYC


In [11]:
category = Category.objects.get(category=u"Coen Brothers Movies")
_ = [print(c) for c in cs.most_similar_categories(category, n=10)]

Quentin Tarantino Movies
Martin Scorsese Films.
Movies Starring Creepy Children
Tim Burton Movies
Movies Starring Sean Penn
Pixar Movies
Godfather Movies
Berlin Indie Movie Theaters
Kubrick Movies
Harry Potter Movies


Our recommendation system uses the Common Crawl 42B words, 300 dimensions model trained with GloVe.

It takes around 6GB in memory ... and this is a problem:

We run a Django server and 8 celery workers on an EC2 T2 Micro... That would be a lot of memory
for that poor instance. 

## A word embedding service

* We separate the word embedding model as a service
* Simple Flask server with a few primitives:
    * `curl http://127.0.0.1:5000/word2vec/similarity?w1=Python&w2=Java`
    * `curl http://127.0.0.1:5000/word2vec/n_similarity?ws1=Python&ws1=programming&ws2=Java`
    * `curl http://127.0.0.1:5000/word2vec/model?word=Python`
    * `curl http://127.0.0.1:5000/word2vec/most_similar?positive=king&positive=queen&negative=man`
* Easy to setup: 
    * `python word2vec-api --model path/to/the/model [--host host --port 1234]`
* Get it at https://github.com/3Top/word2vec-api


<img src="imgs/3Top Production Infrastructure 11-2014.jpg" />

## Caching the vector space

As the number of categories increase we do not want to hit the database every time a recommendation is needed (every page access). Category vectors size is actually significant:

In [12]:
print(np.fromstring(category.vector).nbytes)

2400


In [13]:
print(u"... and for {} categories the space becomes large (~{}MB)".format(len(cs.category_space.syn0), cs.category_space.syn0.nbytes/1000000))

... and for 1125 categories the space becomes large (~2MB)


We store vectors with the category object in MySQL, using a base64 encoding of the numpy object. Let's look at it:

In [14]:
print(category._vector[:1000] + "...")

oNl7j9l8hr/2FoHhEbSyv+GALkJ6JVU/rxm5pueouL80aF72QLiivzr0z1WKaZM/i3M5FvqQmD+hDmIs7fitv+lL6cLFSbM/lSwwkBxv0D/sA+FgTxmhP5+lJZPGVLE/q8pBj07ukT9OceKjxl2jv4s4cA7RJIc/JxVUF8afnr/RQcXciUyCP+M4N3mbtrS/Ngwo85uUor/+4vqargCyP7YnHbTSv3W/MHjrh6iHsr/1vkzSmI+yP+bsS7E0B5u/JJ5iJAUNoT//xo0IJ3inP5/BwCfgWZ2/Q2r8q9Fuir/KdAOAr0OQPwzGTnUXU5i/9uQD77+xuD/1QKbEaDWjvyqfSePd7XE/3RLqJXiOrz8ZyEDICd2UP2beFLiqPZy/rIb+8sFgqr+ILyc3/5yoP5pL25IahpQ/4WpgCeuNZ7/ley+Tl9ygP+knz2odUHo/yBSdc5+Klj+GLgrafftvP2yiB76KBY2/z0RqB+1ri7+THdXBVVKvPzwZ2X+2HaA/oOThsHeidL/O7w8+bummv8Z8XCqeYas/WzQpioG6qr+JaauGrie2P7+8NmNBN5o/0Ji6XFJMpj/xAtoHvg9PPyr3OOBXVaG/M2aCbN8dv79pANKgDzNrPy4XXBNVi4S/WBuYYjWZlD8T/W3jLbOJPy3xHNTzarQ/MoOWx7aZtz/RDMwbryievwA5kQgazaO/3Ep0jVo1rD+ns3oJ3iWUv4TlEPcAnJy/dHNcwygjnr/cMGYNKPbCP5E06afPWa6/mUHAC+8mjj+NwgyjQFB5v6ffLvduuai/kBntVvsdpb8Yg3KzY/qev9r5son3VJg/h06aD0/IuD8NMHm7jGViv7o8zQzPd5U/esP4Ax6BrL8TOZuX+qGpP1WHNPzdQH0/7HXRMAqXmr9G8pkwjbenv3RvQppal7i/E5jWmLXSp792VpPxJeOjPyEKhEhl324/1E00QnHdvr9Mg0Fohd+cP6UAj0X5R5g/2umwTF42

In [15]:
# a property method takes care of the decoding
def get_vector(self):
        return base64.b64decode(self._vector)

def set_vector(self, value):
    encoded = base64.b64encode(value)
    self._vector = encoded
        
vector = property(get_vector, set_vector)

In [16]:
np.fromstring(category.vector)[:100]

array([-0.01098032, -0.07306015,  0.00129067, -0.09632728, -0.03656199,
        0.01895729,  0.02399054, -0.05853978,  0.07534443,  0.25678171,
        0.03339623,  0.06769982,  0.01751063, -0.03782483,  0.01130069,
       -0.02990636,  0.00893505, -0.08091137, -0.03629005,  0.07032291,
       -0.00530989, -0.07238248,  0.07250362, -0.02639468,  0.03330246,
        0.04583857, -0.02866316, -0.01290668,  0.0158832 , -0.02375447,
        0.09646225, -0.03751686,  0.00437724,  0.06163383,  0.02037444,
       -0.02757899, -0.05151945,  0.04807279,  0.02004282, -0.00287529,
        0.03293298,  0.00642406,  0.02201318,  0.0039041 , -0.01417073,
       -0.01338945,  0.06117504,  0.03147669, -0.00503775, -0.04474968,
        0.05347914, -0.05220418,  0.086543  ,  0.02560141,  0.04355104,
        0.00094792, -0.03385424, -0.12154957,  0.00332025, -0.01003138,
        0.02011569,  0.01254879,  0.07975696,  0.09218924, -0.02945207,
       -0.03867418,  0.05509456, -0.0196757 , -0.02793886, -0.02

* In order to avoid issuing a few thousand SQL queries every time a page is loaded we use Memcache to store the category space. 
* As the space is larger than a MB we store each vector with its own key (the category Id). They share a common key prefix.
* Here we directly store the numpy vectors through the Gensim API
* A separate key is used for the vocabulary indexes.

In [17]:
def set_space_cache(space):
        sim.set(VOC, space.vocab)
        sim.set(IDX, space.index2word)
        sim.set_many({"{0}-{1}".format(VEC, i): space.syn0[i] for i in range(len(space.vocab))})

This also allows to add a category vector to the space without having to rebuild it. Simply by stacking its vector in the cache and updating the cached space indexes.

In [18]:
def add_last_vector_to_space_cache(space):
        sim.set(VOC, space.vocab)
        sim.set(IDX, space.index2word)
        sim.set("{}-{}".format(VEC, len(space.vocab)-1), space.syn0[-1])

### Updates
* Each process gets its own copy of the vector space.
* Whenver a process adds a category, it also updates the space. 
* Django signals are used to tell other processes to reload the space from cache.

## Work in progress
* We are about to add a few 100k generated categories
* The category space will becode large in memory: 8 workers \* 2.4 kb \* 100000 categories = 1,9 GB 
* Including entity vectors would improve results for names, places, etc.
* Training a specialized corpus using categories scraped all over the web
* Train a phrase2Vec model on these categories

## Resources
### Tutorials & Applications
* Instagram: http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji
* Word embeddings and RNNs: http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
* Word2vec gensim tutorial: http://rare-technologies.com/word2vec-tutorial/
* Clothing style search: http://multithreaded.stitchfix.com/blog/2015/03/11/word-is-worth-a-thousand-vectors/
* In digital humanities: http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html
* In digital humanities, application to gender studies: http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html

## Resources
### Academic Papers
* Le, Quoc V., and Tomas Mikolov. "Distributed representations of sentences and documents." arXiv preprint arXiv:1405.4053 (2014).
* JeffreyPennington, RichardSocher, and ChristopherD Manning. "Glove: Global vectors for word representation." (2014).
* Levy, Omer, and Yoav Goldberg. "Dependencybased word embeddings." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Vol. 2. 2014.
* Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method." arXiv preprint arXiv:1402.3722 (2014).

In [19]:
Thank you !

SyntaxError: invalid syntax (<ipython-input-19-f087ca1d6988>, line 1)