# Machine Reading: Advanced Topics in Word Vectors
## Part III. Pre-trained Models and Extended Vector Algorithms (50 mins)

This is a 4-part series of Jupyter notebooks on the topic of word embeddings originally created for a workshop during the Digital Humanities 2018 Conference in Mexico City. Each part is comprised of a mix of theoretical explanations and fill-in-the-blanks activities of increasing difficulty.

Instructors:
- Eun Seo Jo, <a href="mailto:eunseo@stanford.edu">*eunseo@stanford.edu*</a>, Stanford University
- Javier de la Rosa, <a href="mailto:versae@stanford.edu">*versae@stanford.edu*</a>, Stanford University
- Scott Bailey, <a href="mailto:scottbailey@stanford.edu">*scottbailey@stanford.edu*</a>, Stanford University

This unit will explore the various flavors of word embeddings specifically tailored to sentences, word meaning, paragraph, or entire documents. We will give an overview of pre-trained embeddings including where they can be found, how to use them, and what they're effective for.

- 0:00 - 0:15 [Out-of-vocabulary words and pre-trained embeddings](#1.-Out-of-vocabulary-words-and-pre-trained-embeddings)
- 0:15 - 0:25 [Activity] Bias in pre-trained historical word embeddings
- 0:25 - 0:40 [Extending Vector Algorithms: Text Classification](#2.-Extending-Vector-Algorithms:-Text-Classification)
- 0:40 - 0:50 [Activity] Authorship attribution

---

### 0. Setting Up 

Before we get started, let's go ahead and set up our notebook. We will start by importing a few Python libraries that we will use throughout the workshop.

#### What are these libraries?

1. NumPy: This is a package for scientific computing in python. For us, NumPy is useful for vector operations. 
2. NLTK: Easy to use python package for text processing (lemmatization, tokenization, POS-tagging, etc.)
3. gensim: Built-in word2vec and other NLP algorithms
4. fastText: Super fast word embeddings library

We will be working with a few sample texts using NLTK's corpus package and other corpor downloaaded to such effect.

In [None]:
%%capture --no-stderr
import sys
!pip install Cython  # needed to compile fasttext
!pip install -r requirements.txt
!python -m nltk.downloader all
print("All done!", file=sys.stderr)

If all went well, we should be able now to import the next packages into our workspace

In [None]:
import io
import pickle
import os

import numpy as np
import nltk
import gensim
import fasttext
from tqdm import tqdm



---



### 1. Out-of-vocabulary words and pre-trained embeddings

So far, we've seen the power of word embeddings and how easy they are to obtain from your own corpus. In most cases, however, we do not have access to millions of unlabelled documents in our target domain that would allow for training good embeddings from scratch. Training word embeddings is very resource intensive and it may require relatively large corpora for the geometric relationships to be semantically meaningful. Still, there are some issues with regular word-oriented embeddings. To illustrate this, consider the next code that trains on the text from _Alice in Wonderland_.

In [None]:
print(nltk.corpus.gutenberg.raw('carroll-alice.txt')[0:200])

We'll use the handy `.words()` method in NLTK to access just the words.

In [None]:
words = list(map(str.lower, nltk.corpus.gutenberg.words('carroll-alice.txt')))
words[:10]

And now let's train a very simple `word2vec` model.

In [None]:
documents = [words]
model = gensim.models.Word2Vec(
    documents,
    size=25,
    window=5,
    min_count=1,
    workers=10
)
model.train(documents, total_examples=len(documents), epochs=10)
model.wv['alice']

Regardless of whether this model is able to compute semantic similarities or not, word vectors have been computed. However, if you try to look for words that are not in the vocabulary you'll get an error.

In [None]:
try:
    model.wv['google']
except KeyError as e:
    print(e)

This is known as the Out-Of-Vocabulary (OOV) issue in Word2Vec and similar approaches.

Now, you may think, I could get synonyms of the OOV words using something like WordNet, and then look for those words' embeddings. And while that might work in some cases, in others it is not that simple. Two such cases are new-ish words like `facebook` and `google`, or proper names of places, like `Teotihuacan`.

One way to solve this issue is to use a different measure of atomicity in your algorithm. In Word2Vec-like approaches, including GloVe, the word is the minimum unit, and as such, when looking for words that are not in the vocabulary there is certainly no vector information for it. In contrast, a different approach could train for sub-word units, for example 3-grams. While not guaranteeing that all words will be covered, a good amount of them might be, due to the fact that it's more likely for all possible trigrams to be included in a large enough corpus than all possible words. This is the approach taken by Facebook's fastText.

In [None]:
from gensim.models import FastText

fasttext_model = FastText(documents, size=25, min_count=1)
fasttext_model.wv['alice']

In [None]:
fasttext_model.wv['google']

fastText also distributes word vectors pre-trained on [Common Crawl](http://commoncrawl.org/) and [Wikipedia](https://www.wikipedia.org/) for more than 157 languages. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. They come in binary and text format: binary includes a model ready to use while the text format only contains the actual vectors associated to each word on the training set.

Gensim is soon to include a special method to load in these fasText embeddings (not working as of 3.4.0). Just take into account that only the `.bin` format allows for OOV word vectors. For the regular and usually lighter `.vec` format you still would need to load in the vectors, save a binary Gensim model, and load it back in.

Let's see a couple of examples of using `.vec` from the Somali and the Simplified English Wikipedia corpora available for fastText. These files are loaded in using the regular Gensim `KeyedVectors` word2vec model (`.load_word2vec_format()`), and vectors for out of vocabulary cannot be computed.

In Somali, the word `xiddigta` (meaning *the star*) should have its own vector avalilable since the word is present in the corpus.

In [None]:
filename = 'wiki.so.vec'
if not os.path.isfile(filename):
    !echo "Downloading $filename"
    !curl --progress-bar -Lo $filename https://s3-us-west-1.amazonaws.com/fasttext-vectors/$filename

somali_model = gensim.models.KeyedVectors.load_word2vec_format(filename, binary=False)
somali_model.wv['xiddigta'][:25]  # it means 'the star' in Somali

But the word `ciyaalsuuq` (meaning *unruly youth*) raises a `KeyError` in the word vectors dictionary.

In [None]:
try:
    somali_model.wv['ciyaalsuuq'][:25]
except KeyError as e:
    print(e)

And the same thing occurs in English: while words like `star` are certainly available, words such as `bibliopole` (meaning *a person who buys and sells books, especially rare ones*) are not.

In [None]:
# This might take a while
filename = 'wiki.simple.zip'
if (not os.path.isfile(filename)
        and not os.path.isfile(filename.replace('.zip', '.vec'))
        and not os.path.isfile(filename.replace('.zip', '.bin'))):
    !echo "Downloading $filename"
    !curl --progress-bar -Lo $filename https://s3-us-west-1.amazonaws.com/fasttext-vectors/$filename
if (os.path.isfile(filename)
        and (not os.path.isfile(filename.replace('.zip', '.vec'))
                 or not os.path.isfile(filename.replace('.zip', '.bin')))):
    !unzip $filename

In [None]:
english_model = gensim.models.KeyedVectors.load_word2vec_format(
    filename.replace('.zip', '.vec'), binary=False)

In [None]:
english_model.wv['star'][:25] 

In [None]:
try:
    english_model.wv['bibliopole'][:25] 
except KeyError as e:
    print(e)

The fastText English embeddings **without** sub-word information are also included in Gensim's `downloader` feature.

In [None]:
import gensim.downloader as pretrained

pretrained.info()['models']['fasttext-wiki-news-subwords-300']

In [None]:
fasttext_english = pretrained.load('fasttext-wiki-news-subwords-300')

In [None]:
fasttext_english.wv['star'][:25]

By contrast, when using the `.bin` file and loading it in Gensim using the special `Fastext.load_fasttext_format()` method, out of vocabulary words suddenly have embeddings available.

In [None]:
english_oov = FastText.load_fasttext_format('wiki.simple')

In [None]:
english_oov.wv['bibliopole'][:25]

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
<strong>Activity</strong>
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Could you find a word in the `english_oov` model for which there is no embedding? And in the `english_model` model? Would the embedding for `ciyaalsuuq` be available in any of these models?
<br>
<em>
<!--
<strong>Hint</strong>: Use the numpy functions
-->
</em>
</p>
</div>

In [None]:
# Enter your solution here

As we've seen, non-existing words in English, such as the Somali `ciyaalsuuq`, also become available, so it's a feature we must be very careful when using.

In [None]:
english_oov.wv['ciyaalsuuq'][:25]

Unsurprisingly, if we check what other words are similar in English to the Somali word `ciyaalsuuq` we get a bunch of words that are not really from English. To be completely fair, the Simple English corpus might not be as reliable as the full English one for finding semantic similarities.

In [None]:
english_model.similar_by_vector(english_oov.wv['ciyaalsuuq'])

#### fastText package

While Gensim provides a way to create fastText embeddings with sub-word information and even load fastText pre-trained word embeddings, there is also a standalone tool, `fasttext`, and an accompanying Python library to do the same. Unfortunately, the Python bindings haven't been updated and it seems to be broken when trying to load in binary models generated with newer versions of the fastText command line tool.

In [None]:
import fasttext

try:
    fasttext.load_model("wiki.simple.bin")
except Exception as e:
    print(e)

Other functionalities, such as building embedding from your own corpus using either Skip-gram or CBOW, are available, as well as methods to create text classifiers very easily.

In [None]:
fasttext.skipgram(nltk.corpus.gutenberg.abspath('carroll-alice.txt'), 'alice_model')

In [None]:
fasttext.cbow(nltk.corpus.gutenberg.abspath('carroll-alice.txt'), 'alice_model')

In [None]:
text = """
__label__pos This is some wonderful positive text.
__label__neg This is some awful negative text.
"""
with open('sentiment_train.txt', 'w') as f:
    f.write(text.strip())
test = """
__label__pos This is wonderful.
__label__neg This is awful.
"""
with open('sentiment_test.txt', 'w') as f:
    f.write(test.strip())

classifier = fasttext.supervised('sentiment_train.txt', 'sentiment_model')
result = classifier.test('sentiment_test.txt')
print('Precision:', result.precision)
print('Recall:', result.recall)
print('Number of examples:', result.nexamples)
print('"This is wonderfully awful":',
      classifier.predict_proba(['This is wonderfully awful'])[0][0])

#### Pre-trained vectors

The list of pre-trained word vectors grows every day, and while it's impractical to enumerate them all, some of them are listed below.

- English
  - fastText. Embeddings (300 dimensions) by Facebook [with](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M-subword.vec.zip) and [without sub-word information](https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip) trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens), and on [Common Crawl (600B tokens)](https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip).
  - [Google News](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/). Embeddings (300 dimensions) by Google trained on Google News (100B) using word2vec with negative sampling and context window BoW with size ~5 ([link](http://code.google.com/p/word2vec/)). There also fastText versions from 2016 with and without sub-word information for Wikipedia and with no sub-word information for Common Crawl.
  - [LexVec](https://github.com/alexandres/lexvec). Embeddings (300 dimensions) trained using LexVec with and without sub-word information trained on Common Crawl, and on Wikipedia 2015 + NewsCrawl.
  - Freebase [IDs](https://docs.google.com/file/d/0B7XkCwpI5KDYaDBDQm1tZGNDRHc/edit?usp=sharing) and [names](https://docs.google.com/file/d/0B7XkCwpI5KDYeFdmcVltWkhtbmM/edit?usp=sharing). Embeddings (1000 dimensions) by Google trained on Gooogle News (100B) using word2vec, skip-gram and context window BoW with size ~10 ([link](http://code.google.com/p/word2vec/)).
  - [Wikipedia 2014 + Gigaword 5](http://nlp.stanford.edu/data/glove.6B.zip). Embeddings (50, 100, 200, and 300 dimensions) by GloVe trained on Wikipedia data from 2014 and newswire data from the mid 1990s through 2011 using GloVe with AdaGrad and context window 10+10 ([link](http://nlp.stanford.edu/projects/glove/)).
  - Common Crawl [42B](http://nlp.stanford.edu/data/glove.42B.300d.zip) and [840B](http://nlp.stanford.edu/data/glove.840B.300d.zip). Embeddings (300 dimensions) by GloVe trained on Common Crawl (42B and 840B) using GloVe and AdaGrad ([link](http://nlp.stanford.edu/projects/glove/)).
  - [Twitter (2B Tweets)](http://www-nlp.stanford.edu/data/glove.twitter.27B.zip). Embeddings (25, 50, 100, and 200 dimensions) by GloVe trained on Twitter (27B) using GloVe with GloVe and AdaGrad ([link](http://nlp.stanford.edu/projects/glove/)).
  - [Wikipedia dependency](http://u.cs.biu.ac.il/~yogo/data/syntemb/deps.words.bz2). Embeddings (300 dimensions) by Levy & Goldberg trained on Wikipedia 2015 using word2vec modified with word2vec and context window syntactic dependencies ([link](https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/)).
  - [DBPedia vectors (wiki2vec)](https://github.com/idio/wiki2vec/raw/master/torrents/enwiki-gensim-word2vec-1000-nostem-10cbow.torrent). Embeddings (1000 dimensions) by Idio trained on Wikipedia 2013 using word2vec with word2vec, skip-gram and context window BoW, 10 ([link](https://github.com/idio/wiki2vec#prebuilt-models)).
  - [60 Wikipedia embeddings with 4 kinds of context](http://vsmlib.readthedocs.io/en/latest/tutorial/getting_vectors.html#). Embeddings (25, 50, 100, 250, and 500 dimensions) by Li, Liu et al. trained on Wikipedia using Skip-Gram, CBOW, GloVe with original and modified and context window 2 ([link](http://vsmlib.readthedocs.io/en/latest/tutorial/getting_vectors.html#)).
- Multi-lingual
  - [fastText](https://fasttext.cc/docs/en/crawl-vectors.html). Embeddigns for 157 languages trained using fastText on Wikipedia 2016 and Common Crawl using CBOW with position-weights, 300 dimensions, with character n-grams of length 5, a window of size 5 and 10 negatives. Both vectors and binary models for OOV are available. There is an old version of these embeddings trained only on Wikipedia 2016 for almost [300 languages](https://fasttext.cc/docs/en/pretrained-vectors.html).
  - [BPEemb](https://github.com/bheinzerling/bpemb). Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) on Wikipedia 2017 with sub-word information.
  - [Kyubyong's wordvectors](https://github.com/Kyubyong/wordvectors#pre-trained-models). Embeddings with and without sub-word information trained on Wikipedia dumps from 2017 for +30 languages.
  - [Polyglot](https://sites.google.com/site/rmyeid/projects/polyglot#h.p_ID_98). Embeddings for more than 100 languages trained on their Wikipedias from 2013. Provides competitive performance with near state-of-art methods in English, Danish and Swedish.

There is even a tool, [`chakin`](https://github.com/chakki-works/chakin#supported-vectors), that allows to easily download word vectors with and without sub-word information for 11 languages.  

In [None]:
import chakin

chakin.search(lang='Japanese')

#### Historical Word Vectors

In the Humanities, despite the value of word embeddings, we usually want to train our own models or to have access to models that are related to a specific time period of study. It might not be of much help to analyze 19th Century literature with word vectors trained on a Google News corpus, specially since the semantic of the words themselves have been proven to change over time.

There is, however, a collection of [historical word vectors](https://nlp.stanford.edu/projects/histwords/) made avaliable to use by the Stanford NLP Group and others (special thanks to [Ryan Heuser](http://ryanheuser.org/)). The embeddings (300 dimensions) are generated using word2vec skip-gram with negative sampling and trained on Google N-Grams for English, English Fiction, French, German, and Simplified Chinese; on the Corpus of Historical American English (COHA); and on the Century Collections Online (ECCO):
- English:
  - [All English](http://snap.stanford.edu/historical_embeddings/eng-all_sgns.zip) (1800s-1990s by decade)
  - [English Fiction](http://snap.stanford.edu/historical_embeddings/eng-fiction-all_sgns.zip) (1800s-1990s by decade)
  - [Genre-Balanced American English](http://snap.stanford.edu/historical_embeddings/coha-word_sgns.zip) (1830s-2000s by decade) (COHA)
  - [Genre-Balanced American English, word lemmas](http://snap.stanford.edu/historical_embeddings/coha-lemma_sgns.zip) (1830s-2000s) (COHA)
  - [ECCO](http://ryanheuser.org/data/word2vec.ECCO.skipgram_n=10.model.txt.gz). Eighteenth Century Collections Online (ECCO), “Literature and Language,” 1700-99, with 1.9 billion words and trained using word2vec with skip-gram size of 10 words
  - [ECCO20](https://archive.org/details/word-vectors-18c-word2vec-models-across-20-year-periods). ECCO split in twenty-year periods of 18C, with 150 million words each and trained using word2vec with skip-gram size of 10 words
  - [ECCO-TCP](http://ryanheuser.org/data/word2vec.ECCO-TCP.txt.zip). ECCO with 80 million words trained using skip-gram size of 5 words. Also available for [size of 10 words](http://ryanheuser.org/data/word2vec.ECCO-TCP.skipgram_n=10.txt.zip).
- Multi-lingual:
  - [French](http://snap.stanford.edu/historical_embeddings/fre-all_sgns.zip) (1800s-1990s by decade)
  - [German](http://snap.stanford.edu/historical_embeddings/ger-all_sgns.zip) (1800s-1990s by decade)
  - [Simplified Chinese](Simplified Chinese (1950-1990s) (1950-1990s by decade)

Let's download and prepare some of these pre-trained word vectors.

In [None]:
# Downloading and preparing the pre-trained embeddings
pretrained.load('word2vec-google-news-300',
                return_path=True)  # return_path avoids to load the model in memory

for filename, dirname in (('eng-fiction-all_sgns.zip', 'fiction'),
                          ('coha-word_sgns.zip', 'coha')):
    if (not os.path.isfile(filename)
            and not os.path.isdir(dirname)):
        print(f'Downloading {filename}')
        !curl --progress-bar -Lo $filename http://snap.stanford.edu/historical_embeddings/$filename
    if (os.path.isfile(filename)
            and not os.path.isdir(dirname)):
        print(f'Uncompressing {filename}')
        !unzip -q -o $filename -d $dirname

In [None]:
for corpus, years in (('fiction', (1900, 1950)),  # range(1800, 1991, 10)
                      ('coha', (1900, 1950))):  # range(1810, 2001, 10)
    for year in tqdm(list(years), desc=f'Generating vector files - {corpus}'):
        if os.path.isfile(f'{corpus}/{year}.vec'):
            continue
        with open(f'{corpus}/{year}.vec', 'w') as vector_file:
            vectors = np.load(open(f'{corpus}/sgns/{year}-w.npy', 'rb'))
            words = pickle.load(open(f'{corpus}/sgns/{year}-vocab.pkl', 'rb'))
            vector_file.write("{} {}".format(*vectors.shape))
            for index, word in enumerate(words):
                vector = np.array2string(vectors[index],
                                         formatter={'float_kind':'{0:.9f}'.format})[1:-1]
                vector = vector.replace('\n', '')
                vector_file.write(f'\n{word} {vector}')

Now we have historical word embeddings for English Fiction and COHA for 1900 and 1950, available as `.vec` files as `<fiction|coha>/<year>.vec`. For example, `fiction/1900.vec`.

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
<strong>Activity</strong>
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Word embeddigns allow for analogy checking. For example, `man is to king as woman is to queen`, expressed as `man:king :: woman:queen`, has its reflection on the vector representions of the words `man`, `king`, `woman`, `queen` in such a way that $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$. However, this can also highlight some biases in the specific corpora the model has been trained on. Using as a base the pair `she-he`, find the most similar term for female from the term in the next list: doctor, captain, gallant, sheriff, engineer, scientist, author, surgeon, honorable, philosopher, warrior, architect, magician, liar, and coward.
<br>
When possible, compute the similarity between the expected term for female and the one for male. Use the Google News (2015), English Fiction (1900, 1950) and Genre-Balanced American English (1900, 1950) embeddings. For example, using the English Fiction 1850 embeddings, `he:gallant :: she:x` solves to `x=gentle`, and the similarity between gallant and gentle is `~0.418`. Can you spot the problem?
<br>
<em>
<strong>Hint</strong>: Use Gensim's `most_similar_cosmul()`/`most_similar()` and `similarity()` functions
</em>
</p>
</div>

Solution (redacted):
```python
words = ...

def get_models():
    """Yields a tuple of 2 elements: (model name, model instance)"""
    yield ('Fiction 1900', ...)
    yield ('Fiction 1950', ...)
    yield ('COHA 1900', ...)
    yield ('COHA 1950', ...)
    yield ('Google News', ...)

for (model_name, model) in get_models():
    print(model_name)
    print(len(model_name) * '=')
    for ...:
        expected = ...
        similarity = ...
        print(f"he:{word:20} :: she:{expected:20}\tsimilarity = {similarity:.4}")
    print()
del model
```

Output (excerpt):
```
Fiction 1900
============
he:doctor               :: she:girl                	similarity = 0.1677
he:captain              :: she:major               	similarity = 0.4955
he:gallant              :: she:kindest             	similarity = 0.2594
...
```

In [None]:
# Enter your solution here

<div align="right"><small class="text-muted">*Double click here to see the full code and output solution*</small></div>

<!--
%%capture --no-stdout
words = "doctor, captain, gallant, sheriff, engineer, scientist, author, surgeon, honorable, philosopher, warrior, architect, magician, liar, coward"
words = words.split(', ')

def get_models():
    """Yields a tuple of 2 elements: (model name, model instance)"""
    yield ('Fiction 1900',
        gensim.models.KeyedVectors.load_word2vec_format('fiction/1900.vec', binary=False))
    yield ('Fiction 1950',
        gensim.models.KeyedVectors.load_word2vec_format('fiction/1950.vec', binary=False))
    yield ('COHA 1900',
        gensim.models.KeyedVectors.load_word2vec_format('coha/1900.vec', binary=False))
    yield ('COHA 1950',
        gensim.models.KeyedVectors.load_word2vec_format('coha/1950.vec', binary=False))
    yield ('Google News', pretrained.load('word2vec-google-news-300'))

for (model_name, model) in get_models():
    print(model_name)
    print(len(model_name) * '=')
    for word in words:
        expected = model.most_similar(
            positive=['she', word], negative=['he'])[0][0]
        similarity = model.similarity(expected, word)
        print(f"he:{word:20} :: she:{expected:20}\tsimilarity = {similarity:.4}")
    print()
del model
-->

---

### 2. Extending Vector Algorithms: Text Classification

#### Averaging vectors

We've seen that vectors for out of vocabulary words are obtained by splitting the word into its n-grams, getting the embedding for the n-grams, and then averaging the composition to produce the final word vector for the OOV word.


What is best in the word2vec approach is that operations on the vectors approximately keep the characteristics of the words, so that joining (averaging) vectors from the words of a sentence produce a vector that is likely to represent the general topic of the sentence.

Therefore, the same technique used for OOV words in fastText can also be used to produce embeddings for sentences, paragraphs and even entire documents, making it possible for text classification purposes.

> The goal of text classification is to assign documents (such as emails, posts, text messages, product reviews, etc...) to one or multiple categories. Such categories can be review scores, spam v.s. non-spam, or the language in which the document was typed. Nowadays, the dominant approach to build such classifiers is machine learning, that is learning classification rules from examples. In order to build such classifiers, we need labeled data, which consists of documents and their corresponding categories (or tags, or labels).
-- [fastText documentation](https://fasttext.cc/docs/en/supervised-tutorial.html#what-is-text-classification)

Let's see how this way of seeing sentences and documents might actually work. Consider the next dummy sentiment texts with 1 positive sentence and 1 negative sentence.

In [None]:
positive_sentence = 'This is some wonderful positive text'
negative_sentence = 'This is some awful negative text'

Now let's get the embeddings for every single word then average them per sentence.

In [None]:
documents = [positive_sentence.split(), negative_sentence.split()]
model = gensim.models.Word2Vec(
    documents,
    size=25,
    window=5,
    min_count=1,
    workers=10
)
model.train(documents, total_examples=len(documents), epochs=10)
model.wv['text']

In [None]:
positive_vectors = [model.wv[word]
                    for word in positive_sentence.split()]
negative_vectors = [model.wv[word]
                    for word in negative_sentence.split()]

In [None]:
positive_vectors[:2]  # first 2 words

And now, let's get the average vectors por positive and negative sentences.

In [None]:
positive_vector = np.mean(positive_vectors, axis=0)
negative_vector = np.mean(negative_vectors, axis=0)
positive_vector, negative_vector

We can now run the same process for a couple of test sentences and see if their sentence vectors are similar to the positive or the negative one.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

for test_sentence in ('This is awful', 'This is wonderful'):
    test_vector = np.mean([model.wv[word] for word in test_sentence.split()], axis=0)
    print(test_sentence)
    print('\tSimilarity to positive sentence',
          cosine_similarity(positive_vector.reshape(1, -1), test_vector.reshape(1, -1))[0][0])
    print('\tSimilarity to negative sentence',
          cosine_similarity(negative_vector.reshape(1, -1), test_vector.reshape(1, -1))[0][0])

Although this approach seems naive, it's still part of the way fastText does its text classification. Moreover, it uses a shallow neural network and ideas similar to CBOW but for word n-grams. The result, which rivals state-of-the-art text classification techniques based on deep learning, runs several orders of magtinude faster.

#### Doc2Vec

In Gensim, this functionality is under `gensim.models.Doc2Vec`, and it uses a slightly different approach based on Word2Vec, the *Paragraph Vector*, where the model learns to correlate labels and words, rather than words with other words.

> The idea is straightforward: we act as if a paragraph (or document) is just another vector like a word vector, but we will call it a paragraph vector. We determine the embedding of the paragraph in vector space in the same way as words. Our paragraph vector model considers local word order like bag of n-grams, but gives us a denser representation in vector space compared to a sparse, high-dimensional representation. -- [RaRe Technologies](http://nbviewer.jupyter.org/github/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb#Paragraph-Vector)

The first step is coming up with a vector that represents the *meaning* of a document, which can then be used as input to a supervised machine learning algorithm to associate documents with labels. There are 2 flavors that are roughly the equivalents of CBOW and Skip-gram. In the **Paragraph Vector Distributed Memory (PV-DM)** model, analogous to CBOW Word2vec, the paragraph vectors are obtained by training a neural network on the fake task of inferring a center word based on context words and a context paragraph. A paragraph is a context for all words in the paragraph, and a word in a paragraph can have that paragraph as a context. In the **Paragraph Vector Distributed Bag of Words (PV-DBOW)** model, analogous to Skip-gram Word2vec, the paragraph vectors are obtained by training a neural network on the fake task of predicting a probability distribution of words in a paragraph given a randomly-sampled word from the paragraph.

Doc2Vec, which considers blocks of texts and units, has built-in support for the equivalent for CBOW as *distributed memory* (`dm`), and *distributed bag of words* (`dbow`) for Skip-gram. Since the distributed memory model performed noticeably better in the paper, that algorithm is the default when running Doc2Vec. You can still force the `dbow` model if you wish, by using the `dm=0` flag in constructor.

Let's now see an example where we aim to build a classifier for Jane Austen and G.K. Chesterton's works, where each work is assigned a label or tag with its author. Let's also suppose we don't know who the author for Austen's *Emma* was but we know for sure that is either Austen or Chesterton. One way to approach this is by making building a classifier to predict a label or tag for the unseen anonymous work so we can see who the classifier thinks the work belongs to. This is, with some licenses, a very basic instance of authorship attribution.

We start by obtaining the total number of words and sentences for all their works in the NLTK's Gutenberg corpus and consider that to be all the works that they ever wrote.

In [None]:
works = ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt',
         'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt']
for work in works:
    print("{:25} {:8} words {:5} sentences".format(
        work,
        len(nltk.corpus.gutenberg.words(work)),
        len(nltk.corpus.gutenberg.sents(work)))
    )

By removing *Emma* from our list of works, we roughly have the same amount of words per author, which should alleviate the class imbalance issue.

| Austen's Work          | Words    | Sentences  |
| ---------------------- |:--------:| ----------:|
| austen-persuasion.txt  |  98171   |  3747      |
| austen-sense.txt       |  141576  |  4999      |
| **Total**              |**239747**|**8746**    |    

| chesterton's Work      | Words    | Sentences  |
| ---------------------- |:--------:| ----------:|
| chesterton-ball.txt    |  96996   |  4779      |
| chesterton-brown.txt   |  86063   |  3806      |
| chesterton-thursday.txt|  69213   |  3742      |
| **Total**              |**252272**|**12327**   |    



Gensim provides a couple of classes to encode sentences and entire documents: `LabeledSentence()` and `TaggetDocument()`, respectively. While their operation is very similar, the former is intended for short texts and sentences, while the latter performs better for large chunks of text.

Once we have the words from our corpus of works, let's create instances of each of them as `TaggedDocument`s by passing the list of words and the label or tag. We will use the label `anonymous` for Austen's *Emma*. Similarly, we could have built `documents` to have a list of sentences instead and then predict the proportion of sentences in Austen's *Emma* properly assigned to her.

In [None]:
from gensim.models.doc2vec import TaggedDocument
    
documents = [
    TaggedDocument(
        words=nltk.corpus.gutenberg.words('austen-persuasion.txt'),
        tags=['austen']),
    TaggedDocument(
        words=nltk.corpus.gutenberg.words('austen-sense.txt'),
        tags=['austen']),
    TaggedDocument(
        words=nltk.corpus.gutenberg.words('chesterton-ball.txt'),
        tags=['chesterton']),
    TaggedDocument(
        words=nltk.corpus.gutenberg.words('chesterton-brown.txt'),
        tags=['chesterton']),
    TaggedDocument(
        words=nltk.corpus.gutenberg.words('chesterton-thursday.txt'),
        tags=['chesterton']),
    TaggedDocument(
        words=nltk.corpus.gutenberg.words('austen-emma.txt'),
        tags=['anonymous']),
]

Now, it's time to initialize a `Doc2Vec()` model with some learning rate `alpha`, build the vocabulary from our list of works, and start training over a number of epochs, 10 in this case. The values for `total_examples` and `epochs` are the default ones, although for some reason `Dov2Vec()` forces you to pass them in explicitly: `total_examples=model.corpus_count` and `epochs=model.epochs`.

In [None]:
from gensim.models import Doc2Vec

model = Doc2Vec(min_cout=0)
model.build_vocab(documents)
for epoch in tqdm(range(10), desc='Epochs'):
    model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)

Once the model has been fit to out corpus, we have available the usual functions from word2vec but for documents, `docs2vec`. Let's see, from our list of tags that represent the authors of the works, which one is more similar to.

In [None]:
model.docvecs.most_similar_to_given('anonymous', ['austen', 'chesterton'])

Finally, let's compute similarity and distance from Austen's Emma (`anonymous`) to the vectors that represent Jane Austen and Chesterton.

In [None]:
for author in ('austen', 'chesterton'):
    print(author)
    print("\tSimilarity", model.docvecs.similarity('anonymous', author))
    print("\tDistance", model.docvecs.distance('anonymous', author))    

Of course, there are many interpretations of this and it's, by no means, the best way to tackle the problem of authorship attribution. However, it gives a sense of how extending ideas from word2vec can be leveraged to other problems.

<div style="font-size: 1em; margin: 1em 0 1em 0; border: 1px solid #86989B; background-color: #f7f7f7; padding: 0;">
<p style="margin: 0; padding: 0.1em 0 0.1em 0.5em; color: white; border-bottom: 1px solid #86989B; font-weight: bold; background-color: #AFC1C4;">
<strong>Activity</strong>
</p>
<p style="margin: 0.5em 1em 0.5em 1em; padding: 0;">
Using the example code given for fastText text clasification, replicate the Austen and Chesterton classifier.
You can use a sentence level approach, holding 25% of all sentences for testing and assessing the performance.
Once the model has been fit, we'll use it to predict the percentage of sentences from Austen's *Emma* that fall to either Austen or Chesterton.
</p>
</div>

Solution (redacted):
```python
# 1. Build the list of sentences, prefixing each one with __label__author
texts = []
for work in works:
    for sent in ...:
        if 'emma' not in work:  # Our anonymous work!
            author = ...
            sentence = ...
            texts.append(f"__label__{author} {sentence}")

# 2. Randomly split the list in training and testing
training = ...
testing = ...

# 3. Get the texts from the training and testing lists
training_text = ...
testing_text = ...

# 4. Create the temporary files so fastText can read them 
with open('author_train.txt', 'w') as f:
    f.write(training_text)
with open('author_test.txt', 'w') as f:
    f.write(testing_text)

# 5. Build a classifier and test it on the testing sentences
classifier = fasttext....
result = classifier...
print('Precision:', result...)
print('Recall:', result...)
print('Number of sentences:', result...)

# 6. For each sentence in Austen's Emma, predict its author
emma_sentences = ...
predictions = classifier...

# 7. Print the percentage of sentences from Austen' Emma associated to each author
authors = ...
counts = ...
print()
print("Predicting Austen's Emma's sentences")
for author, count in zip(*[authors, counts]):
    print(f"{author.capitalize()} accounts for {100*count:.2f}% of sentences.")
```

Output:
```
Precision: 0.8980831277282216
Recall: 0.8980831277282216
Number of sentences: 5269

Predicting Austen's Emma's sentences
Austen accounts for 80.38% of sentences.
Chesterton accounts for 19.62% of sentences.
```

In [None]:
# Enter your code here

<div align="right"><small class="text-muted">*Double click here to see the full code solution*</small></div>

<!--

#- 1. Build the list of sentences, prefixing each one with __label__author
texts = []
for work in works:
    for sent in nltk.corpus.gutenberg.sents(work):
        if 'emma' not in work:  # Our anonymous work!
            author = work.split('-')[0]
            sentence = ' '.join(sent)
            texts.append(f"__label__{author} {sentence}")

#- 2. Randomly split the list in training and testing
training_size = int(np.floor(len(texts) * 0.75))
np.random.shuffle(texts)
training, testing = texts[:training_size], texts[training_size:]

#- 3. Get the texts from the training and testing lists
training_text = '\n'.join(training).strip()
testing_text = '\n'.join(testing).strip()

#- 4. Create the temporary files so fastText can read them 
with open('author_train.txt', 'w') as f:
    f.write(training_text)
with open('author_test.txt', 'w') as f:
    f.write(testing_text)

#- 5. Build a classifier and test it on the testing sentences
classifier = fasttext.supervised('author_train.txt', 'author_model')
result = classifier.test('author_test.txt')
print('Precision:', result.precision)
print('Recall:', result.recall)
print('Number of sentences:', result.nexamples)

#- 6. For each sentence in Austen's Emma, predict its author
emma_sentences = [' '.join(sent)
                  for sent in nltk.corpus.gutenberg.sents('austen-emma.txt')]
predictions = classifier.predict(emma_sentences)

#- 7. Print the percentage of sentences from Austen' Emma associated to each author
authors, counts = np.unique(predictions, return_counts=True)
counts = counts / sum(counts)
print()
print("Predicting Austen's Emma's sentences")
for author, percentage in zip(*[authors, counts]):
    print(f"{author.capitalize()} accounts for {100*percentage:.2f}% of sentences.")
-->