# Week 20: SCC.413 Applied Data Mining

## Word Embedding Models

## Contents
---
* [1. Introduction](#intro)
* [2. Training the Embedding Model](#embmodel)
* [3. Querying word vectors: So, what does our trained model contain](#query)
* [4. More querying: What can we do with the word vectors?](#morequery)
* [5. Using Pretrained Embedding Models](#pretrained)
* [6. Evaluating Pre-trained Word Embeddings](#evaluation)
    * [6a. Task1: Vector Analogy](#analogy)
        - [Exercise 1](#ex1)
    * [6b. Task2: Sentiment Classification](#sentiments)
        - [Exercise 2](#ex2)

<a name="intro"></a>
## 1. Introduction
---

We have seen different (even more sophisticated) methods for preprocesing text in previous labs, so this is merely a review. By now we have established that processing text and understanding the patterns there in is at the heart of whatever we do in data mining and NLP.

Also, you may have noticed that machine learning algorithms we use to train our models for most of the downstream tasks prefer to have text inputs converted to some form of numerical data. When we perform classification of documents, for example, each document is an **input** and a class label is the **output** for our predictive algorithm. We give *vectors* of numbers (also called *input features*) as input.

So, we need to convert documents to *fixed-length* vectors of numbers. *Scikit-learn*'s *CountVectorizer* is an example of a way to achieve that. In dealing with embedding models, we think beyond counting words but also capturing relationships between words.

There are two key tasks in this lab:
* **Answer analogy questions** with embedded vectors as described in [this paper](http://arxiv.org/pdf/1301.3781.pdf)
    - e.g. `man` is to `king` as `woman` is to `?`; (Answer = `queen`)

* **Training a word sentiment classifier** as described in [this blog](http://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/)
    - e.g. `awesome` is *positive* while `horrible` is *negative*
    

At the end of the lab, you will hopefully learn
* how to train, save and load your own embedding models using the Gensim library (discussed later).
* how to query trained word vectors
    - similarity scores
    - vector analogy
    - odd-words
    - Word Mover's Distance
* how to fit a classfier with embbeding vectors as features


Useful skills from previous labs:
   
   * file processing
   * text processing
   * training classifiers
   * etc
   

---
### Installing and Importing libraries ...

In [None]:
# pip install -r requirements.txt  #this isn't needed on colab, as all libraries are installed

In [None]:
import os
import gensim.downloader as api
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn
import re

from nltk.tokenize import word_tokenize, sent_tokenize
from gensim.models import Word2Vec, KeyedVectors
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from collections import Counter

# suppressing some deprecation warning..
import warnings
warnings.simplefilter(action='ignore')

You should upload all of the provided files to a Google Drive folder, you can then access these files from your Python code. See also the files tab.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive/')

We save the folder we are working from as a variable for easy access. You may need to edit the path to match your own.

In [None]:
working_folder = '/content/gdrive/MyDrive/413/wk20/'

### Data preparation
The files for this exercise are kept in a folder named `data` with different subfolders, each of which will be used at different stages. We will normalise to lowercase and with each sentence in the list of sentences presented as a list of words. In creating our word embeddings, we will use the `CreateCorpus` class below to prepare our training data.

#### Task 0:
Observe that no pre-processing or cleaning was done on the data. Proceeding without that will produce unwanted tokens. Modify parts of the `CreateCorpus(object)` to pre-process the text better e.g. removing the punctuation etc. You can also remove stopwords if you wish.

In [None]:
# training the model
class CreateCorpus(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.lower().split()

<a name="embmodel"></a>
## 2. Training Embedding Model
---
The main purpose of this exercise is to see how to train a word embedding model. There are many ways to train embedding models. Some of the popular methods and libraries include:

 - Tomas Mikolov's [Word2vec](https://en.wikipedia.org/wiki/Word2vec),
 - Stanford University's [GloVe](https://nlp.stanford.edu/projects/glove/),
 - AllenNLP's [ELMo](https://allennlp.org/elmo),
 - FacebookAI's [fastText](https://en.wikipedia.org/wiki/FastText),
 - Radim Řehůřek's [Gensim](https://radimrehurek.com/gensim/index.html)
 - Google's [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) 
 
For our tasks, we shall use [Gensim](https://radimrehurek.com/gensim/index.html) which is an open source Python library for natural language processing, with a focus on [topic modeling](https://en.wikipedia.org/wiki/Topic_model). You can install Gensim with `pip install --upgrade gensim` (`pip install -r requirements.txt` installs all dependencies for this lab), if you haven't done so. I also find this [Gensim's tutorial and demo page](https://rare-technologies.com/word2vec-tutorial/) quite insightful too.

---
### Training Parameters
Some of the common training parameters you may use to optimise your model based on your immediate task include:

 * **size:** (int, optional, default=100) – Dimensionality of the word vectors.
 * **window:** (int, optional, default=5) – Maximum distance between the current and predicted word within a sentence.
 * **min_count:** (int, optional, default=5) – Ignores all words with total frequency lower than this.
 * **workers:** (int, optional, default=3) – Use these many worker threads to train the model (=faster training with multicore machines).
 * **sg:** ({0, 1}, optional, default=0) – Training algorithm: 1 for *skip-gram(SG)*; otherwise *CBOW*.
 * **alpha:** (float, optional, default=0.025) – The initial learning rate.
 * **min_alpha:** (float, optional, default=0.0001) – Learning rate will linearly drop to min_alpha as training progresses.
 * **iter:** (int, optional, default=5) – Number of iterations (epochs) over the corpus.
 * **negative:** (int, optional, default=5) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

See [this page](https://radimrehurek.com/gensim/models/word2vec.html) for other parameters if you wish to perform training optimisation later.

---
### Training our word embedding model
So let's train a simple word embedding model that uses the parameters. The embedding dimension (size) 100.

To train the model, we simply pass our list of `corpus` (*sentences*) to the Word2Vec object.

In [None]:
corpus = CreateCorpus(working_folder + 'data/sentiment_data/all/')
twitter_sentiment_model = Word2Vec(corpus)

Done! That's it. We now have a model. We can look at some of the parameter configurations for training our model.

<a name="query"></a>
## 3. Querying word vectors
The *full model* often retains the training details allowing you to checkpoint it and continue training later with more data (more data implies better representation). But when the model is fully trained within certain bounds (e.g. domain, data size, embedding dimension, target task), we may only require the (pre-trained) word vectors as inputs to our downstream tasks. We will come back to that later.

Some of the things we may be interested in with regards to text processing include:

 - **Model vocabulary** (may be different from the corpus vocabulary depending on pre-processing and model initial parameters)
 - **Frequency counts** (of words/tokens in model)
 - **Embedding dimension** (size)
---
### Model vocabulary
We can use `twitter_sentiment_model.wv.vocab`to view the vocabulary built in training (`len(twitter_sentiment_model.wv.key_to_index)`). Looking at the list displayed below, you may decide, for example, what is okay to keep and what constitutes noise (e.g. *his...*, *|*, *-&gt* and *numbers* in general) and, more importantly, how to deal with them e.g. using a better tokeniser, setting the right **min_count** parameter value.

In [None]:
print(f"Vocabulary size: {len(twitter_sentiment_model.wv.vocab)}")
for w in twitter_sentiment_model.wv.vocab:
    print(f"{w}, ", end="")

### Frequency counts

This list will show that a lot of the input words to the model are punctuations and stopwords. Depending on what your main task is, you may decide to retain or remove them before the model training using simple processing techniques learnt in the previous labs.

In [None]:
def show_top_token_freq(model,topn):
    for w, v in sorted(list(model.wv.vocab.items()), key=lambda x:x[1], reverse=True)[:topn]:
        if topn<=20:
            print(f"{w:>10s} {v.count:5d}")
        else:
            print(f"{w}({v.count}), ", end="")
show_top_token_freq(twitter_sentiment_model, 10)

### Embedding dimension
So, it is basically a *vector* (a list of numbers) with length equal to the value assigned to the `size` parameter during train. Remember, we used `size = 100` in the training example above, but often most of the pre-trained models (such as [Glove](https://nlp.stanford.edu/projects/glove/), [Word2Vec](https://code.google.com/archive/p/word2vec/), [fastText](https://fasttext.cc/)) will use higher dimensions (50, 100, 200, 300) training on 100s of billions of tokens to capture better semantic information.

In [None]:
twitter_sentiment_model.wv.get_vector("the")

### Saving a model
Again because we used a toy corpus for this training, we can afford to train and retrain so many times. In reality, you don't want to throw away your (hard) trained model because it takes a lot of time and is very computationally expensive to train a 'good enough' model.

There are basically two ways you can save the model:

* *full model*: `model.save('path\to\file_name')` contains the full model state (hidden weights, vocab frequencies and other training parameter values. You can initialise a new model with this and continue training at a later time.

* *word vectors*: `model.save_word2vec_format('path\to\file_name')`. This is (more or less) the actual output of a training process. It is a plain text file with each line containing a token and its vector representation. Technically, you cannot continue training on this when loaded but you can  view it with a text editor (if not too large!) or process it as text file. Most pre-trained models are available in this format for re-use.

For more options see [Usage examples on Gensim](https://radimrehurek.com/gensim/models/word2vec.html)

In [None]:
twitter_sentiment_model.save(working_folder + 'data/embedding_model/twitter_sentiment_model.bin')
twitter_sentiment_model.wv.save_word2vec_format(working_folder + 'data/embedding_model/twitter_sentiment_model.txt')

<a name="loading"></a>
### Loading a model
You can load the *full model* or the *word vectors* for later use as shown below.

In [None]:
loaded_model_bin = Word2Vec.load(working_folder + 'data/embedding_model/twitter_sentiment_model.bin')
loaded_model_txt = KeyedVectors.load_word2vec_format(working_folder + 'data/embedding_model/twitter_sentiment_model.txt')

In [None]:
loaded_model_bin.wv['the']

In [None]:
loaded_model_txt['the']

<a name="morequery"></a>
## 4. What we can do with word vectors

Now we have seen how to configure, train, save and load embedding models. But what can we do with these pre-trained word vectors? Since word embedding models became popular in 2013 [Ruder,2018](https://ruder.io/tag/word-embeddings/) which also popularised the application of neural networks to NLP including:

### Application areas

    - Language modeling
    - Information extraction
    - Machine translation
    - Named entity recognition
    - Question answering
    - Text classification
    - etc.

Also, here is an interesting blog on [why we use embedding models](https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2).

### How we can use word vectors

A common way to use the embeddings to have it as a layer (the *embedding layer*) in a deep neural network setting i.e. the embedding is learned jointly with a neural network model on a specific natural language processing task. (That will not be our focus in this lab).

But you can also use pre-trained vectors, combined in some format, as inputs to a classification algorithm. Obviously, your task determines how the vectors are created and/or combined and that can be optimised experimentally.

In this lab we will be looking at a *very* simple sentiment classification task using pre trained vectors as inputs. Before that, let's look at some of the properties of a generic embedding model.

---
### Analogy - A Basic Question Answering problem 

#### `king` minus `man` plus `woman` = ?

![](https://dpzbhybb2pdcj.cloudfront.net/smith4/Figures/f0207-01.jpg)

One of the most popular tasks that word embedding supported (as demonstrated in this [Mikolov's paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)) was the *vector analogy* task. (You can read all about how and why it works [in this paper](https://levyomer.files.wordpress.com/2014/04/linguistic-regularities-in-sparse-and-explicit-word-representations-conll-2014.pdf). The key idea is that it can capture analogies like *man is to king* what *woman is to queen*. 

Therefore, if you get the vector for the word, say *king* and subtract the vector for *man* and then add the vector for *woman*, the closest approximation happens to be the vector for *queen* i.e. `vec(king) - vec(man) + vec(woman) ~= vec(queen)`

We may not be able to see much with our toy dataset though, but let's explore a little

---
### Similarity measure
From our model, let's look at the 10 most similar words (according to our model!) to each of these randomly selected words *man, woman, king, queen, good, bad, office, kitchen*. Here, we use *similarity* as measure of 'closeness' of words e.g. synonyms (*coast* vs *shore*) or related words (*clothes* and *closet*). This is a good paper in [similarity and relatedness](https://www.aclweb.org/anthology/D15-1242.pdf).

What are the most similar words to *queen*? How good or bad are they. By default we trained the model with 5 epochs (i.e.`iter=5`) You can retrain it with 10, 20 or more and see how much difference it will make. 

In [None]:
words = ['man', 'woman', 'king', 'queen', 'good', 'bad', 'office', 'kitchen']

In [None]:
for word in words:
    similar_words = twitter_sentiment_model.wv.most_similar(word)
    print(f"\n---{word}:")
    for wd, score in similar_words:
        print(f"{wd:>15s}: {score:.4f}")

As *king* (or any word linked to royalty) is not seen in the most similar words to *queen*. It looks like our model may not be able to answer the question *king - man + woman $\approx$ queen?* But let's try to find out. We don't even have `kitchen` in our vocabulary.

In [None]:
result = twitter_sentiment_model.wv.most_similar(positive=['woman', 'king'], negative=['man'])
for word, score in result:
    print(f"{word:>15s}: {score:.4f}")

Well, our model is struggling and this is not very surprising because it was trained with a comparative small amount of data data...too little to capture anything meaningful. It's common practice to use pre-trained models for your work unless you have a good reason not to (e.g. if they are too generic for your task) and you have a lot of time and compute resources. 

---
<a name="pretrained"></a>
## 5. Using Pretrained Embedding Models
Using pre-trained models is common but, in some cases, it may be better to train your own embedding. Here is a paper on [when-and-why-using-pretrained-embedding](https://www.aclweb.org/anthology/N18-2084.pdf). It is about machine translation but has general advice.

Gensim `download` API can be used to get these `Word2Vec` models but they may be too heavy for the lab:

    word2vec-google-news-300 (1662 MB) (dimensionality: 300)
    word2vec-ruscorpora-300 (198 MB) (dimensionality: 300)

Alternatively, there are lighter versions trained with the [GloVe](https://nlp.stanford.edu/projects/glove/) architecture described in [this paper](https://nlp.stanford.edu/pubs/glove.pdf). 

    glove-wiki-gigaword-50 (65 MB)
    glove-wiki-gigaword-100 (128 MB)
    glove-wiki-gigaword-200 (252 MB)
    glove-wiki-gigaword-300 (376 MB)

For our lab exercise, we will use one of the pre-trained models that can be downloaded using the Gensim's `downloader` API, `glove-wiki-gigaword-100`as shown below. Also, the `data/word_vectors/` contains other versions with different dimensions which you can load and use.

Also you can use the trained vectors irrespective of how they are trained ([Word2Vec](https://nlp.stanford.edu/projects/glove/), [fastText](https://fasttext.cc/), [WordRank](https://www.groundai.com/project/wordrank-learning-word-embeddings-via-robust-ranking/4), VarEmbed etc). Here is [a nice blog](https://rare-technologies.com/wordrank-embedding-crowned-is-most-similar-to-king-not-word2vecs-canute/) on the comparison of different embedding models. As demonstrated [above](#loading), Gensim represents them as a standalone structure called `KeyedVectors`.

Let us use the `gensim.downloader` API for loading the Glove model (it takes a while) and check the most similar words to each word in `words` with the `glove_vectors`. Try to answer the *king - man + woman $\approx$ queen?* question again. Compare the results with those of the previous model.

First, let'd download the embeddings and save the vectors it for the classification task later.

In [None]:
# download pre-trained embeddings from gensim-data
glove_50 = api.load("glove-wiki-gigaword-50")

You don't always have to do this but if you wish to back up the vector format for later use, run the code below.

In [None]:
glove_50.save_word2vec_format(working_folder + 'data/embedding_model/glove.6B.50d.txt')

In [None]:
for word in words:
    similar_words = glove_50.most_similar(word)
    print(f"\n---{word}:")
    for wd, score in similar_words:
        print(f"{wd:>15s}: {score:.4f}")

In [None]:
result = glove_50.most_similar(positive=['king', 'woman'], negative=['man'], topn=10)
for word, score in result:
    print(f"{word:>15s}: {score:.4f}")

Clearly, `queen` tops the list and that's good. Also, we can see that the list is made up of words that are expected in the context (e.g. royalty, womanhood etc.)

You can try a few more analogy examples:

    - positive=['uk', 'russia'], negative=['london'] = ?
    - positive=['father', 'mother'], negative=['son'] = ?

or anything you like, make something up.

---    
### Odd-word: Another simple question anwering task

Here, we give a set of words to our model and let it pick the **odd one** out. Are the results below as expected? Give examples that can break it.

In [None]:
print(glove_50.doesnt_match('breakfast cereal dinner lunch'.split())) #cereal?
print(glove_50.doesnt_match('london savannah manchester glasgow'.split())) #savannah?
print(glove_50.doesnt_match('bad nice horrible disgusting'.split())) #nice?

### Similar documents with WMD
[Word Mover's Distance (WMD)](https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html) is an interesting technique in building machine learning models for [document retrieval](https://en.wikipedia.org/wiki/Document_retrieval). We will use the example sentences from the tutorial. How close are these sentence pairs?

* [**Warning:** `pyemd` could not run on my Windows machine. Skip cell if you have problems running this]

In [None]:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()
sentence_orange = 'Oranges are my favorite fruit'.lower().split()

similarity1 = glove_50.wmdistance(sentence_obama, sentence_president)
similarity2 = glove_50.wmdistance(sentence_obama, sentence_orange)
similarity3 = glove_50.wmdistance(sentence_orange, sentence_president)

print(f"{'[sentence_obama, sentence_president]':38s}: {similarity1:.4f}")
print(f"{'[sentence_obama, sentence_orange]':38s}: {similarity2:.4f}")
print(f"{'[sentence_orange, sentence_president]':38s}: {similarity3:.4f}")

## 6. Evaluating Pre-trained Word Embeddings
<a name="evaluation"></a>
We have seen how pre-trained embeddings are loaded and queried and the interesting information they capture. But we need to be able to evaluate them i.e. to measure the probability of their usefulness in real world scenarios. Often these evaluations can be in two ways:

* *intrinsic evaluation*  applies word-vectors to *specific intermediate subtasks* (e.g. *analogy completion*, *word similarity and relatedness*). These subtasks are simple and shows whether the model is behaving as expected, or not, and why.

* *extrinsic evaluation* applies the vectors on a real world task, typically elaborate and slow to compute. Typically, optimizing an underperforming system in this case is hard and so we often use intrinsic evaluations during development.
   
See this lecture note on [Evaluation of Word Vectors](https://cs224d.stanford.edu/lecture_notes/notes2.pdf) and  [Evaluating Word Embedding Models: Methods and Experimental Results](https://arxiv.org/pdf/1901.09785.pdf) for details on evaluation methods.

<a name="analogy"></a>
### 6a. Task1: Vector Analogy

This task is described in *Section 4.1* of the paper [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781v3.pdf).

<img src="questions-answers.png" width=600>

The image may not display on google colab, either move the image to your files or view it separately (questions-answers.png).

**Task Setup: **
* *Aim:* To measure the impact of vector `size` on model quality or performance

* *Problem:* To answer semantic and syntactic questions such as `athens:greece::oslo:?` (*norway*) or `walking:worked::swimming:?` (*swam*), `big:bigger::small:?` (*smaller*)

* *Dataset:* In `data/analogy/questions-words.txt`. Categories: 5 semantic; 9 syntactic (see Table 1) 8869 semantic and 10675 syntactic questions. **[10% of each category's questions should be enough.]**
 
* *Method:* `big:bigger::small:?`, compute `answer = vector(”biggest”) − vector(”big”) + vector(”small”)` with *cosine similarity* (use a library function). Correct if *most similar* ==  *smaller* otherwise wrong.

---
<a name="ex1"></a>
## Exercise 1: Vector Analogy
* Compare the performance of the two models on the analogy task defined above:

        glove_50 = api.load("glove-wiki-gigaword-50")
        glove_100 = api.load("glove-wiki-gigaword-100")

Show the:

  - Overall performance on all questions (selected 10% from each category)
  - Performance for each question type (i.e. semantic, syntactic)
  - Performance for each category type (i.e. semantic, syntactic)

* It's entirely up to you how you do this but you may need functions similar to these
  - `getAnswer(question)` function
  - `getScore(y_test, y_pred)` function
  - `split_categories(questions)` or `split_types(questions)` function

In [None]:
# see the file structure
questions_words = open(working_folder + "data/analogy/questions-words.txt", 'r', encoding='utf8').read().split("\n")
questions_words[:10]

In [None]:
# Task 1: Your solution here...

<a name="sentiments"></a>
### 6b. Task2: Word Sentiment Classification

In this exercise, we train a classifier that predicts the sentiment - *positive* or *negative* - of a given word using the sentiment lexicon on [Bing Liu's website](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon). It is actually a modified reproduction of the experiment presented [in this interesting blog](http://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/) about de-biasing embedding models. Following the same steps, we will: 

* Use our saved embeddings in `data/word_vectors/`
* Get training and test data, with gold-standard examples of *positive* and *negative* words
* Train a classifier to classify words as *positive* or *negative* given vector representations
* Compute sentiment scores for texts with the classifier

### Loading the word vectors models

In [None]:
# Load a DataFrame from the generalized text format 
# Author: Robyn Speer
def load_embeddings(filename):
    rows, labels = [], []
    with open(filename, encoding='utf-8') as infile:
        for i, line in enumerate(infile):
            items = line.rstrip().split(' ')
            if len(items) == 2:
                # This is a header row giving the shape of the matrix
                continue
            labels.append(items[0])
            values = np.array([float(x) for x in items[1:]], 'f')
            rows.append(values)
    arr = np.vstack(rows)
    return pd.DataFrame(arr, index=labels, dtype='f')

glove_50 = load_embeddings(working_folder + 'data/embedding_model/glove.6B.50d.txt')
glove_50

### Loading a sentiment lexicon

We use a gold-standard sentiment lexicon [(Hu and Liu, 2004)](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon) to train a word sentiment classifier which uses only the embedding vectors as it's features. The cell below will load the positive and negative sentiment words into the variables `pos_words` and `neg_words`.

In [None]:
# Load a DataFrame from the generalized text format 
# Author: Robyn Speer
def load_lexicon(filename):
    lexicon = []
    with open(filename, encoding='latin-1') as infile:
        for line in infile:
            line = line.rstrip()
            if line and not line.startswith(';'):
                lexicon.append(line)
    return lexicon

pos_words = load_lexicon(working_folder + 'data/sentiment_lexicon/positive-words.txt')
neg_words = load_lexicon(working_folder +'data/sentiment_lexicon/negative-words.txt')

### Train-Test Split
Simultaneously separate the input vectors, output values, and labels into training and test data, with 10% of the data used for testing. Our classes are `positive` and `negative` words. Words that are not in the embedding e.g misspelt words (“fancinating”) are .dropna() to remove them. The rest are used for training.

Here we will start with the `glove_50`, the inputs are the embeddings, and the outputs are `1` for *positive* words and `-1` for negative words. We keep track of the words they’re labeled with, so we can interpret the results. Your task will involve comparing our results here with the ones from using `glove_100`.

In [None]:
pos_vectors = glove_50.reindex(pos_words).dropna()
neg_vectors = glove_50.reindex(neg_words).dropna()

vectors = pd.concat([pos_vectors, neg_vectors])
targets = np.array([1 for entry in pos_vectors.index] + [-1 for entry in neg_vectors.index])

labels = list(pos_vectors.index) + list(neg_vectors.index)

train_vectors, test_vectors, train_targets, test_targets, train_labels, test_labels = \
    train_test_split(vectors, targets, labels, test_size=0.1, random_state=0)

### Training and Evaluation
We use the `SGDClassifier` setting only the `loss`, `random_state`, `n_iter` parameters. We use a logistic function as the loss, so that the resulting classifier can output the probability that a word is positive or negative.

In [None]:
model = SGDClassifier(loss='log', random_state=0, max_iter=100)
model.fit(train_vectors, train_targets)
accuracy_score(model.predict(test_vectors), test_targets)

### Words and their sentiment scores
Apologies for some really horrible ones.

In [None]:
def vecs_to_sentiment(vecs):
    # predict_log_proba gives the log probability for each class
    predictions = model.predict_log_proba(vecs)

    # To see an overall positive vs. negative classification in one number,
    # we take the log probability of positive sentiment minus the log
    # probability of negative sentiment.
    return predictions[:, 1] - predictions[:, 0]

def words_to_sentiment(words):
    vecs = glove_50.reindex(words).dropna()
    log_odds = vecs_to_sentiment(vecs)
    return pd.DataFrame({'sentiment': log_odds}, index=vecs.index)

# Show 20 examples from the test set
words_to_sentiment(test_labels).iloc[:20]

### Classifying texts
We can pass the tokenised text (list of words) to the `words_to_sentiment` function as above and take the average of their returned predicted `predict_log_proba` scores.

In [None]:
TOKEN_RE = re.compile(r"\w.*?\b")
# The regex above finds tokens that start with a word-like character (\w), and continues
# matching characters (.+?) until the next word break (\b). It's a relatively simple
# expression that manages to extract something very much like words from text.

def text_to_sentiment(text):
    tokens = [token.lower() for token in TOKEN_RE.findall(text)]
    sentiments = words_to_sentiment(tokens)
    return sentiments['sentiment'].mean()

s1 = text_to_sentiment("this example is pretty cool")
s2 = text_to_sentiment("food so tasteless dont go there")
s1,s2

### Classifying tweets
We can also do a similar thing with some positive and negative tweets an observe their returned scores.

In [None]:
pos = ["""Happy 50th Birthday Michelle Obama! 
                We're winding down our Friday with our 
                favorite FLOTUS quotes!""",
       """Jay Z 'pays tribute to Michael Jackson as he joins
                Instagram' with touching photo, it may be his last
                """
      ]

neg = ["""California May Label Monsanto's Roundup as 'Known to Cause
                Cancer' | Natural Society well it's about time!""", 
      """Why in the name of Monsanto am I sitting @ home on Saturday
         night trolling for used RVs on Craigslist..oh yeah,
         I have no date""",
      ] 
tweets = {
    'pos': pos,
    'neg': neg
}

In [None]:
def tweet_sentiment_table():
    sentiments={}
    for sentiment, tweet_list in sorted(tweets.items()):
        for i, tweet in enumerate(tweet_list):
            sentiments.setdefault(sentiment,[]).append(text_to_sentiment(tweet.lower()))
    return sentiments

tweet_sentiments = tweet_sentiment_table()
tweet_sentiments

### Scatterplot of tweet sentiments

In [None]:
for key in tweet_sentiments:
    plt.scatter([key]*2, tweet_sentiments[key], label=key)
plt.legend()
plt.show()

---
<a name="ex2"></a>
## Exercise 2: Sentiment classification

* Using the embedded features in the two models in "data/word_vectors", and the sentiment lexicon used above, train a classifier for each model and present a comparative analysis of their analysis on
    - word sentiment prediction
    - tweet sentiment classfification using the positive and negative Twitter data in `data/sentiment_data`
    
* Train another classifer using the normal feature extraction methods used in the previous labs
* Compare the results with the above and discuss your findings.
---

In [None]:
# Task 2: Your solution here...