<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Text Classification using Doc2Vec embeddings
In this notebook we will be doing text classification using document embeddings obtained using a pre-trained Doc2Vec model.  The Doc2Vec algorithm was introduced in 2014 by Le and Mikolov to overcome the issues associated with simple averaging of Word2Vec vectors to form a representation of a document as an average of the words in the document.  Doc2Vec creates a numerical embedding for a document by embedding all words in the document (as Word2Vec does) but then also creates an additional vector representing the entire document which contributes to the training predictions.

Our goal will be to classify the articles in the AgNews dataset into their correct category: "World", "Sports", "Business", or "Sci/Tec".

**Notes:**  
- This does not need to be run on GPU, but will take 5-10 minutes on CPU  

**References:**  
- Read the original [Doc2Vec paper](https://arxiv.org/pdf/1405.4053v2.pdf) by Le & Mikolov  
- Some portions of this code are from the [GENSIM docs Doc2Vec tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_doc2vec_lee.html)

In [1]:
import os
import numpy as np
import pandas as pd
import string
import time
from sklearn.linear_model import LogisticRegression
import urllib.request
import zipfile

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
#!python -m spacy download en_core_web_md
nlp = spacy.load('en_core_web_sm')

import gensim

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Download the data
if not os.path.exists('../data'):
    os.mkdir('../data')
if not os.path.exists('../data/agnews'):
    url = 'https://storage.googleapis.com/aipi540-datasets/agnews.zip'
    urllib.request.urlretrieve(url,filename='../data/agnews.zip')
    zip_ref = zipfile.ZipFile('../data/agnews.zip', 'r')
    zip_ref.extractall('../data/agnews')
    zip_ref.close()

train_df = pd.read_csv('../data/agnews/train.csv')
test_df = pd.read_csv('../data/agnews/test.csv')

# Combine title and description of article to use as input documents for model
train_df['full_text'] = train_df.apply(lambda x: ' '.join([x['Title'],x['Description']]),axis=1)
test_df['full_text'] = test_df.apply(lambda x: ' '.join([x['Title'],x['Description']]),axis=1)

# Create dictionary to store mapping of labels
ag_news_label = {1: "World",
                 2: "Sports",
                 3: "Business",
                 4: "Sci/Tec"}

train_df.head()

Unnamed: 0,Class Index,Title,Description,full_text
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",Wall St. Bears Claw Back Into the Black (Reute...
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Carlyle Looks Toward Commercial Aerospace (Reu...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Oil and Economy Cloud Stocks' Outlook (Reuters...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Iraq Halts Oil Exports from Main Southern Pipe...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...","Oil prices soar to all-time record, posing new..."


In [3]:
# View a couple of the documents
for i in range(5):
    print(train_df.iloc[i]['full_text'])
    print()

Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.

Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.

Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.

Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday.

Oil prices soar to all-time record, posing new menace to US economy (AFP) AFP - Tearaway world

## Pre-process text

In [4]:
# Function to generate gensim tokens from a text corpus
import smart_open
def read_corpus(fname, tokens_only=False):
    with smart_open.open(fname, encoding="iso-8859-1") as f:
        for i, line in enumerate(f):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

# Function to generate gensim tokens from a list of text docs 
def read_corpus_from_list(lst, tokens_only=False):
        for i, line in enumerate(lst):
            tokens = gensim.utils.simple_preprocess(line)
            if tokens_only:
                yield tokens
            else:
                # For training data, add tags
                yield gensim.models.doc2vec.TaggedDocument(tokens, [i])

In [5]:
# Read in the training corpus
corpus = list(read_corpus_from_list(train_df['full_text'].tolist()))

# Tokenize the training and test sets
train_tokens = list(read_corpus_from_list(train_df['full_text'].tolist(),tokens_only=True))
test_tokens = list(read_corpus_from_list(test_df['full_text'].tolist(),tokens_only=True))

In [6]:
test_tokens[:10]

[['fears',
  'for',
  'pension',
  'after',
  'talks',
  'unions',
  'representing',
  'workers',
  'at',
  'turner',
  'newall',
  'say',
  'they',
  'are',
  'disappointed',
  'after',
  'talks',
  'with',
  'stricken',
  'parent',
  'firm',
  'federal',
  'mogul'],
 ['the',
  'race',
  'is',
  'on',
  'second',
  'private',
  'team',
  'sets',
  'launch',
  'date',
  'for',
  'human',
  'spaceflight',
  'space',
  'com',
  'space',
  'com',
  'toronto',
  'canada',
  'second',
  'team',
  'of',
  'rocketeers',
  'competing',
  'for',
  'the',
  'million',
  'ansari',
  'prize',
  'contest',
  'for',
  'privately',
  'funded',
  'suborbital',
  'space',
  'flight',
  'has',
  'officially',
  'announced',
  'the',
  'first',
  'launch',
  'date',
  'for',
  'its',
  'manned',
  'rocket'],
 ['ky',
  'company',
  'wins',
  'grant',
  'to',
  'study',
  'peptides',
  'ap',
  'ap',
  'company',
  'founded',
  'by',
  'chemistry',
  'researcher',
  'at',
  'the',
  'university',
  'of',
  

## Create document embeddings
Our first step is to build the vocabulary.  Essentially, the vocabulary is a list (accessible via `model.wv.index_to_key`) of all of the unique words extracted from the training corpus. Additional attributes for each word are available using the `model.wv.get_vecattr()` method.

After our vocabulary is build, we train our embedding model using the training corpus.

In [7]:
# Build vocabulary and train embedding model on the training corpus
# dbow_words=0 uses pre-trained word embeddings and only trains the document embeddings
doc2vec_model = gensim.models.doc2vec.Doc2Vec(vector_size=25, min_count=2, dbow_words=0, epochs=20)
doc2vec_model.build_vocab(corpus)
doc2vec_model.train(corpus, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)



Now that our embedding model is trained, we can use it to get the embeddings for the training and test sets using `model.infer_vector()` for each document in the set.  We can then use the vectors as representations of the documents to do things such as evaluate similarity using cosine similarity.

In [8]:
# Use the embedding model to get the embedding vectors for the training and test sets
X_train = [doc2vec_model.infer_vector(doc) for doc in train_tokens]
X_test = [doc2vec_model.infer_vector(doc) for doc in test_tokens]

In [9]:
X_test[0]

array([ 0.4254839 , -0.18550114, -0.01266643,  0.23160307, -0.13381748,
       -0.18228108,  0.36365193,  0.25047034, -0.06107698,  0.22020012,
       -0.28994694,  0.10834277, -0.12223433,  0.03980525, -0.27706906,
        0.16899094, -0.15330431, -0.4889337 , -0.36030567, -0.2705503 ,
        0.7236407 ,  0.150491  , -0.51067156,  0.07982385,  0.3432706 ],
      dtype=float32)

## Train classification model
Finally, we will used our embeddings as features to train a softmax regression model to classify the documents.

In [10]:
# Train a classification model using logistic regression classifier
y_train = train_df['Class Index']
logreg_model = LogisticRegression(solver='saga')
logreg_model.fit(X_train,y_train)
preds = logreg_model.predict(X_train)
acc = sum(preds==y_train)/len(y_train)
print('Accuracy on the training set is {:.3f}'.format(acc))

Accuracy on the training set is 0.763


## Evaluate model performance

In [11]:
# Evaluate performance on the test set
y_test = test_df['Class Index']
preds = logreg_model.predict(X_test)
acc = sum(preds==y_test)/len(y_test)
print('Accuracy on the training set is {:.3f}'.format(acc))

Accuracy on the training set is 0.757
