# Word2Vec for Text Classification

In this short notebook, we will see an example of how to use a pre-trained Word2vec model for doing feature extraction and performing text classification.

We will use the sentiment labelled sentences dataset from UCI repository
http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

The dataset consists of 1500 positive, and 1500 negative sentiment sentences from Amazon, Yelp, IMDB. Let us first combine all the three separate data files into one using the following unix command:

```cat amazon_cells_labelled.txt imdb_labelled.txt yelp_labelled.txt > sentiment_sentences.txt```

For a pre-trained embedding model, we will use the Google News vectors.
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM

Let us get started!

In [1]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

!pip install numpy==1.19.5
!pip install pandas==1.1.5
!pip install gensim==3.8.3
!pip install wget==3.2
!pip install nltk==3.5
!pip install scikit-learn==0.21.3

# ===========================

Collecting gensim==3.8.3
  Downloading gensim-3.8.3-cp37-cp37m-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 84.7 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-3.8.3
Collecting wget==3.2
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9673 sha256=4877de9e41ccfba395a6bc044ccad7ba2ea4f6324ca63bbf9da41b644eb8efea
  Stored in directory: /root/.cache/pip/wheels/a1/b6/7c/0e63e34eb06634181c63adacca38b79ff8f35c37e3c13e3c02
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
Collecting nltk==3.5
  Downloading nltk-3.5.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 14.8 MB/s

In [2]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try:
#     import google.colab
#     !curl  https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch4/ch4-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError:
#     !pip install -r "ch4-requirements.txt"

# ===========================

In [3]:
#basic imports
import warnings
warnings.filterwarnings('ignore')
import os
import wget
import gzip
import shutil
from time import time

#pre-processing imports
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

#imports related to modeling
import numpy as np
from gensim.models import Word2Vec, KeyedVectors
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
try:
    from google.colab import files
    
    # upload 'amazon_cells_labelled.txt', 'imdb_labelled.txt' and 'yelp_labelled.txt' present in "sentiment labelled sentences" folder
    uploaded = files.upload()
    
    !mkdir DATAPATH
    !mv -t DATAPATH amazon_cells_labelled.txt imdb_labelled.txt yelp_labelled.txt
    !cat DATAPATH/amazon_cells_labelled.txt DATAPATH/imdb_labelled.txt DATAPATH/yelp_labelled.txt > DATAPATH/sentiment_sentences.txt
    
except ModuleNotFoundError:

    fil = 'sentiment_sentences.txt'

    if not os.path.exists("Data/sentiment_sentences.txt"):
        file = open(os.path.join(path, fil), 'w')
        file.close()
        
        # combined the three files to make sentiment_sentences.txt
        filenames = ['amazon_cells_labelled.txt', 'imdb_labelled.txt', 'yelp_labelled.txt']

        with open('Data/sentiment_sentences.txt', 'w') as outfile:
            for fname in filenames:
                with open('Data/sentiment labelled sentences/' + fname) as infile:
                    outfile.write(infile.read())
        print("File created")
    else:
        print("File already exists")

Saving amazon_cells_labelled.txt to amazon_cells_labelled.txt
Saving imdb_labelled.txt to imdb_labelled.txt
Saving yelp_labelled.txt to yelp_labelled.txt


In [5]:
#Load the pre-trained word2vec model and the dataset
try:
    
    from google.colab import files
    data_path= "DATAPATH"
    !wget -P DATAPATH https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
    !gunzip DATAPATH/GoogleNews-vectors-negative300.bin.gz      
    path_to_model = 'DATAPATH/GoogleNews-vectors-negative300.bin'
    training_data_path = "DATAPATH/sentiment_sentences.txt"
    
except ModuleNotFoundError:
    
    data_path= "Data"
    
    if not os.path.exists('GoogleNews-vectors-negative300.bin'):
        if not os.path.exists('../Ch2/GoogleNews-vectors-negative300.bin'):
            if not os.path.exists('../Ch3/GoogleNews-vectors-negative300.bin'):
                wget.download("https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz")

                with gzip.open('GoogleNews-vectors-negative300.bin.gz', 'rb') as f_in:
                    with open('GoogleNews-vectors-negative300.bin', 'wb') as f_out:
                        shutil.copyfileobj(f_in, f_out)

                path_to_model = 'GoogleNews-vectors-negative300.bin'
            else:
                path_to_model = '../Ch3/GoogleNews-vectors-negative300.bin'

        else:
            path_to_model = '../Ch2/GoogleNews-vectors-negative300.bin'
    else:
        path_to_model = 'GoogleNews-vectors-negative300.bin'
        
    training_data_path = os.path.join(data_path, "sentiment_sentences.txt")
    
    
#Load W2V model. This will take some time. 
%time w2v_model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
print('done loading Word2Vec')

#Read text data, cats.
#the file path consists of tab separated sentences and cats.
texts = []
cats = []
fh = open(training_data_path)
for line in fh:
    text, sentiment = line.split("\t")
    texts.append(text)
    cats.append(sentiment)

--2021-07-20 08:36:30--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.130.248
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.130.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘DATAPATH/GoogleNews-vectors-negative300.bin.gz’


2021-07-20 08:37:16 (34.1 MB/s) - ‘DATAPATH/GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]

CPU times: user 19.6 s, sys: 3.11 s, total: 22.7 s
Wall time: 35.2 s
done loading Word2Vec


In [6]:
#Inspect the model
word2vec_vocab = w2v_model.vocab.keys()
word2vec_vocab_lower = [item.lower() for item in word2vec_vocab]
print(len(word2vec_vocab))

3000000


In [7]:
#Inspect the dataset
print(len(cats), len(texts))
print(texts[1])
print(cats[1])

3000 3000
Good case, Excellent value.
1



In [8]:
#preprocess the text.
def preprocess_corpus(texts):
    mystopwords = set(stopwords.words("english"))
    def remove_stops_digits(tokens):
        #Nested function that lowercases, removes stopwords and digits from a list of tokens
        return [token.lower() for token in tokens if token.lower() not in mystopwords and not token.isdigit()
               and token not in punctuation]
    #This return statement below uses the above function to process twitter tokenizer output further. 
    return [remove_stops_digits(word_tokenize(text)) for text in texts]

texts_processed = preprocess_corpus(texts)
print(len(cats), len(texts_processed))
print(texts_processed[1])
print(cats[1])

3000 3000
['good', 'case', 'excellent', 'value']
1



In [9]:
# Creating a feature vector by averaging all embeddings for all sentences
def embedding_feats(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    feats = []
    for tokens in list_of_lists:
        feat_for_this =  np.zeros(DIMENSION)
        count_for_this = 0 + 1e-5 # to avoid divide-by-zero 
        for token in tokens:
            if token in w2v_model:
                feat_for_this += w2v_model[token]
                count_for_this +=1
        if(count_for_this!=0):
            feats.append(feat_for_this/count_for_this) 
        else:
            feats.append(zero_vector)
    return feats


train_vectors = embedding_feats(texts_processed)
print(len(train_vectors))

3000


In [10]:
#Take any classifier (LogisticRegression here, and train/test it like before.
classifier = LogisticRegression(random_state=1234)
train_data, test_data, train_cats, test_cats = train_test_split(train_vectors, cats)
classifier.fit(train_data, train_cats)
print("Accuracy: ", classifier.score(test_data, test_cats))
preds = classifier.predict(test_data)
print(classification_report(test_cats, preds))

Accuracy:  0.8453333333333334
              precision    recall  f1-score   support

          0
       0.87      0.83      0.85       388
          1
       0.82      0.86      0.84       362

    accuracy                           0.85       750
   macro avg       0.85      0.85      0.85       750
weighted avg       0.85      0.85      0.85       750



Not bad. With little efforts we got 81% accuracy. Thats a great starting model to have!!