
# Using Word2Vec to Solve Document Classification Problem 

Acknowledgement: This notebook is provided by Intel AI Developer Program

### Introduction

Word2Vec are a representation of vocabulary and its features. 
Once we have vectors for examples, we can perform Machine Learning (both supervised and unsupervised).
We can use it to solve classification problem.

In this exercise, we be using a well-known text datasets to explore the capabilities of Word2Vec:
- [20 Newsgroups Dataset](http://qwone.com/~jason/20Newsgroups/): Famous text classification dataset from user discussion forums with 20 classes, including...
  
To perform our tasks, we will both derive our own **word vectors** from the data as well as borrow Google's massive set of word vectors trained on the web ([Google Vectors]()).

In [1]:
# NLP tools
import nltk
import gensim

# Data tools
import numpy as np
import pandas as pd

# Necessary for adding accessory_functions module to path
import os, sys
lib_path = os.path.abspath(os.path.join('..', '..'))
sys.path.append(lib_path)
from accessory_functions import google_vec_file, nltk_path


### Define a function to derive document vectors
Consider the case of document classification.  From Word2Vec we have vectors for words, but for our examples, we need classify are documents.  How do we get vectors representation for whole documents? The most common answer is to take an average of all the word vectors in a document.  Let's try that with our sample data.



The function below will return the document vectors which will be used later on

In [2]:
# Function to take a document as a list of words and the word2Vec model.
# The function will check if the word vector exists. If so, the word is added to good_words list.
# Finally, the mean of vectors for all words is returned.
# What is eventually returned is the document vector.

def get_doc_vec(words, model):
    good_words = []   #good_words is a list of words where the word vectors are available in the model
    for word in words:
        # Words not in the original model will fail
        try:
            if model[word] is not None:   # None is when the word vector isn't available
                good_words.append(word)
        except:
            continue
    # If no words are in the original model
    if len(good_words) == 0:
        return None
    # Return the mean of the vectors for words found in good_woods
    # ref to https://www.geeksforgeeks.org/numpy-mean-in-python/ for documentation
    return model[good_words].mean(axis=0)  

### Load Data and Prepare The Data

We will be using a portion of a data set containing approximately 20,000 posts partitioned evenly across 20 different newsgroups. This data set is quite famous. We will be using a sample of this data set, containing 5 topics and about 3,000 posts. We will need to load in the data.

The cell below loads  the input data into a Data Frame

In [3]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from accessory_functions import preprocess_series_text, nltk_path

topic_list = ['sci.space', 'comp.sys.mac.hardware', 'rec.autos',
              'rec.sport.baseball', 'sci.med']

 
# Retrieve the data into a DataFrame
# #The data is a dictionary of key 'data', and a list containing string of the text
dataset = fetch_20newsgroups(shuffle=True, random_state=1, data_home='./data',
                             categories=topic_list,
                             remove=('headers', 'footers', 'quotes'))


Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


The cell below  preprocess the input data into a Data Frame

In [4]:
ng_data = pd.DataFrame(dataset['data'], columns=['text'])
ng_data['label'] = dataset['target']

# Preprocess
# The function preprocess_series_text() defined in the accesssory_functions.py file
ng_data['text'] = preprocess_series_text(ng_data.text, nltk_path=nltk_path)

print(len(ng_data))
ng_data.head(10)

2956


Unnamed: 0,text,label
0,otoh u get lucky unplugged replugged scsi adb ...,0
1,yes everyone else may wonder fred well would o...,4
2,umm perhaps could explain right talk,4
3,like alomar like differ opinion city likely po...,2
4,wow know uranus long way think far away,4
5,outbreak chronic mono like entity originally c...,3
6,couple question multimedia set anybody phone f...,0
7,sure dietician date crohn ulcerative colitis p...,3
8,seek recommendation vendor networkable fax wou...,0
9,mlb standing score friday april include yester...,2


### Create a Custom Word2Vec model 



In this section, you will train a custom word2Vec  based on sentences from documents.

#### Train a Word2Vec model to generate word vectors from the 20 Newsgroups data. 
* Split documents into sentences
* Split each sentences into a list of words since gensim requires the documents to be represented as a list of sentences in tokens to train Word2Vec
* Instantiate a new Word2Vec object
* Use your custom Word2Vec model to generate document vectors from these word vectors.
* Combine these vectors with the 20 Newsgroups class labels to create a DataFrame for classification.


In [5]:
from gensim.models import Word2Vec
# Generate sentences for training word2vec
sentences = ng_data.text.str.split()  ## Generate sentences for training word2vec
# Train a Word2Vec model (ng_model ==> newsgroup model)
ng_model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
# check the shape of the custom ng_model
print(ng_model)


Word2Vec(vocab=6005, size=100, alpha=0.025)


In [6]:
# Did it work? Check out similar words to baseball, airplane etc
ng_model.wv.most_similar('airplane')

[('full', 0.9991823434829712),
 ('plant', 0.9991742372512817),
 ('self', 0.9991512298583984),
 ('rate', 0.999147355556488),
 ('fluid', 0.999129056930542),
 ('consideration', 0.9991122484207153),
 ('burst', 0.9991054534912109),
 ('country', 0.9991052150726318),
 ('tax', 0.9991040825843811),
 ('gun', 0.9991024732589722)]


### Derive Document Vectors from Word Vectors
Now that the custom Word2Vec model is created, we can use it, together with the document, to call the function get_doc_vec(defined earlier) to return a vector that represents the vector.

In [7]:
# Make a copy of the new group dataframe for the Google word vector
ng_data1 = ng_data.copy()

# Retrieve the document vectors based on newsgroup word vectors
ng_vecs = ng_data1.text.str.split().map(lambda x: get_doc_vec(x, ng_model.wv))

# Add to dataframe
ng_data1['vecs'] = ng_vecs

# Drop the bad docs
ng_data1 = ng_data1.dropna()

# Create a Numpy array of the document vectors
ng_np_vecs = np.zeros((len(ng_data1), 100))
for i, vec in enumerate(ng_data1.vecs):
    ng_np_vecs[i, :] = vec
    
# Combine the full dataframe with the labels
ng_w2v_data = pd.concat([ng_data1.reset_index().label, pd.DataFrame(ng_np_vecs)], axis=1)



In [8]:
ng_data1.head()

Unnamed: 0,text,label,vecs
0,otoh u get lucky unplugged replugged scsi adb ...,0,"[-0.3021288, 0.66113096, -0.48783496, 0.263959..."
1,yes everyone else may wonder fred well would o...,4,"[-0.32507056, 0.6875524, -0.47469854, 0.254792..."
2,umm perhaps could explain right talk,4,"[-0.36925447, 0.79559404, -0.5978705, 0.325746..."
3,like alomar like differ opinion city likely po...,2,"[-0.32073867, 0.6705939, -0.5382954, 0.2905089..."
4,wow know uranus long way think far away,4,"[-0.3174214, 0.67041767, -0.5557123, 0.2942629..."


### Trained 20 Newsgroups Classifier
Now that we have the document vector setup, we can start to use classification algorithms available in sklearn to built classification models based on documents vectors


In [9]:
## Training a Classifier with our own trained vectors
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# Split the data
X_ng1 = ng_w2v_data.iloc[:, 1:]
y_ng1 = ng_w2v_data.label
X_train_ng1, X_test_ng1, y_train_ng1, y_test_ng1 = train_test_split(X_ng1, y_ng1, test_size=0.3)

# Train a KNN or Logistic Regression classifier
est = KNeighborsClassifier(algorithm='brute', metric='cosine')
# est = LogisticRegression()
est.fit(X_train_ng1, y_train_ng1)


KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='cosine',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

The score for the model is given below.

In [10]:
est.score(X_test_ng1, y_test_ng1)

0.6619718309859155

########################################################################

### Classification with Google Word2Vec¶

In this section, you will repeat the earlier steps,  but this time using Google's pretrained model and then  perform a classification task.

* Use the Google vectors to generate document vectors for the 20 Newsgroups data.
* Combine these vectors with the 20 Newsgroups class labels to create a DataFrame for classification.
* Train a classification model for these five 20 Newsgroups classes and evaluate its performance.  
  **Hint**: Try a K-Nearest Neighbors Model.

Note the performance of the Google vectors vs your own Word2Vec training.


### Loading Google Word2Vec Vectors
* Load the Google vectors into an object `google_model` using `gensim` 
* This step will take awhile, as it has to load 3 million vectors into the appropriate Word2Vec format.
* Google's model contains an extensive vocabulary.
* Confirm that you have 3 million vectors of length 300.

In [None]:
# Load the Google vectors
google_vec_file = "./data/GoogleNews-vectors-negative300.bin.gz"
google_model = gensim.models.KeyedVectors.load_word2vec_format(google_vec_file, binary=True)



### Derive Document Vectors from Word Vectors
Now that the Google word2vec model is loaded, we can use it, together with the document, to call the function get_doc_vec(defined earlier) to return a vector that represents the vector.

In [None]:
# Make a copy of the spam dataframe for the Google work
ng_data2 = ng_data.copy()

# Retrieve the document vectors based on google word vectors
ng_google_vecs = ng_data2.text.str.split().map(lambda x: get_doc_vec(x, google_model))

# Add to dataframe
ng_data2['vecs'] = ng_google_vecs

# Drop the bad docs
ng_data2 = ng_data2.dropna()

# Create a Numpy array of the document vectors
ng_np_vecs = np.zeros((len(ng_data2), 300))
for i, vec in enumerate(ng_data2.vecs):
    ng_np_vecs[i, :] = vec
    
# Combine the full dataframe with the labels
ng_google_data = pd.concat([ng_data2.reset_index().label, pd.DataFrame(ng_np_vecs)], axis=1)

### Trained Word2Vec 20 Newsgroups Classifier
Now that we have the document vector setup, we can start to use classification algorithms available in sklearn to built classification models based on documents vectors
* Train a classification model for these five 20 Newsgroups classes and evaluate its performance. Hint: Try a K-Nearest Neighbors Model.

In [None]:
## Training a Classifier with Google's vectors
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# Split the data
X_ng2 = ng_google_data.iloc[:, 1:]
y_ng2 = ng_google_data.label
X_train_ng2, X_test_ng2, y_train_ng2, y_test_ng2 = train_test_split(X_ng2, y_ng2, test_size=0.3)

# Train a KNN or Logistic Regression classifier
est = KNeighborsClassifier(algorithm='brute', metric='cosine')
# est = LogisticRegression()
est.fit(X_train_ng2, y_train_ng2)


In [None]:
est.score(X_test_ng2, y_test_ng2)

#### Question:
Compare the score for the Google Word2Ved model and the custom Word2Vec model.L
Which one performs better?


In [None]:
# your answer


### Exercise A

So far, you have only performed classification for 5 newsgroup. 
Repeat the training process, but this time increases to all 20 newsgroup. 
1. Modify the code to include all 20  newsgroup
1. Has the score for the custom word vector improve, got worse or remain the same?
2. Has the score for the Google pretrained improve, got worse or remain the same?



In [None]:
## Your codes
'''


### Exercise B
According to Mikolov, Skip Gram works well with small amount of data and is found to represent rare words well. On the other hand, CBOW is faster and has better representations for more frequent words.
1. What was the default model used when instantiating our custom Word2Vec? (Hint: look the hyperparameter sg)
2. Modify the codes to use CBOW to instantiate the model.
3. Repeat the classification training.
4. Discuss if the result is consistent with Mikolov.

Hint: Refer to documenation at https://radimrehurek.com/gensim/models/word2vec.html for sg parameter

In [None]:
# Your answers


In [None]:
# end