<a href="https://colab.research.google.com/github/mesahwi/TextAnlaysis/blob/master/Learning/Doc2Vec_toy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Text Classification

(Doc2Vec -> Classification) Toy Example, <br>

Using 20newsgroups data

Step 0 : import necessary packages

In [1]:
import numpy as np
import gensim
import sklearn
import nltk
import collections

from sklearn.datasets import fetch_20newsgroups
categories = ['rec.sport.baseball', 'sci.electronics']

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.corpora.dictionary import Dictionary

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

from sklearn.linear_model import LogisticRegression

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt')

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Step 1 : Gather data

In [2]:
print('-----Training Dataset Word Count-----')
data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
length = len(np.unique(data_train.target))
for i in range(length):
  n = len(np.where(data_train.target==i)[0])
  wc = 0
  for j in range(n):
    wc = wc + len(data_train.data[j])
    
  print('type',i, ' : ', wc)
  i=i+1
  
print('-----Test Dataset Word Count-----')  
data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
length = len(np.unique(data_test.target))
for i in range(length):
  n = len(np.where(data_test.target==i)[0])
  wc = 0
  for j in range(n):
    wc = wc + len(data_test.data[j])
    
  print('type',i, ' : ', wc)
  i=i+1


-----Training Dataset Word Count-----
type 0  :  789442
type 1  :  779070
-----Test Dataset Word Count-----
type 0  :  585660
type 1  :  580836


Step 2 : Preprocess data

In [0]:
def preprocess(text, stemmer=False):
  tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
#   print('tokenized into words')
  
  tokens = [word.lower() for word in tokens]
#   print('lower capitalization')

  tokens = [word for word in tokens if len(word) >= 4]
#   print('removed short (length<4) words')

  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word) for word in tokens]
#   print('lemmatized words')
  
  tokens = [lemmatizer.lemmatize(word, 'v') for word in tokens]
#   print('lemmatized verbs')

  stop = stopwords.words('english')
  my_stopwords = ['from', 'subject', 'line', 'say', 'would', 'like', 'write', 'article', 'organization', 'year', 'university', 'nntp-posting-host', 'reply-to', 'distribution', 'know']
  # my_stopwords were chosen in a post-hoc manner
  tokens = [token for token in tokens if token not in stop and token not in my_stopwords]
#   print('removed stopwords')
  
  
  if stemmer:
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(word) for word in tokens]
#     print('stemmed words')
  
  tokens = [word for word in tokens if not any(char.isdigit() for char in word)]
#   print('removed words containing numbers')
  
#   preprocessed = ' '.join(tokens)
  return tokens

In [0]:
text_train = data_train.data
train_corpus = [TaggedDocument(words = preprocess(_d), tags=[str(i)]) for i, _d in enumerate(text_train)] #'TaggedDocument', to be used for doc2vec

Step 3 : Train Doc2vec Model (, and test performance with training data)

In [0]:
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)

In [6]:
correct_list = []
wrong_list = []
wrong_id_list = []
for doc_id in range(len(train_corpus)):
  v = model.infer_vector(train_corpus[doc_id].words)
  sims = model.docvecs.most_similar([v])
  dif = int(sims[0][0]) - doc_id
  if dif==0:
    correct_list.append(sims)
  else:
    wrong_list.append(sims)
    wrong_id_list.append(doc_id)

print(len(correct_list) / len(train_corpus), ' correct')


0.9941077441077442  correct


Plugging in the training documents into 'model.docvecs.most_similar()' , we can see that 99.4% of the training documents were most similar to themselves.<br>
Therefore, we can see that our doc2vec is working fine

Step 4 : Train Logistic Regression using Doc2vec vectors.


In [0]:
def getDocVec(model):
  totLen = len(model.docvecs)
  X = [model.docvecs[i] for i in range(totLen)]
  return X

Sidenote : Haven't yet figured out if there is a more efficient way (getting the vectors directly from model.docvecs, not by iteration)

In [0]:
X_train = getDocVec(model)
Y_train = data_train.target

lm = LogisticRegression()
lmfit = lm.fit(X_train, Y_train)
y_train_lm = lmfit.predict(X_train)

In [9]:
col = collections.Counter(y_train_lm - Y_train)
print('Logistic Regression Performance with Training set : ', col[0]/len(y_train_lm))

Logistic Regression Performance with Training set :  0.9882154882154882


Now that we have assurance the logistic regression model can perform, we bring in the test set

Step 5 : Test Logistic Regression Model

In [0]:
text_test = data_test.data
test_corpus =[TaggedDocument(words = preprocess(_d), tags=[str(i)]) for i, _d in enumerate(text_test)]

model2 = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
model2.build_vocab(test_corpus)
model2.train(test_corpus, total_examples=model2.corpus_count, epochs=model2.epochs)

In [11]:
X_test = getDocVec(model2)
y_test_lm = lmfit.predict(X_test)
col = collections.Counter(y_test_lm - data_test.target)
print('Logistic Regression Performance with Test set : ', col[0]/len(y_test_lm))

Logistic Regression Performance with Test set :  0.8240506329113924


Though far from 98.8% of our training set, 82.4% is still an acceptable performance :-)