# Experimenting with Topic Modeling using Word Embeddings

The data set being used contains research paper titles and abstracts as well as labels as either Computer Science, Physics, Mathematics, Statistics, Quantitative Biology, Quantitative Finance, or some combination of those labels.  The approach that I am taking is to convert the text to a vector using word embeddings trained on this data set, then I will train a classifier for each of the labels, separately.  At the end I am going to create a function that when text is inputed will return the likely topic(s) of the title or abstract.

This function will evaluate the inputed text on each of the classifiers separately, then return an array with the results of each one in the same order that they appear in the columns in the training dataset.

In [1]:
import sys
import pandas as pd
import matplotlib.pyplot as plt
import random
import numpy as np

from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.ensemble import RandomForestClassifier

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.models import Word2Vec
import gensim.utils

import tensorflow as tf

%matplotlib notebook
print('You\'re running python %s' % sys.version.split(' ')[0])

You're running python 3.8.1


#### Load the training data:

In [2]:
train = pd.read_csv('TM_Dataset/train.csv',keep_default_na=False)

## Take a look at the training data:

In [3]:
train.head()

Unnamed: 0,ID,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance
0,1,Reconstructing Subject-Specific Effect Maps,Predictive models allow subject-specific inf...,1,0,0,0,0,0
1,2,Rotation Invariance Neural Network,Rotation invariance and translation invarian...,1,0,0,0,0,0
2,3,Spherical polyharmonics and Poisson kernels fo...,We introduce and develop the notion of spher...,0,0,1,0,0,0
3,4,A finite element approximation for the stochas...,The stochastic Landau--Lifshitz--Gilbert (LL...,0,0,1,0,0,0
4,5,Comparative study of Discrete Wavelet Transfor...,Fourier-transform infra-red (FTIR) spectra o...,1,0,0,1,0,0


#### Create array with labels for later:

In [4]:
labels = np.array(['Computer Science', 'Physics', 'Mathematics', 'Statistics', 'Quantitative Biology', 'Quantitative Finance'],dtype='str')

#### Break up dataset into lists that can be used for training and testing sets:

In [5]:
# this row number has a title that has no strings after simple_preprocess, so I removed it
issueRowNumber = 8270

# X's
inputAbstracts = train['ABSTRACT'].tolist()
inputTitles = train['TITLE'].tolist()
inputAbstracts.pop(issueRowNumber)
inputTitles.pop(issueRowNumber)

# y's or labels
labelColumns = [None]*len(labels)
for i in range(len(labels)):
    col = train[labels[i]].tolist()
    col.pop(issueRowNumber)
    labelColumns[i] = col

#### Tokenize titles and abstracts:

In [6]:
#tokenize titles:
inputTitleTokens = []
for title in inputTitles:
    tokens = gensim.utils.simple_preprocess(title)
    inputTitleTokens.append(tokens)
    
#tokenize abstracts:   
inputAbstractTokens = []
for abstract in inputAbstracts:
    tokens = gensim.utils.simple_preprocess(abstract)
    inputAbstractTokens.append(tokens)
# inputAbstractTokens,inputLabels

#### Create Word Embeddings for article titles using Word2Vec

In [7]:
W2V_model = Word2Vec(inputTitleTokens, min_count=1,size=100,workers=3, window=5, sg=1)

#### Vectorize article titles using Word Embeddings:

In [8]:
vectorizedTitles = [None]*len(inputTitleTokens)
for i in range(len(inputTitleTokens)):
    post=[]
    for word in inputTitleTokens[i]:
        try:
            post.append(W2V_model.wv[word])
        except:
            'do nothing'
    post_avg = np.mean(np.array(post, dtype='f'),axis=0)
#     print(i,inputLabels[i],inputTitle[i],post_avg,W2V_model.wv.most_similar(positive=[post_sum], topn=5))
    vectorizedTitles[i]=post_avg

#### Split up testing and training sets:

In [9]:
test_size = len(inputTitles)//5
train_size = len(inputTitles)-test_size
print('Testing set size: '+str(test_size),'|','Training set size: '+str(train_size),'|','Total size: '+str(test_size+train_size))

Testing set size: 4194 | Training set size: 16777 | Total size: 20971


In [10]:
#create the X test and training matricies for the article titles
temp = np.array(vectorizedTitles)
X_title_test,X_title_train = temp[train_size:],temp[:train_size]

#create the Y test and training arrays for the article labels (list of "np.array columns")
Y_train,Y_test = [None]*len(labelColumns),[None]*len(labelColumns)
for colNumber in range(len(labelColumns)):
    temp = np.array(labelColumns[colNumber])
    Y_test[colNumber],Y_train[colNumber]  = temp[train_size:],temp[:train_size]

#### Create random forest classifiers for each label:

In [11]:
classifiers = [None]*len(Y_train)
for colNumber in range(len(Y_train)):
    temp = RandomForestClassifier(max_depth=6,n_estimators=10)
    temp.fit(X_title_train, Y_train[colNumber])
    classifiers[colNumber] = temp
    print(colNumber,labels[colNumber])
    print('Training accuracy:',np.sum(temp.predict(X_title_train)==Y_train[colNumber])/len(X_title_train))
    print('Testing accuracy:',np.sum(temp.predict(X_title_test)==Y_test[colNumber])/len(X_title_test))
    print()

0 Computer Science
Training accuracy: 0.7981760743875543
Testing accuracy: 0.7906533142584645

1 Physics
Training accuracy: 0.8664242713238363
Testing accuracy: 0.8512160228898427

2 Mathematics
Training accuracy: 0.8533110806461227
Testing accuracy: 0.837863614687649

3 Statistics
Training accuracy: 0.8357870894677236
Testing accuracy: 0.8116356700047688

4 Quantitative Biology
Training accuracy: 0.970972164272516
Testing accuracy: 0.9763948497854077

5 Quantitative Finance
Training accuracy: 0.988496155450915
Testing accuracy: 0.9871244635193133



#### Create classifier function that evaluates input text on all five labels:

In [12]:
def classifier(title):
    global classifiers
    tokenTitle = gensim.utils.simple_preprocess(title)
    vecTitle=[]
    for word in tokenTitle:
        try:
            vecTitle.append(W2V_model.wv[word])
        except:
            'do nothing'
    vecTitle = np.mean(np.array(vecTitle, dtype='f'),axis=0)
    preds = [None]*len(classifiers)
    for index in range(len(classifiers)):
        preds[index] = int(classifiers[index].predict(vecTitle.reshape(1, -1))[0])
    return np.array(preds)

#### Try out classifier on some made up article name inputs:

In [13]:
articleName = "New Methods for KNN with text data"
preds = classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

Output vector: [1 0 0 1 0 0] | Predicited Label(s): ['Computer Science' 'Statistics']


In [14]:
articleName = "Pi used in new formula"
preds = classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

Output vector: [0 1 0 0 0 0] | Predicited Label(s): ['Physics']


In [16]:
articleName = "New prime discovered"
preds = classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

Output vector: [0 0 1 0 0 0] | Predicited Label(s): ['Mathematics']


In [18]:
articleName = "New Data distribution used to speed up training"
preds = classifier(articleName)
print('Output vector:',preds,'|','Predicited Label(s):',labels[preds==1])

Output vector: [1 0 0 0 0 0] | Predicited Label(s): ['Computer Science']
