In [1]:
import numpy as np
import pandas as pd
import statistics as stats
import matplotlib.pyplot as plt
%matplotlib inline
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import gensim
from gensim import corpora, models
import pyLDAvis.gensim

In [2]:
df = pd.read_csv('cleaned_papers')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Title,Abstract,Category,Cleaned Title,Cleaned Abstract
0,0,Calculation of prompt diphoton production cros...,A fully differential calculation in perturbati...,physics,calcul prompt diphoton product cross section t...,a fulli differenti calcul perturb quantum chro...
1,1,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\\ell)$-p...",math,sparsiti certifi graph decomposit,we describ new algorithm k ell pebbl game colo...
2,2,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is describe...,physics,the evolut earth moon system base dark matter ...,the evolut earth moon system describ dark matt...
3,3,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle n...,math,a determin stirl cycl number count unlabel acy...,we show determin stirl cycl number count unlab...
4,4,From dyadic $\\Lambda_{\\alpha}$ to $\\Lambda_...,In this paper we show how to compute the $\\La...,math,from dyadic lambda alpha lambda alpha,in paper show comput lambda alpha norm alpha g...


In [4]:
vectorizer = TfidfVectorizer()
RANDOMSEED = 42

Below, we take the `Cleaned Title` column of our data and split it up into testing and training data. We then proceed to test random forest on it. 

Note: Since the data has a disproportionate amount of physics tags compared to the rest, we [stratify](https://en.wikipedia.org/wiki/Stratified_sampling) our data splitting by setting the `stratify` part of `train_test_split` to be the target `Y_t`. We do the same for the abstracts, setting  `stratify=Y_a`.

In [5]:
title_final_features = vectorizer.fit_transform(df['Cleaned Title']).toarray()
# title training and testing sets
X_t = title_final_features
Y_t = df['Category']
X_t_train, X_t_test, y_t_train, y_t_test = train_test_split(X_t, Y_t, test_size=0.25, random_state=RANDOMSEED, stratify=Y_t)

In [6]:
# https://www.blopig.com/blog/2017/07/using-random-forests-in-python-with-scikit-learn/
# RANDOM FOREST
rf_t = RandomForestClassifier(n_estimators=100, random_state=RANDOMSEED)
rf_t.fit(X_t_train, y_t_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [7]:
# https://stackabuse.com/text-classification-with-python-and-scikit-learn/
y_t_pred = rf_t.predict(X_t_test)
print(accuracy_score(y_t_test, y_t_pred))

0.832


Random forest yields 83% accuracy from the title of the paper alone. This feels relatively impressive given that I can't be sure how well I would do based on such jargon-heavy titles alone. 

Below we apply the same as above for the abstracts of the papers. 

In [8]:
abstract_final_features = vectorizer.fit_transform(df['Cleaned Abstract']).toarray()
# abstract training and testing sets
X_a = abstract_final_features
Y_a = df['Category']
X_a_train, X_a_test, y_a_train, y_a_test = train_test_split(X_a, Y_a, test_size=0.25, random_state=RANDOMSEED, stratify=Y_a)

In [9]:
# https://www.blopig.com/blog/2017/07/using-random-forests-in-python-with-scikit-learn/
# RANDOM FOREST
rf_a = RandomForestClassifier(n_estimators=100, random_state=RANDOMSEED)
rf_a.fit(X_a_train, y_a_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [10]:
# https://stackabuse.com/text-classification-with-python-and-scikit-learn/
y_a_pred = rf_a.predict(X_a_test)
print(accuracy_score(y_a_test, y_a_pred))

0.8816


The abstract proves to be a better indicator of subject with an 88% accuracy using random forest. 

Now we want to try out [latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) (LDA). We first need to create a Dictionary: we make each paper's title (or abstract) into a list of all the words in that title (or abstract), and then make a list out of all such lists; we then run that larger list through the `corpora.Dictionary`. This is all that really needs to be done since we have already cleaned and stemmed our titles (and abstracts). 

We begin by using LDA on the titles:

In [28]:
word_list_t = []
for title in df['Cleaned Title'].values.tolist():
    word_list_t.append([i for i in title.split()])
dictionary_t = corpora.Dictionary(word_list_t)
corpus_t = [dictionary_t.doc2bow(text) for text in word_list_t]

Below we create the LDA model. Notice that we chose the number of topics to be generated to be 6. This is because our initial EDA showed that there were only 6 subject total that came up in the data. Additionally, we set `minimum_probability` to 0.0013; this is because the proportion of the least likely subject (quantitative finance) had a 0.0014 chance of coming up, and we didn't want the model to disregard it.

In [29]:
ldamodel_t = gensim.models.ldamodel.LdaModel(corpus_t, num_topics=6, id2word = dictionary_t, passes=20, 
                                           minimum_probability=0.0013, random_state=RANDOMSEED)

The list below shows the indices of the topics generated and their most popular words, preceeded by their rate of appearance. 

In [30]:
print(ldamodel_t.print_topics())

[(0, '0.023*"the" + 0.017*"star" + 0.017*"a" + 0.016*"galaxi" + 0.016*"cluster" + 0.009*"dark" + 0.009*"mass" + 0.008*"format" + 0.007*"problem" + 0.007*"model"'), (1, '0.021*"quantum" + 0.018*"system" + 0.012*"equat" + 0.011*"spin" + 0.010*"state" + 0.010*"non" + 0.009*"theori" + 0.008*"gener" + 0.008*"graviti" + 0.008*"the"'), (2, '0.019*"x" + 0.018*"ray" + 0.012*"hole" + 0.011*"black" + 0.009*"the" + 0.009*"effect" + 0.008*"quantum" + 0.008*"observ" + 0.007*"accret" + 0.007*"einstein"'), (3, '0.015*"network" + 0.014*"a" + 0.010*"algebra" + 0.009*"gener" + 0.008*"model" + 0.008*"on" + 0.007*"random" + 0.007*"algorithm" + 0.007*"energi" + 0.006*"complex"'), (4, '0.015*"the" + 0.010*"decay" + 0.010*"high" + 0.008*"photon" + 0.008*"mass" + 0.008*"gamma" + 0.007*"field" + 0.007*"pi" + 0.007*"energi" + 0.007*"meson"'), (5, '0.015*"model" + 0.015*"two" + 0.012*"phase" + 0.011*"field" + 0.009*"effect" + 0.009*"2" + 0.009*"theori" + 0.009*"dimension" + 0.009*"quantum" + 0.008*"space"')]


In [31]:
pyLDAvis.enable_notebook()
vis_t = pyLDAvis.gensim.prepare(ldamodel_t, corpus_t, dictionary_t)
vis_t

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Clearly, all the topics generated were overwhelmingly biased towards the physics papers. This is not too surprising given the weight of physics papers in the data. While this is not even close to what we hoped to see the LDA model do, it seems there may be promise in trying to see if the model was able to make distinction between different subfields in physics. 

Regardless, the hope is that the abstract of a paper, which has more content to help the model and performed better with random forests than the titles did, might work better. We will determine this below:

In [32]:
word_list_a = []
for abstract in df['Cleaned Abstract'].values.tolist():
    word_list_a.append([i for i in abstract.split()])
dictionary_a = corpora.Dictionary(word_list_a)
corpus_a = [dictionary_a.doc2bow(text) for text in word_list_a]

In [33]:
ldamodel_a = gensim.models.ldamodel.LdaModel(corpus_a, num_topics=6, id2word = dictionary_a, passes=20, 
                                           minimum_probability=0.0013, random_state=RANDOMSEED)
print(ldamodel_a.print_topics())

[(0, '0.031*"1" + 0.027*"2" + 0.027*"0" + 0.017*"n" + 0.014*"k" + 0.013*"x" + 0.013*"3" + 0.011*"p" + 0.011*"c" + 0.010*"we"'), (1, '0.016*"we" + 0.008*"in" + 0.008*"gener" + 0.008*"space" + 0.007*"group" + 0.007*"algebra" + 0.007*"n" + 0.007*"the" + 0.007*"paper" + 0.006*"show"'), (2, '0.018*"the" + 0.009*"magnet" + 0.009*"we" + 0.008*"field" + 0.007*"use" + 0.007*"time" + 0.007*"simul" + 0.006*"result" + 0.006*"model" + 0.006*"densiti"'), (3, '0.016*"we" + 0.013*"model" + 0.013*"the" + 0.010*"theori" + 0.009*"field" + 0.008*"equat" + 0.008*"gener" + 0.006*"energi" + 0.006*"use" + 0.006*"in"'), (4, '0.014*"state" + 0.014*"the" + 0.013*"we" + 0.011*"phase" + 0.011*"spin" + 0.009*"quantum" + 0.009*"electron" + 0.009*"system" + 0.008*"two" + 0.007*"effect"'), (5, '0.013*"the" + 0.012*"we" + 0.012*"star" + 0.011*"observ" + 0.008*"galaxi" + 0.008*"mass" + 0.007*"ray" + 0.007*"cluster" + 0.006*"1" + 0.006*"0"')]


In [34]:
vis_a = pyLDAvis.gensim.prepare(ldamodel_a, corpus_a, dictionary_a)
vis_a

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Admittedly, while a little more varied, the results of this model are not particularly pretty either. Topic 1 here seems to be astrophysics. Topic 3 seems like it could be based on computer science papers -- a partial win! The visual shows that topics 4 and 5 clearly overlap quite a bit; they seem to be more based on quantum/energy physics. Hovering over Topic 6 shows an interesting result: it seems a large amount of the single-letter "words" were grouped together here. I initially kept them in the hopes that the mathematical symbols that get interpretted like this may help the model, but it seems not. 

Notes for the current state of this:
* stopwords are still here
* numbers and single letter "words" should go
* consider increasing how many topics are allowed