# LDA Topic Extraction

## Goal: 
In this notebook I will finally begin extracting the topics from the arXiv abstracts using Latent Dirilicht Allocation.  The purpose of this is to tune the LDA algorithm which will output the generalized topics in an unsupervised setting.  While we have the general topics (as seen in the arXiv_PrelimEDA notebook), we noticed that almost all of the text are cross-topics.  Additionally, the topics on the arXiv are too broad strokes as well.  

I will note however, if we were to use ADS then the keywords are likely already presented and will be easier to extract.  However, in the case where we want to extend our analysis to fields other than Astronomy and Astrophysics (which ADS does not have), then understanding how well we perform here is critical.  

In [1]:
import spacy
import re
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import json

import numpy as np 
import scipy as sp
from scipy import signal
import pandas as pd

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
%cd ../arXiv_MetadataExtraction

/Users/nakulgangolli/Desktop/arXiv_Reader/arXiv_MetadataExtraction


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [3]:
df = pd.read_csv('cleaned_arxiv_astro.csv')
print(df.head)

<bound method NDFrame.head of         Unnamed: 0      arXiv_ID  \
0                0     9501001v1   
1                1     9501002v1   
2                2     9501003v1   
3                3     9501004v1   
4                4     9501005v1   
...            ...           ...   
324341      324341  2507.17833v1   
324342      324342  2507.18188v1   
324343      324343  2507.14343v1   
324344      324344  2507.12860v1   
324345      324345  2507.06255v1   

                                                    title  \
0                      The Origin of Galactic Cosmic Rays   
1            Turbulent Convection in Thin Accretion Disks   
2                               THE ORIGIN OF COSMIC RAYS   
3        ROSAT Observations of Compact Groups of Galaxies   
4       Gamma-Ray Transfer and Energy Deposition in Su...   
...                                                   ...   
324341  Not-quite-primordial black holes seeded by cos...   
324342  Updated constraints on modified gravity f

In [4]:
CLEANED_ABSTRACT_LIST = df['Cleaned Abstracts'].values

In [5]:
print(CLEANED_ABSTRACT_LIST[2:10])

['propose cosmic originate mainly site normal supernova explosion interstellar medium supernova explosion stellar wind spot powerful radio galaxy proposal depend assumption scaling turbulent diffusive transport cosmic mediate shock region proposal specific model interstellar transport cosmic model investigate compare radio star wolf rayet stars radio supernovae radio supernova remnant gammaray line continuum emission starforme region cosmic electron spectrum akeno shower particle energy range akeno shower specifically discuss assumption inherent model propose concern transport energetic particle acceleration region galaxy'
 'systematically analyze sample archival rosat pspc observation compact group galaxy hickson compact group plus group approximately twothird group extend xray emission emission resolve diffuse emission temperature group potential group extended emission spiral fraction baryon fraction group diffuse emission similar value cluster galaxy single exception gastostellarma

In [6]:
import gensim
from gensim import corpora
from gensim.models import Phrases 
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS

BAD_ABSTRACT = 0
TOKENIZED_ABSTRACT_LIST = []
for idx, ABSTRACT in enumerate(CLEANED_ABSTRACT_LIST): 
    try:
        TOKENIZED_ABSTRACT_LIST.append(ABSTRACT.split(' '))
    except:
        CLEANED_ABSTRACT_LIST = np.delete(CLEANED_ABSTRACT_LIST, idx)
        BAD_ABSTRACT += 1 
        print("Something weird is going on with this abstract")

print(F"Total Number of Abstracts: {len(TOKENIZED_ABSTRACT_LIST)} \t Number of Bad Abstracts: {BAD_ABSTRACT}")

ARXIV_CONNECTOR_WORDS = list(ENGLISH_CONNECTOR_WORDS)# +['dark', 'black', 'physical', 
                                                     #   galactic', 'stellar', 'primordial']
print(ARXIV_CONNECTOR_WORDS)

bigrams = Phrases(TOKENIZED_ABSTRACT_LIST, min_count=2, threshold=10., max_vocab_size=40000, connector_words=ARXIV_CONNECTOR_WORDS)
bigram_model = gensim.models.phrases.Phraser(bigrams)

BIGRAM_ABSTRACT_LIST = [bigram_model[ABSTRACT] for ABSTRACT in TOKENIZED_ABSTRACT_LIST]
print(BIGRAM_ABSTRACT_LIST[0])
# Create Dictionary and Corpus

# print(LDA_ABSTRACT_LIST[:10])
dictionary = corpora.Dictionary(BIGRAM_ABSTRACT_LIST)
dictionary.filter_extremes(no_below=5, no_above=0.95)
corpus = [dictionary.doc2bow(ABSTRACT) for ABSTRACT in BIGRAM_ABSTRACT_LIST]


Something weird is going on with this abstract
Something weird is going on with this abstract
Total Number of Abstracts: 324344 	 Number of Bad Abstracts: 2
['by', 'an', 'without', 'of', 'the', 'with', 'from', 'to', 'in', 'on', 'at', 'or', 'and', 'a', 'for']
['motivate', 'recent', 'measurement', 'major', 'component', 'cosmic', 'radiation', 'discuss', 'phenomenology', 'model', 'distinct', 'kind', 'cosmic', 'accelerator', 'galaxy', 'comparison', 'spectra', 'hydrogen', 'helium', 'nucleon', 'suggest', 'element', 'spectrum', 'magnetic', 'rigidity', 'entire', 'region', 'dominant', 'element', 'receive', 'contribution', 'different', 'source']


In [7]:
# topic_model.visualize_topics()

In [25]:
from gensim.models import LdaModel

NTOPICS = 24
lda_model = LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=NTOPICS,         # Change to desired number of topics
    random_state=1008,
    passes=20,
    alpha='auto',
    per_word_topics=True,
    iterations=200
    # chunksize = len(CLEANED_ABSTRACT_LIST)//50
)

In [26]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

import warnings
warnings.filterwarnings("ignore")

pyLDAvis.enable_notebook()
vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis_data)

In [27]:
def get_TopicKeyWords(_lda_model, _TOPIC_NUM): 
    _TOPICS = _lda_model.print_topics(num_words=10)
    words = _lda_model.show_topic(_TOPIC_NUM, topn=10) 
    keywords = ", ".join([word for word, _ in words])
    # print(f"Topic {topic_num}: {topic}")
    return keywords

def get_dominant_topic(_abstract_BOW, _ldamodel):

    # Derive the topics from the Bag-of-Words
    # from the abstract. 
    topics = _ldamodel.get_document_topics(_abstract_BOW)

    # Sort topics based on probability, in descedning 
    # order.  
    topics = sorted(topics, key=lambda x: x[1])[::-1]

    # Select topics that this abstract belongs, have a 
    # cutoff for the probability that a given abstract 
    # belongs to a specific topic
    _PROB_THRESHOLD = 0.4 # the 
    counter, tot_prob = 0, 0.
    return_topics = []
    while tot_prob < _PROB_THRESHOLD:
        return_topics.append(topics[counter])
        tot_prob += topics[counter][1]
        counter += 1 
    # print(return_topics, tot_prob)
    return return_topics # topics[0] if topics else (None, 0.0)

ABSTRACT_TOPIC_LIST = []
# Assign dominant topic to each abstract
for idx, abstract in enumerate(CLEANED_ABSTRACT_LIST):
    abstract_BOW = corpus[idx]
    topics = get_dominant_topic(abstract_BOW, lda_model)
    ABSTRACT_TOPIC_LIST.append(topics)
    # print(ABSTRACT_TOPIC_LIST[idx])
    # print(f"Abstract {idx} => Topic {topic_id} (confidence: {prob:.2f})")

In [28]:
topic_dict = {}
for idx, topic_list in enumerate(ABSTRACT_TOPIC_LIST): 
    for _ in range(len(topic_list)):
        if _ not in topic_dict:
            topic_dict[_] = [[df['arXiv_ID'][idx], df['Month'][idx], df['Year'][idx], \
                              df['title'][idx], df['authors'][idx], \
                              df['abstract'][idx], df['Cleaned Abstracts'][idx], topic_list[_][0]]]
        else:
            topic_dict[_].append([df['arXiv_ID'][idx], df['Month'][idx], df['Year'][idx], \
                              df['title'][idx], df['authors'][idx], \
                              df['abstract'][idx], df['Cleaned Abstracts'][idx], topic_list[_][0]])


for key in topic_dict.keys():
    print(f"Number of Topics: {len(topic_dict[key])}")

# print(topic_dict[0])

Number of Topics: 324344
Number of Topics: 298586
Number of Topics: 92559
Number of Topics: 3800
Number of Topics: 27


In [29]:
df_FirstTopic = pd.DataFrame(topic_dict[0], columns=['arXiv_ID', 'Month', 'Year', 
                                                     'Title', 'Authors', 
                                                     'Abstract', 'Cleaned Abstract',
                                                     'Topic'])
print(np.unique(df_FirstTopic['Topic'].values))
df_FirstTopic.to_csv('abstractFirstTopics.csv')

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
