## Data in paper
In the original paper, the authors used 16,000 documents from a subset of the TREC AP corpors(Harman, 1992). It is not easy to get the TREC datast since we need to sign an individual agreement and ask for approval from NIST. Instead, we download the sample data on [Blei's webpage](http://www.cs.columbia.edu/~blei/lda-c/). This sample is just a subset of the data that the authors used in the paper, so we cannot get the same result.

In [1]:
import pandas as pd
import numpy as np
from ldapkg.mymodel import LDA_OPT
import gensim
from nltk.stem import WordNetLemmatizer
from nltk import PorterStemmer

In [2]:
## Functions used to preprocess the data
def lemmatize_stemming(text):
    '''
    Lenmmatize and stem the text.
    '''
    return PorterStemmer().stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    '''
    Preprocess the text.
    '''
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [8]:
# Functions used to show the result
import heapq
def get_words(words_list, word_num):
    '''
    Get the words with largetest probilities in a specific topic.
    words_list is a list of probability of words under a specific topic.
    word_num is the number of words that we return.
    Return the index of words in vocabulary.
    '''
    return list(map(words_list.index, heapq.nlargest(word_num, words_list)))

def word_to_list(index, beta):
    """Transform the top_words into a list"""
    topic = []
    words_top = get_words(list(beta[:,index]), 10)
    for i in words_top:
        topic.append(vocabulary[i])
    return topic

In [4]:
# Load the data
ap = []

with open("data/ap.txt") as f:
    for line in f:
        if not (line.startswith("<") or line.startswith(" <")):
            ap.append(line)

In [None]:
ap_pd = pd.Series(ap)
processed_ap = ap_pd.map(preprocess)  
vocabulary = gensim.corpora.Dictionary(processed_ap)
bow_corpus = [vocabulary.doc2bow(doc) for doc in processed_ap]
doc = [dict(bow) for bow in bow_corpus]

In [6]:
lda1 = LDA_OPT(30, 100, 100, 100)
alpha1, beta1 = lda1.fit(doc, vocabulary)

In [9]:
list1 = word_to_list(18, beta1)
list2 = word_to_list(0, beta1)
list3 = word_to_list(8, beta1)
dict1 = {'Life': list1, 'Economy': list2, 'Politics': list3}  
result = pd.DataFrame(dict1) 
result 

Unnamed: 0,Economy,Life,Politics
0,million,year,presid
1,year,work,unit
2,market,peopl,forc
3,billion,famili,reagan
4,stock,children,defens
5,month,live,talk
6,compani,health,support
7,busi,studi,statement
8,point,life,sourc
9,board,like,troop


# Another Dataset

This dataset is named "All the news" and it is coming from [kaggle](https://www.kaggle.com/snapcrack/all-the-news). The dataset contains articles from New York Times, Breitbart, CNN, Business Insider, the Atlantic, Fox News and so on. The original dataset has three csv file, but we just use the first 1000 rows in the second file.

In [None]:
df = pd.read_csv('data/articles.csv')

In [11]:
document = df[['content']].copy()
document['index'] = document.index
processed_docs = document['content'].map(preprocess)
vocabulary = gensim.corpora.Dictionary(processed_docs)
bow_corpus = [vocabulary.doc2bow(doc) for doc in processed_docs]
doc = [dict(bow) for bow in bow_corpus]

In [12]:
lda2 = LDA_OPT(10) 
alpha, beta = lda2.fit(doc, vocabulary)

In [14]:
list1 = word_to_list(4, beta)
list2 = word_to_list(3, beta)
list3 = word_to_list(2, beta)
dict2 = {'President Election': list1, 'Medical': list2, 'Astronomy': list3}  
result = pd.DataFrame(dict2) 
result 

Unnamed: 0,Astronomy,Medical,President Election
0,say,say,trump
1,scientist,patient,presid
2,space,drug,busi
3,year,studi,organ
4,human,medic,hotel
5,planet,doctor,conflict
6,earth,health,properti
7,climat,pain,elect
8,univers,cancer,accord
9,speci,vaccin,tabl
