# Topic Model

- a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.
- Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently
- A document typically concerns multiple topics in different proportions

### Application 1. Topic Modeling in Financial Documents

<img src="graph/docflow.png">

<img src="graph/clu1.png">

- network company
- accounting terms

### Application 2: Social network analysis with topic models

 [**group or label** the edges and nodes in the graph based on their topic similarity](http://oak.cs.ucla.edu/~cho/papers/SIGIR12.pdf).

### Applicaiton 3: Discovering Health Topics in Social Media 
- [By aggregating self-reported health statuses across millions of users, characterize the variety of health information discussed in Twitter](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0103408)

<img src="graph/dis.png">

<img src="graph/trend1.png">

### Application 4: Topic Models in Financial Market

- [apply topic models to financial data to obtain a more accurate view of economic networks than that supplied by traditional economic statistics.](https://web.stanford.edu/~gdoyle/papers/doyle-elkan-2009-nips-paper.pdf)
- The learned topic models can serve as a substitute for or a complement to more complicated network analysis.

<img src="graph/market.png">

# Latent Dirichlet allocation(LDA) - Most Popular Topic Model

- generates topics based on word frequency from a set of documents. 


## Necesary package 

- NLTK: a natural language toolkit for Python. 
- stop_words: a Python package containing stop words.
- gensim: a topic modeling package containing our LDA model.


!pip install -U nltk
!pip install stop-words
!pip install gensim

## Raw data
We have following documents

In [2]:
doc_a = "Brocolli is  good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."



and make a corpus as follows:

In [3]:
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

## Preprocessing of data

- Tokenizing: converting a document to its atomic elements.
- Stopping: removing meaningless words.
- Stemming: merging words that are equivalent in meaning.

### Tokenizing

We use NLTK’s tokenize.regexp module to  segments a document into words

In [4]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

In [7]:
raw_a = doc_a.lower() # transform all word into lower case
tokens_a = tokenizer.tokenize(raw_a) 

In [6]:
print(tokens_a)

['brocolli', 'is', 'good', 'to', 'eat', 'my', 'brother', 'likes', 'to', 'eat', 'good', 'brocolli', 'but', 'not', 'my', 'mother']


### Stop words

In [8]:
from stop_words import get_stop_words
# create English stop words list
en_stop = get_stop_words('en')

In [9]:
stop_a = [i for i in tokens_a if not i in en_stop]

In [11]:
print(tokens_a)
print(stop_a)

['brocolli', 'is', 'good', 'to', 'eat', 'my', 'brother', 'likes', 'to', 'eat', 'good', 'brocolli', 'but', 'not', 'my', 'mother']
['brocolli', 'good', 'eat', 'brother', 'likes', 'eat', 'good', 'brocolli', 'mother']


### Stemming 
Reduce words into their stems 

In [12]:
from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()

In [14]:
texts_a = [p_stemmer.stem(i) for i in stop_a]


In [21]:
raw_b = doc_b.lower()
tokens_b = tokenizer.tokenize(raw_b)
stop_b = [i for i in tokens_b if not i in en_stop]
texts_b = [p_stemmer.stem(i) for i in stop_b]

raw_c = doc_c.lower()
tokens_c = tokenizer.tokenize(raw_c)
stop_c = [i for i in tokens_c if i not  in en_stop]
texts_c = [p_stemmer.stem(i) for i in stop_c]

raw_d = doc_d.lower()
tokens_d = tokenizer.tokenize(raw_d)
stop_d = [i for i in tokens_d if not i in en_stop]
texts_d = [p_stemmer.stem(i) for i in stop_d]

raw_e = doc_e.lower()
tokens_e = tokenizer.tokenize(raw_e)
stop_e = [i for i in tokens_e if not i in en_stop]
texts_e = [p_stemmer.stem(i) for i in stop_e]

## Constructing a document-term matrix


In [19]:
TEXT=[]
TEXT.append(texts_a)
TEXT.append(texts_b)
TEXT.append(texts_c)
TEXT.append(texts_d)
TEXT.append(texts_e)
print(len(TEXT)) # TEXT is call coupus

5


### Give ID for each word

In [22]:
from gensim import corpora

dictionary = corpora.Dictionary(TEXT)

In [23]:
print(dictionary.token2id)

{'perform': 22, 'brocolli': 1, 'tension': 15, 'never': 23, 'suggest': 17, 'basebal': 8, 'say': 30, 'eat': 2, 'time': 10, 'better': 29, 'feel': 28, 'often': 24, 'drive': 6, 'spend': 9, 'mother': 3, 'brother': 5, 'profession': 31, 'around': 12, 'increas': 19, 'seem': 25, 'good': 4, 'lot': 7, 'health': 20, 'blood': 14, 'school': 27, 'like': 0, 'well': 26, 'may': 16, 'expert': 13, 'caus': 18, 'practic': 11, 'pressur': 21}


## Replace word with ID and frequency
- change document to bag-of-word

In [24]:
corpus = [dictionary.doc2bow(text) for text in TEXT] # bow : bag of words

In [26]:
print(len(corpus))

5


In [29]:
print(corpus[2])

[(6, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1)]


(0,2), 0 represents word 'good', 2 (term frequency) is  the frequency of 'good' in the first document

<font color='blue'>"corpus" is a document-term matrix </font> which is our starting point for LDA(topic) analysis. 

In [97]:
t=models.TfidfModel(corpus=corpus)
print(t[corpus[0]])
print(corpus[0])

[(0, 0.40784451109112935), (1, 0.3581834867987973), (2, 0.11368521994734913), (3, 0.7163669735975946), (4, 0.40784451109112935), (5, 0.11368521994734913)]
[(0, 2), (1, 1), (2, 1), (3, 2), (4, 2), (5, 1)]


## LDA Analysis

In [34]:
from gensim import  models

In [39]:
ldamodel =models.ldamodel.LdaModel(corpus, num_topics=3,
                                   id2word = dictionary, passes=20)

- num_topics : you need to decide how many topics you want to generate. 
- id2word: word id dictionary
- passes:  you may have higher accuracy if this value is higher. 

In [38]:
ldamodel.print_topic(1,topn=20)

'0.086*"health" + 0.086*"brocolli" + 0.086*"good" + 0.061*"eat" + 0.037*"caus" + 0.037*"may" + 0.037*"blood" + 0.037*"tension" + 0.037*"suggest" + 0.037*"expert" + 0.037*"increas" + 0.037*"say" + 0.037*"profession" + 0.037*"like" + 0.036*"pressur" + 0.036*"drive" + 0.036*"mother" + 0.036*"brother" + 0.012*"often" + 0.012*"lot"'

In [41]:
ldamodel.print_topics(num_topics=3,  num_words=3)

[(0, '0.098*"health" + 0.069*"drive" + 0.040*"increas"'),
 (1, '0.150*"brocolli" + 0.150*"good" + 0.108*"eat"'),
 (2, '0.059*"brother" + 0.059*"mother" + 0.059*"pressur"')]

We also can print the topic for each document(corpus)

In [42]:
ldamodel.get_document_topics(corpus[0])

[(0, 0.034067378103823549),
 (1, 0.93146216802237836),
 (2, 0.034470453873797977)]

## What is LDA 

[LDA model](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)  : It is damn hard

It is all about Bayesian 
$$P(Words|Topic)\longrightarrow P(Topic|Words)$$

#### Forward thinking - how we generate document
- Determine topics the document will cover and their percentage
- For each topic, randomly generate a word and fill the document slot. 




#### Backward tracking-backtracks and  detect topics 

 ## Visualization of Topic Models

In [101]:
! pip install pyLDAvis



In [109]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook() # which make notebook can takes 

In [133]:
pyLDAvis.gensim.prepare(ldamodel, corpus[0:4], dictionary)

In [128]:
ldamodel.print_topic(0,topn=15)

'0.061*"mother" + 0.061*"brother" + 0.060*"drive" + 0.059*"lot" + 0.059*"around" + 0.059*"practic" + 0.059*"time" + 0.059*"basebal" + 0.059*"spend" + 0.020*"say" + 0.020*"profession" + 0.020*"health" + 0.020*"brocolli" + 0.020*"good" + 0.020*"eat"'

## In-class practice


In [44]:
Newlist=[]
Newlist.append("4 UK Students Prove Their Amazing Money-Making System, Worth Millions, On live TV")
Newlist.append("U.S. Supreme Court may limit where companies can be sued")
Newlist.append("Wynn's New Macau Casino Delivers Forecast-Topping Profit")
Newlist.append("Gilead (GILD) Q1 Earnings: Stock Likely to Beat Estimates?")
Newlist.append("Bank of America just said there's ‘material risk’ to the long-term viability of Tesla")
Newlist.append("Is the Options Market Predicting a Spike in SeaDrill (SDRL) Stock?")
Newlist.append("Exelixis (EXEL) Q1 Earnings: Will the Stock Disappoint?")
Newlist.append("Express Scripts, Teva Pharmaceuticals Drop into Tuesday’s 52-Week Low Club")
Newlist.append("As retail bankruptcies climb toward post-recession high, these companies could be next")
Newlist.append("This Is the Eye-Opening Data That Hints the U.S. Restaurant Industry Has Completely Collapsed")
Newlist.append("Goldman has figured out the trick for making money off Amazon")
Newlist.append("T-Mobile CEO: Most people have no idea what 5G is")
Newlist.append("Is a Surprise in Store for Ford (F) this Earnings Season?")
Newlist.append("How Earned Income Affects Social Security in Retirement")
Newlist.append("EINHORN ON TESLA: 'We expect these bubbles to pop'")
Newlist.append("It's time for a reality check on Microsoft's grand turnaround vision")
Newlist.append("General Electric: Cash is King…and a Problem")

In [45]:
print(len(Newlist))

17


In [46]:
TEXT=[]
for l in range(17):
    raw = Newlist[l].lower()
    tokens = tokenizer.tokenize(raw)
    stop = [i for i in tokens if not i in en_stop]
    texts = [p_stemmer.stem(i) for i in stop]
    TEXT.append(texts)

In [47]:
dictionary = corpora.Dictionary(TEXT)

In [48]:
print(dictionary.token2id)

{'gilead': 32, 'peopl': 90, 'money': 2, 'vision': 109, 'wynn': 22, 'q1': 29, 'gild': 31, '4': 8, 'million': 11, 'teva': 63, 'low': 62, 'industri': 81, 'deliv': 27, 'disappoint': 53, 'einhorn': 106, 'prove': 3, 'live': 4, 'student': 6, '52': 57, 'ceo': 88, 'like': 35, 'tesla': 39, 'limit': 14, 'su': 18, 'materi': 45, 'post': 67, 'season': 94, 'compani': 19, 'estim': 34, 'trick': 86, 'worth': 0, 'may': 16, '5g': 91, 'sdrl': 47, 'pop': 105, 'restaur': 77, 'risk': 37, 'long': 40, 'drop': 61, 'just': 46, 'high': 71, 's': 17, 'top': 28, 'can': 15, 'recess': 68, 'problem': 114, 'bankruptci': 69, 'retail': 70, 'said': 42, 'amazon': 83, 'next': 72, 'f': 96, 'realiti': 113, 'exelixi': 55, 'ford': 97, 'gener': 118, 'uk': 10, 'king': 117, 'option': 48, 'bank': 38, 'week': 58, 'collaps': 80, 'make': 5, 'retir': 102, 'profit': 26, 'suprem': 20, 'surpris': 93, 'will': 56, 'express': 66, 'hint': 75, 'idea': 87, 'forecast': 21, 'exel': 54, 'tv': 1, 'seadril': 49, 'predict': 50, 'viabil': 43, 'america':

In [49]:
corpus = [dictionary.doc2bow(text) for text in TEXT]

In [50]:
print(corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1)]


In [56]:
ldamodel =models.ldamodel.LdaModel(corpus, num_topics=5,
                                   id2word = dictionary, passes=40)

In [57]:
ldamodel.print_topics(num_topics=5,  num_words=3)

[(0, '0.053*"compani" + 0.029*"u" + 0.029*"high"'),
 (1, '0.028*"money" + 0.028*"make" + 0.028*"stock"'),
 (2, '0.024*"profit" + 0.024*"macau" + 0.024*"casino"'),
 (3, '0.063*"s" + 0.018*"pharmaceut" + 0.018*"script"'),
 (4, '0.041*"earn" + 0.022*"tesla" + 0.022*"stock"')]

In [58]:
ldamodel.get_document_topics(corpus[0])

[(0, 0.015387364985435362),
 (1, 0.93840886993991157),
 (2, 0.015386918619843593),
 (3, 0.015430101024563086),
 (4, 0.01538674543024654)]

In [59]:
Newlist[0]

'4 UK Students Prove Their Amazing Money-Making System, Worth Millions, On live TV'