# Topic Model

- a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.
- Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently
- A document typically concerns multiple topics in different proportions

### Application 1. Topic Modeling in Financial Documents

<img src="graph/docflow.png">

<img src="graph/clu1.png">

- network company
- accounting terms

### Application 2: Social network analysis with topic models

 [**group or label** the edges and nodes in the graph based on their topic similarity](http://oak.cs.ucla.edu/~cho/papers/SIGIR12.pdf).

### Applicaiton 3: Discovering Health Topics in Social Media 
- [By aggregating self-reported health statuses across millions of users, characterize the variety of health information discussed in Twitter](http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0103408)

<img src="graph/dis.png">

<img src="graph/trend1.png">

### Application 4: Topic Models in Financial Market

- [apply topic models to financial data to obtain a more accurate view of economic networks than that supplied by traditional economic statistics.](https://web.stanford.edu/~gdoyle/papers/doyle-elkan-2009-nips-paper.pdf)
- The learned topic models can serve as a substitute for or a complement to more complicated network analysis.

<img src="graph/market.png">

# Latent Dirichlet allocation(LDA) - Most Popular Topic Model

- generates topics based on word frequency from a set of documents. 


## Necesary package 

- NLTK: a natural language toolkit for Python. 
- stop_words: a Python package containing stop words.
- gensim: a topic modeling package containing our LDA model.


In [5]:
!pip install -U nltk
!pip install stop-words
!pip install gensim

Requirement already up-to-date: nltk in /Users/neuron/anaconda3/lib/python3.5/site-packages
Requirement already up-to-date: six in /Users/neuron/anaconda3/lib/python3.5/site-packages (from nltk)


## Raw data
We have following documents

In [18]:
doc_a = "Brocolli is  good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."



and make a corpus as follows:

In [19]:
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]

## Preprocessing of data

- Tokenizing: converting a document to its atomic elements.
- Stopping: removing meaningless words.
- Stemming: merging words that are equivalent in meaning.

### Tokenizing

We use NLTK’s tokenize.regexp module to  segments a document into words

In [20]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

In [21]:
raw_a = doc_a.lower() # transform all word into lower case
tokens_a = tokenizer.tokenize(raw_a)

In [22]:
print(tokens_a)

['brocolli', 'is', 'good', 'to', 'eat', 'my', 'brother', 'likes', 'to', 'eat', 'good', 'brocolli', 'but', 'not', 'my', 'mother']


### Stop worlds

In [23]:
from stop_words import get_stop_words

# create English stop words list
en_stop = get_stop_words('en')

In [30]:
stop_a = [i for i in tokens_a if not i in en_stop]

In [31]:
print(stop_a)

['brocolli', 'good', 'eat', 'brother', 'likes', 'eat', 'good', 'brocolli', 'mother']


### Stemming 
Reduce words into their stems 

In [27]:
from nltk.stem.porter import PorterStemmer
p_stemmer = PorterStemmer()

In [32]:
texts_a = [p_stemmer.stem(i) for i in stop_a]

In [33]:
print(texts_a)

['brocolli', 'good', 'eat', 'brother', 'like', 'eat', 'good', 'brocolli', 'mother']


## Constructing a document-term matrix


In [47]:
TEXT=[]
TEXT.append(texts_a)
TEXT.append(texts_b)
TEXT.append(texts_c)
TEXT.append(texts_d)
TEXT.append(texts_e)


### Give ID for each word

In [58]:
from gensim import corpora

dictionary = corpora.Dictionary(TEXT)

In [51]:
print(dictionary.token2id)

{'blood': 13, 'like': 1, 'increas': 15, 'perform': 24, 'mother': 5, 'practic': 8, 'often': 26, 'brocolli': 4, 'around': 12, 'brother': 2, 'better': 27, 'drive': 9, 'say': 30, 'school': 25, 'good': 0, 'feel': 22, 'lot': 10, 'caus': 16, 'suggest': 18, 'spend': 7, 'time': 11, 'basebal': 6, 'health': 14, 'never': 28, 'seem': 23, 'eat': 3, 'pressur': 21, 'tension': 19, 'may': 20, 'profession': 31, 'well': 29, 'expert': 17}


## Replace word with ID and frequency
- change document to bag-of-word

In [54]:
corpus = [dictionary.doc2bow(text) for text in TEXT]

In [55]:
print(len(corpus))

5


In [83]:
print(corpus[0])

[(0, 2), (1, 1), (2, 1), (3, 2), (4, 2), (5, 1)]


(0,2), 0 represents word 'good', 2 (term frequency) is  the frequency of 'good' in the first document

<font color='blue'>"corpus" is a document-term matrix </font> which is our starting point for LDA(topic) analysis. 

In [97]:
t=models.TfidfModel(corpus=corpus)
print(t[corpus[0]])
print(corpus[0])

[(0, 0.40784451109112935), (1, 0.3581834867987973), (2, 0.11368521994734913), (3, 0.7163669735975946), (4, 0.40784451109112935), (5, 0.11368521994734913)]
[(0, 2), (1, 1), (2, 1), (3, 2), (4, 2), (5, 1)]


## LDA Analysis

In [57]:
from gensim import  models

In [118]:
ldamodel =models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

- num_topics : you need to decide how many topics you want to generate. 
- id2word: word id dictionary
- passes:  you may have higher accuracy if this value is higher. 

In [125]:
ldamodel.print_topic(0,topn=3)

'0.061*"mother" + 0.061*"brother" + 0.060*"drive"'

In [122]:
ldamodel.print_topics(num_topics=2, num_words=3)

[(0, '0.061*"mother" + 0.061*"brother" + 0.060*"drive"'),
 (1, '0.066*"good" + 0.066*"brocolli" + 0.066*"health"')]

We also can print the topic for each document(corpus)

In [123]:
ldamodel.get_document_topics(corpus[0])

[(0, 0.056829289219893886), (1, 0.94317071078010617)]

## What is LDA 

[LDA model](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)  : It is damn hard

It is all about Bayesian 
$$P(Words|Topic)\longrightarrow P(Topic|Words)$$

#### Forward thinking - how we generate document
- Determine topics the document will cover and their percentage
- For each topic, randomly generate a word and fill the document slot. 




#### Backward tracking-backtracks and  detect topics 

 ## Visualization of Topic Models

In [101]:
! pip install pyLDAvis



In [109]:
import pyLDAvis.gensim
pyLDAvis.enable_notebook() # which make notebook can takes 

In [133]:
pyLDAvis.gensim.prepare(ldamodel, corpus[0:4], dictionary)

In [128]:
ldamodel.print_topic(0,topn=15)

'0.061*"mother" + 0.061*"brother" + 0.060*"drive" + 0.059*"lot" + 0.059*"around" + 0.059*"practic" + 0.059*"time" + 0.059*"basebal" + 0.059*"spend" + 0.020*"say" + 0.020*"profession" + 0.020*"health" + 0.020*"brocolli" + 0.020*"good" + 0.020*"eat"'