## Topic Modelling with Latent Dirichlet Allocation (LDA)

- Describes the broad taks of assigning topics to unlabelled text documents.
- Latent Dirichlet Allocation is often abbriviated as LDA, it is not to be confused with Linear Discriminant analysis, - a supervised dimenstionality reduction technique.


### Decomposing text documents with LDA

- The mathematics behind LDA is quite involved and required knowledge about __Bayesian inference__, the approach here is from a practitioner's perspective.


LDA is a __generative probabilistic model__ that tries to find groups of words that appear frequently together across different documents. These frequently appearing words represent our topics assuming that each document is a mixture of different words.

- The input to an LDA is the bag-of-words model we discused earlier
- Given a bag-of-words matrix as input, LDA decomposes it into two new matrices:
> A document to topic matrix<br>
> A word to topic matrix


LDA decomposes the bag-of-words matrix in such a way that if we multiply those two matrices together, we would be able to reproduce the input, the bag-of-words matrix, with the lowest possible error. In practice, we are interested in those topics that LDA found in the bag-of-words matrix.  

- The only downside may be that we must define the number of topics beforehand -- the number of topic is a hyperparameter of LDA that has to be specified manually.

### LDA with scikit-learn



In [2]:
import pandas as pd
df = pd.read_csv('movie_data.csv', encoding='utf-8')

We are going to use the _CountVectorizer_ to create the bag-of-words matrix as input to the LDA.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# for convenience, we are using scikit-learn's built-in English stop word library, via stop_words = 'English
count = CountVectorizer(stop_words = 'english',
                       max_df = .1,
                       max_features=5000)

X= count.fit_transform(df['review'].values)

Notice that we set _maximum document frequency_ of words to be considered to 10 percent(max_df=.1) to exclude words that occur too frequently across documetns. 
- The rationale being that frequently common occuring words might be common words appearing across all documents and are therefore less likely associated with a specific topic category of a given document.
- We have also limited the number of words to be considered to the most frequentyly occuring 5,000 words (_max_featueres=5000_), to limit the dimensionality of this dateset so that it improves the inference performed by LDA. However both max_df and max_features are hyperparameter values and should be tune

In [4]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10, random_state=123, learning_method='batch')

X_topics = lda.fit_transform(X)

By setting _learning-method = 'bathc'_ ,we let the _lda_ estimator do its estimation based on all available tarining data (thebag-of-words matrix) in one iteration, which is slower than the alternative _'online'_ learning method but can lead to more accuracte results.


We now have access to the _components_ attribute of the _lda_ instance, which stores a matrix containing the word importance(here, 5000) for each of the 10  in increasing order


In [5]:
lda.components_.shape

(10, 5000)

let's print the five most import words for each of the 10 topics . The word imprtance values are ranked in increasing order. thus to pringt the top 5 words, we need to sort the _topic_ array in reverse order:

In [6]:
n_top_words=5
feature_names = count.get_feature_names()

for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d: " %(topic_idx+1))
    print(" ".join([feature_names[i] for i in topic.argsort() ]))

Topic 1: 
Topic 2: 
Topic 3: 
Topic 4: 
Topic 5: 
Topic 6: 
Topic 7: 
Topic 8: 
Topic 9: 
Topic 10: 


Based on readng  teh five most important words for each topic, we may guess that the LDA identified the following topics:
1. Generally bad movies (not really a topic category)
2. Movies about fam8lies
3. War Movies
4. Art Movies
5. Crime Movies
6. Horror Movies
7. Comedy Movies
8. Movies somehow related to TV show
9. Movies based on books
10. Action Movies

let's plot three movies form the horror movie category (horror movies belong to category 6 at inde position 5)

In [7]:
horror = X_topics[:,5].argsort()[::-1]
for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nHorror movie #%d:' %(iter_idx +1))
    print(df['review'][movie_idx][:300], '...')


Horror movie #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

Horror movie #2:
Okay, what the hell kind of TRASH have I been watching now? "The Witches' Mountain" has got to be one of the most incoherent and insane Spanish exploitation flicks ever and yet, at the same time, it's also strangely compelling. There's absolutely nothing that makes sense here and I even doubt there  ...

Horror movie #3:
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...
