# Topic Modelling with LDA


#### Acknowledge: The exercise is adopted from Intel AI Developer Program

For this exercise, we will deriving topics from an existing corpus of newgroups. News groups are already categorized, so we have a useful baseline topics for us to understand the outcome of LDA better.



###  Import the neccessary libraries
- LDA works only with bag of words approach only
- Regular expressions re, gensim and spacy are used to process texts. 
- PyLDAvis and matplotlib for visualization and numpy
- Pandas for manipulating and viewing data in tabular format.



In [3]:
#!pip show pyLDAvis

Name: pyLDAvis
Version: 2.1.2
Summary: Interactive topic model visualization. Port of the R package.
Home-page: https://github.com/bmabey/pyLDAvis
Author: Ben Mabey
Author-email: ben@benmabey.com
License: MIT
Location: c:\users\david\anaconda3\lib\site-packages
Requires: pytest, scipy, future, wheel, numpy, pandas, funcy, jinja2, numexpr, joblib
Required-by: 


In [4]:
# Sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

# 
import numpy as np
import pandas as pd
import re, nltk

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

  from collections import Mapping


#### display_topic() is a commonly used function to display topics and related terms
- model - the lda model
- feature_names - the features names
- no_top_words - how many terms to display

** Do not change this function ** 

In [5]:

def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

#### Function for Data Cleaning
- Clean_documents() is a function to perform data cleansing for a raw document
- This is a snub function for now. When you prepare your own set of data, you will write your own pre-processing logic.
- As you evaluate the results of the topic components, you may have to revist this function to improve on the data clean up

In [6]:
def clean_documents(document):
    # placeholder: Write data preparation codes here
    
    return document

###  Data Processing 

#### This is the section to modify if you have other sources
- Load in the documents from its source
- The LDA topic model algorithm requires a document word matrix as the main input.
- Vectorise the document using count vectorizing
- LDA can only use raw term counts for LDA because it is a probabilistic graphical model


In [14]:

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
corpus = clean_documents(dataset.data)
print(corpus[0])

Well i'm not sure about the story nad it did seem biased. What
I disagree with is your statement that the U.S. Media is out to
ruin Israels reputation. That is rediculous. The U.S. media is
the most pro-israeli media in the world. Having lived in Europe
I realize that incidences such as the one described in the
letter have occured. The U.S. media as a whole seem to try to
ignore them. The U.S. is subsidizing Israels existance and the
Europeans are not (at least not to the same degree). So I think
that might be a reason they report more clearly on the
atrocities.
	What is a shame is that in Austria, daily reports of
the inhuman acts commited by Israeli soldiers and the blessing
received from the Government makes some of the Holocaust guilt
go away. After all, look how the Jews are treating other races
when they got power. It is unfortunate.



In [8]:
no_features = 1000
cv = CountVectorizer(max_features=no_features, stop_words='english')
document_terms = cv.fit_transform(corpus)
tf_feature_names = cv.get_feature_names()

#### Create the LDA model and apply LDA to the corpus of document
- Create LDA object
- Fit and transform the vectorize document (tf) using LDA


In [10]:
# TO DO:
    
no_topics = 20    # this is just a wild guess
lda_model = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
lda_output = lda_model.fit_transform(document_terms)

# print(lda_model)  # To look at the Model attributes


### Preview the terms of each topic
- Displays the topics terms

In [11]:
no_top_words = 20
display_topics(lda_model, tf_feature_names, no_top_words)

Topic 0:
don think db use like just want good need stuff know sure ground wire pretty btw deleted used food really
Topic 1:
mr gun law guns control stephanopoulos crime laws weapons firearms state self states use police rate children defense don carry
Topic 2:
ve like line post bike thanks got right ll know lines sound way sorry good current area dod yes point
Topic 3:
10 11 25 15 12 17 14 16 20 13 24 18 27 19 30 21 23 26 22 00
Topic 4:
game team games play season hockey league players win player teams nhl good won better best runs lost hit second
Topic 5:
does people question problem case point answer rules believe article used given cause actually fact certain far use different idea
Topic 6:
windows thanks card does dos window use know using problem memory help pc like video hi work ms need program
Topic 7:
new 1993 university april national 000 health information research american washington york medical center news 1992 san united report states
Topic 8:
just don say think know way 

#### Using pyLDAvis to visualized topic models
- pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. 
- The package extracts information from a fitted LDA topic model to form an interactive web-based visualization.
- Intuitively, a good topic model will have non-overlapping wirh fairly big sized area for each topic.
- For examples on how to use scikit-learn's topic models with pyLDAvis see https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/sklearn.ipynb
- Note: the topic numeration in this visualisation does not correspond to the models numeration of topics.


In [13]:
# TO DO: Write the codes to display pyLDAVIS 
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_model, document_terms, cv)
panel

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


#### Questions:
1. Select topic 15. Based on the terms, can you suggest what this topic is all about?
2. Set the parameter for relevancy metric to 0.6.
3. Look at the first term of topic 15. Is the first top unique to the topic?
4. Click on the term. Notice that the size of the scatter plot changes.  Which other topics is the term is significant?
5. Continue to explore the various topics and terms. Are there any top terms that are not too helpful?  Should you remove these terms using NLP proprocessing techniques ?

#### Reference:
- www.machinelearningplus.com/nlp/topic-modeling-python-sklearn-examples/
- https://youtu.be/IksL96ls4o0![image.png](attachment:image.png)
