<a href="https://colab.research.google.com/github/kiranshahi/Natural-Language-Processing/blob/main/Topic_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modelling

## Problem description

Topic modelling is a type of statistical modelling where we aim to discover the hidden semantic structures from a large set of text known as the corpus (collections of documents).

Here we are going to use the Laten Dirichlet Allocation (LDA) as a topic model to classify the text in a document. It assumes that each document in a corpus has one or more hidden topics, and each topic is supported by the number of words.

We are going to find these hidden topics and their supporting words by maximising the posterior probability of the whole corpus with the given topics and words. `p(corpus|topics,words)`

## Implementation and Results

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from gensim import models, corpora

In [2]:
documents = [
  """
  Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data. 
  It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, 
  in order to make predictions or decisions without being explicitly programmed to do so. 
  Machine learning are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, 
  and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
  We’re in the age where machines are no different. Machine Learning is still fairly a new concept. 
  We can teach machines how to learn and some machines can even learn on its own. This is magical phenomenon is called Machine Learning.
  ".
  """,
  """
  Indian market on a working day, opens at 9:00 AM and closes at 3:30 PM. 
  The price of the stocks when the market opens is called the opening price. 
  The price of the stocks when the market closes is called the closing price. 
  Through the session, the stocks also hit two more values of importance 
  which are the day’s highest price and the day’s lowest price.
  A market is where trading of stocks happen. Traders are of two kinds.
  Trader who buys stocks and a trader who sells stocks. Sellers offer the stocks and buyers bid the stocks. 
  If there is buying pressure and buyers bid at a higher price and the stock prices rise, we call this state of market as bullish.
  """
]

In [3]:
# Clean the data by using stemming and stopwords removal

nltk.download('stopwords')
def stemming(documents):
  stemmer = SnowballStemmer('english')
  stop_words = stopwords.words('english')
  texts = [
           [stemmer.stem(word) for word in document.lower().split() if word not in stop_words]
           for document in documents
          ]
  return texts

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
texts = stemming(documents)

# Create a dictionary from the words
dictionary = corpora.Dictionary(texts)

# Create a document-term matrix
doc_term_mat = [dictionary.doc2bow(text) for text in texts]

# Generate the LDA model 
num_topics = 2
ldamodel = models.ldamodel.LdaModel(doc_term_mat, 
        num_topics=num_topics, id2word=dictionary, passes=25)

In [5]:
num_words = 5
for item in ldamodel.print_topics(num_topics=num_topics, num_words=num_words):
  print(f'\nTop {num_words} contributing words for {("first" if item[0] == 0 else "second" )} document:')
  list_of_strings = item[1].split(' + ')
  for text in list_of_strings:
    details = text.split('*')
    print("%-12s:%0.2f%%" %(details[1], 100*float(details[0])))


Top 5 contributing words for first document:
"machin"    :6.70%
"learn"     :5.10%
"algorithm" :2.80%
"use"       :2.00%
"data,"     :2.00%

Top 5 contributing words for second document:
"stock"     :6.40%
"price"     :4.70%
"market"    :4.70%
"call"      :3.00%
"close"     :3.00%


In [6]:
new_docs = [
  """
  Stock market prediction and analysis are some of the most difficult jobs to complete. There are numerous causes for this,
  including market volatility and a variety of other dependent and independent variables that influence the value of a certain stock 
  in the market. These variables make it extremely difficult for any stock market expert to anticipate the rise and fall of the 
  market with great precision. However, with the introduction of Machine Learning and its strong algorithms, the most recent market research 
  and Stock Market Prediction advancements have begun to include such approaches in analyzing stock market data. In summary, Machine Learning 
  Algorithms are widely utilized by many organizations in Stock market prediction. This article will walk through a simple implementation 
  of analyzing and forecasting the stock prices of a Popular Worldwide Online Retail Store in Python using various Machine Learning Algorithms.
  """
]

new_texts = stemming(new_docs)
new_doc_term_mat = [dictionary.doc2bow(text) for text in new_texts]

vector = ldamodel[new_doc_term_mat]
print(vector[0])


[(0, 0.4519081), (1, 0.5480919)]


## Discussions
In this task, we had built an LDA model for and discovered the abstract topics that occurred in the collection of documents.

### First document
For the first document, the top five contributing words are machin (6.70%), learn (5.10%), algorithm (2.80%), data, (2.00%) and use (2.00%)respectively. The document was about machine learning and the topic suggested by the model is related to it which is expected.

### Second document
For the second document, the top five contributing words are stock (6.40%), price (4.70%), market (4.70%), call (3.00%) and price. (3.00%)respectively. The document was about the stock market and the topic suggested by the model is related to it, which is as expected.

### New document
For this, we picked the article related to implementation in the Machine learning stock market. So, that it contains the topic related to both documents. After computing its projection vector from the LDA model we got the following result. 

`[(0, 0.45192024), (1, 0.5480798)]`

The result shows that the new document has content related to both the first and second documents.