## Introduction to Natural Language Processing


In [1]:
from IPython.display import Image

## Topic modelling

In [2]:
Image(url= "../img/TM.png")
# source: https://www.analyticsvidhya.com/blog/2021/07/topic-modelling-with-lda-a-hands-on-introduction/

Topic modeling is a type of statistical modeling for discovering abstract topics that occur in a collection of documents. At its core, topic modeling is about uncovering hidden structure in text data, and it is a powerful tool for organizing and understanding large collections of unstructured data.

Topic models provide a simple way to analyze large volumes of unlabeled text. A "topic" consists of a cluster of words that frequently occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings.



### Topic modelling in the context of Machine Learning

Topic modeling is an <b>unsupervised</b> machine learning task. The aim of topic modeling is to identify the main topics that occur in a collection of documents. Each topic is represented as a distribution over words, and each document is then represented as a distribution over topics.

The reason it's an unsupervised task is because we don't know in advance what the topics are or how many topics there are. There's no "correct" answer that we're trying to predict. Instead, the algorithm tries to find patterns in the data and uses these patterns to determine the topics.

There are several algorithms for topic modeling, including Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), and others. Each of these algorithms has its own assumptions and methods for determining the topics.

In [3]:
Image(url= "../img/tm2.png")
# source: https://www.cognub.com/index.php/cognitive-platform/

### The practical part

Before we start with any Machine Learning or Natural Language Processing, we need data. Here, we are using the BBC News dataset. It contains articles from BBC News.

In Python, we can use the pandas library to load the data from a csv file.



In [4]:
import pandas as pd

#### Load the dataset

In [5]:
df = pd.read_csv('../data/bbc-text.csv', sep=',')

In [6]:
df.shape

(2225, 1)

#### Check out the first 10 rows

In [7]:
df.head(10)

Unnamed: 0,text
0,tv future in the hands of viewers with home th...
1,worldcom boss left books alone former worldc...
2,tigers wary of farrell gamble leicester say ...
3,yeading face newcastle in fa cup premiership s...
4,ocean s twelve raids box office ocean s twelve...
5,howard hits back at mongrel jibe michael howar...
6,blair prepares to name poll date tony blair is...
7,henman hopes ended in dubai third seed tim hen...
8,wilkinson fit to face edinburgh england captai...
9,last star wars not for children the sixth an...


## Latent Dirichlet Allocation (LDA):

LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the context of topic modeling, the "observations" are words in documents and the "unobserved groups" are topics. The "why some parts of the data are similar" is because similar documents share topics.

Basically, it’s a way of explaining why some documents are similar to others (they are about the same topics). It uses a probabilistic graphical model where each document is assumed to be a mixture of various topics, and each word is probabilistically drawn from one topic.



In [8]:
Image(url= "../img/lda.png")
# Blei, D.M. (2012) Probabilistic Topic Models. Communications of the ACM, 55, 77-84. http://dx.doi.org/10.1145/2133806.2133826



### Text Preprocessing


The first step in any NLP task is text preprocessing. This usually involves converting all the text to lowercase, removing punctuation and stop words, and tokenization (breaking the text down into individual words).

We're also going to do some extra steps specific to topic modelling: lemmatization, which reduces words to their root form. For instance, "running" would be reduced to "run".

In [9]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import string

In [10]:
# Initialize a Lemmatizer
lemmatizer = WordNetLemmatizer()

# Text Preprocessing function
def preprocess_text(text):
    # Lower case
    text = text.lower()
    # Remove special characters
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub(r'\w*\d\w*', '', text)
    # Tokenization
    text = text.split()
    # Remove stop words and Lemmatize
    text = [lemmatizer.lemmatize(word) for word in text if word not in stopwords.words('english')]
    # Join words to a single string
    return ' '.join(text)

In [11]:
df['processed_text'] = df['text'].apply(lambda x: preprocess_text(x))

#### Digression - Lambda function:
 
 - An anonymous function in Python is one that has no name when it is defined. In Python, the lambda keyword is used to define anonymous functions rather than the def keyword, which is used for normal functions. As a result, lambda functions are another name for anonymous functions.

 -  here, x is the argument (an individual value from the 'text' column), and preprocess_text(x) is the expression. This lambda function applies the preprocess_text function to each value in the 'text' column.

- so, the entire line of code is taking each value in the 'text' column of the dataframe, applying the preprocess_text function to it, and then storing the result in a new column called 'processed_text'

In [None]:
df.head(5)

Now that we have our text preprocessed, let's start with our first topic modelling algorithm: Latent Dirichlet Allocation (LDA). LDA assumes that every document is a mixture of topics and that every word in the document is attributable to the document's topics.

We'll use the LDA implementation from the sklearn library.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

### Create a CountVectorizer for parsing/counting words

In [None]:
count_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
term_matrix = count_vectorizer.fit_transform(df['processed_text'])

### Digression - CountVectorizer

- CountVectorizer is a class provided by the sklearn.feature_extraction.text module in the scikit-learn library. It's used to convert a collection of text documents to a matrix of token (word) counts.

<b>When using CountVectorizer, the process usually involves the following steps:</b>

1. The text is tokenized, meaning it's split into individual words according to some rule. By default, this is done by splitting the text on whitespace and punctuation.

2. The words are counted. For each document, CountVectorizer maintains a count of how many times each word appeared.

3. A document-term matrix is created. Each row of the matrix represents a document, and each column represents a unique word from across all documents. The entry in the ith row and the jth column of the matrix is the count of word j in document i.

4. The output is a sparse matrix representation of the documents, which can be used as input to a machine learning model.

#### max_df: 
- used for removing terms that appear too frequently, also known as "corpus-specific stop words"
- For example, max_df=0.95 means "ignore terms that appear in more than 95% of the documents"
- These common words usually don't carry important meaning and are often removed.


#### min_df:
- used for removing terms that appear too infrequently
- For example, min_df=2 means "ignore terms that appear in less than 2 documents"
- The intuition behind this approach is that words that appear only once or a few times in the entire corpus might be typos, rare words or otherwise irrelevant to the analysis, thus can be safely ignored.



### Create and fit the LDA model


In [None]:
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(term_matrix)

In this section, the n_components stands for the number of topics we want to retreive. Playing with this parameter may also influence the quality of our results, since we are talking about clustering. Usually the best thing is to play around until we get a good evaluation score - or we simply figure that out topics look nice enough.

In [None]:
# Display the topics from the model
for idx, topic in enumerate(lda.components_):
    print ("Topic ", idx, " ".join(count_vectorizer.get_feature_names()[i] for i in topic.argsort()[:-10 - 1:-1]))


 - the argsort() function from numpy gets the indices that would sort the topic array. It returns the indices that would sort an array.

-  [:-11:-1] slice is getting the last 10 values from the sorted indices in reverse order. This gives you the indices of the 10 highest values in the topic array, which are the top 10 words for that topic.



In [None]:
!pip install wordcloud
 

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt


In [None]:
# Get the feature names from count vectorizer
feature_names = count_vectorizer.get_feature_names()

# Get the topics and their top 10 words for LDA
lda_topics = [[(feature_names[i], topic[i]) for i in topic.argsort()[:-11:-1]] for topic in lda.components_]

for i, topic in enumerate(lda_topics):
    wc = WordCloud(background_color="white", max_words=2000)
    wc.generate_from_frequencies(dict(topic))
    
    plt.figure(figsize=(10,5))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    plt.title(f'Topic {i+1}')
    plt.show()


In [None]:
lda_topics

### How do we interpret these topics? 


1. Check the most significant words: Start by looking at the words with the highest weights in each topic. These are the words that are most representative of the topic according to the model.

2. Understand the common theme: Try to find a common theme or category among these words. For instance, if the top words are "doctor", "patient", "hospital", and "medicine", then a good interpretation of the topic might be "Healthcare" or "Medicine".

3. Use your domain knowledge: Your own understanding of the subject matter can be very useful in interpreting the topics. For instance, if you're analyzing news articles and one of the topics contains words like "election", "votes", "candidate", and "campaign", then you could interpret this as a "Politics" topic.

4. Check the related documents: Another way to interpret the topics is to look at some documents that are heavily associated with each topic. By reading these documents, you might get a better understanding of what the topic represents.

5. Keep in mind that topics are probabilistic: Topic modelling algorithms like LDA are probabilistic, which means that they provide a probability distribution over all words for each topic, and a probability distribution over all topics for each document. The topics and the document-topic associations are not definitive but rather represent the algorithm's best guess based on the data and its own internal mathematics.

6. Don't overinterpret: Finally, remember that not all topics might make perfect sense, and that's okay. Topic models are statistical models that try to find structure in the data, but sometimes this structure doesn't map perfectly onto human interpretability.

Interpreting topics from topic modelling is more of an art than a science, requiring a mix of understanding the model's output, using your own domain knowledge, and making sensible judgments.






<div class="alert alert-block alert-info">
<b>Exercise 1 - LDA</b>
<p>
<li>Think of the topic names for the identified words in the word clouds. What would be the common themes?</li>

</p>
  
</div>

### Literature and references

- "Latent Dirichlet Allocation" by David M. Blei, Andrew Y. Ng, Michael I. Jordan (https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) - This is the original paper that introduced LDA.

- Topic modeling in Python with NLTK and Gensim (https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24) - This blog post provides a practical guide to topic modeling in Python.

- Nonnegative Matrix Factorization (NMF) (https://medium.com/python-in-plain-english/topic-modelling-with-nmf-in-python-194eb6ae04a5) - Practical Guide - This post explains how to perform topic modeling using NMF with practical examples. 