<a href="https://colab.research.google.com/github/pratikagithub/All-About-Data-Science/blob/main/Topic_Modelling_using_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Topic Modelling means assigning topic labels to a collection of text documents. The goal of topic modelling is to identify topics present in the text documents.

Topic Modelling is a Natural Language Processing technique to uncover hidden topics from text documents. It helps identify topics of the text documents to find relationships between the content of a text document and the topic.

To identify topics of any text document, we need to use algorithms that can analyze the frequency of words to identify relationships between the content and topics. To solve this problem, we need to have textual data.

In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from google.colab import files
uploaded = files.upload()
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

data = pd.read_csv("articles.csv", encoding = 'latin1')
print(data.head())

Saving articles.csv to articles.csv


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


                                             Article  \
0  Data analysis is the process of inspecting and...   
1  The performance of a machine learning algorith...   
2  You must have seen the news divided into categ...   
3  When there are only two classes in a classific...   
4  The Multinomial Naive Bayes is one of the vari...   

                                               Title  
0                  Best Books to Learn Data Analysis  
1         Assumptions of Machine Learning Algorithms  
2          News Classification with Machine Learning  
3  Multiclass Classification Algorithms in Machin...  
4        Multinomial Naive Bayes in Machine Learning  


As we are working on a Natural Language Processing problem, we need to clean the textual content by removing punctuation and stopwords. Here’s how we can clean the textual data:

In [5]:
import nltk

nltk.data.path.append('/path/to/custom/directory')  # Replace with your directory
nltk.download('punkt', download_dir='/path/to/custom/directory')
nltk.download('stopwords', download_dir='/path/to/custom/directory')
nltk.download('wordnet', download_dir='/path/to/custom/directory')

[nltk_data] Downloading package punkt to /path/to/custom/directory...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /path/to/custom/directory...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /path/to/custom/directory...


True

In [6]:
import nltk
print(nltk.data.find('tokenizers/punkt'))

/root/nltk_data/tokenizers/punkt


In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

# Download necessary resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_text(text):
    try:
        # Convert text to lowercase
        text = text.lower()
        # Remove punctuation
        text = text.translate(str.maketrans('', '', string.punctuation))
        # Tokenize text
        tokens = word_tokenize(text)  # Ensure word_tokenize is imported
        # Remove stopwords
        stop_words = set(stopwords.words("english"))
        tokens = [word for word in tokens if word not in stop_words]
        # Lemmatize tokens
        lemma = WordNetLemmatizer()
        tokens = [lemma.lemmatize(word) for word in tokens]
        # Join tokens to form preprocessed text
        return ' '.join(tokens)
    except Exception as e:
        print(f"Error processing text: {text}\nError: {e}")
        return text

# Example of applying it to a DataFrame
data['Article'] = data['Article'].apply(preprocess_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Error processing text: data analysis is the process of inspecting and exploring data generated by a particular population to find the information needed to make decisions and draw conclusions with the use of data in decision making most businesses today need data analysts so if you want to know about the best books to learn data analysis this article is for you in this article i will introduce you to some of the best books to learn data analysis
Error: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '

Now we need to convert the textual data into a numerical representation. We can use text vectorization here:

In [9]:
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(data['Article'].values)

Now we will use an algorithm to identify relationships between the textual data to assign topic labels. We can use the Latent Dirichlet Allocation algorithm for this task. Latent Dirichlet Allocation (LDA) is a generative probabilistic algorithm used to uncover the underlying topics in a corpus of textual data. Let’s use the LDA algorithm to assign topic labels:

In [10]:
lda = LatentDirichletAllocation(n_components=5, random_state=42)
lda.fit(x)

topic_modelling = lda.transform(x)

topic_labels = np.argmax(topic_modelling, axis=1)
data['topic_labels'] = topic_labels

Now here’s the final data with topic labels:

In [11]:
print(data.head())

                                             Article  \
0  data analysis is the process of inspecting and...   
1  the performance of a machine learning algorith...   
2  you must have seen the news divided into categ...   
3  when there are only two classes in a classific...   
4  the multinomial naive bayes is one of the vari...   

                                               Title  topic_labels  
0                  Best Books to Learn Data Analysis             4  
1         Assumptions of Machine Learning Algorithms             1  
2          News Classification with Machine Learning             2  
3  Multiclass Classification Algorithms in Machin...             4  
4        Multinomial Naive Bayes in Machine Learning             4  


So this is how you can assign topic labels with Machine Learning using the Python programming language.