Topic-Modeling using LDA Model

Topic Modeling to segregate news story data to different topics using Gensim, NLTK.

Tasks Involved :

Loading the dataset Link to dataset : Dataset: https://www.kaggle.com/datasets/akash14/news-category-dataset?select=Data_Train.csv

About Dataset: Size of training set: 7,628 records Size of test set: 2,748 records. FEATURES: STORY: A part of the main content of the article to be published as a piece of news. SECTION: The genre/category the STORY falls in. There are four distinct sections where each story may fall in to. The Sections are labelled as follows : Politics: 0 Technology: 1 Entertainment: 2 Business: 3

Data Preprocessing

Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
Words that have fewer than 3 characters are removed.
All stopwords are removed.
Words are lemmatized - words in third person are changed to first person and verbs in past and future tenses are changed into present.
Words are stemmed - words are reduced to their root form.

Bag of Words on the dataset
Created a dictionary from the processed data, containing the number of times a word appears in the training set. I have used genism.coropa.Dictionary() for this. Followed by some more filtering out of data. Steps Involved:

Create dictionary from words present in the entire training data.
Remove very rare and very common words from the dictionary under consideration.
Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string.

Topic Modelling using Gensim's LDA
LDA generalizes the way the documents are generated and this modelling assumption leads to better topics. We will be using num_topics = 4 for or usecase since we have four major categories of data in our dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
dataset		dataset
ITCS 5156 - [Project] Progress 1 .pdf		ITCS 5156 - [Project] Progress 1 .pdf
Project Report.pdf		Project Report.pdf
README.md		README.md
Topic_Modeling_using_LDA.ipynb		Topic_Modeling_using_LDA.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Topic-Modeling using LDA Model

About

Uh oh!

Releases

Packages

Languages

pmehta16/Topic-Modeling-using-Natural-Language-Processing

Folders and files

Latest commit

History

Repository files navigation

Topic-Modeling using LDA Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages