Skip to content

We explore and implement LDA in order to estimate topics on various datasets (text documents).

Notifications You must be signed in to change notification settings

mokleit/topic-modeling-lda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TOPIC MODELING USING LATENT DIRICHLET ALLOCATION (LDA)

SCIKIT IMPLEMENTATION

We implement LDA using scikit-learn on two different datasets.

Datasets

  1. Text documents from the Associated Press found here and here in our project.
  2. Speech-to-Text recordings of IFT6269 lectures at the MILA (Université de Montréal) found here.

Pre-processing

We pre-process the data and store it as corpus.txt.

Training

  1. Use train to train the model and save it as a pickle file.
  2. Use save to save the topics extracted from training.

PYTHON IMPLEMENTATION

DATASETS

  1. Text documents from the scribe notes of IFT6269 here
  2. Text documents from the Associated Press found here and here in our project.

Pre-processing and training

Completed in lda. However, it needs to be fixed and cleaned as it was run in Colab.

Releases

No releases published

Packages

No packages published

Languages