<div class="alert"  style="background-color:#1f2e6b; color:white; padding:0px 5px; border-radius:10px;"><h2 style='margin:10px'>LDA</h2></div>

# Model overview

LDA works by operating under the assumption that documents are mixtures of topics, where a topic is a distribution over words.  It then aims to reverse engineer this mix to discover the original topics that generated the documents.


Latent Dirichlet Allocation (LDA) is a probabilistic model that identifies topics present in a collection of documents by assuming each document is a mixture of various topics and each topic is characterized by a distribution over words. It operates by iterating over each word in each document, reassigning the word to a topic with consideration for how prevalent that topic is across all documents and how much the word is associated with the topic throughout the corpus. The outcome is a set of topics that are represented as a collection of words, along with their associated documents, revealing the underlying thematic structure of the dataset.

In [24]:
import pandas as pd

df = pd.read_csv("full.csv")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63503 entries, 0 to 63502
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Job Title    63503 non-null  object
 1   Description  63503 non-null  object
 2   Category_1   63503 non-null  object
dtypes: object(3)
memory usage: 1.5+ MB


In [25]:
df.head()

Unnamed: 0,Job Title,Description,Category_1
0,Power bi specialist freelance,Already data pooled and designed. Need to refi...,Data Analysis
1,Case Study (on-demand delivery startup),"Hi,\n\nWould you be able to help me do a case-...",Google Data Studio
2,"File Maker Pro Reports, Charts, Query and Ongo...",NITIAL PROJECT\n\nSet up Monthly Report mimick...,Report Writing
3,Implementation of EleutherAI/gpt-neox-20b,"As a first step, you will implement the instal...",Machine Learning Model
4,BI and Data Engineer for Upwork Finance System...,The Upwork Finance Systems team is looking for...,Data Analysis


The dataframe consists of 3 columns and 63503 rows. The column of interest is the description column, based on the Category_1 that houses it.

# Importing the necessary libraries

This model will be using Scikit-learn's TfidfVectorizer to convert the text data into a matrix of TF-IDF features. It will also use KMeans to cluster the data, which will produce the clusters that will be seen by the end user. 

The first function that will be defined is a text cleaning function. It will remove any punctuation, numbers, and stop words from the text.

Now, let's make it so that the dataset can be dynamically filtered based on the category that the user selects. This will be done by saving all of the top 20 categories, and programatically filter the dataset based on an index.

In [3]:
full_df_index = df["Category_1"].value_counts().head(20).index
full_df_index 

Index(['Web Development', 'Social Media Marketing', 'Marketing Strategy',
       'Mobile App Development', '3D Modeling', 'Email Marketing', 'Python',
       'Android App Development', '3D Design', 'Market Research',
       '3D Rendering', 'WordPress', 'Data Scraping', '3D Animation',
       'Lead Generation', 'Data Analysis', 'JavaScript', 'React', 'Shopify',
       'Sales & Marketing'],
      dtype='object', name='Category_1')

In [26]:
# Dynamically choose the category that the user wants based on the selected index
curr_index = 0

condition = df["Category_1"] == full_df_index[curr_index]
current_df = df[condition]
current_df = current_df["Description"].tolist()
current = current_df

additional_words = ["like", "needed", "need", "looking", "help"]
additional_words.extend(full_df_index[curr_index].lower().split()) # adds the category name to the list of additional words to be removed by the tokenizer

descriptions = current

Gensim will be used to create the LDA model. Gensim is a library for topic modeling and NLP processing. It permits the creation of a LDA models with few lines of code, and visualization of the end clusters with another library, pyLDAvis.

In [27]:
# Import necessary libraries
import pandas as pd
from gensim.utils import simple_preprocess
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
import nltk
from nltk.corpus import stopwords

In [22]:
# Set of English stopwords
stop_words = set(stopwords.words('english'))

# Adding additional stopwords
additional_words = ['help', 'design', 'looking', 'need', 'like', 'needed', 'looking', 'help', 'experience', 'page']
additional_words.extend(full_df_index[curr_index].lower().split()) # adds the category name to the list of additional words to be removed by the tokenizer
stop_words.update(additional_words)


A simple preprocessing will be employed to filter the stop words and tokenize the text.

In [9]:
def preprocess(text):
    return [word for word in simple_preprocess(text) if word not in stop_words]

In [28]:
# This creates the tokens for each document
tokenized_docs = [preprocess(doc) for doc in current_df]

id2word is a Gensim dictionary object that will be created to map the words to their respective ids. 

This will be used to create the corpus, consisting of tuples of the word id and the frequency of the word in the document.

In [29]:
# Create a Dictionary dictionary object
id2word = Dictionary(tokenized_docs)

id2word.filter_extremes(no_below=5, no_above=0.5)  # no_below is the minimum number of documents a word must be in to be included in the dictionary,
                                                    # no_above is the maximum proportion of documents a word can be in to be included in the dictionary


Finally, we will be left with a bag of words representation of the text data, which will be used to train the LDA model.

In [30]:
corpus = [id2word.doc2bow(doc) for doc in tokenized_docs]

In [31]:
# Train the LDA model
lda_model = LdaModel(corpus=corpus, id2word=id2word, num_topics=5, random_state=42)

# Display the topics
topics = lda_model.print_topics()
for topic in topics:
    print(topic)

# Preparing the LDA visualization
pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
pyLDAvis.display(vis)


(0, '0.015*"work" + 0.012*"developer" + 0.011*"team" + 0.010*"skills" + 0.009*"wordpress" + 0.008*"strong" + 0.007*"project" + 0.007*"requirements" + 0.006*"ability" + 0.006*"ensure"')
(1, '0.018*"com" + 0.016*"https" + 0.009*"would" + 0.007*"want" + 0.007*"site" + 0.007*"www" + 0.007*"create" + 0.007*"please" + 0.007*"also" + 0.006*"content"')
(2, '0.017*"shopify" + 0.014*"product" + 0.013*"user" + 0.011*"products" + 0.010*"app" + 0.010*"store" + 0.008*"someone" + 0.006*"site" + 0.005*"please" + 0.005*"work"')
(3, '0.015*"developer" + 0.014*"wordpress" + 0.012*"project" + 0.012*"site" + 0.011*"work" + 0.009*"someone" + 0.008*"new" + 0.008*"time" + 0.007*"make" + 0.007*"please"')
(4, '0.010*"site" + 0.008*"would" + 0.008*"want" + 0.007*"build" + 0.007*"work" + 0.006*"project" + 0.006*"us" + 0.006*"create" + 0.006*"new" + 0.006*"also"')


  if isinstance(node, ast.Num):  # <number>
  if isinstance(node, ast.Num):  # <number>
  return node.n
  if isinstance(node, ast.Num):  # <number>
  return node.n


We can see that topics have been extracted from the text data and meaningful topics have been created. 

PyLDAvis is very useful. We can get a bird's eye view of how the topics intersect or do not, or how prevalent a topic is in the dataset. While hovering over a single topic, we can see what it consists of, which gives a granular view of what topics have been uncovered.

However, even with the custom stopwords added to the pipeline, some meaningless topics have been leftover, which I could not manage to remove. 