# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [3]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [2]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [5]:
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(text):
    text = text.strip().lower()
    
    text = ''.join([i for i in text if not i.isdigit()])
    
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    tokens = word_tokenize(text)
    
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return ' '.join(lemmatized_tokens)

data['cleaned_text'] = data['text'].apply(preprocessing)
    
data

Unnamed: 0,text,cleaned_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,from gldcunixbcccolumbiaedu gary l dare subjec...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,from minerkuhubccukansedu subject re ancient b...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,from atterlepvelaacsoaklandedu cardinal ximene...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,from vzhivovsuperiorcarletonca vladimir zhivov...
...,...,...
1194,From: jerryb@eskimo.com (Jerry Kaufman)\nSubje...,from jerrybeskimocom jerry kaufman subject re ...
1195,From: golchowy@alchemy.chem.utoronto.ca (Geral...,from golchowyalchemychemutorontoca gerald olch...
1196,From: jayne@mmalt.guild.org (Jayne Kulikauskas...,from jaynemmaltguildorg jayne kulikauskas subj...
1197,From: sclark@epas.utoronto.ca (Susan Clark)\nS...,from sclarkepasutorontoca susan clark subject ...


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [31]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_df=0.8, max_features=5000, min_df=50)

vectorized_documents = vectorizer.fit_transform(data['cleaned_text'])

# Instantiate the LDA 
n_components = 10
lda_model = LatentDirichletAllocation(n_components=n_components, max_iter = 100)
lda_model.fit(vectorized_documents)
topic_word_mixture = pd.DataFrame(
    vectorized_documents.toarray(), 
    columns = vectorizer.get_feature_names_out()
)

topic_word_mixture


Unnamed: 0,able,about,above,accept,according,account,act,actually,adam,after,...,wrong,wrote,year,yes,yet,york,you,young,your,youre
0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.148971,...,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0
1,0.000000,0.000000,0.000000,0.071875,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.063403,0.0,0.262909,0.0,0.000000,0.0
2,0.000000,0.118244,0.080304,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.160259,0.0,0.045218,0.0
3,0.000000,0.049450,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0
4,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1194,0.141084,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.241756,0.0,0.511598,0.0
1195,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.193848,0.0,0.000000,0.0
1196,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.051697,0.0,0.000000,0.0,0.033403,0.0,0.000000,0.0
1197,0.000000,0.000000,0.000000,0.000000,0.232058,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.0


##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [32]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [33]:
print_topics(lda_model, vectorizer)

Topic 0:
[('la', 5.5106869152831885), ('center', 1.1840208013675395), ('distribution', 1.1063524847121604), ('just', 0.6505251931762636), ('usa', 0.6118299159598204), ('flame', 0.579228123936542), ('university', 0.563994268496561), ('please', 0.49276629207711165), ('andrew', 0.45688158215177244), ('no', 0.3492755273860856)]
Topic 1:
[('period', 5.298462379499342), ('power', 2.825890070108786), ('play', 2.196654059610108), ('third', 1.6724529011341769), ('second', 1.3427931044335693), ('ranger', 1.3342275062745534), ('islander', 1.10051013180538), ('new', 1.0612330838617776), ('first', 1.0434957641798122), ('detroit', 1.0189386577745254)]
Topic 2:
[('fan', 0.78671738292884), ('coverage', 0.5383334479305526), ('th', 0.5141246271056759), ('help', 0.4758841962790922), ('university', 0.2805004415886749), ('spirit', 0.10000059604803019), ('calgary', 0.10000049117246393), ('study', 0.10000046079021713), ('blue', 0.10000045316030935), ('father', 0.10000044137426663)]
Topic 3:
[('that', 83.4793

## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [34]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [35]:
# Vectorize the new text using the TF-IDF vectorizer
example_vectorized = vectorizer.transform(example)

# Use the fitted LDA model to predict the topics of the vectorized new text
predicted_topics = lda_model.transform(example_vectorized)
predicted_topics

array([[0.02276932, 0.02277008, 0.02276932, 0.02277325, 0.02276932,
        0.7950708 , 0.02276932, 0.02276953, 0.02276932, 0.02276972]])

🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!