# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [1]:
import pandas as pd
import numpy as np
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer


url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [2]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [3]:
# YOUR CODE HERE

def preprocessing(sentence):
    sentence.strip(' ')
    sentence = sentence.lower()
    sentence = ''.join(word for word in sentence if not word.isdigit())
    sentence = sentence.translate(str.maketrans('', '', string.punctuation))
    
    
    tokens = word_tokenize(sentence)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token, pos='v') for token in tokens]
    cleaned_text = ' '.join(lemmatized_tokens)
    return cleaned_text

In [4]:
data['clean_text'] = data['text'].apply(preprocessing)

In [5]:
data

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gldcunixbcccolumbiaedu gary l dare subject sta...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,minerkuhubccukansedu subject ancient book orga...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivovsuperiorcarletonca vladimir zhivov subj...
...,...,...
1194,From: jerryb@eskimo.com (Jerry Kaufman)\nSubje...,jerrybeskimocom jerry kaufman subject prayers ...
1195,From: golchowy@alchemy.chem.utoronto.ca (Geral...,golchowyalchemychemutorontoca gerald olchowy s...
1196,From: jayne@mmalt.guild.org (Jayne Kulikauskas...,jaynemmaltguildorg jayne kulikauskas subject q...
1197,From: sclark@epas.utoronto.ca (Susan Clark)\nS...,sclarkepasutorontoca susan clark subject pick ...


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [6]:
data['clean_text']


0       gldcunixbcccolumbiaedu gary l dare subject sta...
1       atterlepvelaacsoaklandedu cardinal ximenez sub...
2       minerkuhubccukansedu subject ancient book orga...
3       atterlepvelaacsoaklandedu cardinal ximenez sub...
4       vzhivovsuperiorcarletonca vladimir zhivov subj...
                              ...                        
1194    jerrybeskimocom jerry kaufman subject prayers ...
1195    golchowyalchemychemutorontoca gerald olchowy s...
1196    jaynemmaltguildorg jayne kulikauskas subject q...
1197    sclarkepasutorontoca susan clark subject pick ...
1198    lmvecwestminsteracuk william hargreaves subjec...
Name: clean_text, Length: 1199, dtype: object

In [7]:
# YOUR CODE HERE
vectorizer = TfidfVectorizer()

vectorized_text = vectorizer.fit_transform(data['clean_text'])
vectorized_text = pd.DataFrame(
    vectorized_text.toarray(), 
    columns = vectorizer.get_feature_names_out()
)

vectorized_text

Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aadams,aafreenetcarletonca,aargh,aaron,aaronbinahccbrandeisedu,aaroncathenamitedu,aarons,...,zombo,zone,zoo,zoom,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.086661,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.073976,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1194,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1195,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1196,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1197,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
from sklearn.decomposition import LatentDirichletAllocation

n_components = 2
model = LatentDirichletAllocation(n_components=n_components, max_iter = 100)
model.fit(vectorized_text)

In [9]:
text_topic_mixture = model.transform(vectorized_text)

In [10]:
text_topic_mixture

array([[0.9508423 , 0.0491577 ],
       [0.95107325, 0.04892675],
       [0.9523986 , 0.0476014 ],
       ...,
       [0.94332557, 0.05667443],
       [0.90131856, 0.09868144],
       [0.93808433, 0.06191567]])

##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [11]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [12]:
# YOUR CODE HERE
print_topics(model,vectorizer)

Topic 0:
[('god', 30.564371807676498), ('game', 26.972446730680588), ('go', 26.488316121874973), ('would', 26.306312106335817), ('team', 25.716731613896673), ('write', 23.78042316886099), ('say', 23.775830529037755), ('one', 23.450407028137185), ('line', 23.18590324516018), ('subject', 23.022939059821237)]
Topic 1:
[('wsh', 0.9960588427961377), ('sturm', 0.9944635078632753), ('gakwrscom', 0.936461105052841), ('dee', 0.9050623754958804), ('hfd', 0.8270766272107497), ('stueven', 0.7796818164150636), ('wpg', 0.7680602212649625), ('teamabucknelledu', 0.7244227293987856), ('bucknell', 0.7244227293987856), ('lewisburg', 0.7244227293987856)]


## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [13]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [16]:
# YOUR CODE HERE
X_new = vectorizer.transform(example)

topic_dist = model.transform(X_new)

prediction = np.argmax(topic_dist, axis=1)
prediction

array([0])

🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!