# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [10]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [11]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [12]:
# YOUR CODE HERE
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
       # YOUR CODE HERE
    
    # remove whitespace 
    sentence = sentence.strip()
    
    # lowercase characters
    sentence = sentence.lower()
    
    # remove numbers 
    sentence = "".join([char for char in sentence if not char.isdigit()])
    
    # remove punctuation
    for punctuation in string.punctuation: 
        sentence = sentence.replace(punctuation, "")
        
    # tokenize
    tokens = word_tokenize(sentence)
    
     # remove stopwords
    stop_words = set(stopwords.words("english"))
    tokens_without_stopwords = [token for token in tokens if not token in stop_words]
    
    # lemmatize
    lemmatized_verbs = [WordNetLemmatizer().lemmatize(token, pos="v") for token in tokens_without_stopwords]
    lemmatized_nouns = [WordNetLemmatizer().lemmatize(token, pos="n") for token in lemmatized_verbs]
    
    return " ".join(lemmatized_nouns)

# Clean reviews
data["clean_text"] = data["text"].apply(preprocessing)
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gldcunixbcccolumbiaedu gary l dare subject sta...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,minerkuhubccukansedu subject ancient book orga...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivovsuperiorcarletonca vladimir zhivov subj...


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [13]:
# YOUR CODE HERE
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

vectorized_data = pd.DataFrame(vectorizer.fit_transform(data.clean_text).toarray())
vectorized_data.columns = vectorizer.get_feature_names_out()

vectorized_data

Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aadams,aafreenetcarletonca,aargh,aaron,aaronbinahccbrandeisedu,aaroncathenamitedu,aassists,...,zombo,zone,zoo,zoom,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.088609,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.07373,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1194,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1195,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1196,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1197,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
from sklearn.decomposition import LatentDirichletAllocation

n_components = 10
lda = LatentDirichletAllocation(n_components=n_components, max_iter=100)

topics = lda.fit_transform(vectorized_data)
topics = pd.DataFrame(topics, columns=[f"Topic {i+1}" for i in range(n_components)])
topics

Unnamed: 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9,Topic 10
0,0.008696,0.315571,0.486086,0.137467,0.008696,0.008700,0.008696,0.008696,0.008696,0.008696
1,0.009275,0.009275,0.916525,0.009275,0.009275,0.009275,0.009275,0.009275,0.009275,0.009275
2,0.008664,0.008666,0.723813,0.008664,0.206870,0.008664,0.008664,0.008664,0.008664,0.008664
3,0.011165,0.011165,0.886472,0.011165,0.011165,0.011165,0.011165,0.011165,0.024206,0.011165
4,0.009921,0.163271,0.578338,0.009921,0.009922,0.009921,0.009922,0.188940,0.009921,0.009922
...,...,...,...,...,...,...,...,...,...,...
1194,0.013542,0.013543,0.664616,0.227041,0.013542,0.013542,0.013542,0.013550,0.013542,0.013542
1195,0.014095,0.014095,0.517082,0.014094,0.014095,0.370163,0.014094,0.014094,0.014094,0.014094
1196,0.010465,0.010467,0.797783,0.010465,0.010465,0.010465,0.118489,0.010465,0.010472,0.010465
1197,0.018414,0.089294,0.763392,0.018414,0.018414,0.018414,0.018414,0.018414,0.018414,0.018414


##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [21]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [22]:
# YOUR CODE HERE
print_topics(lda, vectorizer)

Topic 0:
[('grass', 4.064857256520005), ('valley', 3.8134220050726233), ('chuck', 3.131125223274644), ('petchgvggvgtekcom', 3.050467805117499), ('cell', 2.344385802640014), ('petch', 2.1515652355304873), ('daily', 2.138701707725435), ('statemaine', 1.0349620650615423), ('finalswho', 1.0349620650615423), ('ata', 1.0224572473286888)]
Topic 1:
[('espn', 8.766644264207741), ('ranger', 8.354649262523514), ('captain', 7.710132818628686), ('islander', 6.5050200907771245), ('gary', 6.023975669218992), ('mask', 5.793995695648306), ('pt', 5.752492841557317), ('hawk', 5.664793328320538), ('jet', 5.650328188638847), ('dare', 5.633748474990385)]
Topic 2:
[('god', 35.723133778996655), ('game', 26.688698002351035), ('go', 26.185827082569567), ('would', 25.94013754551293), ('team', 25.359105336545575), ('one', 24.121526980633046), ('write', 23.414177149625477), ('say', 23.368415293740068), ('line', 22.88364693411444), ('subject', 22.71792375093564)]
Topic 3:
[('cdkaupaneosncsuedu', 1.0443215540667932)

## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [23]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [29]:
# YOUR CODE HERE
example_vectorized = vectorizer.transform(example).toarray()
example_vectorized = pd.DataFrame(example_vectorized, columns= vectorizer.get_feature_names_out())

In [31]:
lda_vectors = lda.transform(example_vectorized)
print(pd.DataFrame(lda_vectors))
print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])

          0         1        2         3         4         5         6  \
0  0.027739  0.214135  0.56395  0.027739  0.027739  0.027739  0.027739   

          7         8         9  
0  0.027739  0.027739  0.027739  
topic 0 : 0.027739368234211182
topic 1 : 0.21413514299992303


🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!