# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [1]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [2]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [3]:
# YOUR CODE HERE
import string
from nltk.corpus import stopwords 
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocessing(sentence):
       # YOUR CODE HERE
    
    # remove whitespace 
    sentence = sentence.strip()
    
    # lowercase characters
    sentence = sentence.lower()
    
    # remove numbers 
    sentence = "".join([char for char in sentence if not char.isdigit()])
    
    # remove punctuation
    for punctuation in string.punctuation: 
        sentence = sentence.replace(punctuation, "")
        
    # tokenize
    tokens = word_tokenize(sentence)
    
     # remove stopwords
    stop_words = set(stopwords.words("english"))
    tokens_without_stopwords = [token for token in tokens if not token in stop_words]
    
    # lemmatize
    lemmatized_verbs = [WordNetLemmatizer().lemmatize(token, pos="v") for token in tokens_without_stopwords]
    lemmatized_nouns = [WordNetLemmatizer().lemmatize(token, pos="n") for token in lemmatized_verbs]
    
    return " ".join(lemmatized_nouns)

# Clean reviews
data["clean_text"] = data["text"].apply(preprocessing)
data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,gldcunixbcccolumbiaedu gary l dare subject sta...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,minerkuhubccukansedu subject ancient book orga...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,atterlepvelaacsoaklandedu cardinal ximenez sub...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,vzhivovsuperiorcarletonca vladimir zhivov subj...


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [4]:
# YOUR CODE HERE
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

vectorized_data = pd.DataFrame(vectorizer.fit_transform(data.clean_text).toarray())
vectorized_data.columns = vectorizer.get_feature_names_out()

vectorized_data

Unnamed: 0,aa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aacc,aadams,aafreenetcarletonca,aargh,aaron,aaronbinahccbrandeisedu,aaroncathenamitedu,aassists,...,zombo,zone,zoo,zoom,zorasterism,zubov,zupancic,zurich,zwart,zzzzzz
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.088609,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.07373,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1194,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1195,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1196,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1197,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
from sklearn.decomposition import LatentDirichletAllocation

n_components = 2
lda = LatentDirichletAllocation(n_components=n_components, max_iter=100)

topics = lda.fit_transform(vectorized_data)
topics = pd.DataFrame(topics, columns=[f"Topic {i+1}" for i in range(n_components)])
topics

Unnamed: 0,Topic 1,Topic 2
0,0.939465,0.060535
1,0.052553,0.947447
2,0.063136,0.936864
3,0.064230,0.935770
4,0.930382,0.069618
...,...,...
1194,0.079498,0.920502
1195,0.906709,0.093291
1196,0.072484,0.927516
1197,0.899159,0.100841


##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [6]:
def print_topics(model, vectorizer):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
                        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [7]:
# YOUR CODE HERE
print_topics(lda, vectorizer)

Topic 0:
[('god', 36.095018832661026), ('christian', 22.70553978977035), ('jesus', 19.279784679557554), ('say', 17.880034635631997), ('people', 17.852801353085912), ('would', 16.99364300672694), ('one', 16.69350383182574), ('believe', 16.659330251404057), ('church', 16.648502375184915), ('know', 15.721240653844724)]
Topic 1:
[('game', 27.037892254270613), ('team', 25.74732813784154), ('play', 20.040878946500857), ('go', 19.01962228434318), ('hockey', 18.716719243252758), ('player', 18.365812296540604), ('get', 14.647979163797128), ('win', 14.239403611685697), ('nhl', 13.599848126613603), ('year', 13.447013986985953)]


## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [8]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [9]:
# YOUR CODE HERE
example_vectorized = vectorizer.transform(example).toarray()
example_vectorized = pd.DataFrame(example_vectorized, columns= vectorizer.get_feature_names_out())

In [10]:
lda_vectors = lda.transform(example_vectorized)
print(pd.DataFrame(lda_vectors))
print("topic 0 :", lda_vectors[0][0])
print("topic 1 :", lda_vectors[0][1])

          0         1
0  0.156156  0.843844
topic 0 : 0.15615629394292976
topic 1 : 0.8438437060570702


🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!