# Latent Dirichlet Allocation (LDA)

🎯 The goal of this challenge is to find topics within a corpus of emails with the **LDA** algorithm (Unsupervised Learning in NLP)

✉️ Here is a collection of 1K+ ***unlabelled emails***. Let's try to ***extract topics*** from them!

In [2]:
import pandas as pd

url = 'https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/10-Natural-Language-Processing/lda_data'

data = pd.read_csv(url, sep=",", header=None)
data.columns = ['text']
data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [3]:
data.shape

(1199, 1)

## (1) Preprocessing 

❓ **Question (Cleaning**) ❓ You're used to it by now... Clean up! Store the cleaned text in a new column "clean_text" of the DataFrame.

In [4]:
from nltk.corpus import stopwords
import string
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import word_tokenize 

def clean (text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, ' ') # Remove Punctuation
    lowercased = text.lower() # Lower Case
    tokenized = word_tokenize(lowercased) # Tokenize
    words_only = [word for word in tokenized if word.isalpha()] # Remove numbers
    stop_words = set(stopwords.words('english')) # Make stopword list
    without_stopwords = [word for word in words_only if not word in stop_words] # Remove Stop Words
    lemma=WordNetLemmatizer() # Initiate Lemmatizer
    lemmatized = [lemma.lemmatize(word) for word in without_stopwords] # Lemmatize
    return lemmatized

# Apply to all texts
data['clean_text'] = data.text.apply(clean)
data['clean_text'] = data['clean_text'].astype('str')

data.head()

Unnamed: 0,text,clean_text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...,"['gld', 'cunixb', 'cc', 'columbia', 'edu', 'ga..."
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...,"['atterlep', 'vela', 'ac', 'oakland', 'edu', '..."
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...,"['miner', 'kuhub', 'cc', 'ukans', 'edu', 'subj..."
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...,"['atterlep', 'vela', 'ac', 'oakland', 'edu', '..."
4,From: vzhivov@superior.carleton.ca (Vladimir Z...,"['vzhivov', 'superior', 'carleton', 'ca', 'vla..."


## (2) Latent Dirichlet Allocation model

❓ **Question (Training)** ❓ Train a LDA model to extract potential topics

In [5]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
vectorizer = CountVectorizer()

data_vectorized = vectorizer.fit_transform(data.clean_text)

lda_model = LatentDirichletAllocation(n_components=2)

lda_vectors = lda_model.fit_transform(data_vectorized)


In [7]:
lda_vectors

array([[0.99611028, 0.00388972],
       [0.00339439, 0.99660561],
       [0.0133878 , 0.9866122 ],
       ...,
       [0.00380039, 0.99619961],
       [0.98555334, 0.01444666],
       [0.00950916, 0.99049084]])

##  (3) Visualize potential topics

🎁 We coded for you a  function that prints the words associated with the potential topics.

In [13]:
def print_topics(model, vectorizer):
    for i, topic in enumerate(model.components_):
#         print(topic)
        print("Topic %d:" % (i))
        print([(vectorizer.get_feature_names_out()[i], topic[i])
        for i in topic.argsort()[:-10 - 1:-1]])

❓ **Question** ❓ Print the topics extracted by your LDA.

In [71]:
print_topics(lda_model, vectorizer)

Topic 0:
[('edu', 1097.4469072079066), ('team', 958.4923130542303), ('game', 950.8338440921185), ('line', 735.5040425582603), ('ca', 687.3131944894501), ('subject', 650.9049194382225), ('hockey', 649.4673791678098), ('organization', 627.2359927218225), ('player', 529.3338984729344), ('play', 520.7404813856008)]
Topic 1:
[('god', 1525.4483122897243), ('edu', 1030.553092792042), ('one', 848.0805369757429), ('would', 835.7042437119907), ('christian', 816.9774010218118), ('people', 716.1423470748451), ('subject', 656.0950805617266), ('jesus', 626.4924718876902), ('line', 621.4959574416882), ('say', 547.8497455097026)]


In [14]:
lda_model.get_feature_names_out()

array(['latentdirichletallocation0', 'latentdirichletallocation1'],
      dtype=object)

## (4) Predict the document-topic mixture of a new text

❓ **Question (Prediction)** ❓

Now that your LDA model is fitted, you can use it to predict the topics of a new text.

1. Vectorize the example
2. Use the LDA on the vectorized example to predict the topics

In [16]:
example = ["My team performed poorly last season. Their best player was out injured and only played one game"]

In [17]:
example_vectorizer = vectorizer.transform(example)

lda_vectors = lda_model.transform(example_vectorizer)

In [18]:
lda_vectors

array([[0.95574312, 0.04425688]])

In [76]:
print('topic 0 :', lda_vectors[0][0])
print('topic 1 :', lda_vectors[0][1])

topic 0 : 0.9557592843870686
topic 1 : 0.04424071561293148


🏁 Congratulations! You know how to implement an LDA quickly.

💾 Don't forget to `git add/commit/push` your notebook...

🚀 ... and move on to the next challenge!