# Latent Dirichlet Allocation

In [7]:
import pandas as pd

data = pd.read_csv('data', sep=",", header=None)

data.columns = ['text']

data.head()

Unnamed: 0,text
0,From: gld@cunixb.cc.columbia.edu (Gary L Dare)...
1,From: atterlep@vela.acs.oakland.edu (Cardinal ...
2,From: miner@kuhub.cc.ukans.edu\nSubject: Re: A...
3,From: atterlep@vela.acs.oakland.edu (Cardinal ...
4,From: vzhivov@superior.carleton.ca (Vladimir Z...


In [8]:
df=data

The data is a collection of emails that are not labelled. Let's try extract topics from them!

## Preprocessing 

👇 You're used to it by now... Clean up! Store the cleaned text in a new dataframe column "clean_text".

In [9]:

import numpy as np
import re
from  string import punctuation
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords

#df["text"] = df['text'].str.lower()
#df["text"] = df['text'].str.replace('[0-9]','')
#lemmatizer = WordNetLemmatizer()
#df['text'] = df['text'].apply(lambda x: lemmatizer.lemmatize(x))



## Latent Dirichlet Allocation model

👇 Train an LDA model to extract potential topics.

In [10]:
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords 
stop_words = stopwords.words('english')

def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
        
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) 
             if word not in stop_words] for doc in texts]


data_words = list(sent_to_words(data['text']))# remove stop words
data_words = remove_stopwords(data_words)



import gensim.corpora as corpora# Create Dictionary
id2word = corpora.Dictionary(data_words)# Create Corpus
texts = data_words# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]# View


from pprint import pprint# number of topics
num_topics = 10# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word)# Print the Keyword in the 10 topics



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\philg\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Visualize potential topics

👇 The function to print the words associated with the potential topics is already made for you. You just have to pass the correct arguments!

In [11]:
pprint(lda_model.print_topics())

#doc_lda = lda_model[corpus]

[(53,
  '0.010*"god" + 0.009*"hell" + 0.009*"edu" + 0.006*"said" + 0.006*"would" + '
  '0.006*"church" + 0.005*"lines" + 0.005*"subject" + 0.004*"organization" + '
  '0.004*"period"'),
 (26,
  '0.011*"edu" + 0.007*"ca" + 0.005*"writes" + 0.005*"organization" + '
  '0.005*"subject" + 0.005*"would" + 0.004*"university" + 0.004*"lines" + '
  '0.004*"article" + 0.004*"god"'),
 (39,
  '0.012*"edu" + 0.007*"one" + 0.006*"writes" + 0.005*"subject" + '
  '0.005*"organization" + 0.005*"would" + 0.005*"lines" + 0.005*"university" + '
  '0.005*"god" + 0.004*"marriage"'),
 (61,
  '0.010*"edu" + 0.005*"lines" + 0.005*"get" + 0.005*"subject" + '
  '0.005*"hockey" + 0.005*"game" + 0.005*"one" + 0.005*"organization" + '
  '0.005*"team" + 0.004*"would"'),
 (94,
  '0.013*"edu" + 0.012*"god" + 0.007*"one" + 0.006*"go" + 0.005*"organization" '
  '+ 0.005*"lines" + 0.005*"subject" + 0.004*"would" + 0.004*"mail" + '
  '0.004*"gld"'),
 (63,
  '0.011*"church" + 0.008*"edu" + 0.006*"god" + 0.006*"one" + 0.005*

## Predict topic of new text

👇 You can now use your LDA model to predict the topic of a new text. First, use your vectorizer to vectorize the example. Then, use your LDA model to predict the topic of the vectorized example.

In [12]:
string_input = df['text'].iloc[100]
#X = vect.transform(string_input)
    # Convert sparse matrix to gensim corpus.
print(string_input)

corpus = id2word.doc2bow(string_input.split())   
#corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
output = list(lda_model[corpus])[0]
#topics = sorted(output,key=lambda x:x[1],reverse=True)
lda_model.print_topic(output[0])

From: farenebt@logic.camp.clarkson.edu (Droopy)
Subject: Re: AHL Calder Cup Playoff preview
Organization: Clarkson University
Lines: 37
Nntp-Posting-Host: logic.clarkson.edu
X-Newsreader: TIN [version 1.1 PL8]

Daryl Turner (umturne4@ccu.umanitoba.ca) wrote:
: In article <1993Apr14.193524.25755@news.clarkson.edu> farenebt@craft.camp.clarkson.edu (Droopy) writes:
: >
: >ATLANTIC DIVISION
: >	
: >	ST JOHN'S MAPLE LEAFS VS MONCTON HAWKS
: >	MONCTON HAWKS
: >See CD Islanders. Moncton is a very similar team to CDI. Low scoring,
: >defensive, good goaltending. John Leblanc and Stu Barnes are the only
: >noticable guns on the team. But the defense is top notch and 
: >Mike O'Neill is the most underrated goalie in the league.
: >

: Bri, as I have tried to tell you since 2 February, Michael O'Neill
: might be the most underrated goalie in the AHL, but he ISN'T in the
: AHL.  He's on the Winnipeg Jets' injury list, as he has been since
: his first NHL start against the Ottawa Senators.  He's ou

'0.007*"edu" + 0.007*"would" + 0.007*"vs" + 0.006*"ahl" + 0.006*"hockey" + 0.005*"clarkson" + 0.004*"lines" + 0.004*"subject" + 0.004*"la" + 0.004*"ca"'