Topic Modelling:
For the second part of this assignment, you will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in newsgroup_data. You will first need to finish the code in the cell below by using gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save to the variable ldamodel. Extract 10 topics using corpus and id_map, and with passes=25 and random_state=34.

In [3]:
import pickle
# import gensim
from sklearn.feature_extraction.text import CountVectorizer
# from gensim.models import LdaModel
import gensim
from gensim import corpora
from gensim.models import LdaModel

import numpy as np
import nltk
nltk.download('punkt')
from nltk.corpus import wordnet as wn
import pandas as pd
nltk.data.path.append("assets/")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [13]:
# Convert the corpus in txt file into pickle
with open(r'/content/newsgroups.txt', 'r') as f:
    newsgroups = f.readlines()

# Save the corpus using pickle.dump()
with open('newsgroups.pkl', 'wb') as f:
    pickle.dump(newsgroups, f)



In [14]:
# testing the corpus content

with open(r'newsgroups.pkl', 'rb') as f:
    newsgroup_data = pickle.load(f)
len(newsgroup_data)


1000

In [17]:

def lda_topics():

    # Load the list of documents
    with open('newsgroups.pkl', 'rb') as f:
        newsgroup_data = pickle.load(f)

    # The CountVectorizor to find three letter tokens, remove stop_words, 
    # removing tokens that don't appear in at least 20 documents,
    # removing tokens that appear in more than 20% of the documents
    vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                           token_pattern='(?u)\\b\\w\\w\\w+\\b')
    # Fit and transform
    X = vect.fit_transform(newsgroup_data)

    # Convert sparse matrix to gensim corpus.
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

    # Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
    id_map = dict((v, k) for k, v in vect.vocabulary_.items())
    
    ldamodel=LdaModel(corpus=corpus, num_topics=10, id2word=id_map, passes=15, random_state=42)

    # YOUR CODE HERE
    topics = ldamodel.print_topics(num_topics=10, num_words=10)
    # print(topics)
    return topics

In [18]:
lda_topics()

[(0,
  '0.039*"don" + 0.029*"know" + 0.023*"people" + 0.018*"think" + 0.016*"work" + 0.013*"science" + 0.013*"doesn" + 0.013*"way" + 0.013*"area" + 0.012*"look"'),
 (1,
  '0.025*"ground" + 0.020*"like" + 0.020*"people" + 0.019*"current" + 0.017*"does" + 0.016*"card" + 0.015*"used" + 0.014*"help" + 0.014*"use" + 0.014*"don"'),
 (2,
  '0.044*"drive" + 0.025*"just" + 0.024*"car" + 0.022*"good" + 0.015*"hard" + 0.015*"problem" + 0.013*"know" + 0.013*"like" + 0.012*"disk" + 0.011*"don"'),
 (3,
  '0.042*"god" + 0.028*"does" + 0.021*"believe" + 0.017*"nand" + 0.016*"nthat" + 0.015*"know" + 0.014*"nof" + 0.014*"posting" + 0.014*"true" + 0.013*"said"'),
 (4,
  '0.066*"space" + 0.055*"nasa" + 0.053*"data" + 0.035*"information" + 0.035*"available" + 0.028*"program" + 0.026*"use" + 0.017*"edu" + 0.015*"mail" + 0.014*"com"'),
 (5,
  '0.032*"year" + 0.030*"team" + 0.030*"think" + 0.022*"don" + 0.019*"better" + 0.017*"play" + 0.015*"like" + 0.015*"season" + 0.014*"just" + 0.014*"good"'),
 (6,
  '0.05

topic_names:

From the list of the following given topics, assign topic names to the topics you found. If none of these names best matches the topics you found, create a new 1-3 word "title" for the topic.

Topics: Health, Science, Automobiles, Politics, Government, Travel, Computers & IT, Sports, Business, Society & Lifestyle, Religion, Education.

*This function should return a list of 10 strings.*

In [19]:
def topic_names():

    # Define list of topic names
    list_topics = ['Health', 'Science', 'Automobiles', 'Politics', 'Government', 'Travel', 
                   'Computers & IT', 'Sports', 'Business', 'Society & Lifestyle', 'Religion', 'Education']

    # Load the list of documents
    with open('newsgroups.pkl', 'rb') as f:
        newsgroup_data = pickle.load(f)

    # Use CountVectorizor to find three letter tokens, remove stop_words, 
    # remove tokens that don't appear in at least 20 documents,
    # remove tokens that appear in more than 20% of the documents
    vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                           token_pattern='(?u)\\b\\w\\w\\w+\\b')
    # Fit and transform
    X = vect.fit_transform(newsgroup_data)

    # Convert sparse matrix to gensim corpus.
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)



    # Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
    id_map = dict((v, k) for k, v in vect.vocabulary_.items())

    # Train LDA model
    ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=id_map, passes=15, random_state=42)

    # Convert to gensim corpus
    new_doc_corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

    # Get the topic distribution
    topic_distribution = ldamodel.get_document_topics(new_doc_corpus)

    # Create a list to store the topics found
    topics_found = []

    # Set a threshold for the probability
    threshold = 0.1

    # Check each topic probability against the threshold
    for topic, prob in topic_distribution[0]:
        if prob > threshold:
            topics_found.append(list_topics[topic])

    # Print the topics found
    # print(topics_found)
    # raise NotImplementedError()
    return topics_found

In [20]:
topic_names()

['Health', 'Government']

topic_distribution:

For the new document `new_doc`, find the topic distribution. Remember to use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus.

*This function should return a list of tuples, where each tuple is `(#topic, probability)`*

In [21]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]

In [36]:
def topic_distribution():
    vect = CountVectorizer( stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
    # Fit and transform
    X = vect.fit_transform(new_doc)
    # Convert sparse matrix to gensim corpus.
    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)

    # Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
    id_map = dict((v, k) for k, v in vect.vocabulary_.items())

    
    new_doc_transformed = vect.transform(new_doc)
    
    # Train LDA model
    ldamodel = LdaModel(corpus=corpus, num_topics=10, id2word=id_map, passes=15, random_state=42)

    # Convert to gensim corpus
    new_doc_corpus = gensim.matutils.Sparse2Corpus(new_doc_transformed, documents_columns=False)

    # Get the topic distribution
    topic_distribution = ldamodel.get_document_topics(new_doc_corpus)

    # Return list of tuples (#topic, probability)
    topic_distribution = [( topic, prob) for topic, prob in topic_distribution[0]]
    
    return topic_distribution

In [37]:
topic_distribution()

[(1, 0.9571427)]