# Topic Modeling with Gensom on Newsgroups

Jon Chun
28 Feb 2022

* https://colab.research.google.com/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb

**KEEP IN MIND**

Topic Modeling arose, in part, from Library Information Sciences and Information Retrieval. It was designed to identify latent or 'hidden' topics to facilitate searching large collections (Corpora) of individual Documents.

If you have one large text (Corpus), you'll have to find a means to segment it into semantically coherent units as Documents. For example, with the long novel middle march, it may be logical to segment it into many Documents of text one or more paragraphs, pages or blocks of 1000 words.

# Configure this Jupyter Notebook

In [None]:
## Configure Jupyter Notebook

# Ignore warnings

import warnings
warnings.filterwarnings('ignore')

# Configure Jupyter

# Enable multiple outputs from one code cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from IPython.display import display
from IPython.display import Image
from ipywidgets import widgets, interactive

# Connect this Jupyter Notebook to Google gDrive permanent storage

In [None]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

Mounted at /gdrive
/gdrive


In [None]:
!pwd

/gdrive/MyDrive/iphs300/topic_modeling


In [None]:
# Create a project directory to hold IPHS300 Projects like this Topic Modeling
#   then cHANGE dIRECTORY into it with the command below

%cd ./MyDrive/iphs300/topic_modeling/

/gdrive/MyDrive/iphs300/topic_modeling


# Load Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

pd.set_option('max_colwidth', 100) # -1)

# Visualizing a Gensim model

To illustrate how to use [`pyLDAvis`](https://github.com/bmabey/pyLDAvis)'s gensim [helper funtions](https://pyldavis.readthedocs.org/en/latest/modules/API.html#module-pyLDAvis.gensim) we will create a model from the [20 Newsgroup corpus](http://qwone.com/~jason/20Newsgroups/). Minimal preprocessing is done and so the model is not the best. However, the goal of this notebook is to demonstrate the helper functions.

## Downloading the data

### Option (a): Manually copy the file to Google gDrive

```
Outside this Colab Jupyter Notebook, use any browser to download 
your textfile and then drag it into your gDrive ./MyDrive/IPHS300/data folder.

Once your datafile is there, it can be seen from within this Colab Jupyter Notebook.
```

### Option (b): Grab an unprotected datafile with !wget

In [None]:
# Verify we're in the ./data directory
#   %cd into ./data if necessary

!pwd

/gdrive/MyDrive/iphs300/topic_modeling


In [None]:
%cd ../data

/gdrive/MyDrive/iphs300/data


In [None]:
!pwd

/gdrive/MyDrive/iphs300/data


In [None]:
# Get plain text (not HTML version) file from Gutenberg

# Marx's Manifesto is too short
# !wget https://www.gutenberg.org/cache/epub/61/pg61.txt 

# Benedetto Croce's commentary
!wget https://gutenberg.org/files/39653/39653-0.txt

--2022-02-28 19:51:56--  https://gutenberg.org/files/39653/39653-0.txt
Resolving gutenberg.org (gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to gutenberg.org (gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 351372 (343K) [text/plain]
Saving to: ‘39653-0.txt’


2022-02-28 19:51:57 (2.50 MB/s) - ‘39653-0.txt’ saved [351372/351372]



In [None]:
# Verify the file content by looking at the top 10 lines at the head of the file

#  There are some header/footer cruft you could delete with any text editor
#    (e.g. Notepad/Window or Write/MacOS)

# !mv pg61.txt marx_manifesto.txt # Rename
!mv 39653-0.txt bcroce_histmat.txt

# !head -n 20 marx_manifesto.txt  # View first 20 lines
!head -n 20 bcroce_histmat.txt

﻿The Project Gutenberg EBook of Historical materialism and the economics of
Karl Marx, by Benedetto Croce

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org/license


Title: Historical materialism and the economics of Karl Marx

Author: Benedetto Croce

Translator: C. M. Meredith

Release Date: May 8, 2012 [EBook #39653]

Language: English

Character set encoding: UTF-8


In [None]:
# Go back to procject root directory

%cd ..
!pwd

/gdrive/MyDrive/iphs300
/gdrive/MyDrive/iphs300


### Option (c): Be an OG and Use a Bash shell script

In [None]:
%%bash
mkdir -p data
pushd data
if [ -d "20news-bydate-train" ]
then
  echo "The data has already been downloaded..."
else
  wget http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz
  tar xfv 20news-bydate.tar.gz
  rm 20news-bydate.tar.gz
fi
echo "Lets take a look at the groups..."
!ls 20news-bydate-train/
popd

/content/data /content
20news-bydate-test/
20news-bydate-test/alt.atheism/
20news-bydate-test/alt.atheism/53265
20news-bydate-test/alt.atheism/53339
20news-bydate-test/alt.atheism/53260
20news-bydate-test/alt.atheism/53340
20news-bydate-test/alt.atheism/53333
20news-bydate-test/alt.atheism/53302
20news-bydate-test/alt.atheism/53313
20news-bydate-test/alt.atheism/53293
20news-bydate-test/alt.atheism/53297
20news-bydate-test/alt.atheism/53315
20news-bydate-test/alt.atheism/53320
20news-bydate-test/alt.atheism/53324
20news-bydate-test/alt.atheism/53328
20news-bydate-test/alt.atheism/53325
20news-bydate-test/alt.atheism/53322
20news-bydate-test/alt.atheism/53326
20news-bydate-test/alt.atheism/53261
20news-bydate-test/alt.atheism/53327
20news-bydate-test/alt.atheism/53329
20news-bydate-test/alt.atheism/53321
20news-bydate-test/alt.atheism/53068
20news-bydate-test/alt.atheism/53338
20news-bydate-test/alt.atheism/53257
20news-bydate-test/alt.atheism/53262
20news-bydate-test/alt.atheism/53276


--2022-02-28 16:30:40--  http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz
Resolving qwone.com (qwone.com)... 173.48.209.137
Connecting to qwone.com (qwone.com)|173.48.209.137|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14464277 (14M) [application/x-gzip]
Saving to: ‘20news-bydate.tar.gz’

     0K .......... .......... .......... .......... ..........  0%  734K 19s
    50K .......... .......... .......... .......... ..........  0% 1.51M 14s
   100K .......... .......... .......... .......... ..........  1% 3.37M 11s
   150K .......... .......... .......... .......... ..........  1% 5.32M 9s
   200K .......... .......... .......... .......... ..........  1% 2.54M 8s
   250K .......... .......... .......... .......... ..........  2% 3.34M 7s
   300K .......... .......... .......... .......... ..........  2% 3.47M 7s
   350K .......... .......... .......... .......... ..........  2% 3.66M 6s
   400K .......... .......... .......... .......... .......

## Exploring the dataset 

In [None]:
!pwd

/gdrive/MyDrive/iphs300


In [None]:
%cd /gdrive/MyDrive/iphs300

/gdrive/MyDrive/iphs300


In [None]:
# Look ata datafiles

!ls ./data

bcroce_histmat.txt  marx_manifesto.txt


In [None]:
MIN_DOC_LEN = 5 # Min word/token length

In [None]:
# Read Textfile(Corpus) and split into Paragraphs(Documents)

corpus_ls = []
corpus_clean_ls = []

textfile_name = './data/bcroce_histmat.txt'

parag_delimiter = "\n\n"
with open(textfile_name, "r") as file_ptr:
    all_content = file_ptr.read() #reading all the content in one step
    #using the string methods we split it
    corpus_ls = all_content.split(parag_delimiter)

print(f'There are {len(corpus_ls)} Documents extracted\n\nfrom the Corpus: {textfile_name}\n')

# Filter out Documents with lengths < MIN_DOC_LEN
for adoc in corpus_ls:
  if len(adoc) >= MIN_DOC_LEN:
    corpus_clean_ls.append(adoc)

corpus_lengths_ls = []

for i, adoc in enumerate(corpus_clean_ls):
  corpus_lengths_ls.append(len(adoc))

print(f'The longest Document is: {max(corpus_lengths_ls)}')
print(f'The shortest Document is: {min(corpus_lengths_ls)}')



There are 656 Documents extracted

from the Corpus: ./data/bcroce_histmat.txt

The longest Document is: 3605
The shortest Document is: 9


## Loading and tokenizing the corpus

In [None]:
!pip install funcy  # A collection of fancy functional tools focused on practicality.

Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Installing collected packages: funcy
Successfully installed funcy-1.17


In [None]:
from glob import glob
import re
import string

import funcy as fp

from gensim import models
from gensim.corpora import Dictionary, MmCorpus

import nltk

In [None]:
# quick and dirty....
EMAIL_REGEX = re.compile(r"[a-z0-9\.\+_-]+@[a-z0-9\._-]+\.[a-z]*")
FILTER_REGEX = re.compile(r"[^a-z '#]")
TOKEN_MAPPINGS = [(EMAIL_REGEX, "#email"), (FILTER_REGEX, ' ')]

def tokenize_line(line):
    res = line.lower()
    for regexp, replacement in TOKEN_MAPPINGS:
        res = regexp.sub(replacement, res)
    return res.split()
    
def tokenize(lines, token_size_filter=2):
    tokens = fp.mapcat(tokenize_line, lines)
    return [t for t in tokens if len(t) > token_size_filter]
    

def load_doc(filename):
    group, doc_id = filename.split('/')[-2:]
    with open(filename, errors='ignore') as f:
        doc = f.readlines()
    return {'group': group,
            'doc': doc,
            'tokens': tokenize(doc),
            'id': doc_id}


docs = pd.DataFrame(list(map(load_doc, glob('data/20news-bydate-train/*/*')))).set_index(['group','id'])
docs.head()

KeyError: ignored

## Creating the dictionary, and bag of words corpus

In [None]:

def nltk_stopwords():
    return set(nltk.corpus.stopwords.words('english'))

def prep_corpus(docs, additional_stopwords=set(), no_below=5, no_above=0.5):
  print('Building dictionary...')
  dictionary = Dictionary(docs)
  stopwords = nltk_stopwords().union(additional_stopwords)
  stopword_ids = map(dictionary.token2id.get, stopwords)
  dictionary.filter_tokens(stopword_ids)
  dictionary.compactify()
  dictionary.filter_extremes(no_below=no_below, no_above=no_above, keep_n=None)
  dictionary.compactify()

  print('Building corpus...')
  corpus = [dictionary.doc2bow(doc) for doc in docs]

  return dictionary, corpus



In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
dictionary, corpus = prep_corpus(docs['tokens'])

Building dictionary...
Building corpus...


In [None]:
MmCorpus.serialize('newsgroups.mm', corpus)
dictionary.save('newsgroups.dict')

## Fitting the LDA model

In [None]:
%%time
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=50, passes=10)
                                      
lda.save('newsgroups_50_lda.model')

CPU times: user 3min 29s, sys: 2min 30s, total: 5min 59s
Wall time: 3min 13s


## Visualizing the model with pyLDAvis

Okay, the moment we have all been waiting for is finally here!  You'll notice in the visualization that we have a few junk topics that would probably disappear after better preprocessing of the corpus. This is left as an exercises to the reader. :)

In [None]:
!pip install pyldavis

Collecting pyldavis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 5.1 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: pyldavis
  Building wheel for pyldavis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyldavis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136898 sha256=42611cafac27977893f4767dc754256d9ca9ef172f0b961f3ec04a2d712608c2
  Stored in directory: /root/.cache/pip/wheels/c9/21/f6/17bcf2667e8a68532ba2fbf6d5c72fdf4c7f7d9abfa4852d2f
Successfully built pyldavis
Installing collected packages: pyldavis
Successfully installed pyldavis-3.3.1


In [None]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

  from collections import Iterable


In [None]:
vis_data = gensimvis.prepare(lda, corpus, dictionary)
pyLDAvis.display(vis_data)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


## Fitting the HDP model

We can both visualize LDA models as well as gensim HDP models with pyLDAvis.

The difference between HDP and LDA is that HDP is a non-parametric method. Which means that we don't need to specify the number of topics. HDP will fit as many topics as it can and find the optimal number of topics by itself.

In [None]:
%%time
# The optional parameter T here indicates that HDP should find no more than 50 topics
# if there exists any.
hdp = models.hdpmodel.HdpModel(corpus, dictionary, T=50)
                                      
hdp.save('newsgroups_hdp.model')

  start_time = time.clock()


CPU times: user 55.6 s, sys: 16.4 s, total: 1min 11s
Wall time: 1min 2s


## Visualizing the HDP model with pyLDAvis

As for the LDA model, in order to prepare the visualization you only need to pass it your model, the corpus, and the associated dictionary.

In [None]:
vis_data = gensimvis.prepare(hdp, corpus, dictionary)
pyLDAvis.display(vis_data)

  by='saliency', ascending=False).head(R).drop('saliency', 1)
