### SSS - Extracting Topics from Text ###

*Super simple script (SSS) for summarising stuff*

Latent Dirichlet Allocation (LDA) is one of the most commonly used methods for extracting groups of topics from text. Latent Semantic Indexing is another possibility. There are good posts on LDA and LSI on the interwebs, so I shall not go into the math or theory of these methods. Rather, I shall just demonstrate what the LDA and LSI is good for - discovering groups of topics in body of text (or what most term as the corpus).

An aside - this site provides a good introduction to LDA - http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/; and there are a number of threads on quora on the difference between LDA and LSI, such as this - 
https://www.quora.com/Whats-the-difference-between-Latent-Semantic-Indexing-LSI-and-Latent-Dirichlet-Allocation-LDA

First, we need to find something interesting to run LDA/LSI on. Ideally, the larger the more interesting. But as this is just a demo, we will use 'Ulysses' from the Project Gutenberg site.

In [1]:
source = 'https://www.gutenberg.org/files/4300/4300-0.txt'

Download the data with the 'requests' library as per what we have done for the other scripts

In [2]:
import requests
import re
corpus = requests.get(source)

Using a combination of string functions and the 're' library, we strip out some of the special characters (e.g. newlines)

In [3]:
corpus_text = corpus.text.strip('\ufeff')
regex = re.compile(r'[\n\r\t]')
corpus_text=regex.sub(' ', corpus_text)
corpus_text = corpus_text.replace('\\', '')
corpus_text[:1000] # First 1000 characters

'  The Project Gutenberg EBook of Ulysses, by James Joyce    This eBook is for the use of anyone anywhere at no cost and with almost  no restrictions whatsoever. You may copy it, give it away or re-use  it under the terms of the Project Gutenberg License included with this  eBook or online at www.gutenberg.org      Title: Ulysses    Author: James Joyce    Release Date: August 1, 2008 [EBook #4300]  Last Updated: August 17, 2017    Language: English    Character set encoding: UTF-8    *** START OF THIS PROJECT GUTENBERG EBOOK ULYSSES ***          Produced by Col Choat, and David Widger.            Ulysses    by James Joyce          — I —          [ 1 ]    Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of  lather on which a mirror and a razor lay crossed. A yellow dressinggown,  ungirdled, was sustained gently behind him on the mild morning air. He  held the bowl aloft and intoned:    —Introibo ad altare Dei.    Halted, he peered down the dark winding stairs and cal

For this exercise, we will use gensim to do LDA and LSI, so let's import the libraries first

In [4]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


get_tokens is just a simple function to process the text we downloaded, and strip out the stopwords (things like is, are ...)

In [5]:
def get_tokens(text):
    return [t for t in simple_preprocess(text) if t not in STOPWORDS]

In [6]:
tokens = get_tokens(corpus_text)

In [7]:
len(tokens)

132581

First 25 words

In [8]:
tokens[:25]

['project',
 'gutenberg',
 'ebook',
 'ulysses',
 'james',
 'joyce',
 'ebook',
 'use',
 'cost',
 'restrictions',
 'whatsoever',
 'copy',
 'away',
 'use',
 'terms',
 'project',
 'gutenberg',
 'license',
 'included',
 'ebook',
 'online',
 'www',
 'gutenberg',
 'org',
 'title']

We then convert into a basket of words - basically pre-preparing the text so that we can feed it into the the LDA/LSI model

In [9]:
dictionary = gensim.corpora.Dictionary([tokens])
corpus = [dictionary.doc2bow(token) for token in [tokens]]

Now to build the LDA model for this text. The main parameters are:
- num_topics - number of topic groups we want
- update_every - 0 for once-off; 1 for online learning (i.e. we will feed new data and update model)
- passes - number of passes through the data

In [10]:
lda_model = gensim.models.LdaModel(corpus, 
                                   num_topics=5, 
                                   id2word=dictionary, 
                                   update_every=0,
                                   passes=50)

In [11]:
lda_model.show_topics(5)

[(0,
  '0.000*"bloom" + 0.000*"said" + 0.000*"mr" + 0.000*"like" + 0.000*"old" + 0.000*"stephen" + 0.000*"time" + 0.000*"says" + 0.000*"man" + 0.000*"eyes"'),
 (1,
  '0.000*"said" + 0.000*"bloom" + 0.000*"like" + 0.000*"mr" + 0.000*"says" + 0.000*"stephen" + 0.000*"man" + 0.000*"old" + 0.000*"yes" + 0.000*"eyes"'),
 (2,
  '0.000*"said" + 0.000*"like" + 0.000*"bloom" + 0.000*"mr" + 0.000*"stephen" + 0.000*"old" + 0.000*"says" + 0.000*"man" + 0.000*"yes" + 0.000*"hand"'),
 (3,
  '0.000*"said" + 0.000*"bloom" + 0.000*"like" + 0.000*"mr" + 0.000*"stephen" + 0.000*"old" + 0.000*"know" + 0.000*"yes" + 0.000*"says" + 0.000*"man"'),
 (4,
  '0.009*"said" + 0.007*"bloom" + 0.005*"like" + 0.005*"mr" + 0.004*"stephen" + 0.004*"old" + 0.003*"says" + 0.003*"man" + 0.003*"time" + 0.003*"yes"')]

In [12]:
lsi_model = gensim.models.LsiModel(corpus, 
                                   id2word=dictionary)

In [13]:
lsi_model.print_topics()

[(0,
  '0.376*"said" + 0.311*"bloom" + 0.227*"like" + 0.224*"mr" + 0.178*"stephen" + 0.153*"old" + 0.147*"says" + 0.140*"man" + 0.118*"time" + 0.112*"yes"')]

The topics generated off such a small text body is obviously not satisfactory, but the simple script shows how simple it would be to do this for any other collection of text. 

We must however understand that these methods can only show us the collection of words that are close to each other and can be grouped as topics. We will still have to eyeball to see if they make sense, and how we would link the words in each of these clusters.