<a href="https://colab.research.google.com/github/popelucha/NLP-notebooks/blob/main/Topic_Modeling_for_20ng_with_gensim.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling with `gensim`

In [None]:
import re
import tarfile
import itertools
import gensim
import pprint
import logging
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora import Dictionary, MmCorpus

## Get the data
Download text from 20 Newsgroups. We convert the texts to Matrix Market Corpus (see https://radimrehurek.com/gensim/corpora/mmcorpus.html for more details). The texts are categorized into 20 groups by topic.

In [None]:
!wget http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz -P data

--2023-10-30 16:43:47--  http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz
Resolving qwone.com (qwone.com)... 173.48.205.131
Connecting to qwone.com (qwone.com)|173.48.205.131|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14464277 (14M) [application/x-gzip]
Saving to: ‘data/20news-bydate.tar.gz’


2023-10-30 16:43:48 (9.80 MB/s) - ‘data/20news-bydate.tar.gz’ saved [14464277/14464277]



## Preprocessing

We need to extract the email body from the whole email.
There are more options how to do that.

In [None]:
def process_message(message, skip_top=1, skip_bottom=1,
                    re_regex = r'^\s*[a-zA-Z]*>\s.*', ignore_regex=r'^.*[@0-9].* (wrote|writes).*:'):
    """
    Preprocess a single 20newsgroups message, returning the result as
    a unicode string.

    """
    message = gensim.utils.to_unicode(message, 'latin1').strip()
    blocks = message.split(u'\n\n')
    # skip email headers (first block) and footer (last block)
    # also skip lines starting with '>'
    reduced_blocks = []
    for block in blocks[skip_top:-skip_bottom]:
      lines = block.split('\n')
      if len([l for l in lines if re.match(re_regex, l)])==0:
        reduced_blocks.append('\n'.join([l for l in lines if not re.match(ignore_regex, l)]))
    content = u'\n\n'.join(reduced_blocks)
    return content

def iter_20newsgroups(fname, log_every=None):
    """
    Yield plain text of each 20 newsgroups message, as a unicode string.

    The messages are read from raw tar.gz file `fname` on disk (e.g. `./data/20news-bydate.tar.gz`)

    """
    extracted = 0
    with tarfile.open(fname, 'r:gz') as tf:
        for file_number, file_info in enumerate(tf):
            if file_info.isfile():
                if log_every and extracted % log_every == 0:
                    logging.info("extracting 20newsgroups file #%i: %s" % (extracted, file_info.name))
                content = tf.extractfile(file_info).read()
                yield (file_number, file_info, process_message(content))
                extracted += 1

In [None]:
EMAIL_REGEX = re.compile(r"[a-zA-Z0-9\.\+_-]+@[a-zA-Z0-9\._-]+\.[a-z]*")
FILTER_REGEX = re.compile(r"[^a-zA-Z '#]")
TOKEN_MAPPINGS = [(EMAIL_REGEX, " "), (FILTER_REGEX, ' ')]

def load_doc(filename):
    group, doc_id = filename.split('/')[-2:]
    with open(filename, errors='ignore') as f:
        doc = f.readlines()
    return {'group': group,
            'doc': doc,
            'tokens': tokenize(doc),
            'id': doc_id}


def load_doc(file_number, file_info, text):
    # tokenize each message; simply lowercase & match alphabetic chars, for now
    for regexp, replacement in TOKEN_MAPPINGS:
      text = regexp.sub(replacement, text)
    words = gensim.utils.tokenize(text, lower=True)
    tokenized = list([w for w in words if w not in STOPWORDS and len(w)>2])
    #print(file_number, file_info.name, text)
    group, doc_id = file_info.name.split('/')[-2:]
    return {'group': group,
        'doc': text,
        'tokens': tokenized,
        'id': doc_id}


def get_docs(filename):
  docs = []
  for file_number, file_info, text in iter_20newsgroups(filename):
    docs.append(load_doc(file_number, file_info, text))

  return docs

docs = get_docs('./data/20news-bydate.tar.gz')

# print the first two tokenized messages
#print(list(itertools.islice(tokenized_corpus, 2)))

In [None]:
def prep_corpus(docs, additional_stopwords=set(), no_below=5, no_above=0.5):
  print('Building dictionary...')
  dictionary = Dictionary([d['tokens'] for d in docs])
# you can play with this
#  dictionary.compactify()
#  dictionary.filter_extremes(no_below=no_below, no_above=no_above, keep_n=None)
  dictionary.compactify()

  print('Building corpus...')
  docs_ = [d['tokens'] for d in docs]
  corpus = [dictionary.doc2bow(doc) for doc in docs_]
  return dictionary, corpus

dictionary, corpus = prep_corpus(docs)

MmCorpus.serialize('newsgroups.mm', corpus)

dictionary.save('newsgroups.dict')

Building dictionary...
Building corpus...


## Check the corpus
Get familiar with MM Corpus and its features.

In [None]:
len(corpus)
#corpus[3]

18846

In [None]:
dictionary[1]

'allowing'

In [None]:

#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# load id->word mapping (the dictionary), one of the results of step 2 above
id2word = gensim.corpora.Dictionary.load('newsgroups.dict')

# load corpus iterator
mm = gensim.corpora.MmCorpus('newsgroups.mm')

## Corpus Parameters

In [None]:
len(mm) # num of documents

18846

In [None]:
len(id2word) # vocabulary size

74908

In [None]:
id2word[113] # example mapping between word id and words

'moslems'

In [None]:
[(i, len(mm[i])) for i in range(0,20)] # document lengths for n first documents

[(0, 71),
 (1, 16),
 (2, 65),
 (3, 92),
 (4, 54),
 (5, 82),
 (6, 77),
 (7, 163),
 (8, 33),
 (9, 29),
 (10, 53),
 (11, 239),
 (12, 208),
 (13, 206),
 (14, 74),
 (15, 78),
 (16, 12),
 (17, 103),
 (18, 22),
 (19, 2)]

In [None]:
mm[19] # example document with TF-IDF scores

[(1223, 1.0), (1224, 1.0)]

In [None]:
mm[0][:10]

[(0, 1.0),
 (1, 1.0),
 (2, 1.0),
 (3, 1.0),
 (4, 1.0),
 (5, 1.0),
 (6, 2.0),
 (7, 1.0),
 (8, 1.0),
 (9, 1.0)]

In [None]:
[id2word[t[0]] for t in mm[19]]

['hear', 'speak']

In [None]:
[id2word[t[0]] for t in mm[0][:10]]

['agree',
 'allowing',
 'atheist',
 'bad',
 'baggage',
 'beaten',
 'belief',
 'believe',
 'believed',
 'believing']

In [None]:
docs[0]['group']

'alt.atheism'

In [None]:
docs[19]

{'group': 'alt.atheism',
 'doc': "      Could you speak up  I can't hear you      ",
 'tokens': ['speak', 'hear'],
 'id': '53321'}

## LSA Topics

In [None]:
# extract 20 LSA topics; use the default one-pass algorithm
lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=20)
pprint.pprint(lsa.show_topics())

[(0,
  '0.305*"jpeg" + 0.259*"file" + 0.237*"dos" + 0.229*"image" + 0.161*"use" + '
  '0.157*"available" + 0.136*"ftp" + 0.134*"version" + 0.133*"windows" + '
  '0.132*"graphics"'),
 (1,
  '0.788*"dos" + 0.277*"windows" + -0.190*"jpeg" + 0.149*"microsoft" + '
  '-0.112*"file" + 0.111*"tcp" + -0.109*"image" + 0.088*"mouse" + -0.080*"gif" '
  '+ 0.077*"amiga"'),
 (2,
  '-0.394*"jpeg" + 0.233*"people" + 0.200*"said" + -0.192*"image" + '
  '0.187*"know" + 0.176*"think" + -0.160*"gif" + 0.152*"god" + '
  '0.136*"president" + 0.122*"going"'),
 (3,
  '0.439*"jpeg" + 0.191*"dos" + -0.185*"pub" + -0.177*"edu" + 0.167*"gif" + '
  '-0.156*"data" + -0.150*"available" + -0.150*"ftp" + 0.141*"people" + '
  '0.134*"said"'),
 (4,
  '-0.419*"god" + -0.408*"jehovah" + -0.322*"lord" + -0.307*"elohim" + '
  '-0.189*"christ" + -0.177*"jesus" + 0.159*"stephanopoulos" + '
  '0.154*"president" + -0.153*"father" + -0.134*"mcconkie"'),
 (5,
  '-0.953*"max" + -0.155*"giz" + -0.116*"bhj" + -0.091*"rlk" + -0.077*"

## LDA Topics

In [None]:
# extract 20 LDA topics, using 1 pass and updating once every 1 chunk (5,000 documents)
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=20, update_every=1, chunksize=5000, passes=1)
pprint.pprint(lda.show_topics())



[(8,
  '0.024*"jews" + 0.007*"jewish" + 0.006*"van" + 0.004*"cal" + 0.004*"german" '
  '+ 0.004*"det" + 0.004*"chi" + 0.004*"bos" + 0.004*"nazi" + 0.003*"tor"'),
 (4,
  '0.016*"president" + 0.016*"myers" + 0.011*"encryption" + 0.009*"april" + '
  '0.008*"chip" + 0.008*"clipper" + 0.007*"nuclear" + 0.006*"secretary" + '
  '0.005*"administration" + 0.005*"mission"'),
 (9,
  '0.017*"orbit" + 0.015*"file" + 0.006*"gif" + 0.006*"officers" + '
  '0.006*"keyboard" + 0.005*"use" + 0.005*"available" + 0.004*"jpeg" + '
  '0.004*"good" + 0.004*"know"'),
 (11,
  '0.014*"people" + 0.008*"think" + 0.007*"god" + 0.007*"know" + 0.006*"like" '
  '+ 0.005*"right" + 0.005*"time" + 0.005*"believe" + 0.004*"government" + '
  '0.004*"said"'),
 (18,
  '0.011*"key" + 0.008*"president" + 0.007*"congress" + 0.007*"keys" + '
  '0.006*"government" + 0.005*"law" + 0.005*"federal" + 0.005*"ripem" + '
  '0.004*"public" + 0.004*"encryption"'),
 (15,
  '0.005*"time" + 0.005*"know" + 0.004*"people" + 0.004*"file" + 0.0

## Coherence Score
Resources about different coherence measures:
https://datascience.oneoffcoder.com/topic-modeling-gensim.html
overview of coherence score implementations:
https://github.com/dice-group/Palmetto/wiki/Coherences


In [None]:
# compute coherence for LDA model
cm = gensim.models.coherencemodel.CoherenceModel(model=lda, corpus=mm, coherence='u_mass')
print(cm.get_coherence())

-2.818568989230388


In [None]:
list(zip(range(0,lda.num_topics), cm.get_coherence_per_topic()))

[(0, -2.669009128165177),
 (1, -2.2662126814868078),
 (2, -1.7071216796604614),
 (3, -2.5110169135291804),
 (4, -4.022165491082864),
 (5, -1.9997254246564846),
 (6, -2.6560061957845256),
 (7, -3.7302674538756833),
 (8, -6.455294311018299),
 (9, -2.4593399161418765),
 (10, -4.0183780000732945),
 (11, -1.7910597232866061),
 (12, -2.5173883307694283),
 (13, -2.0109091990848884),
 (14, -2.3799541261237227),
 (15, -1.9927138109293876),
 (16, -3.2291364607343067),
 (17, -2.6121061058806294),
 (18, -2.9944324703494636),
 (19, -2.349142361974658)]

In [None]:
lda.show_topic(3)

[('people', 0.00803572),
 ('time', 0.0072467527),
 ('know', 0.00611841),
 ('like', 0.0059714084),
 ('year', 0.0047999495),
 ('think', 0.0047546597),
 ('said', 0.0047238134),
 ('going', 0.0046848585),
 ('president', 0.0045393007),
 ('scripture', 0.004043937)]

In [None]:
lda.show_topic(1)

[('drive', 0.008777292),
 ('scsi', 0.008715821),
 ('like', 0.0066580456),
 ('use', 0.0058125956),
 ('hard', 0.0047446),
 ('need', 0.0046949787),
 ('thanks', 0.004437542),
 ('know', 0.0043188017),
 ('mac', 0.0042235204),
 ('drives', 0.0038375352)]

# Assignment

The task is to find optimal preprocessing and optimal number of topics.

1. Check the preprocessing, if you change something, describe it in `YOUR_FILE`.

1. Experiment with different number of topics for the LDA. You can try to find optimal number using the coherence score (try different numbers and check the coherence). HINT: For 20 newsgroups, 20 topics are not enough.

1. Check LDA with a good number of topics with the actual topics of the texts (in which group the text was published).

1. Describe how the topics fit into the original categories. For example, there can be topics "virus - bacteria" and "allergy immune" in the `sci.med` group.

1. Try to assign manually the topics to particular newsgroups. Are all words good to describe the topic? E.g., "allergy immune" are good words for the `sci.med` group, "image" does not make that much sense in the `sci.med` group.

1. OPTIONAL: Use a method to assign a topic name. You can use Wikipedia search or ChatGPT (or something else) to find a common term for top K topic words.

1. Describe your observations in `YOUR_FILE`.
