---
title: "Simple topic identification"
format:
  html:
    code-fold: true
jupyter: python3
author: "kakamana"
date: "2023-03-24"
categories: [python, datacamp, machine learning, nlp, tf-idf]
image: "simpleTopicIdentification.png"

---

# Simple topic identification

By using basic NLP models, you will be able to identify topics from any text you encounter in the wild. You will experiment and compare two simple methods: bag-of-words and Tf-idf using NLTK, as well as a new library called Gensim.

This **Simple topic identification** is part of [Datacamp course: Introduction to Natural Language Processing in Python] You will learn the basics of natural language processing (NLP), such as how to identify and separate words, how to extract topics from a text, and how to construct your own fake news classifier. As part of this course, you will also learn how to use basic libraries such as NLTK as well as libraries that utilize deep learning to solve common NLP problems. The purpose of this course is to provide you with the foundation for processing and parsing text as you progress through your Python learning journey.

This is my learning experience of data science through DataCamp. These repository contributions are part of my learning journey through my graduate program masters of applied data sciences (MADS) at University Of Michigan, [DeepLearning.AI], [Coursera] & [DataCamp]. You can find my similar articles & more stories at my [medium] & [LinkedIn] profile. I am available at [kaggle] & [github blogs] & [github repos]. Thank you for your motivation, support & valuable feedback.

These include projects, coursework & notebook which I learned through my data science journey. They are created for reproducible & future reference purpose only. All source code, slides or screenshot are intellactual property of respective content authors. If you find these contents beneficial, kindly consider learning subscription from [DeepLearning.AI Subscription], [Coursera], [DataCamp]



[DeepLearning.AI]: https://www.deeplearning.ai
[DeepLearning.AI Subscription]: https://www.deeplearning.ai
[Coursera]: https://www.coursera.org
[DataCamp]: https://www.datacamp.com
[medium]: https://medium.com/@kamig4u
[LinkedIn]: https://www.linkedin.com/in/asadenterprisearchitect
[kaggle]: https://www.kaggle.com/kakamana
[github blogs]: https://kakamana.github.io
[github repos]: https://github.com/kakamana
[Datacamp course: Introduction to Natural Language Processing in Python]: (https://app.datacamp.com/learn/courses/introduction-to-natural-language-processing-in-python)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import re
from nltk.tokenize import word_tokenize, sent_tokenize

# Word counts with bag-of-words

* Bag-of-words
    * Basic method for finding topics in a text
    * Need to first create tokens using tokenization
    *  ... and then count up all the tokens
    * The more frequent a word, the more important it might be
    * Can be a great way to determine the significant words in a text


## Building a Counter with bag-of-words

In this exercise, you'll build your first (in this course) bag-of-words counter using a Wikipedia article, which has been pre-loaded as article. Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! If you'd like to peek at the title at the end, we've included it as article_title. Note that this article text has had very little preprocessing from the raw Wikipedia database entry.

In [2]:
with open('dataset/Wikipedia articles/wiki_text_debugging.txt', 'r') as file:
    article = file.read()
    article_title = word_tokenize(article)[2]

In [3]:
# Import Counter
from collections import Counter

# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))


[(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 69), ('to', 63), ('a', 60), ('``', 47), ('in', 44), ('and', 41)]


# Simple text preprocessing

* preprocessing
    * Helps make for better input data
        * When performing machine learning or other statistical methods
    * Examples
        * Tokenization to create a bag of words
        * Lowercasing words
    * Lemmatization / Stemming
        * Shorten words to their root stems
    * Removing stop words, punctuation, or unwanted tokens


## Text preprocessing practice

It is now your turn to apply the techniques you have learned to help clean up text for better NLP results by removing stop words and non-alphabetic characters, lemmatizing, and performing a new bag-of-words on your cleaned text.

In [4]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dghr201\AppData\Roaming\nltk_data...


True

In [5]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\dghr201\AppData\Roaming\nltk_data...


True

In [6]:
with open('dataset/english_stopwords.txt', 'r') as file:
    english_stops = file.read()

In [7]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))


[('debugging', 39), ('system', 25), ('bug', 17), ('software', 16), ('problem', 15), ('tool', 15), ('computer', 14), ('process', 13), ('term', 13), ('debugger', 13)]


# Introduction to gensim

* gensim
    * Popular open-source NLP library
    * Uses top academic models to perform complex tasks
        * Building document or word vectors
        * Performing topic identification and document comparison

![](simpleTopic-1.png)

## Creating and querying a corpus with gensim

It's time to apply the methods you learned in the previous video to create your first gensim dictionary and corpus!

You'll use these data structures to investigate word trends and potential interesting topics in your document set. To get started, we have imported a few additional messy articles from Wikipedia, which were preprocessed by lowercasing all words, tokenizing them, and removing stop words and punctuation. These were then stored in a list of document tokens called articles. You'll need to do some light preprocessing and then generate the gensim dictionary and corpus.

In [27]:
import glob

path_list = glob.glob('dataset/wikipedia_articles/*.txt')
articles = []
for article_path in path_list:
    article = []
    with open(article_path, encoding="utf-8") as file:
        a = file.read()
    tokens = word_tokenize(a)
    lower_tokens = [t.lower() for t in tokens]

    # Retain alphabetic words: alpha_only
    alpha_only = [t for t in lower_tokens if t.isalpha()]

    # Remove all stop words: no_stops
    no_stops = [t for t in alpha_only if t not in english_stops]
    articles.append(no_stops)

In [28]:
print(articles)



In [30]:
# Import Dictionary
from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]
#print(corpus)

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])


computer
[(1, 1), (13, 1), (15, 1), (18, 1), (26, 1), (29, 1), (37, 1), (38, 4), (47, 2), (48, 7)]


## Gensim bag-of-words

Now, you'll use your new gensim corpus and dictionary to see the most common terms per document and across all documents. You can use your dictionary to look up the terms

In [31]:
from collections import defaultdict
import itertools

# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)

# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count

# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True)

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

debugging 39
system 19
software 16
tools 14
computer 12
computer 597
software 451
cite 322
ref 259
code 235


# Tf-idf with gensim
![](simpleTopic-2.png)

In [32]:
from gensim.models.tfidfmodel import TfidfModel

# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[:5])

[(1, 0.012368676656974298), (13, 0.015622064172391845), (15, 0.019603888684849375), (18, 0.012368676656974298), (26, 0.019603888684849375)]


In [33]:
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

wolf 0.22170620999398985
debugging 0.2002051205348696
fence 0.17736496799519189
debugger 0.13605544322671728
squeeze 0.13302372599639392
