# Simple topic identification

This chapter will introduce you to topic identification, which you can apply to any text you encounter in the wild. Using basic NLP models, you will identify topics from texts based on term frequencies. You'll experiment and compare two simple methods: bag-of-words and Tf-idf using NLTK, and a new library Gensim.

### Building a Counter with bag-of-words

In this exercise, you'll build your first (in this course) bag-of-words counter using a Wikipedia article, which has been pre-loaded as `article`. Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! If you'd like to peek at the title at the end, we've included it as article_title. Note that this article text has had very little preprocessing from the raw Wikipedia database entry.

In [5]:
import wikipedia
from collections import Counter
from nltk.tokenize import word_tokenize

# Load an article from Wikipedia
article = wikipedia.page("Debugging").content

# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [token.lower() for token in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))



[('the', 155), (',', 144), ('.', 132), ('of', 79), ('to', 64), ('a', 63), ('and', 52), ('debugging', 51), ('in', 44), ('(', 37)]


### Text Pre-processing

These are the pre-processing steps helping to make beter input data for machine learning or other statistical purposes:  
* Tokenization to create a bag of words
* Lowercasing words
* Lemmatization/Stemming - Shorten words to their root stems
* Removing stop words, punctuation, or unwanted tokens

Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text.

In [9]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
english_stops = stopwords.words('english')
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))


[('debugging', 51), ('system', 27), ('bug', 18), ('problem', 16), ('software', 16), ('term', 15), ('tool', 14), ('debugger', 14), ('program', 13), ('process', 12)]


### Introduction to gensim

- Popular open-source NLP library
- Uses top academic models to perform complex tasks
    - Building document or word vectors
    - Performing topic identification and document comparison

**Creating and querying a corpus with gensim**  
It's time to apply the methods you learned in the previous video to create your first `gensim` dictionary and corpus!

You'll use these data structures to investigate word trends and potential interesting topics in your document set. To get started, we have imported a few additional messy articles from Wikipedia, which were preprocessed by lowercasing all words, tokenizing them, and removing stop words and punctuation. These were then stored in a list of document tokens called `articles`. You'll need to do some light preprocessing and then generate the gensim dictionary and corpus.

In [18]:
# Let's load more articles from Wikipedia and pre-process them
# Application software, Debugging, Crash, Malware, Reverse engineering, Software, Computer programming
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from gensim.corpora.dictionary import Dictionary
from bs4 import BeautifulSoup

BeautifulSoup('features="lxml"')

topics = {'Application software', 'Debugging', 'Crash (computing)', 'Malware', 'Reverse engineering', 'Software', 'Computer programming'}

english_stops = stopwords.words('english')

articles = []

for topic in topics:
    article = wikipedia.page(topic).content
    # Tokenize the article: tokens
    tokens = word_tokenize(article)

    # Convert the tokens into lowercase: lower_tokens
    lower_tokens = [token.lower() for token in tokens]
    
    # Retain alphabetic words: alpha_only
    alpha_only = [t for t in lower_tokens if t.isalpha()]

    # Remove all stop words: no_stops
    no_stops = [t for t in alpha_only if t not in english_stops]

    # Instantiate the WordNetLemmatizer
    wordnet_lemmatizer = WordNetLemmatizer()

    # Lemmatize all tokens into a new list: lemmatized
    lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
    
    #print(type(lemmatized))
    articles.append(lemmatized)

# Create a dictionary and corpus
dictionary = Dictionary(articles)
corpus = [dictionary.doc2bow(article) for article in articles]
    


### Gensim bag-of-words  
Now, you'll use your new `gensim` corpus and dictionary to see the most common terms per document and across all documents. You can use your dictionary to look up the terms. Take a guess at what the topics are and feel free to explore more documents in the IPython Shell!

You have access to the dictionary and corpus objects you created in the previous exercise, as well as the Python defaultdict and itertools to help with the creation of intermediate data structures for analysis.

- `defaultdict` allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argument `int`, we are able to ensure that any non-existent keys are automatically assigned a default value of 0. This makes it ideal for storing the counts of words in this exercise.

- `itertools.chain.from_iterable()` allows us to iterate through a set of sequences as if they were one continuous sequence. Using this function, we can easily iterate through our corpus object (which is a list of lists).



In [21]:
from collections import defaultdict
import itertools

# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)

print('====================')
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count
    
# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

malware 66
computer 53
system 52
software 43
user 38
software 357
system 188
computer 162
application 125
program 109


### Tf-idf with gensim

**What is tf-idf?**  
- Term frequency - inverse document frequency
- Allows you to determine the most important words in each document
- Each corpus may have shared words beyond just stopwords
- These words should be down-weighted in importance
- Example from astronomy: "Sky"
- Ensures most common words don't show up as key words
- Keeps document specific frequent words weighted high

### Tf-idf with Wikipedia  
Now it's your turn to determine new significant terms for your corpus by applying `gensim`'s tf-idf. You will again have access to the same `corpus` and `dictionary` objects you created in the previous exercises - dictionary, corpus, and doc.  
Will tf-idf make for more interesting results on the document level?

In [25]:
from gensim.models.tfidfmodel import TfidfModel

# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[0:5])

# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

[(4, 0.03460279270559653), (5, 0.010304832646815313), (6, 0.011534264235198844), (7, 0.010304832646815313), (20, 0.0030979201833038284)]
malware 0.3400594773449053
worm 0.26874168220485045
virus 0.24183479816635686
ransomware 0.19707723361689033
spread 0.19707723361689033
