<font color = blue size=5><b> 2. Simple Topic Identification</font>

<font color = green size=3><b> 2.1 Bag of Words </font>

- Basic method for finding topics in text
- Need to first tokenize and then count the tokens
- Basic theory is that the more frequent a word, the more important it might be
- Can be a great way to determine the significant word sin a text

**TASKS: Building a Counter with bag-of-words**

In this exercise, you'll build your first (in this course) bag-of-words counter using a Wikipedia article, which has been pre-loaded as article. Note that this article text has had very little preprocessing from the raw Wikipedia database entry.

- Import Counter from collections.
- Use word_tokenize() to split the article into tokens.
- Use a list comprehension with t as the iterator variable to convert all the tokens into lowercase. The .lower() method converts text into - lowercase.
- Create a bag-of-words counter called bow_simple by using Counter() with lower_tokens as the argument.
- Use the .most_common() method of bow_simple to print the 10 most common tokens.


In [91]:
with open("wiki_article.txt") as file:
    article = file.read()
    
article[0:500]

"'\\'\\'\\'Debugging\\'\\'\\' is the process of finding and resolving of defects that prevent correct operation of computer software or a system.  \\n\\nNumerous books have been written about debugging (see below: #Further reading|Further reading), as it involves numerous aspects, including interactive debugging, control flow, integration testing, Logfile|log files, monitoring (Application monitoring|application, System Monitoring|system), memory dumps, Profiling (computer programming)|profiling, Statist"

In [92]:
# Import necessary modules
from nltk.tokenize import word_tokenize
from collections import Counter

# Tokenize the article: tokens
tokens = word_tokenize(article)
tokens[0:10]

["'\\'\\'\\'Debugging\\'\\'\\",
 "'",
 'is',
 'the',
 'process',
 'of',
 'finding',
 'and',
 'resolving',
 'of']

In [93]:
# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]
lower_tokens[0:10]

["'\\'\\'\\'debugging\\'\\'\\",
 "'",
 'is',
 'the',
 'process',
 'of',
 'finding',
 'and',
 'resolving',
 'of']

In [94]:
# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))


[(',', 151), ('the', 147), ('of', 81), ('.', 70), ('to', 61), ('a', 59), ("''", 42), ('and', 41), ('in', 41), ('(', 40)]


<font color = green size=3><b> 2.2 Text Preprocessing </font>

**TASKS: Text Preprocessing Practice**

- Import the WordNetLemmatizer class from nltk.stem.
- Create a list called alpha_only that iterates through lower_tokens and retains only alphabetical characters. You can use the .isalpha() method to check for this.
- Create another list called no_stops in which you remove all stop words, which are held in a list called english_stops.
- Initialize a WordNetLemmatizer object called wordnet_lemmatizer and use its .lemmatize() method on the tokens in no_stops to create a new list called lemmatized.
- Finally, create a new Counter called bow with the lemmatized words and show the 10 most common tokens.

In [95]:
import re
from nltk.stem import WordNetLemmatizer

#import enlgish_stops data
with open('english_stops.txt') as f:
    english_stops = f.read()

#clean english_stops file
english_stops = re.sub(r'\n|\'|\s',"",english_stops)
english_stops=english_stops.split(',')
english_stops[0:5]

['i', 'me', 'my', 'myself', 'we']

In [96]:
# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]
alpha_only[0:8]

['is', 'the', 'process', 'of', 'finding', 'and', 'resolving', 'of']

In [97]:
# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops ]
no_stops[0:20]

['process',
 'finding',
 'resolving',
 'defects',
 'prevent',
 'correct',
 'operation',
 'computer',
 'software',
 'system',
 'books',
 'written',
 'debugging',
 'see',
 'reading',
 'involves',
 'numerous',
 'aspects',
 'including',
 'interactive']

In [98]:
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

lemmatized[0:20]

['process',
 'finding',
 'resolving',
 'defect',
 'prevent',
 'correct',
 'operation',
 'computer',
 'software',
 'system',
 'book',
 'written',
 'debugging',
 'see',
 'reading',
 'involves',
 'numerous',
 'aspect',
 'including',
 'interactive']

In [99]:
# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

[('debugging', 30), ('system', 23), ('software', 16), ('computer', 14), ('bug', 14), ('problem', 14), ('term', 13), ('tool', 13), ('process', 12), ('used', 12)]


<font color = green size=3><b> 2.3 gensim </font>

**What is a word vector?**

Word vectors are multi-dimensional mathematical representations of words created using deep learning methods. They give us insight into relationships between words in a corpus.

In the graphic below, we can see that the vector operations King minus Queen, is approximately equal to man minus woman. The deep learning algo used to create word vectors has been able to distill this meaning based on how these words are used throughout the text. 
![](pictures/wordvec.jpg)

In [100]:
from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize
my_documents = ['The movie was about a spaceship and aliens.',
                'I really liked the movie!',
                'Awesome action scenes, but boring characters.',
                'The movie was awful! I hate alien films.',
                'Space is cool! I liked the movie.',
                'More space films, please!']

tokenized_docs = [word_tokenize(doc.lower())for doc in my_documents]
tokenized_docs

[['the', 'movie', 'was', 'about', 'a', 'spaceship', 'and', 'aliens', '.'],
 ['i', 'really', 'liked', 'the', 'movie', '!'],
 ['awesome', 'action', 'scenes', ',', 'but', 'boring', 'characters', '.'],
 ['the', 'movie', 'was', 'awful', '!', 'i', 'hate', 'alien', 'films', '.'],
 ['space', 'is', 'cool', '!', 'i', 'liked', 'the', 'movie', '.'],
 ['more', 'space', 'films', ',', 'please', '!']]

In [101]:
dictionary = Dictionary(tokenized_docs)
dictionary.token2id

{'!': 12,
 ',': 16,
 '.': 8,
 'a': 4,
 'about': 3,
 'action': 14,
 'alien': 22,
 'aliens': 7,
 'and': 6,
 'awesome': 13,
 'awful': 20,
 'boring': 18,
 'but': 17,
 'characters': 19,
 'cool': 26,
 'films': 23,
 'hate': 21,
 'i': 9,
 'is': 25,
 'liked': 11,
 'more': 27,
 'movie': 1,
 'please': 28,
 'really': 10,
 'scenes': 15,
 'space': 24,
 'spaceship': 5,
 'the': 0,
 'was': 2}

**TASKS: Creating and querying a corpus with gensim**

- Import Dictionary from gensim.corpora.dictionary.
- Initialize a gensim Dictionary with the tokens in articles.
- Obtain the id for "computer" from dictionary. To do this, use its .token2id method which returns ids from text, and then chain .get() which returns tokens from ids. Pass in "computer" as an argument to .get().
- Use a list comprehension in which you iterate over articles to create a gensim MmCorpus from dictionary.
- In the output expression, use the .doc2bow() method on dictionary with article as the argument.
- Print the first 10 word ids with their frequency counts from the fifth document. This has been done for you, so hit 'Submit Answer' to see the results!



In [102]:
#recreate 'articles' variable from datacamp
with open('articles.txt') as f:
    articles = f.read()
    
#clean file
articles = re.sub(r'\n|\'|\s|\[|\]',"",articles)
articles=articles.split(',')
num_words = round(len(articles)/12)

#recreate list (wont be exact, because we cant get the full list easily.. but we will have 12 documents!!!)
words = len(articles)
num_list = list(range(0,12))
articles_list = [articles[num_words*k:num_words*(k+1)] for k in num_list]
len(articles_list)

12

In [103]:
# Import Dictionary
from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles_list)
dictionary

<gensim.corpora.dictionary.Dictionary at 0x2a5375f1978>

In [104]:
# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

computer


In [105]:
# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles_list]

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])

[(0, 1), (1, 9), (3, 9), (5, 12), (9, 1), (10, 3), (14, 1), (22, 4), (26, 1), (28, 36)]


**TASKS: Gensim bag-of-words**

Now, you'll use your new gensim corpus and dictionary to see **the most common terms per document and across all documents.** You can use your dictionary to look up the terms. Take a guess at what the topics are!

- Print the top five words of bow_doc using each word_id with the dictionary alongside word_count. The word_id can be accessed using the .get() method of dictionary.
- Create a defaultdict called total_word_count in which the keys are all the token ids (word_id) and the values are the sum of their occurrence across all documents (word_count). Remember to specify int when creating the defaultdict, and inside the for loop, increment each word_id of total_word_count by word_count.
- Create a sorted list from the defaultdict, using words across the entire corpus. To achieve this, use the .items() method on total_word_count inside sorted().
- Similar to how you printed the top five words of bow_doc earlier, print the top five words of sorted_word_count as well as the number of occurrences of each word across all the documents.

**HINT**
- To print the word_id inside the for loop, pass it into dictionary.get(), such that  dictionary.get(word_id) is the first argument of print().
- Use defaultdict(int) to create total_word_count, and be sure you're correctly incrementing total_word_count[word_id] by word_count.
- Use the .items() method on total_word_count as the first argument to sorted(), to ensure that words across the entire corpus are used.

In [106]:
# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True) #sorting for second element of each tuple (which is the frequency!)
bow_doc[0:5]

[(28, 36), (40, 23), (1568, 16), (2132, 14), (5, 12)]

In [107]:
# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)

"" 36
engineering 23
reverse 16
medal 14
software 12


In [108]:
print(dictionary.get(28))
print(dictionary.get(40))
print(dictionary.get(1568))
print(dictionary.get(2132))
print(dictionary.get(5))

""
engineering
reverse
medal
software


<font color = green size=3><b> 2.4 Tf-idf with Genism </font>

**Term Frequency - inverse document frequency**

- Allows you determine the msot important words in each document
- each corpus may have shared words beyond just stopwords
- these words should be down-weighted in importance
- example from astronomy: "sky"
- Ensure most common words don't show up as key words
- keeps document specific frequent words are weighted high

Tf-idf formula:

![](pictures/tfids.jpg)

- The weight will be LOW if the term doesnt appear very often in the document
- **the weight will ALSO be low if the low if the internal equatio is low (N/df_i is very high, because as the the log of 1 is zero!!!). This effectively penalizes words that are common across ALL documents. **

**TASKS: Tf-idf with Wikipedia**

- Import TfidfModel from gensim.models.tfidfmodel.
- Initialize a new TfidfModel called tfidf using corpus.
- Use doc to calculate the weights. You can do this by passing [doc] to tfidf.
- Print the first five term ids with weights.
- Sort the term ids and weights in a new list from highest to lowest weight. This has been done for you.
- Print the top five weighted words (term_id) from sorted_tfidf_weights along with their weighted score (weight).

In [109]:
# Import TfidfModel
from gensim.models.tfidfmodel import TfidfModel

# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[:5])

[(0, 0.008517407350738034), (1, 0.009622778920251195), (3, 0.009622778920251195), (5, 0.02688445418094919), (9, 0.004982363903548477)]


In [110]:
# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

medal 0.3082405004703436
reverse 0.2725570352236171
3d 0.24427668764610064
ribbon.svg|border|22px 0.21374210169033805
engineering 0.19590036906697475
