# Simple topic identification
>  This chapter will introduce you to topic identification, which you can apply to any text you encounter in the wild. Using basic NLP models, you will identify topics from texts based on term frequencies. You'll experiment and compare two simple methods: bag-of-words and Tf-idf using NLTK, and a new library Gensim.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/datacamp/1_supervised_learning_with_scikit_learn/2_regression.png

> Note: This is a summary of the course's chapter 2 exercises "Introduction to Natural Language Processing in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (8, 8)

## Word counts with bag-of-words

### Bag-of-words picker

<div class=""><p>It's time for a quick check on your understanding of bag-of-words. Which of the below options, with basic <code>nltk</code> tokenization, map the bag-of-words for the following text?</p>
<p>"The cat is in the box. The cat box."</p></div>

<pre>
Possible Answers

('the', 3), ('box.', 2), ('cat', 2), ('is', 1)

('The', 3), ('box', 2), ('cat', 2), ('is', 1), ('in', 1), ('.', 1)

('the', 3), ('cat box', 1), ('cat', 1), ('box', 1), ('is', 1), ('in', 1)

<b>('The', 2), ('box', 2), ('.', 2), ('cat', 2), ('is', 1), ('in', 1), ('the', 1)</b>
</pre>

### Building a Counter with bag-of-words

<div class=""><p>In this exercise, you'll build your first (in this course) bag-of-words counter using a Wikipedia article, which has been pre-loaded as <code>article</code>. Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! If you'd like to peek at the title at the end, we've included it as <code>article_title</code>. Note that this article text has had very little preprocessing from the raw Wikipedia database entry.</p>
<p><code>word_tokenize</code> has been imported for you.</p></div>

In [None]:
article = '\'\'\'Debugging\'\'\' is the process of finding and resolving of defects that prevent correct operation of computer software or a system.  \n\nNumerous books have been written about debugging (see below: #Further reading|Further reading), as it involves numerous aspects, including interactive debugging, control flow, integration testing, Logfile|log files, monitoring (Application monitoring|application, System Monitoring|system), memory dumps, Profiling (computer programming)|profiling, Statistical Process Control, and special design tactics to improve detection while simplifying changes.\n\nOrigin\nA computer log entry from the Mark&nbsp;II, with a moth taped to the page\n\nThe terms "bug" and "debugging" are popularly attributed to Admiral Grace Hopper in the 1940s.[http://foldoc.org/Grace+Hopper Grace Hopper]  from FOLDOC While she was working on a Harvard Mark II|Mark II Computer at Harvard University, her associates discovered a moth stuck in a relay and thereby impeding operation, whereupon she remarked that they were "debugging" the system. However the term "bug" in the meaning of technical error dates back at least to 1878 and Thomas Edison (see software bug for a full discussion), and "debugging" seems to have been used as a term in aeronautics before entering the world of computers. Indeed, in an interview Grace Hopper remarked that she was not coining the term{{Citation needed|date=July 2015}}. The moth fit the already existing terminology, so it was saved.  A letter from J. Robert Oppenheimer (director of the WWII atomic bomb "Manhattan" project at Los Alamos, NM) used the term in a letter to Dr. Ernest Lawrence at UC Berkeley, dated October 27, 1944,http://bancroft.berkeley.edu/Exhibits/physics/images/bigscience25.jpg regarding the recruitment of additional technical staff.\n\nThe Oxford English Dictionary entry for "debug" quotes the term "debugging" used in reference to airplane engine testing in a 1945 article in the Journal of the Royal Aeronautical Society. An article in "Airforce" (June 1945 p.&nbsp;50) also refers to debugging, this time of aircraft cameras.  Hopper\'s computer bug|bug was found on September 9, 1947. The term was not adopted by computer programmers until the early 1950s.\nThe seminal article by GillS. Gill, [http://www.jstor.org/stable/98663 The Diagnosis of Mistakes in Programmes on the EDSAC], Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, Vol. 206, No. 1087 (May 22, 1951), pp. 538-554 in 1951 is the earliest in-depth discussion of programming errors, but it does not use the term "bug" or "debugging".\nIn the Association for Computing Machinery|ACM\'s digital library, the term "debugging" is first used in three papers from 1952 ACM National Meetings.Robert V. D. Campbell, [http://portal.acm.org/citation.cfm?id=609784.609786 Evolution of automatic computation], Proceedings of the 1952 ACM national meeting (Pittsburgh), p 29-32, 1952.Alex Orden, [http://portal.acm.org/citation.cfm?id=609784.609793 Solution of systems of linear inequalities on a digital computer], Proceedings of the 1952 ACM national meeting (Pittsburgh), p. 91-95, 1952.Howard B. Demuth, John B. Jackson, Edmund Klein, N. Metropolis, Walter Orvedahl, James H. Richardson, [http://portal.acm.org/citation.cfm?id=800259.808982 MANIAC], Proceedings of the 1952 ACM national meeting (Toronto), p. 13-16 Two of the three use the term in quotation marks.\nBy 1963 "debugging" was a common enough term to be mentioned in passing without explanation on page 1 of the Compatible Time-Sharing System|CTSS manual.[http://www.bitsavers.org/pdf/mit/ctss/CTSS_ProgrammersGuide.pdf The Compatible Time-Sharing System], M.I.T. Press, 1963\n\nKidwell\'s article \'\'Stalking the Elusive Computer Bug\'\'Peggy Aldrich Kidwell, [http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?tp=&arnumber=728224&isnumber=15706 Stalking the Elusive Computer Bug], IEEE Annals of the History of Computing, 1998. discusses the etymology of "bug" and "debug" in greater detail.\n\nScope\nAs software and electronic systems have become generally more complex, the various common debugging techniques have expanded with more methods to detect anomalies, assess impact, and schedule software patches or full updates to a system. The words "anomaly" and "discrepancy" can be used, as being more neutral terms, to avoid the words "error" and "defect" or "bug" where there might be an implication that all so-called \'\'errors\'\', \'\'defects\'\' or \'\'bugs\'\' must be fixed (at all costs). Instead, an impact assessment can be made to determine if changes to remove an \'\'anomaly\'\' (or \'\'discrepancy\'\') would be cost-effective for the system, or perhaps a scheduled new release might render the change(s) unnecessary. Not all issues are life-critical or mission-critical in a system. Also, it is important to avoid the situation where a change might be more upsetting to users, long-term, than living with the known problem(s) (where the "cure would be worse than the disease"). Basing decisions of the acceptability of some anomalies can avoid a culture of a "zero-defects" mandate, where people might be tempted to deny the existence of problems so that the result would appear as zero \'\'defects\'\'. Considering the collateral issues, such as the cost-versus-benefit impact assessment, then broader debugging techniques will expand to determine the frequency of anomalies (how often the same "bugs" occur) to help assess their impact to the overall system.\n\nTools\nDebugging on video game consoles is usually done with special hardware such as this Xbox (console)|Xbox debug unit intended for developers.\n\nDebugging ranges in complexity from fixing simple errors to performing lengthy and tiresome tasks of data collection, analysis, and scheduling updates.  The debugging skill of the programmer can be a major factor in the ability to debug a problem, but the difficulty of software debugging varies greatly with the complexity of the system, and also depends, to some extent, on the programming language(s) used and the available tools, such as \'\'debuggers\'\'. Debuggers are software tools which enable the programmer to monitor the execution (computers)|execution of a program, stop it, restart it, set breakpoints, and change values in memory. The term \'\'debugger\'\' can also refer to the person who is doing the debugging.\n\nGenerally, high-level programming languages, such as Java (programming language)|Java, make debugging easier, because they have features such as exception handling that make real sources of erratic behaviour easier to spot. In programming languages such as C (programming language)|C or assembly language|assembly, bugs may cause silent problems such as memory corruption, and it is often difficult to see where the initial problem happened. In those cases, memory debugging|memory debugger tools may be needed.\n\nIn certain situations, general purpose software tools that are language specific in nature can be very useful.  These take the form of \'\'List of tools for static code analysis|static code analysis tools\'\'.  These tools look for a very specific set of known problems, some common and some rare, within the source code.  All such issues detected by these tools would rarely be picked up by a compiler or interpreter, thus they are not syntax checkers, but more semantic checkers.  Some tools claim to be able to detect 300+ unique problems. Both commercial and free tools exist in various languages.  These tools can be extremely useful when checking very large source trees, where it is impractical to do code walkthroughs.  A typical example of a problem detected would be a variable dereference that occurs \'\'before\'\' the variable is assigned a value.  Another example would be to perform strong type checking when the language does not require such.  Thus, they are better at locating likely errors, versus actual errors.  As a result, these tools have a reputation of false positives.  The old Unix \'\'Lint programming tool|lint\'\' program is an early example.\n\nFor debugging electronic hardware (e.g., computer hardware) as well as low-level software (e.g., BIOSes, device drivers) and firmware, instruments such as oscilloscopes, logic analyzers or in-circuit emulator|in-circuit emulators (ICEs) are often used, alone or in combination.  An ICE may perform many of the typical software debugger\'s tasks on low-level software and firmware.\n\nDebugging process \nNormally the first step in debugging is to attempt to reproduce the problem. This can be a non-trivial task, for example as with Parallel computing|parallel processes or some unusual software bugs. Also, specific user environment and usage history can make it difficult to reproduce the problem.\n\nAfter the bug is reproduced, the input of the program may need to be simplified to make it easier to debug. For example, a bug in a compiler can make it Crash (computing)|crash when parsing some large source file. However, after simplification of the test case, only few lines from the original source file can be sufficient to reproduce the same crash. Such simplification can be made manually, using a Divide and conquer algorithm|divide-and-conquer approach. The programmer will try to remove some parts of original test case and check if the problem still exists. When debugging the problem in a Graphical user interface|GUI, the programmer can try to skip some user interaction from the original problem description and check if remaining actions are sufficient for bugs to appear.\n\nAfter the test case is sufficiently simplified, a programmer can use a debugger tool to examine program states (values of variables, plus the call stack) and track down the origin of the problem(s). Alternatively, Tracing (software)|tracing can be used. In simple cases, tracing is just a few print statements, which output the values of variables at certain points of program execution.{{citation needed|date=February 2016}}\n\n Techniques \n \'\'Interactive debugging\'\'\n \'\'{{visible anchor|Print debugging}}\'\' (or tracing) is the act of watching (live or recorded) trace statements, or print statements, that indicate the flow of execution of a process. This is sometimes called \'\'{{visible anchor|printf debugging}}\'\', due to the use of the printf function in C. This kind of debugging was turned on by the command TRON in the original versions of the novice-oriented BASIC programming language. TRON stood for, "Trace On." TRON caused the line numbers of each BASIC command line to print as the program ran.\n \'\'Remote debugging\'\' is the process of debugging a program running on a system different from the debugger. To start remote debugging, a debugger connects to a remote system over a network. The debugger can then control the execution of the program on the remote system and retrieve information about its state.\n \'\'Post-mortem debugging\'\' is debugging of the program after it has already Crash (computing)|crashed. Related techniques often include various tracing techniques (for example,[http://www.drdobbs.com/tools/185300443 Postmortem Debugging, Stephen Wormuller, Dr. Dobbs Journal, 2006]) and/or analysis of memory dump (or core dump) of the crashed process. The dump of the process could be obtained automatically by the system (for example, when process has terminated due to an unhandled exception), or by a programmer-inserted instruction, or manually by the interactive user.\n \'\'"Wolf fence" algorithm:\'\' Edward Gauss described this simple but very useful and now famous algorithm in a 1982 article for communications of the ACM as follows: "There\'s one wolf in Alaska; how do you find it? First build a fence down the middle of the state, wait for the wolf to howl, determine which side of the fence it is on. Repeat process on that side only, until you get to the point where you can see the wolf."<ref name="communications of the ACM">{{cite journal | title="Pracniques: The "Wolf Fence" Algorithm for Debugging", | author=E. J. Gauss | year=1982}} This is implemented e.g. in the Git (software)|Git version control system as the command \'\'git bisect\'\', which uses the above algorithm to determine which Commit (data management)|commit introduced a particular bug.\n \'\'Delta Debugging\'\'{{snd}} a technique of automating test case simplification.Andreas Zeller: <cite>Why Programs Fail: A Guide to Systematic Debugging</cite>, Morgan Kaufmann, 2005. ISBN 1-55860-866-4{{rp|p.123}}<!-- for redirect from \'Saff Squeeze\' -->\n \'\'Saff Squeeze\'\'{{snd}} a technique of isolating failure within the test using progressive inlining of parts of the failing test.[http://www.threeriversinstitute.org/HitEmHighHitEmLow.html Kent Beck, Hit \'em High, Hit \'em Low: Regression Testing and the Saff Squeeze]\n\nDebugging for embedded systems\nIn contrast to the general purpose computer software design environment, a primary characteristic of embedded environments is the sheer number of different platforms available to the developers (CPU architectures, vendors, operating systems and their variants). Embedded systems are, by definition, not general-purpose designs: they are typically developed for a single task (or small range of tasks), and the platform is chosen specifically to optimize that application. Not only does this fact make life tough for embedded system developers, it also makes debugging and testing of these systems harder as well, since different debugging tools are needed in different platforms.\n\nto identify and fix bugs in the system (e.g. logical or synchronization problems in the code, or a design error in the hardware);\nto collect information about the operating states of the system that may then be used to analyze the system: to find ways to boost its performance or to optimize other important characteristics (e.g. energy consumption, reliability, real-time response etc.).\n\nAnti-debugging\nAnti-debugging is "the implementation of one or more techniques within computer code that hinders attempts at reverse engineering or debugging a target process".<ref name="veracode-antidebugging">{{cite web |url=http://www.veracode.com/blog/2008/12/anti-debugging-series-part-i/ |title=Anti-Debugging Series - Part I |last=Shields |first=Tyler |date=2008-12-02 |work=Veracode |accessdate=2009-03-17}} It is actively used by recognized publishers in copy protection|copy-protection schemas, but is also used by malware to complicate its detection and elimination.<ref name="soft-prot">[http://people.seas.harvard.edu/~mgagnon/software_protection_through_anti_debugging.pdf Software Protection through Anti-Debugging Michael N Gagnon, Stephen Taylor, Anup Ghosh] Techniques used in anti-debugging include:\nAPI-based: check for the existence of a debugger using system information\nException-based: check to see if exceptions are interfered with\nProcess and thread blocks: check whether process and thread blocks have been manipulated\nModified code: check for code modifications made by a debugger handling software breakpoints\nHardware- and register-based: check for hardware breakpoints and CPU registers\nTiming and latency: check the time taken for the execution of instructions\nDetecting and penalizing debugger<ref name="soft-prot" /><!-- reference does not exist -->\n\nAn early example of anti-debugging existed in early versions of Microsoft Word which, if a debugger was detected, produced a message that said: "The tree of evil bears bitter fruit. Now trashing program disk.", after which it caused the floppy disk drive to emit alarming noises with the intent of scaring the user away from attempting it again.<ref name="SecurityEngineeringRA">{{cite book | url=http://www.cl.cam.ac.uk/~rja14/book.html | author=Ross J. Anderson | title=Security Engineering | isbn = 0-471-38922-6 | page=684 }}<ref name="toastytech">{{cite web | url=http://toastytech.com/guis/word1153.html | title=Microsoft Word for DOS 1.15}}\n'

In [None]:
#import re
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Instructions
<ul>
<li>Import <code>Counter</code> from <code>collections</code>.</li>
<li>Use <code>word_tokenize()</code> to split the article into tokens.</li>
<li>Use a list comprehension with <code>t</code> as the iterator variable to convert all the tokens into lowercase. The <code>.lower()</code> method converts text into lowercase.</li>
<li>Create a bag-of-words counter called <code>bow_simple</code> by using <code>Counter()</code> with <code>lower_tokens</code> as the argument.</li>
<li>Use the <code>.most_common()</code> method of <code>bow_simple</code> to print the 10 most common tokens.</li>
</ul>

In [None]:
# Import Counter
from collections import Counter

# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [token.lower() for token in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))

[(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 68), ('to', 63), ('a', 60), ('in', 44), ('and', 41), ('debugging', 40)]


## Simple text preprocessing

### Text preprocessing steps

<p>Which of the following are useful text preprocessing steps?</p>

<pre>
Possible Answers

Stems, spelling corrections, lowercase.

<b>Lemmatization, lowercasing, removing unwanted tokens.</b>

Removing stop words, leaving in capital words.

Strip stop words, word endings and digits.

</pre>

### Text preprocessing practice

<div class=""><p>Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text.</p>
<p>You start with the same tokens you created in the last exercise: <code>lower_tokens</code>. You also have the <code>Counter</code> class imported.</p></div>

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('wordnet')
english_stops = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


Instructions
<ul>
<li>Import the <code>WordNetLemmatizer</code> class from <code>nltk.stem</code>. </li>
<li>Create a list <code>alpha_only</code> that contains <strong>only</strong> alphabetical characters. You can use the <code>.isalpha()</code> method to check for this.</li>
<li>Create another list called <code>no_stops</code> consisting of words from <code>alpha_only</code> that <strong>are not</strong> contained in <code>english_stops</code>.</li>
<li>Initialize a <code>WordNetLemmatizer</code> object called <code>wordnet_lemmatizer</code> and use its <code>.lemmatize()</code> method on the tokens in <code>no_stops</code> to create a new list called <code>lemmatized</code>.</li>
<li>Create a new <code>Counter</code> called <code>bow</code> with the lemmatized words.</li>
<li>Lastly, print the 10 most common tokens.</li>
</ul>

In [None]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

[('debugging', 40), ('system', 25), ('software', 16), ('bug', 16), ('problem', 15), ('tool', 15), ('computer', 14), ('process', 13), ('term', 13), ('used', 12)]


## Introduction to gensim

### What are word vectors?

<p>What are word vectors and how do they help with NLP?</p>

<pre>
Possible Answers

They are similar to bags of words, just with numbers. You use them to count how many tokens there are.

Word vectors are sparse arrays representing bigrams in the corpora. You can use them to compare two sets of words to one another.

<b>Word vectors are multi-dimensional mathematical representations of words created using deep learning methods. They give us insight into relationships between words in a corpus.</b>

Word vectors don't actually help NLP and are just hype.

</pre>

**Keep working to use some word vectors yourself!**

### Creating and querying a corpus with gensim

<div class=""><p>It's time to apply the methods you learned in the previous video to create your first <code>gensim</code> dictionary and corpus! </p>
<p>You'll use these data structures to investigate word trends and potential interesting topics in your document set. To get started, we have imported a few additional messy articles from Wikipedia, which were preprocessed by lowercasing all words, tokenizing them, and removing stop words and punctuation. These were then stored in a list of document tokens called <code>articles</code>. You'll need to do some light preprocessing and then generate the <code>gensim</code> dictionary and corpus.</p></div>

In [None]:
! wget https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/12-introduction-to-natural-language-processing-in-python/datasets/articles.txt
import pickle
with open('/content/articles.txt', 'rb') as fp:
  articles = pickle.load(fp)

--2021-02-06 14:40:52--  https://github.com/lnunesAI/Datacamp/raw/main/2-machine-learning-scientist-with-python/12-introduction-to-natural-language-processing-in-python/datasets/articles.txt
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/12-introduction-to-natural-language-processing-in-python/datasets/articles.txt [following]
--2021-02-06 14:40:53--  https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/12-introduction-to-natural-language-processing-in-python/datasets/articles.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent,

Instructions
<ul>
<li>Import <code>Dictionary</code> from <code>gensim.corpora.dictionary</code>.</li>
<li>Initialize a <code>gensim</code> <code>Dictionary</code> with the tokens in <code>articles</code>.</li>
<li>Obtain the id for <code>"computer"</code> from <code>dictionary</code>. To do this, use its <code>.token2id</code> method which returns ids from text, and then chain <code>.get()</code> which returns tokens from ids. Pass in <code>"computer"</code> as an argument to <code>.get()</code>.</li>
<li>Use a list comprehension in which you iterate over <code>articles</code> to create a <code>gensim</code> <code>MmCorpus</code> from <code>dictionary</code>.<ul>
<li>In the output expression, use the <code>.doc2bow()</code> method on <code>dictionary</code> with <code>article</code> as the argument.</li></ul></li>
<li>Print the first 10 word ids with their frequency counts from the fifth document. This has been done for you, so hit 'Submit Answer' to see the results!</li>
</ul>

In [42]:
# Import Dictionary
from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in articles]

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])

computer
[(0, 88), (23, 11), (24, 2), (39, 1), (41, 2), (55, 22), (56, 1), (57, 1), (58, 1), (59, 3)]


### Gensim bag-of-words

<div class=""><p>Now, you'll use your new <code>gensim</code> corpus and dictionary to see the most common terms per document and across all documents. You can use your dictionary to look up the terms. Take a guess at what the topics are and feel free to explore more documents in the IPython Shell! </p>
<p>You have access to the <code>dictionary</code> and <code>corpus</code> objects you created in the previous exercise, as well as the Python <code>defaultdict</code> and <code>itertools</code> to help with the creation of intermediate data structures for analysis. </p>
<ul>
<li><p><code>defaultdict</code> allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argument <code>int</code>, we are able to ensure that any non-existent keys are automatically assigned a default value of <code>0</code>. This makes it ideal for storing the counts of words in this exercise.</p></li>
<li><p><code>itertools.chain.from_iterable()</code> allows us to iterate through a set of sequences as if they were one continuous sequence. Using this function, we can easily iterate through our <code>corpus</code> object (which is a list of lists).</p></li>
</ul>
<p>The fifth document from <code>corpus</code> is stored in the variable <code>doc</code>, which has been sorted in descending order.</p></div>

In [46]:
from collections import defaultdict
import itertools

Instructions 1/2
<ul>
<li><p>Using the first <code>for</code> loop, print the top five words of <code>bow_doc</code> using each <code>word_id</code> with the <code>dictionary</code> alongside <code>word_count</code>. </p>
<ul>
<li>The <code>word_id</code> can be accessed using the <code>.get()</code> method of <code>dictionary</code>.</li></ul></li>
<li><p>Create a <code>defaultdict</code> called <code>total_word_count</code> in which the keys are all the token ids (<code>word_id</code>) and the values are the sum of their occurrence across all documents (<code>word_count</code>). </p>
<ul>
<li>Remember to specify <code>int</code> when creating the <code>defaultdict</code>, and inside the second <code>for</code> loop, increment each <code>word_id</code> of <code>total_word_count</code> by <code>word_count</code>.</li></ul></li>
</ul>

In [47]:
# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
    
# Create the defaultdict: total_word_count
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count

engineering 91
'' 88
reverse 71
software 51
cite 26


Instructions 2/2
<ul>
<li>Create a sorted list from the <code>defaultdict</code>, using words across the entire corpus. To achieve this, use the <code>.items()</code> method on <code>total_word_count</code> inside <code>sorted()</code>.</li>
<li>Similar to how you printed the top five words of <code>bow_doc</code> earlier, print the top five words of <code>sorted_word_count</code> as well as the number of occurrences of each word across all the documents.</li>
</ul>

In [48]:
# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

'' 1042
computer 594
software 450
`` 345
cite 322


## Tf-idf with gensim

### What is tf-idf?

<div class=""><p>You want to calculate the tf-idf weight for the word <code>"computer"</code>, which appears five times in a document containing 100 words. Given a corpus containing 200 documents, with 20 documents mentioning the word <code>"computer"</code>, tf-idf can be calculated by multiplying term frequency with inverse document frequency.</p>
<p>Term frequency = percentage share of the word compared to all tokens in the document
Inverse document frequency = logarithm of the total number of documents in a corpora divided by the number of documents containing the term</p>
<p>Which of the below options is correct?</p></div>

<pre>
Possible Answers

<b>(5 / 100) * log(200 / 20)</b>

(5 * 100) / log(200 * 20)

(20 / 5) * log(200 / 20)

(200 * 5) * log(400 / 5)

</pre>

### Tf-idf with Wikipedia

<div class=""><p>Now it's your turn to determine new significant terms for your corpus by applying <code>gensim</code>'s tf-idf. You will again have access to the same corpus and dictionary objects you created in the previous exercises - <code>dictionary</code>, <code>corpus</code>, and <code>doc</code>. Will tf-idf make for more interesting results on the document level?</p>
<p><code>TfidfModel</code> has been imported for you from <code>gensim.models.tfidfmodel</code>.</p></div>

In [50]:
from gensim.models.tfidfmodel import TfidfModel

Instructions 1/2
<ul>
<li>Initialize a new <code>TfidfModel</code> called <code>tfidf</code> using <code>corpus</code>.</li>
<li>Use <code>doc</code> to calculate the weights. You can do this by passing <code>[doc]</code> to <code>tfidf</code>.</li>
<li>Print the first five term ids with weights.</li>
</ul>

In [51]:
# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[:5])

[(24, 0.0022836332291091273), (39, 0.0043409401554717324), (41, 0.008681880310943465), (55, 0.011988285029371418), (56, 0.005482756770026296)]


Instructions 2/2
<ul>
<li>Sort the term ids and weights in a new list from highest to lowest weight. <em>This has been done for you.</em> </li>
<li>Using your pre-existing <code>dictionary</code>, print the top five weighted words (<code>term_id</code>) from <code>sorted_tfidf_weights</code>, along with their weighted score (<code>weight</code>).</li>
</ul>

In [52]:
# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

reverse 0.4884961428651127
infringement 0.18674529210288995
engineering 0.16395041814479536
interoperability 0.12449686140192663
reverse-engineered 0.12449686140192663
