# <center> Words counts with bag-of-words

#### What is Bag-of-words?
- Is a basic method for finding topics in a text. But fisrtly, it is needed to create tokens using tokenization and then count up them all.
- The more frequent a word, the more important it might be
- Can be a great way to determine the significant words in a text

Basic example:

In [1]:
from nltk.tokenize import word_tokenize
from collections import Counter

In [2]:
text="The cat is in the box. The cat likes the box. The box is over the cat."
##Using Counter class to tokens created with word_tokenize
counter=Counter(word_tokenize(text)) ##this need pre-processing
##result counter object similar to a dictionary
counter 

Counter({'The': 3,
         'cat': 3,
         'is': 2,
         'in': 1,
         'the': 3,
         'box': 3,
         '.': 3,
         'likes': 1,
         'over': 1})

In [3]:
##series of tuples with this structure: (holds token, represent frequency)
counter.most_common(2)

[('The', 3), ('cat', 3)]

#### Building a Counter with bag-of-words

In [4]:
##using a Wikipedia article which was copy on a txt file
with open('article.txt','r',encoding='UTF-8') as file:
    article=file.read()
article[:500]

"'''Debugging''' is the process of finding and resolving of defects that prevent correct operation of computer software or a system.  \n\nNumerous books have been written about debugging (see below: #Further reading|Further reading), as it involves numerous aspects, including interactive debugging, control flow, integration testing, Logfile|log files, monitoring (Application monitoring|application, System Monitoring|system), memory dumps, Profiling (computer programming)|profiling, Statistical Proc"

In [5]:
# Import Counter
from collections import Counter
# Tokenize the article: tokens
tokens = word_tokenize(article)
# Convert the tokens into lowercase: lower_tokens
lower_tokens = [word.lower() for word in tokens]
lower_tokens[15:21]

['prevent', 'correct', 'operation', 'of', 'computer', 'software']

In [6]:
# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

##Article title: Debugging
# Print the 15 most common tokens
print(bow_simple.most_common(15))

[('the', 283), (',', 281), ('.', 170), ('of', 155), ("''", 124), ('to', 120), ('a', 110), ('``', 86), ('in', 82), ('and', 76), ('(', 75), (')', 75), ('debugging', 74), (':', 59), ('for', 50)]


# <center> Simple text preprocessing
#### Why preprocess?
- Helps make for better input data
- When performing machine learning or other statistical methods
    
Some examples:
- Tokenization to create a bag of words
- Lowercasing words
- Lemmatization/Stemming: Shorten words to their root stems
- Removing stop words, punctuation, or unwanted tokens (examples: "the", "and", ".", "," , etc)

    
<b>Recommendation:</b> Good to experiment with different approaches
    
Basic example:

In [7]:
from nltk.corpus import stopwords
##text to use in this example
text

'The cat is in the box. The cat likes the box. The box is over the cat.'

In [8]:
#list comprehension to tokenize sentences and also lowering the words
##use string alpha method to only return alphabetic strings (this will effectively strip tokens with numbers or punctuation)
tokens = [w for w in word_tokenize(text.lower()) 
                  if w.isalpha()]
print(tokens)

['the', 'cat', 'is', 'in', 'the', 'box', 'the', 'cat', 'likes', 'the', 'box', 'the', 'box', 'is', 'over', 'the', 'cat']


In [9]:
# if stopwords wasn't used before, download it
import nltk
#nltk.download('stopwords')

In [10]:
#list comprehension to remove words that are in the stopwords list
# the english stopwords comes built in with the NLTK library - need to be downloaded if first time using it

no_stops = [t for t in tokens 
                    if t not in stopwords.words('english')]
print(no_stops)

['cat', 'box', 'cat', 'likes', 'box', 'box', 'cat']


In [11]:
##count the pre-processed words
Counter(no_stops).most_common(2)

[('cat', 3), ('box', 3)]

#### Text preprocessing practice

In [12]:
#using lowercase words from the previous exercise
print(lower_tokens[:50])

["'", "''", 'debugging', "''", "'", 'is', 'the', 'process', 'of', 'finding', 'and', 'resolving', 'of', 'defects', 'that', 'prevent', 'correct', 'operation', 'of', 'computer', 'software', 'or', 'a', 'system', '.', 'numerous', 'books', 'have', 'been', 'written', 'about', 'debugging', '(', 'see', 'below', ':', '#', 'further', 'reading|further', 'reading', ')', ',', 'as', 'it', 'involves', 'numerous', 'aspects', ',', 'including', 'interactive']


In [13]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]
print(alpha_only[:50])

['debugging', 'is', 'the', 'process', 'of', 'finding', 'and', 'resolving', 'of', 'defects', 'that', 'prevent', 'correct', 'operation', 'of', 'computer', 'software', 'or', 'a', 'system', 'numerous', 'books', 'have', 'been', 'written', 'about', 'debugging', 'see', 'below', 'further', 'reading', 'as', 'it', 'involves', 'numerous', 'aspects', 'including', 'interactive', 'debugging', 'control', 'flow', 'integration', 'testing', 'files', 'monitoring', 'application', 'system', 'memory', 'dumps', 'profiling']


In [14]:
# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in stopwords.words('english')]
print(no_stops[:50])

['debugging', 'process', 'finding', 'resolving', 'defects', 'prevent', 'correct', 'operation', 'computer', 'software', 'system', 'numerous', 'books', 'written', 'debugging', 'see', 'reading', 'involves', 'numerous', 'aspects', 'including', 'interactive', 'debugging', 'control', 'flow', 'integration', 'testing', 'files', 'monitoring', 'application', 'system', 'memory', 'dumps', 'profiling', 'computer', 'programming', 'statistical', 'process', 'control', 'special', 'design', 'tactics', 'improve', 'detection', 'simplifying', 'changes', 'origin', 'computer', 'log', 'entry']


In [15]:
## if WordNetLemmatizer wasn't used before, download it
import nltk
#nltk.download('wordnet')

In [16]:
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
print(lemmatized[:50])

['debugging', 'process', 'finding', 'resolving', 'defect', 'prevent', 'correct', 'operation', 'computer', 'software', 'system', 'numerous', 'book', 'written', 'debugging', 'see', 'reading', 'involves', 'numerous', 'aspect', 'including', 'interactive', 'debugging', 'control', 'flow', 'integration', 'testing', 'file', 'monitoring', 'application', 'system', 'memory', 'dump', 'profiling', 'computer', 'programming', 'statistical', 'process', 'control', 'special', 'design', 'tactic', 'improve', 'detection', 'simplifying', 'change', 'origin', 'computer', 'log', 'entry']


In [17]:
# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

[('debugging', 74), ('system', 47), ('bug', 31), ('software', 30), ('problem', 30), ('tool', 30), ('debugger', 26), ('process', 24), ('computer', 23), ('used', 22)]


# <center> Introduction to gensim
#### What is gensim?
- Popular open-source NLP library
- Uses top academic models to perform complex tasks:
    - Building document or word vectors
    - Performing topic identification and document comparison

#### What is word embedding (document vector)?
- Is train for a larger corpus.
- Is multidimensional representation of a word (multidimensional array).
- With this vectors is possible to see relationships between words or documents.
    
Def: Word vectors are multi-dimensional mathematical representations of words created using deep learning methods. They give us insight into relationships between words in a corpus.
    
<img src="https://cdn-images-1.medium.com/max/1000/0*4ctH2ps5Y-ZIYW1g.png" width="800" height="400">    
    
    
Basic example:

In [18]:
from gensim.corpora.dictionary import Dictionary
my_documents = ['The movie was about a spaceship and aliens.','I really liked the movie!',
'Awesome action scenes, but boring characters.','The movie was awful! I hate alien films.','Space is cool! I liked the movie.','More space films, please!']

In [19]:
###pre-process using lowercase
tokenized_docs = [word_tokenize(doc.lower()) for doc in my_documents]
tokenized_docs

[['the', 'movie', 'was', 'about', 'a', 'spaceship', 'and', 'aliens', '.'],
 ['i', 'really', 'liked', 'the', 'movie', '!'],
 ['awesome', 'action', 'scenes', ',', 'but', 'boring', 'characters', '.'],
 ['the', 'movie', 'was', 'awful', '!', 'i', 'hate', 'alien', 'films', '.'],
 ['space', 'is', 'cool', '!', 'i', 'liked', 'the', 'movie', '.'],
 ['more', 'space', 'films', ',', 'please', '!']]

#### Creating a gensim dictionary

In [20]:
dictionary = Dictionary(tokenized_docs)
##print Diccionary of all tokens with their ID's
print(dictionary.token2id)

{'.': 0, 'a': 1, 'about': 2, 'aliens': 3, 'and': 4, 'movie': 5, 'spaceship': 6, 'the': 7, 'was': 8, '!': 9, 'i': 10, 'liked': 11, 'really': 12, ',': 13, 'action': 14, 'awesome': 15, 'boring': 16, 'but': 17, 'characters': 18, 'scenes': 19, 'alien': 20, 'awful': 21, 'films': 22, 'hate': 23, 'cool': 24, 'is': 25, 'space': 26, 'more': 27, 'please': 28}


#### Creating a gensim corpus
- gensim models can be easily saved, updated, and reused
- The dictionary created can also be updated
- This more advanced and feature rich bag-of-words

In [21]:
##This transforms each document into bag-of-words using the token ID's and its respective frequency in de document
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
#Each list inside represents a document
#Each document becames a series of tuples
print('In each tuple items: (token ID, token frequency in document)')
corpus

In each tuple items: (token ID, token frequency in document)


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
 [(5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
 [(0, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1)],
 [(0, 1),
  (5, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1)],
 [(0, 1), (5, 1), (7, 1), (9, 1), (10, 1), (11, 1), (24, 1), (25, 1), (26, 1)],
 [(9, 1), (13, 1), (22, 1), (26, 1), (27, 1), (28, 1)]]

#### Examples: Creating and querying a corpus with gensim

In [22]:
import pandas as pd
##using a Wikipedia article tokens which was copied on a csv file
wiki_tokens=pd.read_csv('Wiki_articles_tokens.csv')
print(wiki_tokens.shape)
wiki_tokens.head()

(12, 1)


Unnamed: 0,articles
0,"['uses', 'file', 'operating', 'system', 'place..."
1,"[""''"", 'debugging', ""''"", 'process', 'finding'..."
2,"['use', 'dmy', 'dates|date=september', '2013',..."
3,"['use', 'dmy', 'dates|date=march', '2014', 'in..."
4,"[""''"", 'reverse', 'engineering', ""''"", 'also',..."


In [23]:
wiki_tok=[]
for num in range(len(wiki_tokens['articles'])):
    wiki_tok.append([text.replace("'","").strip().replace("[","").replace("]","") for text in wiki_tokens['articles'][num].split(',')])
print(wiki_tok[0][:25])

['uses', 'file', 'operating', 'system', 'placement', 'software', '.svg|thumb|upright|a', 'diagram', 'showing', 'user', 'computing', '|user', 'interacts', 'application', 'software', 'typical', 'desktop', 'computer.the', 'application', 'software', 'layer', 'interfaces', 'operating', 'system', 'turn']


In [24]:
# Import Dictionary
# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(wiki_tok)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

computer


In [25]:
# Create a gensim MmCorpus: corpus
corpus = [dictionary.doc2bow(article) for article in wiki_tok]

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])

[(0, 88), (20, 11), (24, 2), (39, 1), (41, 2), (55, 22), (56, 1), (57, 1), (58, 1), (59, 3)]


Use gensim corpus and dictionary to see the most common terms per document and across all documents

In [26]:
# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)

engineering 91
"" 88
reverse 71
software 51
cite 26


Using Python <b> defaultdict </b> and <b>itertools</b> to help with the creation of intermediate data structures for analysis. 
- <b> defaultdict : </b> allows us to initialize a dictionary that will assign a default value to non-existent keys. By supplying the argument int, we are able to ensure that any non-existent keys are automatically assigned a default value of 0. This makes it ideal for storing the counts of words in this exercise.

- <b> itertools.chain.from_iterable() : </b>  allows us to iterate through a set of sequences as if they were one continuous sequence. Using this function, we can easily iterate through our corpus object (which is a list of lists).

In [27]:
##Install : conda install -c conda-forge python-utils
import collections
import itertools

In [28]:
# Create a defaultdict called total_word_count in which the keys are all the token ids (word_id) and the values are the sum of 
#their occurrence across all documents (word_count).
total_word_count = collections.defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count

In [29]:
# Create a sorted list from the defaultdict: sorted_word_count
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

"" 1042
computer 594
software 450
`` 345
cite 322


# <center> Tf-idf with gensim
#### What is tf-idf?
- Term frequency - inverse document frequency
- Allows you to determine the most important words in each document
- Each corpus may have shared words beyond just stopwords
- These words should be down-weighted in importance
- Example from astronomy: "Sky"

#### Tf-idf formula

 <img src="https://plumbr.io/app/uploads/2016/06/tf-idf.png" >  

<b>Term frequency </b>= percentage share of the word compared to all tokens in the document 
    
<b>Inverse document frequency</b> = logarithm of the total number of documents in a corpora divided by the number of documents containing the term

Observations:
- The WEIGHT will be low if the term doesnt appear often in the document because tf variable will then be low.
- The WEIGHT will be low if the logarithm is close to zero, meaning the internal equation is low (remember log(1) = 0).
    - So if the internal operation is close to 1 then the logarithm will be close to zero

Tf-idf with gensim example:

In [30]:
from gensim.models.tfidfmodel import TfidfModel

#Bag-of-words corpus to translate it into a TF-idf model 
tfidf = TfidfModel(corpus)
#reference each document by using it like a dictionary key with our new tf-idf model
## tuple (token id , token weight)
tfidf[corpus[1]][:10]

[(24, 0.011385999579543369),
 (55, 0.04075401940563056),
 (56, 0.013668277699438486),
 (57, 0.010821777804552644),
 (63, 0.010821777804552644),
 (67, 0.017152112011352465),
 (75, 0.0336603857417326),
 (82, 0.012660668413599637),
 (94, 0.0063303342067998185),
 (98, 0.0168301928708663)]

In [31]:
# Calculate the tfidf weights of doc 3: tfidf_weights
tfidf_weights = tfidf[corpus[2]]

# Print the first five weights
print(tfidf_weights[:5])

[(44, 0.01862887035545794), (59, 0.020941657966372935), (64, 0.012381948203932928), (77, 0.004188331593274587), (78, 0.015923132108888866)]


In [32]:
# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

crashes 0.5350890921415871
segmentation 0.2283353260175823
attempting 0.1910775853066664
crashed 0.1646427975820268
invalid 0.15923132108888866
