# <center> Words counts with bag-of-words

#### What is Bag-of-words?
- Is a basic method for finding topics in a text. But fisrtly, it is needed to create tokens using tokenization and then count up them all.
- The more frequent a word, the more important it might be
- Can be a great way to determine the significant words in a text

Basic example:

In [1]:
from nltk.tokenize import word_tokenize
from collections import Counter

In [2]:
text="The cat is in the box. The cat likes the box. The box is over the cat."
##Using Counter class to tokens created with word_tokenize
counter=Counter(word_tokenize(text)) ##this need pre-processing
##result counter object similar to a dictionary
counter 

Counter({'The': 3,
         'cat': 3,
         'is': 2,
         'in': 1,
         'the': 3,
         'box': 3,
         '.': 3,
         'likes': 1,
         'over': 1})

In [3]:
##series of tuples with this structure: (holds token, represent frequency)
counter.most_common(2)

[('The', 3), ('cat', 3)]

#### Building a Counter with bag-of-words

In [4]:
##using a Wikipedia article which was copy on a txt file
with open('article.txt','r',encoding='UTF-8') as file:
    article=file.read()
article[:500]

"'''Debugging''' is the process of finding and resolving of defects that prevent correct operation of computer software or a system.  \n\nNumerous books have been written about debugging (see below: #Further reading|Further reading), as it involves numerous aspects, including interactive debugging, control flow, integration testing, Logfile|log files, monitoring (Application monitoring|application, System Monitoring|system), memory dumps, Profiling (computer programming)|profiling, Statistical Proc"

In [5]:
# Import Counter
from collections import Counter
# Tokenize the article: tokens
tokens = word_tokenize(article)
# Convert the tokens into lowercase: lower_tokens
lower_tokens = [word.lower() for word in tokens]
lower_tokens[15:21]

['prevent', 'correct', 'operation', 'of', 'computer', 'software']

In [6]:
# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

##Article title: Debugging
# Print the 15 most common tokens
print(bow_simple.most_common(15))

[('the', 283), (',', 281), ('.', 170), ('of', 155), ("''", 124), ('to', 120), ('a', 110), ('``', 86), ('in', 82), ('and', 76), ('(', 75), (')', 75), ('debugging', 74), (':', 59), ('for', 50)]


# <center> Simple text preprocessing
#### Why preprocess?
- Helps make for better input data
- When performing machine learning or other statistical methods
    
Some examples:
- Tokenization to create a bag of words
- Lowercasing words
- Lemmatization/Stemming: Shorten words to their root stems
- Removing stop words, punctuation, or unwanted tokens (examples: "the", "and", ".", "," , etc)

    
<b>Recommendation:</b> Good to experiment with different approaches
    
Basic example:

In [7]:
from nltk.corpus import stopwords
##text to use in this example
text

'The cat is in the box. The cat likes the box. The box is over the cat.'

In [8]:
#list comprehension to tokenize sentences and also lowering the words
##use string alpha method to only return alphabetic strings (this will effectively strip tokens with numbers or punctuation)
tokens = [w for w in word_tokenize(text.lower()) 
                  if w.isalpha()]
print(tokens)

['the', 'cat', 'is', 'in', 'the', 'box', 'the', 'cat', 'likes', 'the', 'box', 'the', 'box', 'is', 'over', 'the', 'cat']


In [9]:
# if stopwords wasn't used before, download it
import nltk
#nltk.download('stopwords')

In [10]:
#list comprehension to remove words that are in the stopwords list
# the english stopwords comes built in with the NLTK library - need to be downloaded if first time using it

no_stops = [t for t in tokens 
                    if t not in stopwords.words('english')]
print(no_stops)

['cat', 'box', 'cat', 'likes', 'box', 'box', 'cat']


In [11]:
##count the pre-processed words
Counter(no_stops).most_common(2)

[('cat', 3), ('box', 3)]

#### Text preprocessing practice

In [12]:
#using lowercase words from the previous exercise
print(lower_tokens[:50])

["'", "''", 'debugging', "''", "'", 'is', 'the', 'process', 'of', 'finding', 'and', 'resolving', 'of', 'defects', 'that', 'prevent', 'correct', 'operation', 'of', 'computer', 'software', 'or', 'a', 'system', '.', 'numerous', 'books', 'have', 'been', 'written', 'about', 'debugging', '(', 'see', 'below', ':', '#', 'further', 'reading|further', 'reading', ')', ',', 'as', 'it', 'involves', 'numerous', 'aspects', ',', 'including', 'interactive']


In [13]:
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]
print(alpha_only[:50])

['debugging', 'is', 'the', 'process', 'of', 'finding', 'and', 'resolving', 'of', 'defects', 'that', 'prevent', 'correct', 'operation', 'of', 'computer', 'software', 'or', 'a', 'system', 'numerous', 'books', 'have', 'been', 'written', 'about', 'debugging', 'see', 'below', 'further', 'reading', 'as', 'it', 'involves', 'numerous', 'aspects', 'including', 'interactive', 'debugging', 'control', 'flow', 'integration', 'testing', 'files', 'monitoring', 'application', 'system', 'memory', 'dumps', 'profiling']


In [14]:
# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in stopwords.words('english')]
print(no_stops[:50])

['debugging', 'process', 'finding', 'resolving', 'defects', 'prevent', 'correct', 'operation', 'computer', 'software', 'system', 'numerous', 'books', 'written', 'debugging', 'see', 'reading', 'involves', 'numerous', 'aspects', 'including', 'interactive', 'debugging', 'control', 'flow', 'integration', 'testing', 'files', 'monitoring', 'application', 'system', 'memory', 'dumps', 'profiling', 'computer', 'programming', 'statistical', 'process', 'control', 'special', 'design', 'tactics', 'improve', 'detection', 'simplifying', 'changes', 'origin', 'computer', 'log', 'entry']


In [15]:
## if WordNetLemmatizer wasn't used before, download it
import nltk
#nltk.download('wordnet')

In [16]:
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
print(lemmatized[:50])

['debugging', 'process', 'finding', 'resolving', 'defect', 'prevent', 'correct', 'operation', 'computer', 'software', 'system', 'numerous', 'book', 'written', 'debugging', 'see', 'reading', 'involves', 'numerous', 'aspect', 'including', 'interactive', 'debugging', 'control', 'flow', 'integration', 'testing', 'file', 'monitoring', 'application', 'system', 'memory', 'dump', 'profiling', 'computer', 'programming', 'statistical', 'process', 'control', 'special', 'design', 'tactic', 'improve', 'detection', 'simplifying', 'change', 'origin', 'computer', 'log', 'entry']


In [17]:
# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

[('debugging', 74), ('system', 47), ('bug', 31), ('software', 30), ('problem', 30), ('tool', 30), ('debugger', 26), ('process', 24), ('computer', 23), ('used', 22)]
