# Bag of Words

In [1]:
text = "How many roads must a man walk down Before you call him a man? Yes, and how many seas must a white dove sail Before she sleeps in the sand? Yes, and how many times must the cannonballs fly Before they're forever banned? The answer, my friend, is blowing in the wind The answer is blowing in the wind"
print(text)

How many roads must a man walk down Before you call him a man? Yes, and how many seas must a white dove sail Before she sleeps in the sand? Yes, and how many times must the cannonballs fly Before they're forever banned? The answer, my friend, is blowing in the wind The answer is blowing in the wind


## Clean up text

### De-capitalize
lower case transform: make sure that capitalization won't be interpreted as a different word. 

In [2]:
text = text.lower() 

### Remove punctuation: option 1

In [7]:
clean_text = ''.join([c for c in text if c not in "?,."])
print(clean_text)

how many roads must a man walk down before you call him a man yes and how many seas must a white dove sail before she sleeps in the sand yes and how many times must the cannonballs fly before they're forever banned the answer my friend is blowing in the wind the answer is blowing in the wind


### Remove punctuation: option 2 - string.punctuation
- use <code>string.punctuation</code>
- note that <code>string.punctuation</code> is a <code>string</code>. transform  it to <code>set</code> for faster lookup!

In [8]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [9]:
import string
print(f"punctuation: {string.punctuation}. type: {type(string.punctuation)}")
clean_text = ''.join([c for c in text if c not in string.punctuation]) # note that it's faster to transform 
print(clean_text)

punctuation: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~. type: <class 'str'>
how many roads must a man walk down before you call him a man yes and how many seas must a white dove sail before she sleeps in the sand yes and how many times must the cannonballs fly before theyre forever banned the answer my friend is blowing in the wind the answer is blowing in the wind


### Remove punctuation: option 3 - string.translate
- use <code>string.translate</code> and <code>str.maketrans('', '', string.punctuation)</code>
- <code>string.translate</code> is implemented in C. It creates a lookup table for replacing chars - very efficient especially for large text.
- <code>str.maketrans</code> creates a mapping table used by <code>string.translate</code> to replace character by character. 

<code>str.maketrans</code> can receive: 
- option 1: a dictionary mapping char 2 char (key - value)
- option 2: two strings (equal length)
- option 3: (if 3 arguments are passed) each char in the string is translated to None.

In [10]:
import string
print(f"punctuation: {string.punctuation}")
clean_text = text.translate(str.maketrans('', '', string.punctuation))
print(clean_text)

punctuation: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
how many roads must a man walk down before you call him a man yes and how many seas must a white dove sail before she sleeps in the sand yes and how many times must the cannonballs fly before theyre forever banned the answer my friend is blowing in the wind the answer is blowing in the wind


In [11]:
import string
clean_text = text.translate(text.maketrans(dict({(c, None) for c in string.punctuation})))
print(clean_text)

how many roads must a man walk down before you call him a man yes and how many seas must a white dove sail before she sleeps in the sand yes and how many times must the cannonballs fly before theyre forever banned the answer my friend is blowing in the wind the answer is blowing in the wind


## Text tokenization

In [12]:
words = clean_text.split()
print(words)

['how', 'many', 'roads', 'must', 'a', 'man', 'walk', 'down', 'before', 'you', 'call', 'him', 'a', 'man', 'yes', 'and', 'how', 'many', 'seas', 'must', 'a', 'white', 'dove', 'sail', 'before', 'she', 'sleeps', 'in', 'the', 'sand', 'yes', 'and', 'how', 'many', 'times', 'must', 'the', 'cannonballs', 'fly', 'before', 'theyre', 'forever', 'banned', 'the', 'answer', 'my', 'friend', 'is', 'blowing', 'in', 'the', 'wind', 'the', 'answer', 'is', 'blowing', 'in', 'the', 'wind']


## Count Words

In [14]:
from collections import Counter
counts = Counter(words)
# print(counts)
v = list()
for i,w in enumerate(counts):
    print(f"{i}. {w} – {counts[w]}")   
    v.append(counts[w])

0. how – 3
1. many – 3
2. roads – 1
3. must – 3
4. a – 3
5. man – 2
6. walk – 1
7. down – 1
8. before – 3
9. you – 1
10. call – 1
11. him – 1
12. yes – 2
13. and – 2
14. seas – 1
15. white – 1
16. dove – 1
17. sail – 1
18. she – 1
19. sleeps – 1
20. in – 3
21. the – 6
22. sand – 1
23. times – 1
24. cannonballs – 1
25. fly – 1
26. theyre – 1
27. forever – 1
28. banned – 1
29. answer – 2
30. my – 1
31. friend – 1
32. is – 2
33. blowing – 2
34. wind – 2


### display document vector

In [15]:
import numpy as np
v = np.array(v)
print(v)

[3 3 1 3 3 2 1 1 3 1 1 1 2 2 1 1 1 1 1 1 3 6 1 1 1 1 1 1 1 2 1 1 2 2 2]
