## Bag of Words

- Bag of Words is a method that converts a text corpus into a numerical matrix, so that Machine Learning metods can use it as training data.

| concept | description |
|---------|-------------|
| Corpus | a list of strings (text documents) |
| Tokenization | dividing a text into words (or other units) |
| Vectorization | converting text into numbers |
| Stop words | frequent words that carry little meaning |
| Stemming | cutting off word endings |
| n-grams | tokenizing into strings with n words |
| TF-IDF | method for normalizing counts |

* Understand:
    * What are word vectors
    * Why we need to transform text into word vectors
    * Term Frequency
    * Inverse Document Frequency
    * Difference between CV and TF-IDF
* Do:
    * Transform text to bag of words
    * Implement Count Vectorizer
    * Implement TF-IDF

**There is no order in a bag!**

You can quickly measure term importance!

- Term frequency = number of the term/number of words

TF-IDF:
- Inverse Document Frequency = log(number of documents/number of documents in which the word appears)

---

#### What are word vectors?
* Bag of words represent our first attempt at word vectors. We are transforming text into numerical representation of text.

In [1]:
corpus = []

In [15]:
# corpus.append(input())

pickled Game peter piper penis


In [2]:
corpus = ['Inertia is a property of matter',
'Pizza funghi with peppermint oil',
'FC Bayern rules number one',
'Peter piper picked a peck of pickled peppers',
'Game of Thrones made me forget my laptop',
'I dont know ask me in five minutes',
'Peter is a pickled pepper',
'Bayern is also agreed to not be the best',
'a counter is a special kind of dictionary',
'it was the best of times',
'it was the worst of times',
'pizza is pizza is pizza',
'a full class is better than an empty one']

In [3]:
corpus

['Inertia is a property of matter',
 'Pizza funghi with peppermint oil',
 'FC Bayern rules number one',
 'Peter piper picked a peck of pickled peppers',
 'Game of Thrones made me forget my laptop',
 'I dont know ask me in five minutes',
 'Peter is a pickled pepper',
 'Bayern is also agreed to not be the best',
 'a counter is a special kind of dictionary',
 'it was the best of times',
 'it was the worst of times',
 'pizza is pizza is pizza',
 'a full class is better than an empty one']

#### 1: Transform text to bag of words
* Columns represent words (the corpus), rows represent sentences/songs (the labelled components of the corpus)
* The columns represent the disjoint (full outer) join of all words in the total corpus
* We can achieve an initial attempt at a BOW with **collections.Counter**

In [4]:
from collections import Counter

In [5]:
c = Counter()
for sentence in corpus:
    for word in sentence.split():
        c[word.lower()] += 1

#### Counters are good at quickly measuring the term importance! Literally how many times it turns up in the document 

In [6]:
c

Counter({'inertia': 1,
         'is': 7,
         'a': 6,
         'property': 1,
         'of': 6,
         'matter': 1,
         'pizza': 4,
         'funghi': 1,
         'with': 1,
         'peppermint': 1,
         'oil': 1,
         'fc': 1,
         'bayern': 2,
         'rules': 1,
         'number': 1,
         'one': 2,
         'peter': 2,
         'piper': 1,
         'picked': 1,
         'peck': 1,
         'pickled': 2,
         'peppers': 1,
         'game': 1,
         'thrones': 1,
         'made': 1,
         'me': 2,
         'forget': 1,
         'my': 1,
         'laptop': 1,
         'i': 1,
         'dont': 1,
         'know': 1,
         'ask': 1,
         'in': 1,
         'five': 1,
         'minutes': 1,
         'pepper': 1,
         'also': 1,
         'agreed': 1,
         'to': 1,
         'not': 1,
         'be': 1,
         'the': 3,
         'best': 2,
         'counter': 1,
         'special': 1,
         'kind': 1,
         'dictionary': 1,
        

In [7]:
c.most_common

<bound method Counter.most_common of Counter({'is': 7, 'a': 6, 'of': 6, 'pizza': 4, 'the': 3, 'bayern': 2, 'one': 2, 'peter': 2, 'pickled': 2, 'me': 2, 'best': 2, 'it': 2, 'was': 2, 'times': 2, 'inertia': 1, 'property': 1, 'matter': 1, 'funghi': 1, 'with': 1, 'peppermint': 1, 'oil': 1, 'fc': 1, 'rules': 1, 'number': 1, 'piper': 1, 'picked': 1, 'peck': 1, 'peppers': 1, 'game': 1, 'thrones': 1, 'made': 1, 'forget': 1, 'my': 1, 'laptop': 1, 'i': 1, 'dont': 1, 'know': 1, 'ask': 1, 'in': 1, 'five': 1, 'minutes': 1, 'pepper': 1, 'also': 1, 'agreed': 1, 'to': 1, 'not': 1, 'be': 1, 'counter': 1, 'special': 1, 'kind': 1, 'dictionary': 1, 'worst': 1, 'full': 1, 'class': 1, 'better': 1, 'than': 1, 'an': 1, 'empty': 1})>

---

#### Ok, so we have a word vector representation from textual to numerical data. But why would you do this? Two main reasons:
* We can calculate basic numerical analysis on the features in the vector space
    * Term frequency
    * Inverse Document Frequency
    * TF-IDF
* We can also pass matrices into ML models to so we can classify them

---

### Fortunately sklearn has a simple way to implement bag of words, namely Count Vectoriser:

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
cv = CountVectorizer()

In [10]:
vec = cv.fit_transform(corpus)

In [11]:
cv.vocabulary_

{'inertia': 20,
 'is': 21,
 'property': 45,
 'of': 33,
 'matter': 27,
 'pizza': 44,
 'funghi': 17,
 'with': 54,
 'peppermint': 38,
 'oil': 34,
 'fc': 13,
 'bayern': 4,
 'rules': 46,
 'number': 32,
 'one': 35,
 'peter': 40,
 'piper': 43,
 'picked': 41,
 'peck': 36,
 'pickled': 42,
 'peppers': 39,
 'game': 18,
 'thrones': 50,
 'made': 26,
 'me': 28,
 'forget': 15,
 'my': 30,
 'laptop': 25,
 'dont': 11,
 'know': 24,
 'ask': 3,
 'in': 19,
 'five': 14,
 'minutes': 29,
 'pepper': 37,
 'also': 1,
 'agreed': 0,
 'to': 52,
 'not': 31,
 'be': 5,
 'the': 49,
 'best': 6,
 'counter': 9,
 'special': 47,
 'kind': 23,
 'dictionary': 10,
 'it': 22,
 'was': 53,
 'times': 51,
 'worst': 55,
 'full': 16,
 'class': 8,
 'better': 7,
 'than': 48,
 'an': 2,
 'empty': 12}

#### Count Vectorizer makes some assumptions about the text its receiving:
* The 'sentence' closeness of the words don't matter
* The order of the words doesn't matter 
* Semantic meaning is useless (and actually CV works on non text data as well)
* No text preprocessing has been carried out
    * e.g Stop words / lemmas etc (you'll learn more about this during the exercise on Spacy)
* 'small' corpora will be used (else curse of dimensionality occurs)

In [12]:
vec

<13x56 sparse matrix of type '<class 'numpy.int64'>'
	with 78 stored elements in Compressed Sparse Row format>

In [13]:
type(vec)

scipy.sparse.csr.csr_matrix

In [14]:
vec.data # this is a sparse matrix 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

#### For now, remember it is a matrix and to visualise it call .todense( ) on it!
- Sparse matrix with all the noise taken out of it is converted to a dense matrix

In [15]:
# We can visualise it with .todense()
vec.todense() # We get a list representing sentences in matrix form

matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
         1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
         0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
         1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
         0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 1,

In [16]:
corpus[7]

'Bayern is also agreed to not be the best'

#### Lets pass this information into a dataframe to better visualise the data

In [17]:
import pandas as pd

df = pd.DataFrame(vec.todense(), 
                  columns = list(sorted(cv.vocabulary_.keys())))
df

Unnamed: 0,agreed,also,an,ask,bayern,be,best,better,class,counter,...,rules,special,than,the,thrones,times,to,was,with,worst
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
5,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,1,1,0,0,1,1,1,0,0,0,...,0,0,0,1,0,0,1,0,0,0
8,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,1,0,1,0,0


---

#### TF-IDF
* Inverse Document Frequency
* Normalise the data according to originality
* Log of ratio (total number of documents / number of documents in which the word appears)


#### We can work out how often words occurrrr relatively speaking!

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer

In [19]:
tf = TfidfTransformer()

vec2 = tf.fit_transform(vec)

In [20]:
vec2.todense()

matrix([[0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.52266099, 0.3003968 , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.52266099, 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.3003968 , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.52266099, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0. 

In [21]:
df2 = pd.DataFrame(vec2.todense(), 
                   columns = list(sorted(cv.vocabulary_.keys())))

In [22]:
df2

Unnamed: 0,agreed,also,an,ask,bayern,be,best,better,class,counter,...,rules,special,than,the,thrones,times,to,was,with,worst
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.459137,0.0
2,0.0,0.0,0.0,0.0,0.407095,0.0,0.0,0.0,0.0,0.0,...,0.472069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.375982,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.385081,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.367546,0.367546,0.0,0.0,0.316959,0.367546,0.316959,0.0,0.0,0.0,...,0.0,0.0,0.0,0.281066,0.0,0.0,0.367546,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.463208,...,0.0,0.463208,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.437247,0.0,0.0,0.0,...,0.0,0.0,0.0,0.387733,0.0,0.437247,0.0,0.437247,0.0,0.0
