## "Bag of Words":
* The first attempt at creating word vectors. 
* The common approach for word vectorisation until 2013 (Mikolov et al).

#### Pros
* Works for any text
* Easy and fast to do
* Easy to understand
* Does not require a language model (just the corpus), i.e. no training

#### Cons
* Does not apply language knowledge (stopwords EN only)
* All words are equally similar / disimliar (discrete, orthogonal vectors)
* Order of words is ignored

---

## Example Implementation in Python

---

### Step 1. Construct a Text Corpus
- This is basically what you end up with after all your scraping and cleaning.

In [1]:
CORPUS = ["yesterday all my troubles seemed so far away", #beatles
          "we all live in a yellow submarine yellow submarine", #beatles
          "when i find myself in times of trouble mother mary comes to me", #beatles
          "penny lane is in my ears and in my eyes", #beatles
          "here comes the sun and i say its alright little darling", #beatles
          "i look at the world and i notice its turning while my guitar gently weeps", #beatles
          "youre the one for me youre my ecstasy youre the one i need hey yeah ohh", #backstreet boys
          "you are my fire the one desire believe me when i say i want it that way", #backstreet boys
          "everybody rock your body everybody rock your body right backstreets back alright", #backstreet boys
          "show me the meaning of being lonely is this the feeling i need to walk with", #backstreet boys
          "now i can see that weve fallen apart from the way that it used to be yeah", #backstreet boys
          "that leaves you battered and broken so forgive me for my mixed emotions yeah yeah" #backstreet boys
]

In [2]:
CORPUS = [s.lower() for s in CORPUS]

In [3]:
LABELS =  ['beatles'] * 6 + ['backstreet boys'] * 6

### Step 2. Vectorize the text input using the "Bag of Words" technique.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
cv = CountVectorizer(stop_words='english')

In [7]:
vec = cv.fit_transform(CORPUS)

In [8]:
vec

<12x57 sparse matrix of type '<class 'numpy.int64'>'
	with 64 stored elements in Compressed Sparse Row format>

In [9]:
vec.todense()

matrix([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
         0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
     

In [10]:
import pandas as pd

pd.DataFrame(vec.todense(), columns=cv.get_feature_names(), index=LABELS)

Unnamed: 0,alright,apart,away,backstreets,battered,believe,body,broken,comes,darling,...,walk,want,way,weeps,weve,world,yeah,yellow,yesterday,youre
beatles,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
beatles,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
beatles,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
beatles,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
beatles,1,0,0,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
beatles,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
backstreet boys,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,3
backstreet boys,0,0,0,0,0,1,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
backstreet boys,1,0,0,1,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
backstreet boys,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


### Step 3. Apply Tf-Idf Transformation (Normalization)

* TF - Term Frequency (% count of a word $w$ in doc $d$)
* IDF - Inverse Document Frequency

$TFIDF = TF(w,d) * IDF(w,d)$

$IDF(w,d) = log(\frac{1+ num.documents}{1 + no.documents.containing.word.w})+1$

##### The steps for calculating TFIDF are:
* For each vector:
    * Calculate the term frequency for each term in the vector
    * Calculate the inverse doc frequency for each term in the vector
    * Multiply the two for each term in the vector
* Then normalise each vector by the Euclidean norm (numpy.linalg.norm)
    * $norm = \frac{v}{||v||^2}$

Check out the math behind TFIDF:
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

### Step 4. Put everything together in a pipeline