## "Bag of Words":
* The first attempt at creating word vectors. 
* The common approach for word vectorisation until 2013 (Mikolov et al).

#### Pros
* Works for any text
* Easy and fast to do
* Easy to understand
* Does not require a language model (just the corpus), i.e. no training

#### Cons
* Does not apply language knowledge (stopwords EN only)
* All words are equally similar / disimliar (discrete, orthogonal vectors)
* Order of words is ignored

---

## Example Implementation in Python

---

### Step 1. Construct a Text Corpus
- This is basically what you end up with after all your scraping and cleaning.

In [1]:
CORPUS = ["yesterday all my troubles seemed so far away", #beatles
          "we all live in a yellow submarine yellow submarine", #beatles
          "when i find myself in times of trouble mother mary comes to me", #beatles
          "penny lane is in my ears and in my eyes", #beatles
          "here comes the sun and i say its alright little darling", #beatles
          "i look at the world and i notice its turning while my guitar gently weeps", #beatles
          "youre the one for me youre my ecstasy youre the one i need hey yeah ohh", #backstreet boys
          "you are my fire the one desire believe me when i say i want it that way", #backstreet boys
          "everybody rock your body everybody rock your body right backstreets back alright", #backstreet boys
          "show me the meaning of being lonely is this the feeling i need to walk with", #backstreet boys
          "now i can see that weve fallen apart from the way that it used to be yeah", #backstreet boys
          "that leaves you battered and broken so forgive me for my mixed emotions yeah yeah" #backstreet boys
]

In [2]:
# type(CORPUS)

list

In [2]:
CORPUS = [s.lower() for s in CORPUS]

In [4]:
LABELS = ['beatles'] * 6 + ['backstreet boys'] * 6

In [5]:
type(LABELS)

list

### Step 2. Vectorize the text input using the "Bag of Words" technique.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
cv = CountVectorizer(stop_words='english', ngram_range=(1, 1))

In [6]:
vec = cv.fit_transform(CORPUS)

In [7]:
import pandas as pd

In [8]:
pd.DataFrame(vec.todense(), columns=cv.get_feature_names(), index=LABELS)

Unnamed: 0,alright,apart,away,backstreets,battered,believe,body,broken,comes,darling,...,walk,want,way,weeps,weve,world,yeah,yellow,yesterday,youre
beatles,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
beatles,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
beatles,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
beatles,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
beatles,1,0,0,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
beatles,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
backstreet boys,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,3
backstreet boys,0,0,0,0,0,1,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
backstreet boys,1,0,0,1,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
backstreet boys,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


Other things you can do with CountVectorizer:
- play around with the `n_grams` hyperparameter: i.e. you can have the model "learn" words that appear together as common
- too many `n_grams` will create lots of feature, i.e. tend to overfit

How do reduce the number of words / features:
- removing stop words
- feature selection techniques that we've already seen (e.g. feature selection, feature importance)
    - other dimensionality reduction techniques, e.g. PCA
- Lemmatization (tomorrow)

### Step 3. Apply Tf-Idf Transformation (Normalization)

* TF - Term Frequency (% count of a word $w$ in doc $d$)
* IDF - Inverse Document Frequency

$TFIDF = TF(w,d) * IDF(w,d)$

$IDF(w,d) = log(\frac{1+ num.documents}{1 + no.documents.containing.word.w})+1$

##### The steps for calculating TFIDF are:
* For each vector:
    * Calculate the term frequency for each term in the vector
    * Calculate the inverse doc frequency for each term in the vector
    * Multiply the two for each term in the vector
* Then normalise each vector by the Euclidean norm (numpy.linalg.norm)
    * $norm = \frac{v}{||v||^2}$

Check out the math behind TFIDF:
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer

In [10]:
tf = TfidfTransformer()

In [11]:
vec2 = tf.fit_transform(vec)

In [12]:
pd.DataFrame(vec.todense(), columns=cv.get_feature_names(), index=LABELS)

Unnamed: 0,alright,apart,away,backstreets,battered,believe,body,broken,comes,darling,...,walk,want,way,weeps,weve,world,yeah,yellow,yesterday,youre
beatles,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
beatles,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
beatles,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
beatles,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
beatles,1,0,0,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
beatles,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
backstreet boys,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,3
backstreet boys,0,0,0,0,0,1,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
backstreet boys,1,0,0,1,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
backstreet boys,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [13]:
pd.DataFrame(vec2.todense(), columns=cv.get_feature_names(), index=LABELS)

Unnamed: 0,alright,apart,away,backstreets,battered,believe,body,broken,comes,darling,...,walk,want,way,weeps,weve,world,yeah,yellow,yesterday,youre
beatles,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0
beatles,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.666667,0.0,0.0
beatles,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.394567,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
beatles,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
beatles,0.376156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.376156,0.437996,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
beatles,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.377964,0.0,0.377964,0.0,0.0,0.0,0.0
backstreet boys,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.207919,0.0,0.0,0.822208
backstreet boys,0.0,0.0,0.0,0.0,0.0,0.472713,0.0,0.0,0.0,0.0,...,0.0,0.472713,0.405972,0.0,0.0,0.0,0.0,0.0,0.0,0.0
backstreet boys,0.22371,0.0,0.0,0.260488,0.0,0.0,0.520975,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
backstreet boys,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.459434,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Step 4. Put everything together in a pipeline
- Option A: Use a CountVectorizer + TfidfTransformer sequentially in a pipeline, OR
- Option B: Use a TfidfVectorizer in a single step.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [15]:
vectorizer = TfidfVectorizer(stop_words='english')

In [16]:
X = vectorizer.fit_transform(CORPUS)

In [17]:
from sklearn.ensemble import RandomForestClassifier

In [19]:
m = RandomForestClassifier(random_state=666)

In [20]:
m.fit(X, LABELS)

RandomForestClassifier(random_state=666)

In [21]:
m.score(X, LABELS)

1.0

In [21]:
# m.predict(['yellow submarine'])

### ^This won't work because we also have to pre-process any new text in the same way we did on the training data.
### One solution would be to put everything -- i.e. feature engineering and modeling -- into a single pipeline

In [22]:
from sklearn.pipeline import make_pipeline

In [23]:
pipeline = make_pipeline(TfidfVectorizer(stop_words='english'), RandomForestClassifier(max_depth=5, random_state=666))

In [24]:
pipeline.fit(CORPUS, LABELS)

Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(stop_words='english')),
                ('randomforestclassifier',
                 RandomForestClassifier(max_depth=5, random_state=666))])

In [25]:
pipeline.predict_proba(['yellow submarine'])

array([[0.3715, 0.6285]])

In [26]:
pipeline.predict_proba(['backstreets yeah yeah yeah'])

array([[0.66066667, 0.33933333]])

In [27]:
pipeline.predict_proba(['chocolate']) #for words that the model hasn't seen before, it defaults to the probabilities distributions of the classes.

array([[0.462, 0.538]])

In [28]:
pipeline.predict_proba(['backstreets'])

array([[0.5145, 0.4855]])