# Bag of Words
### A preprocessing method for language representation

Bag of WOrds is a method that converts a text corpus into a numerical matrix, so that Machine Learning methods can use it as training data.

| <p style="font-size: 15px">Concept</p>      | <p style="font-size: 15px">Description</p> 
| ----------- | ----------- |
| <p style="font-size: 15px">Corpus</p>      | <p style="font-size: 15px">a list of strings (text documents)</p>       |
| <p style="font-size: 15px">Tokenization</p>      | <p style="font-size: 15px">dividing a text into words (or other units)</p>       
| <p style="font-size: 15px">Vectorization</p>      | <p style="font-size: 15px">converting text into numbers</p>
| <p style="font-size: 15px">Stop words</p>      | <p style="font-size: 15px">frequent words that carry little meaning</p>      
| <p style="font-size: 15px">Stemming</p>      | <p style="font-size: 15px">cutting off word endings</p>      
| <p style="font-size: 15px">n-grams</p>      | <p style="font-size: 15px">tokenizing into strings with n words</p>  
| <p style="font-size: 15px">TF-IDF</p>      | <p style="font-size: 15px">method for normalizing counts</p>


---

### Learning Objectives:
* Why do we need to preprocess language for Machine Learning?
* What is a CountVector 
* What is a TFIDF-Vector
* How do we apply this to the week's task?

---

### First off - we need a corpus!

In [1]:
coldplay = """Drink from me, drink from me (oh-ah-oh-ah)
Then we'll shoot across the (symphony)
Then we'll shoot across the sky
Drink from me, drink from me (oh-ah-oh-ah)
Then we'll shoot across the (oh-ah-oh-ah)
Symphony
(So high, so high)
Then we'll shoot across the sky
Oh, angels sent from up above
You know you make my world light up
When I was down, when I was hurt
You came to lift me up
Life is a drink, and love's a drug
Oh now I think I must be miles up
When I was hurt, withered, dried up
You came to rain a flood
So drink from me, drink from me
When I was so thirsty
We're on a symphony
Now I just can't get enough
Put your wings on me, wings on me
When I was so heavy
We're on a symphony
When I'm lower, lower, lower, low
Ah-oh-ah-oh-ah
Got me feeling drunk and high
So high, so high
Oh-ah-oh-ah-oh-ah
I'm feeling drunk and high
So high, so high
(Woo)
(Woo-ooo-ooo-woo)
Oh, angels sent from up above
I feel it coursing through my blood
Life is a drink, your love's about
To make the stars come out
Put your wings on me, wings on me
When I was so heavy
We're on a symphony
When I'm lower, lower, lower, low
Ah-oh-ah-oh-ah
Got me feeling drunk and high
So high, so high
Oh-ah-oh-ah-oh-ah
I'm feeling drunk and high
So high, so high
Ah-oh-ah-oh-ah
La, la, la, la, la, la, la
So high, so high
Ah-oh-ah-oh-ah
I'm feeling drunk and high
So high, so high
Then we'll shoot across the sky
Then we'll shoot across the
Then we'll shoot across the sky
Then we'll shoot across the (then we'll shoot)
Then we'll shoot across the sky
Then we'll shoot across the
Then we'll shoot across the sky
Then we'll shoot across the"""

---

In [2]:
masego = """Big white house with the chains in the mirror
Fender guitars and they hang in the air
Ladies are large, they look amazing in here
It's her birthday, got her panties in the air
Hit up the broad with the bangs in her hair
Hit up the mob and the gang and the men
Quit your job, there's a change in the air
Call me MJ, with the man in the mirror

Face down, laid up
Bust down, cake up
Ladies, get your weight up
Welcome to the cake club
Face down, laid up
Bust down, cake up
Ladies, get your weight up
Welcome to the cake club

Jiggle it, hit the room
Partna, pon it's way
Jiggle it, hit the room
Show time finally

Bring that back, you put your hands in the air
Just like that, yeah, you're famous in here
Do that that and make it rain over here
Happy birthday, it's some quakin' over here (yeah)
It's getting wild over here (yeah)
It's getting loud over here (yeah)
A big crowd over here
Get the first aid 'cause you're killing over here

Face down, laid up
Bust down, cake up
Ladies, get your weight up
Welcome to the cake club
Face down, laid up
Bust down, cake up
Ladies, get your weight up
Welcome to the cake club

Jiggle it, hit the room
Partna, pon it's way
Jiggle it, hit the room
Show time finally

Yeah (ahh)
Yeah (yeah)
Finally, finally
Yeah (finally, hee, hee, hee)
Listen"""

In [3]:
songs = []
songs.append(coldplay)
songs.append(masego)

In [4]:
artist_labels = ['coldplay', 'masego']

In [5]:
X = songs
y = artist_labels

### Why do we need to preprocess language?
* In short- nothing works if you don't!!

In [6]:
from sklearn.linear_model import LogisticRegression

In [7]:
def why_transform(X,y):
    m = LogisticRegression()
    try:
        m.fit(X, y)
    except(ValueError):
        print('We have a Value error!')

In [8]:
why_transform(X, y)

We have a Value error!


---

## Bag of Words:
* The first attempt at creating word vectors. 
* The common approach for word vectorisation until 2013 (Mikolov et al)
* (These days we use Sesame Street characters (BERT, ELMO))
* Two types of BOW - Count Vectors and TF-IDF Vectors

---

### What is a Count Vectorizer?:
* A word-counter for every word in a given document

#### Steps to build
* Create a corpus
* Fit a CV on it
* Transform the corpus into a sparse, then dense, matrix

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
cv = CountVectorizer() #stop_words='english'

In [11]:
cv

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

ngram_range = if not 1, then we do not only consider words but also a pair of words, e.g. New York

token_pattern='(?u)\\b\\w\\w+\\b' captures all alphanumerics, captures all alphanumeric patterns with 2 or more than 2 letters (that means that "I" or "a" will not be included

the whitespaces in English and German seperate the words for us, does not work e.g. with Chinese

### overwrite the token_pattern to your costum one!

In [12]:
cv.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [53]:
cv.vocabulary_

{'drink': 34,
 'from': 46,
 'me': 89,
 'oh': 98,
 'ah': 3,
 'then': 122,
 'we': 133,
 'll': 81,
 'shoot': 113,
 'across': 2,
 'the': 121,
 'symphony': 119,
 'sky': 115,
 'so': 116,
 'high': 60,
 'angels': 9,
 'sent': 112,
 'up': 130,
 'above': 1,
 'you': 145,
 'know': 71,
 'make': 87,
 'my': 96,
 'world': 143,
 'light': 78,
 'when': 136,
 'was': 131,
 'down': 32,
 'hurt': 63,
 'came': 22,
 'to': 129,
 'lift': 77,
 'life': 76,
 'is': 65,
 'and': 8,
 'love': 84,
 'drug': 35,
 'now': 97,
 'think': 125,
 'must': 95,
 'be': 13,
 'miles': 91,
 'withered': 141,
 'dried': 33,
 'rain': 109,
 'flood': 45,
 'thirsty': 126,
 're': 110,
 'on': 99,
 'just': 69,
 'can': 23,
 'get': 48,
 'enough': 37,
 'put': 106,
 'your': 146,
 'wings': 139,
 'heavy': 56,
 'lower': 86,
 'low': 85,
 'got': 50,
 'feeling': 41,
 'drunk': 36,
 'woo': 142,
 'ooo': 100,
 'feel': 40,
 'it': 66,
 'coursing': 29,
 'through': 127,
 'blood': 16,
 'about': 0,
 'stars': 118,
 'come': 28,
 'out': 101,
 'la': 72,
 'big': 14,
 'whit

In [13]:
new_corpus = cv.transform(X)

In [14]:
new_corpus

<2x147 sparse matrix of type '<class 'numpy.int64'>'
	with 163 stored elements in Compressed Sparse Row format>

#### Sparse Matrix vs Dense Matrix
Most of our matrix consists of zeroes. A Sparse Matrix only stores the non-zero values to save memory. We need to convert it into a **dense** matrix to view it effectively.

In [20]:
import pandas as pd

In [25]:
songs[0]

"Drink from me, drink from me (oh-ah-oh-ah)\nThen we'll shoot across the (symphony)\nThen we'll shoot across the sky\nDrink from me, drink from me (oh-ah-oh-ah)\nThen we'll shoot across the (oh-ah-oh-ah)\nSymphony\n(So high, so high)\nThen we'll shoot across the sky\nOh, angels sent from up above\nYou know you make my world light up\nWhen I was down, when I was hurt\nYou came to lift me up\nLife is a drink, and love's a drug\nOh now I think I must be miles up\nWhen I was hurt, withered, dried up\nYou came to rain a flood\nSo drink from me, drink from me\nWhen I was so thirsty\nWe're on a symphony\nNow I just can't get enough\nPut your wings on me, wings on me\nWhen I was so heavy\nWe're on a symphony\nWhen I'm lower, lower, lower, low\nAh-oh-ah-oh-ah\nGot me feeling drunk and high\nSo high, so high\nOh-ah-oh-ah-oh-ah\nI'm feeling drunk and high\nSo high, so high\n(Woo)\n(Woo-ooo-ooo-woo)\nOh, angels sent from up above\nI feel it coursing through my blood\nLife is a drink, your love's a

In [22]:

df = pd.DataFrame(new_corpus.todense(), columns=cv.get_feature_names(), index=['coldplay', 'masego'])
df.head()
# first row is the coldplay song

Unnamed: 0,about,above,across,ah,ahh,aid,air,amazing,and,angels,...,white,wild,wings,with,withered,woo,world,yeah,you,your
coldplay,1,2,12,24,0,0,0,0,6,2,...,0,0,4,0,1,3,1,0,4,3
masego,0,0,0,0,1,1,4,1,4,0,...,1,1,0,3,0,0,0,8,3,6


In [24]:
sorted(songs[0].lower().split())

['(oh-ah-oh-ah)',
 '(oh-ah-oh-ah)',
 '(oh-ah-oh-ah)',
 '(so',
 '(symphony)',
 '(then',
 '(woo)',
 '(woo-ooo-ooo-woo)',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'a',
 'about',
 'above',
 'above',
 'across',
 'across',
 'across',
 'across',
 'across',
 'across',
 'across',
 'across',
 'across',
 'across',
 'across',
 'across',
 'ah-oh-ah-oh-ah',
 'ah-oh-ah-oh-ah',
 'ah-oh-ah-oh-ah',
 'ah-oh-ah-oh-ah',
 'and',
 'and',
 'and',
 'and',
 'and',
 'and',
 'angels',
 'angels',
 'be',
 'blood',
 'came',
 'came',
 "can't",
 'come',
 'coursing',
 'down,',
 'dried',
 'drink',
 'drink',
 'drink',
 'drink',
 'drink',
 'drink',
 'drink,',
 'drink,',
 'drug',
 'drunk',
 'drunk',
 'drunk',
 'drunk',
 'drunk',
 'enough',
 'feel',
 'feeling',
 'feeling',
 'feeling',
 'feeling',
 'feeling',
 'flood',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'from',
 'get',
 'got',
 'got',
 'heavy',
 'heavy',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high',
 'high'

### Cons
* word order isn't recognized in BOW
* number of columns gets big very quick, especially with n-grams > 1
* lack of uniqueness of each term in CV -  <span style="background-color:lightblue">**WE NEED TO NORMALIZE OUR VECTORS IN THE COLUMN AND THE ROW SPACE**</span> 
    * column space - how much of a 'fingerprint' does a particular word a particular artist
    * row space - long songs should have less value assigned to individual words vs short songs


### Pros
* easy to understand
* quick to implement
* work surprisingly well for many NLP tasks

In [27]:
# tf = just means count vectorizer

---

### The Tf-Idf Transformer:

* TF - Term Frequency (% count of a word w in doc d)
* IDF - Inverse Document Frequency

$TFIDF = TF(w,d) * IDF(w)$

$IDF(w) = log(\frac{1+ no.documents}{1 + no.documents containing word w})+1$

##### The steps for calculating TFIDF are:
* For each vector:
    * Calculate the term frequency for each term in the vector
    * Calculate the inverse doc frequency for each term in the vector
    * Multiply the two for each term in the vector
* Then normalise each vector by the Euclidean norm (numpy.linalg.norm)
    * $norm = \frac{v}{||v||^2}$

Check out the math behind TFIDF:
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [28]:
from sklearn.feature_extraction.text import TfidfTransformer

In [30]:
tf = TfidfTransformer()

In [31]:
tf.fit(new_corpus) # not on the original corpus, but on the count vectorizer corpus

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [32]:
tf_corpus = tf.transform(new_corpus)

In [35]:
df2 = pd.DataFrame(tf_corpus.todense().round(2), columns=cv.get_feature_names(), index=['coldplay', 'masego'])
df2.head()

Unnamed: 0,about,above,across,ah,ahh,aid,air,amazing,and,angels,...,white,wild,wings,with,withered,woo,world,yeah,you,your
coldplay,0.02,0.03,0.2,0.41,0.0,0.0,0.0,0.0,0.07,0.03,...,0.0,0.0,0.07,0.0,0.02,0.05,0.02,0.0,0.05,0.04
masego,0.0,0.0,0.0,0.0,0.03,0.03,0.12,0.03,0.08,0.0,...,0.03,0.03,0.0,0.09,0.0,0.0,0.0,0.24,0.06,0.13


In [38]:
df.loc['coldplay'].sum() # no normalization

340

In [40]:
df2.loc['coldplay'].sum() # it does not sum up to 1 

5.459999999999999

---

#### What happens with new words??
* we do not want to refit our model on new lyrics!

In [43]:
assert 'jovial' not in cv.get_feature_names()

In [50]:
new_song = ['i am very jovial songwriter']

In [51]:
pd.DataFrame(cv.transform(new_song).todense(), columns=cv.get_feature_names())
# the model just ignores the word "jovial"

Unnamed: 0,about,above,across,ah,ahh,aid,air,amazing,and,angels,...,white,wild,wings,with,withered,woo,world,yeah,you,your
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [49]:
pd.DataFrame(cv.transform(new_song).todense(), columns=cv.get_feature_names()).sum(axis=1)


0    0
dtype: int64

### How do we apply to this week's task?

#### What is this week's task?
take a string of lyrics, and the model guesses who is the artist!

### Feature Engineering ("word engineering")
* download and save your corpus
* create labels for the corpus
* transform your lyrics into BOW document vectors
* delete some words - cf curse of dimensionality
* How can we decide what words to delete?!
    * ML solution - coefficients of the words, correlation matrix
    * all feature selection techniques should apply
    * domain expertise solution - knowing a bit about language
    * remove stop words *(e.g. 'i am very similar to something which is also similar to me and that thing is batman' - important words here: similiar, something, batman (got rid of the stopwords like i, am, to, is)*
    * standardize plural / singular differentation - stemming 
    * more tomorrow

### The rest - standard ML pipeline
* train test splitting
* choose a model - RFC, LR, NB, etc.
* train and measure it
* cross validating

---

## To make your code shorter, you could use the TfidfVectorizer
* This does both steps (count vectorizer and tfidfTransfomer) in one. The reason we show both in the tutorial is because its easier to understand word vectors this way

`from sklearn.feature_extraction.text import TfidfVectorizer`