## Bag of Words:
* The first attempt at creating word vectors. 
* The common approach for word vectorisation until 2013 (Mikolov et al)

#### Pros
* Works for any text
* Easy and fast to do
* Does not require a language model (just the corpus)

#### Cons
* Does not apply language knowledge (stopwords EN only)
* All words are equally similar / disimliar (discrete, orthogonal vectors)
* Order of words is ignored

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

  return f(*args, **kwds)


In [5]:
corpus = ["I'm a ghost Living in a ghost town I'm a ghost Living in a ghost town You can look for me But I can't be found You can search for me I had to go underground Life was so beautiful Then we all got locked down Feel a like ghost Living in a ghost town Once this place was humming And the air was full of drumming The sound of cymbals crashing Glasses were all smashing Trumpets were all screaming Saxophones were blaring Nobody was caring if it's day or night I'm a ghost Living in a ghost town I'm going nowhere Shut up all alone So much time to lose Just staring at my phone Every night I am dreaming That you'll come and creep in my bed Please let this be over Not stuck in a world without end Preachers were all preaching Charities beseeching Politicians dealing Thieves were happy stealing Widows were all weeping There's no beds for us to sleep in Always had the feeling It will all come tumbling down I'm a ghost Living in a ghost town You can look for me But I can't be found We're all living in a ghost town Living in a ghost town We were so beautiful I was your man about town Living in this ghost town Ain't having any fun If I want a party It's a party of one",
          "following lyrics contain explicit language: The simulation just went bad, but you're the best I ever had Like hand prints in wet cement, she touch me, it's permanent In my head, in my head, I couldn't hear anything you said, but In my head, in my head, I'm callin' you 'girlfriend,' what the fuck? I don't do fake love, but I'll take some from you tonight I know I've got to go, but I might just miss the flight I can't stay forever, let's play pretend And treat this night like it'll happen again You'll be my bloody valentine tonight I'm overstimulated and I'm sad, I don't expect you to understand There's nothing less than true romance, or am I just makin' a mess? In my head, in my head, I'm lyin' naked with you, yeah In my head, in my head, I'm ready to die holding your hand I don't do fake love, but I'll take some from you tonight (Take some from you tonight) I know I've got to go, but I might just miss the flight I can't stay forever, let's play pretend And treat this night like it'll happen again You'll be my bloody valentine tonight I can't hide how I feel about you Inside, I gave everything up Tonight, if I could just have you Be my, be my, baby I can't hide how I feel about you (I cannot hide these feelings) Inside, I gave everything up (I cannot hide these feelings) Tonight, if I could just have you (I gave up everything for you) Be my, be my (I gave up everything), ayy I don't do fake love, but I'll take some from you tonight (Take some from you tonight) I know I've got to go, but I might just miss the flight I can't stay forever, let's play pretend And treat this night like it'll happen again You'll be my bloody valentine tonight Na-na-na, na-na-na, na-na-na (Just tonight) Na-na-na, na-na-na, na-na-na (Just tonight) Na-na-na, na-na-na, na-na-na (Just tonight) Na-na-na, na-na-na, na-na-na (Just tonight) Were we on two track?"]

In [6]:
quotes = ['this quote marks handle normal text',
         "double quotes handle text that has 'quotes in it'"]

---

### The Count Vectorizer:
#### Steps to build
* Create a corpus
* Fit a CV on it
* Transform the corpus into a sparse, then dense, matrix

In [7]:
cv = CountVectorizer()

In [8]:
cv

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [9]:
cv.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [10]:
corpus_vecs = cv.transform(corpus)

#### Sparse Matrix
Most of our matrix consists of zeroes. A Sparse Matrix only stores the non-zero values to save memory. We need to convert it into a **dense** matrix to view it effectively.

In [13]:
corpus_vecs.todense()

matrix([[ 1,  0,  1,  1,  8,  1,  1,  1,  2,  1,  0,  1,  0,  0,  0,  3,
          2,  1,  1,  1,  0,  1,  0,  2,  0,  5,  0,  1,  0,  1,  2,  0,
          0,  0,  1,  1,  1,  1,  1,  0,  0,  0,  2,  1,  1,  1,  0,  1,
          0,  0,  0,  0,  1,  1,  0,  0,  0,  4,  0,  2,  0,  0,  1,  1,
          0, 13,  0,  1,  1,  1,  1,  2,  0,  0,  1,  0,  1,  0,  0,  0,
          0,  0,  1,  2, 11,  0,  3,  1,  0,  0,  0,  1,  1,  1,  8,  1,
          1,  2,  1,  0,  0,  0,  0,  1,  3,  0,  0,  0,  1,  2,  0,  0,
          2,  1,  1,  1,  0,  1,  3,  0,  1,  1,  1,  1,  0,  2,  0,  1,
          1,  0,  1,  1,  1,  1,  0,  0,  1,  0,  0,  0,  0,  1,  1,  1,
          0,  1,  0,  1,  1,  3,  0,  1,  1,  0,  1,  1,  0,  0,  1,  3,
          1,  1,  0,  1,  3,  1,  3,  0,  0,  9,  0,  0,  0,  1,  1,  0,
          1,  0,  1,  1,  0,  0,  1,  5,  3,  1,  0,  7,  0,  0,  1,  1,
          0,  1,  1,  0,  4,  1],
        [ 2,  3,  0,  0,  0,  0,  0,  1,  4,  0,  1,  0,  1,  1,  1,  7,
          0,  0, 

In [14]:
import pandas as pd

  return f(*args, **kwds)


In [18]:
df = pd.DataFrame(corpus_vecs.todense(), columns=cv.get_feature_names(), index=['eminem', 'aretha franklin'])
df

Unnamed: 0,about,again,ain,air,all,alone,always,am,and,any,...,wet,what,widows,will,with,without,world,yeah,you,your
eminem,1,0,1,1,8,1,1,1,2,1,...,0,0,1,1,0,1,1,0,4,1
aretha franklin,2,3,0,0,0,0,0,1,4,0,...,1,1,0,0,1,0,0,1,18,1


In [16]:
#document vectors

### Downsides:
* order isn't important
* each is equally important - 'the' is as important as 'death'
* it's inefficient! - many many many columns, mostly the vectors are sparse - "curse of dimensionality"
* vectors are orthogonal - cat is as similar to dog, as cat is to widow

**A downside of the Count Vectorizer is that the uniqueness of words is not taken into consideration. This is where TF-IDF comes in.**

---

### The Tf-Idf Transformer:

* TF - Term Frequency (count of a word w in doc d)
* IDF - Inverse Document Frequency

$TFIDF = TF(w,d) * IDF(w)$

$IDF(w) = log(\frac{1+ no.documents}{1 + no.documents containing word w})+1$

##### The steps for calculating TFIDF are:
* For each vector:
    * Calculate the term frequency for each term in the vector
    * Calculate the inverse doc frequency for each term in the vector
    * Multiply the two for each term in the vector
* Then normalise each vector by the Euclidean norm (numpy.linalg.norm)
    * $norm = \frac{v}{||v||^2}$

Check out the math behind TFIDF:
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer

In [20]:
tf = TfidfTransformer()

In [21]:
tf_vec = tf.fit_transform(corpus_vecs)

In [30]:
df2 = pd.DataFrame(tf_vec.todense().round(2), columns=cv.get_feature_names(), index=['eminem', 'aretha franklin'])

In [31]:
df2

Unnamed: 0,about,again,ain,air,all,alone,always,am,and,any,...,wet,what,widows,will,with,without,world,yeah,you,your
eminem,0.03,0.0,0.04,0.04,0.31,0.04,0.04,0.03,0.05,0.04,...,0.0,0.0,0.04,0.04,0.0,0.04,0.04,0.0,0.11,0.03
aretha franklin,0.03,0.06,0.0,0.0,0.0,0.0,0.0,0.01,0.06,0.0,...,0.02,0.02,0.0,0.0,0.02,0.0,0.0,0.02,0.26,0.01


In [29]:
df

Unnamed: 0,about,again,ain,air,all,alone,always,am,and,any,...,wet,what,widows,will,with,without,world,yeah,you,your
eminem,1,0,1,1,8,1,1,1,2,1,...,0,0,1,1,0,1,1,0,4,1
aretha franklin,2,3,0,0,0,0,0,1,4,0,...,1,1,0,0,1,0,0,1,18,1


In [32]:
new_corpus = ['work to live', 'live to work']

In [39]:
cv2 = CountVectorizer(ngram_range=(1,3)) # also takes into account doublet and triplets
vec_work = cv2.fit_transform(new_corpus)

In [37]:
df3 = pd.DataFrame(vec_work.todense(), columns=cv2.get_feature_names())

In [38]:
df3

Unnamed: 0,live,live to,live to work,to,to live,to work,work,work to,work to live
0,1,0,0,1,1,0,1,1,1
1,1,1,1,1,0,1,1,0,0


### OK, so how could we use this to predict an artist with a classification model we have already seen?

**First, add a labels column to your dataframe by factorizing the artist name**

In [None]:
# string labels fail for LogReg, but pass for Naive Bayes
# per default, transform string into numbers if in doubt
# pd.factorize(label_column)


# remove filler words: cv = CountVectorizer(stop_words='english')

**Now follow the normal workflow for training and predicting a model**

*In this case, the only difference is that we will need to pass our new data (a song lyric) through the word vectors first*

*Remember to not refit the model, just use it to transform the data*

---

## To make your code shorter, you could use the TfidfVectorizer
* This does both steps (count vectorizer and tfidfTransfomer) in one. The reason I show both in the tutorial is because its easier to understand word vectors this way

`from sklearn.feature_extraction.text import TfidfVectorizer`