# Goal of this week's project

- Give a model a new piece of text it hasn't seen before, and it will give you the probability of that text belonging to a certain artist

## This week's project pipeline

**1. Use Requests and either RegEx or Beautiful Soup to scrape lyrics from the internet.**

    - One long column/series that contains all the text of a song (X)
    - One long column/series OF THE SAME LENGTH containing the artists (i.e. the labels) y 
    
**2. Make sure that every artists has approximately (to the best of your ability...) the same number of songs**

**3. Clean your data - RegEx, remove punctuation, lemmatisation, drop stopwords - can use Spacy**

**4. Vectorise your X data (i.e. Apply Count Vectorizer, TF-IDF, etc...)**
- Sklearn doesn't take strings! It needs numbers!
- It standardises the length / shape of all your documents!
    - vec = cv.fit_transform(Xtrain)
    - vec2 = tf.fit_transform(vec)

**5. Fit a model on X, y!**
- Naive Bayes Theorem / Probability Theory 
- m.fit(X,y)

**6. Use your fitted / trained model to predict new text!**
- vec = cv.transform(Xnew/Xtest)
- vec2 = tf.transform(vec)

This is pretty much from the beginning of step 3:

In [1]:
artist1 = 'Nirvana'
artist2 = 'Madonna'

text_corpus = ['come as you are as you were as I want you to be',
              'polly wants a cracker think i should get off her first',
              "im so happy because today found my friends they're in my head",
              'we can escape to a higher plane',
              'im so lonely i shaved my head and im not sad i dont care im so horny',
              'like a virgin touched for the very first time',
              'tropical the island breeze all of nature wild and free',
              'when you call me name its like a little prayer down on my knees i want to take you there',
              'hey mr dj please put a record on i want to dance with my baby',
              'because we are living in a material world and im a material girl']

labels = [artist1] * 5 + [artist2] * 5

In [2]:
text_corpus

['come as you are as you were as I want you to be',
 'polly wants a cracker think i should get off her first',
 "im so happy because today found my friends they're in my head",
 'we can escape to a higher plane',
 'im so lonely i shaved my head and im not sad i dont care im so horny',
 'like a virgin touched for the very first time',
 'tropical the island breeze all of nature wild and free',
 'when you call me name its like a little prayer down on my knees i want to take you there',
 'hey mr dj please put a record on i want to dance with my baby',
 'because we are living in a material world and im a material girl']

In [3]:
len(text_corpus) == len(labels)

True

### Let's vectorise our text corpus:

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [5]:
# There are some interesting CountVect. hyperparameters!
cv = CountVectorizer(stop_words='english')

In [6]:
vec = cv.fit_transform(text_corpus)

In [7]:
vec # 10x44 means 44 unique words in our 10 phrases

<10x43 sparse matrix of type '<class 'numpy.int64'>'
	with 49 stored elements in Compressed Sparse Row format>

In [9]:
# vec.todense()

In [10]:
tf = TfidfTransformer()

In [11]:
vec2 = tf.fit_transform(vec)

In [23]:
# vec2.todense() #a bit more scaling and normalisation done to it
               # we now have floating points

In [12]:
X = vec2
y = labels # our artists

In [13]:
from sklearn.naive_bayes import MultinomialNB

In [14]:
m = MultinomialNB()
# Alpha hyperparameter - the higher the alpha, it's going to take more 
# discrimiating, distinct words to distinguish which artist song belongs to

In [15]:
m.fit(X, y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [16]:
m.score(X, y)

1.0

In [17]:
new_text = ["baby girl you rock my world",
           "kill me now now and shave my head",
           "tropical piña colada getting caught in the rain",
           "come dj prayer"]

In [18]:
vec_test = cv.transform(new_text)

In [19]:
vec_test_final = tf.transform(vec_test)

In [20]:
m.predict(vec_test_final)

array(['Madonna', 'Nirvana', 'Madonna', 'Madonna'], dtype='<U7')

In [21]:
m.predict_proba(vec_test_final)

array([[0.62342229, 0.37657771],
       [0.36996851, 0.63003149],
       [0.57849236, 0.42150764],
       [0.50834738, 0.49165262]])

In [22]:
m.predict_log_proba(vec_test_final)

array([[-0.47253116, -0.97663085],
       [-0.99433739, -0.46198547],
       [-0.54732993, -0.86391738],
       [-0.67659025, -0.70998287]])

In [23]:
prob_log = m.feature_log_prob_

In [24]:
difference = prob_log[0] - prob_log[1]
difference

array([ 0.30400415,  0.31658745, -0.28221518, -0.61488304, -0.43122425,
        0.30400415,  0.30400415, -0.28221518, -0.48150553,  0.31658745,
       -0.42023489,  0.28456107, -0.42023489, -0.53235192,  0.30400415,
       -0.48150553, -0.28221518, -0.48535885,  0.31658745,  0.36871662,
        0.5903472 ,  0.36871662,  0.28456107, -0.28221518,  0.52104552,
        0.30400415,  0.31658745, -0.48150553, -0.43122425,  0.36871662,
        0.30400415, -0.28221518, -0.28221518, -0.43122425,  0.39182616,
       -0.42023489,  0.39182616,  0.31658745,  0.39182616,  0.00715254,
       -0.43122425,  0.31658745,  0.28456107])

In [25]:
sorted(cv.vocabulary_.keys())

['baby',
 'breeze',
 'care',
 'come',
 'cracker',
 'dance',
 'dj',
 'dont',
 'escape',
 'free',
 'friends',
 'girl',
 'happy',
 'head',
 'hey',
 'higher',
 'horny',
 'im',
 'island',
 'knees',
 'like',
 'little',
 'living',
 'lonely',
 'material',
 'mr',
 'nature',
 'plane',
 'polly',
 'prayer',
 'record',
 'sad',
 'shaved',
 'think',
 'time',
 'today',
 'touched',
 'tropical',
 'virgin',
 'want',
 'wants',
 'wild',
 'world']

In [26]:
import pandas as pd

pd.DataFrame(data=difference, 
             index=sorted(cv.vocabulary_.keys())).sort_values(by=0)

Unnamed: 0,0
come,-0.614883
head,-0.532352
im,-0.485359
higher,-0.481506
plane,-0.481506
escape,-0.481506
polly,-0.431224
wants,-0.431224
cracker,-0.431224
think,-0.431224


#### Towards -1 the words are more closely associated with Nirvana, towards +1 the words are more closely associated to 

**Bonus**
- Rather than using Requests + RegEx / BS4, use **Scrapy**
- Rather than storing all your text / artist labels locally in CSV files / .txt files, etc...
    - Store them in a Database:
        - SQL Lite (You need to use SQL)
        - PostGres (probably overkill...)
        - NoSQL data e.g. MongoDB
        
- Add a command-line interface
    - Move your project outside of Jupyter Notebook, into a **.py** files, and allow the 'user' to simply type in a piece of text in the command line / terminal, and then it outputs the result / guess in the terminal.
    
    `python lyric_classifier.py "is this the real life is this just fantasy"`