# taylor_naive_bayes

The aim of this project is essentially to train a Naive Bayes model that can play my Taylor Swift lyrics guessing game - as in, it can be fed a lyric from any Taylor Swift song, and guess from which song the lyric comes from. The model will be trained on a dataset of all lyrics from Taylor's (album) discography (as of March 2024). 

Naive Bayes classifiers are a type of *probabilistic* classifers based on Bayes Theorem. As in, once trained, they can predict the probability of a given sequence of words being of a certain class (the song in this case), based on information learned during training ($P(c|W)$). This specific information includes the **likelihood** $P(W|c)$, which is the probability that the lyrics come up given the song, and also the **prior** $P(c)$, the probability of the song coming up itself.

Ultimately, what the model does is to compute the **posterior** $P(c|W)$ for *every class (song)*. The song whose posterior is the greatest is then the song that the lyric most likely comes from. Given the probabilistic nature of Naive Bayes models, it would be possible for the model to output a list of the songs that the lyric is most likely to come from, not just its overall prediction. 

### Data preprocessing

In [148]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

In [149]:
taylor = pd.read_csv("allTaylorLyrics_2024.csv", usecols=[1, 2, 3, 4, 5])
taylor

Unnamed: 0,line,lyric,element,track_name,album_name
0,1,He said the way my blue eyes shined,Verse 1,Tim McGraw,Taylor Swift
1,2,Put those Georgia stars to shame that night,Verse 1,Tim McGraw,Taylor Swift
2,3,"I said, ""That's a lie""",Verse 1,Tim McGraw,Taylor Swift
3,4,Just a boy in a Chevy truck,Verse 1,Tim McGraw,Taylor Swift
4,5,That had a tendency of gettin' stuck,Verse 1,Tim McGraw,Taylor Swift
...,...,...,...,...,...
10045,39,Let you know that what I feel is true,Chorus,I'm Only Me When I'm With You,Taylor Swift
10046,40,"And I'm only me, who I wanna be",Chorus,I'm Only Me When I'm With You,Taylor Swift
10047,41,"Well, I'm only me when I'm with you",Chorus,I'm Only Me When I'm With You,Taylor Swift
10048,42,With you,Outro,I'm Only Me When I'm With You,Taylor Swift


In [150]:
# replacing All Too Well 10MV with "All Too Well 10" so that it is distinct from normal All Too Well when parentheses are removed
# if this is not done, then the probabilities of ATW are probably more than doubled
taylor.loc[(taylor['track_name'] == "All Too Well (10 Minute Version) (Taylor's Version) (From The Vault)"), 'track_name'] = "All Too Well 10"

In [151]:
# replace all parentheses in track names; helps readability
taylor["track_name"] = taylor["track_name"].str.replace(r'\s*\([^)]*\)', '', regex=True)
np.unique(taylor["track_name"])

array(["'tis the damn season", '...Ready For It?', '22',
       'A Perfectly Good Heart', 'A Place In This World', 'Afterglow',
       'All Too Well', 'All Too Well 10', 'All You Had To Do Was Stay',
       'Anti-Hero', 'Babe', 'Back To December', 'Bad Blood',
       'Begin Again', 'Bejeweled', 'Better Man', 'Better Than Revenge',
       'Bigger Than The Whole Sky', 'Blank Space', 'Breathe',
       'Bye Bye Baby', 'Call It What You Want', 'Castles Crumbling',
       'Change', 'Clean', 'Cold As You', 'Come Back...Be Here',
       'Come In With The Rain', 'Cornelia Street', 'Cruel Summer',
       'Dancing With Our Hands Tied', 'Daylight', 'Dear John',
       'Dear Reader', 'Death By A Thousand Cuts', 'Delicate',
       "Don't Blame Me", "Don't You", 'Dress', 'Electric Touch',
       'Enchanted', 'End Game', 'Everything Has Changed', 'False God',
       'Fearless', 'Fifteen', 'Foolish One', 'Forever & Always',
       'Forever Winter', 'Getaway Car', 'Girl At Home', 'Glitch',
       'Gorge

### Split dataset

In [152]:
# split dataset into training and testing sets
lyrics_train, lyrics_test, song_train, song_test = train_test_split(taylor["lyric"], 
                                                                    taylor["track_name"], 
                                                                    test_size = 0.2, random_state = 21)

In [153]:
lyrics_train

9716                  'Cause we survived the Great War
4567                      All you had to do was (Stay)
748     Now it's too late for you and your white horse
2794           We fall in love 'til it hurts or bleeds
9917                        Memories feel like weapons
                             ...                      
9336              And you're not sure and I don't know
48                           When you think Tim McGraw
8964               Sometimes to run is the brave thing
5944                      Scratches down your back now
5327                                  On the way home?
Name: lyric, Length: 8040, dtype: object

In [154]:
song_train

9716                    The Great War
4567       All You Had To Do Was Stay
748                       White Horse
2794                   State Of Grace
9917    Would've, Could've, Should've
                    ...              
9336                     Question...?
48                         Tim McGraw
8964                  it's time to go
5944                    So It Goes...
5327           Now That We Don't Talk
Name: track_name, Length: 8040, dtype: object

### Models

I would like to find the Naive Bayes model that achieves the highest classification accuracy out of the following: 

1. model trained on only unigrams
2. model trained on unigrams *and* bigrams
3. model trained on unigrams, bigrams, *and* trigrams

I'd also like to examine the effects of removing English stopwords, as well as altering the $\alpha$ hyperparameter (representing LaPlace smoothing; default is 1.0). 

### Model trained only on unigrams

In [155]:
# initialize vectorizers (one with stopwords removed, other one retaining them)
vectorizer_unigrams_rm = CountVectorizer(stop_words="english")
vectorizer_unigrams_all = CountVectorizer()

In [156]:
# fit transform the training set
lyrics_train_unigrams_rm = vectorizer_unigrams_rm.fit_transform(lyrics_train)
lyrics_train_unigrams_all = vectorizer_unigrams_all.fit_transform(lyrics_train)

In [157]:
# observe the shapes - vectorized training set containing stopwords obviously has more features
print(lyrics_train_unigrams_rm.toarray().shape)
print(lyrics_train_unigrams_all.toarray().shape)

(8040, 3334)
(8040, 3567)


In [158]:
# transform the testing set
lyrics_test_unigrams_rm = vectorizer_unigrams_rm.transform(lyrics_test)
lyrics_test_unigrams_all = vectorizer_unigrams_all.transform(lyrics_test)

In [159]:
print(lyrics_test_unigrams_rm.toarray().shape)
print(lyrics_test_unigrams_all.toarray().shape)

(2010, 3334)
(2010, 3567)


In [160]:
# initialize NB models
nb_model_unigrams_rm = MultinomialNB()
nb_model_unigrams_all = MultinomialNB()

In [161]:
# fit the models
nb_model_unigrams_rm.fit(lyrics_train_unigrams_rm, song_train)
nb_model_unigrams_all.fit(lyrics_train_unigrams_all, song_train)

In [162]:
# make predictions
preds_unigrams_rm = nb_model_unigrams_rm.predict(lyrics_test_unigrams_rm)
preds_unigrams_all = nb_model_unigrams_all.predict(lyrics_test_unigrams_all)

In [163]:
# compute accuracy scores
accuracy_unigrams_rm = accuracy_score(song_test, preds_unigrams_rm)
accuracy_unigrams_all = accuracy_score(song_test, preds_unigrams_all)
print("Accuracy for unigram model with stopwords removed: ", accuracy_unigrams_rm, "\n", 
      "Accuracy for unigram model with stopwords retained: ", accuracy_unigrams_all, sep="")

Accuracy for unigram model with stopwords removed: 0.3756218905472637
Accuracy for unigram model with stopwords retained: 0.39950248756218903


It looks like it is more helpful to removed stopwords. 

### Reusable functions: `predict_song` and `check_songs`

This is a function that takes in the NB model, the vectorizer, and a lyric (string), and it will output 1) the model's prediction of the song and 2) the 15 songs that the lyric is most likely to come from, including their associated probabilities. 

In [164]:
def predict_song(model, vectorizer, lyric): 
    vec = vectorizer.transform([lyric])
    song_prediction = model.predict(vec)
    classes = model.classes_
    top_15 = sorted(zip(model.predict_proba(vec)[0], classes), reverse=True)[:15]
    return (song_prediction, top_15)

This is a function that samples n random lyrics from the test set, and prints out the model's predicted song and the top 15 predictions, the actual correct song, as well as the lyric itself. More for fun than for actual model testing. 

In [165]:
def check_songs(model, vectorizer, n=5, test_lyrics=lyrics_test, test_songs=song_test, random_state=21): 
    for lyric, correct_song in zip(test_lyrics.sample(n, random_state=random_state), test_songs.sample(n, random_state=random_state).values): 
        song_prediction, top_15 = predict_song(model, vectorizer, lyric)
        print("Lyric: ", lyric, "\n", 
              "Predicted song: ", song_prediction, "\n", 
              "Correct song: ", correct_song, "\n",
              "Top 15: ", top_15, "\n\n", sep="")   

In [166]:
check_songs(nb_model_unigrams_rm, vectorizer_unigrams_rm)

Lyric: Fell behind all my classmates and I ended up here
Predicted song: ['Today Was A Fairytale']
Correct song: this is me trying
Top 15: [(0.02530754164099304, 'Today Was A Fairytale'), (0.01873254749197323, 'Getaway Car'), (0.018226304470134073, 'All Too Well 10'), (0.017413654308159037, 'long story short'), (0.011221515921588692, 'Timeless'), (0.01103570999073583, 'The Very First Night'), (0.010580784729557577, 'I Knew You Were Trouble'), (0.010487119587642776, 'All You Had To Do Was Stay'), (0.009566038663692832, 'This Love'), (0.009382365623234495, 'Innocent'), (0.00902211962818367, 'Mine'), (0.008972052734997828, 'Holy Ground'), (0.008741451286610828, 'I Bet You Think About Me'), (0.00757762086923376, 'Suburban Legends'), (0.007570231971710424, "I'm Only Me When I'm With You")]


Lyric: Wind in my hair, I was there
Predicted song: ['All Too Well 10']
Correct song: All Too Well
Top 15: [(0.24818842574285815, 'All Too Well 10'), (0.015907591187168935, 'All Too Well'), (0.015253552

In [168]:
check_songs(nb_model_unigrams_rm, vectorizer_unigrams_rm, random_state=3)

Lyric: And I chose you
Predicted song: ['All Too Well 10']
Correct song: Maroon
Top 15: [(0.01070377287123913, 'All Too Well 10'), (0.008603693518145954, "I'm Only Me When I'm With You"), (0.008588956960592758, '22'), (0.007686177862520797, 'I Wish You Would'), (0.0073957432667351546, 'Call It What You Want'), (0.0073876508079150525, 'Blank Space'), (0.007253758817196911, 'Look What You Made Me Do'), (0.0072063486288492075, 'Everything Has Changed'), (0.00718555496547938, 'End Game'), (0.0071737314921981295, 'Getaway Car'), (0.006957668121332763, 'Shake It Off'), (0.006954160923233091, 'Dress'), (0.006918566538190607, 'happiness'), (0.006866069427876857, 'So It Goes...'), (0.006830360312100555, 'Delicate')]


Lyric: You're gonna believe them
Predicted song: ['Shake It Off']
Correct song: Fifteen
Top 15: [(0.07041325413815955, 'Shake It Off'), (0.03810926456471235, 'Bejeweled'), (0.03545467733705139, 'cowboy like me'), (0.03148617216167857, 'Timeless'), (0.029585780541975713, 'Fifteen')

In [169]:
check_songs(nb_model_unigrams_rm, vectorizer_unigrams_rm, random_state=13)

Lyric: And we fell down a rabbit hole
Predicted song: ['long story short']
Correct song: Wonderland
Top 15: [(0.22241190642699163, 'long story short'), (0.020473185210198537, 'Today Was A Fairytale'), (0.014699764072329233, 'Getaway Car'), (0.013989861434338568, 'All Too Well 10'), (0.0089328309436789, 'The Very First Night'), (0.008891240537798651, 'Timeless'), (0.008473111774665518, 'I Knew You Were Trouble'), (0.0076077913969866025, 'This Love'), (0.007563636008105778, 'Innocent'), (0.007275188294348008, 'Run'), (0.00724760418400063, 'Holy Ground'), (0.00721867816700773, 'Mine'), (0.006969988367207225, 'I Bet You Think About Me'), (0.006005048171702458, "I'm Only Me When I'm With You"), (0.005974244337030376, '22')]


Lyric: Look what you made me do
Predicted song: ['Look What You Made Me Do']
Correct song: Look What You Made Me Do
Top 15: [(0.13816969070389865, 'Look What You Made Me Do'), (0.03135587306349674, 'Girl At Home'), (0.02047585329468981, 'Bad Blood'), (0.017711822042582

### Model trained on unigrams and bigrams

In [170]:
# initialize vectorizers (one with stopwords removed, other one retaining them)
vectorizer_bigrams_rm = CountVectorizer(stop_words="english", ngram_range=(1, 2))
vectorizer_bigrams_all = CountVectorizer(ngram_range=(1, 2))

In [171]:
# fit transform the training set
lyrics_train_bigrams_rm = vectorizer_bigrams_rm.fit_transform(lyrics_train)
lyrics_train_bigrams_all = vectorizer_bigrams_all.fit_transform(lyrics_train)

In [172]:
# observe the shapes - these matrices obviously have more features than when there were only unigrams
print(lyrics_train_bigrams_rm.toarray().shape)
print(lyrics_train_bigrams_all.toarray().shape)

(8040, 12630)
(8040, 20674)


In [173]:
# transform the testing set
lyrics_test_bigrams_rm = vectorizer_bigrams_rm.transform(lyrics_test)
lyrics_test_bigrams_all = vectorizer_bigrams_all.transform(lyrics_test)

In [174]:
print(lyrics_test_bigrams_rm.toarray().shape)
print(lyrics_test_bigrams_all.toarray().shape)

(2010, 12630)
(2010, 20674)


In [175]:
# initialize NB models
nb_model_bigrams_rm = MultinomialNB()
nb_model_bigrams_all = MultinomialNB()

In [176]:
# fit the models
nb_model_bigrams_rm.fit(lyrics_train_bigrams_rm, song_train)
nb_model_bigrams_all.fit(lyrics_train_bigrams_all, song_train)

In [177]:
# make predictions
preds_bigrams_rm = nb_model_bigrams_rm.predict(lyrics_test_bigrams_rm)
preds_bigrams_all = nb_model_bigrams_all.predict(lyrics_test_bigrams_all)

In [178]:
# compute accuracy scores
accuracy_bigrams_rm = accuracy_score(song_test, preds_bigrams_rm)
accuracy_bigrams_all = accuracy_score(song_test, preds_bigrams_all)
print("Accuracy for unigram and bigram model with stopwords removed: ", accuracy_bigrams_rm, "\n", 
      "Accuracy for unigram and bigram model with stopwords retained: ", accuracy_bigrams_all, sep="")

Accuracy for unigram and bigram model with stopwords removed: 0.46616915422885574
Accuracy for unigram and bigram model with stopwords retained: 0.5288557213930348


In [179]:
check_songs(nb_model_bigrams_rm, vectorizer_bigrams_rm)

Lyric: Fell behind all my classmates and I ended up here
Predicted song: ['Today Was A Fairytale']
Correct song: this is me trying
Top 15: [(0.02506324088585431, 'Today Was A Fairytale'), (0.01912515351426839, 'Getaway Car'), (0.01907111616370117, 'All Too Well 10'), (0.017475317892038034, 'long story short'), (0.011343806353566964, 'Timeless'), (0.010923272071965468, 'The Very First Night'), (0.010582374796796959, 'I Knew You Were Trouble'), (0.010430009585089443, 'All You Had To Do Was Stay'), (0.009623930517986696, 'This Love'), (0.009320434097313977, 'Innocent'), (0.009023648921314477, 'Mine'), (0.008893268361431705, 'Holy Ground'), (0.008768056227737608, 'I Bet You Think About Me'), (0.007651838486878427, "I'm Only Me When I'm With You"), (0.007644724358352853, '22')]


Lyric: Wind in my hair, I was there
Predicted song: ['All Too Well 10']
Correct song: All Too Well
Top 15: [(0.6656986445977718, 'All Too Well 10'), (0.013880147253002172, 'All Too Well'), (0.0066555581283482814, '

In [180]:
predict_song(nb_model_bigrams_rm, vectorizer_bigrams_rm, "Wreck my plans, that's my man")

(array(['willow'], dtype='<U39'),
 [(0.997193683370179, 'willow'),
  (0.00027241343261402235, 'The Man'),
  (0.0001731881163337371, 'Better Man'),
  (5.264701739113467e-05, 'Bejeweled'),
  (5.243489074755659e-05, 'Mine'),
  (5.2005440063009925e-05, 'Babe'),
  (4.89191820985214e-05, 'I Think He Knows'),
  (4.1877016866183017e-05, 'august'),
  (2.6019123572055494e-05, 'Timeless'),
  (2.5015330124976055e-05, 'I Did Something Bad'),
  (2.4940073316007533e-05, 'Mean'),
  (2.4127658456653617e-05, '...Ready For It?'),
  (2.325441373662321e-05, 'Death By A Thousand Cuts'),
  (2.321400918122227e-05, 'Girl At Home'),
  (2.1999138458479298e-05, 'Foolish One')])

The unigrams + bigrams model is much more confident in its predictions, which makes sense.

In [181]:
predict_song(nb_model_bigrams_rm, vectorizer_bigrams_rm, "Disappointments, close your eyes"), predict_song(nb_model_unigrams_rm, vectorizer_unigrams_rm, "Disappointments, close your eyes")

((array(['coney island'], dtype='<U39'),
  [(0.8400106567660309, 'coney island'),
   (0.004643407904114179, 'Love Story'),
   (0.004152097984757242, "Would've, Could've, Should've"),
   (0.003802111983073004, 'Everything Has Changed'),
   (0.002619121975771275, 'London Boy'),
   (0.002431501003684468, 'Sparks Fly'),
   (0.002250736050015099, 'So It Goes...'),
   (0.002229120882557555, 'happiness'),
   (0.002057075341418272, 'Bejeweled'),
   (0.00200131220861887, 'Cruel Summer'),
   (0.0019764223314305113, 'Wonderland'),
   (0.0019293803929518372, 'cowboy like me'),
   (0.0018229146766772153, 'The Last Time'),
   (0.001649267842383858, 'Run'),
   (0.0015860558640348739, 'Call It What You Want')]),
 (array(['coney island'], dtype='<U39'),
  [(0.24958133539951213, 'coney island'),
   (0.019712133972111277, "Would've, Could've, Should've"),
   (0.01789978139943831, 'Everything Has Changed'),
   (0.012367607350653419, 'London Boy'),
   (0.011651090844188333, 'Sparks Fly'),
   (0.01103275844

In [182]:
check_songs(nb_model_bigrams_rm, vectorizer_bigrams_rm, random_state=3)

Lyric: And I chose you
Predicted song: ['All Too Well 10']
Correct song: Maroon
Top 15: [(0.010950737724026131, 'All Too Well 10'), (0.008651292228401181, "I'm Only Me When I'm With You"), (0.008647269618871547, '22'), (0.007757294179339282, 'I Wish You Would'), (0.007480406621253302, 'Blank Space'), (0.00742806331792825, 'Call It What You Want'), (0.007312212789116519, 'End Game'), (0.007293499875425897, 'Look What You Made Me Do'), (0.007267574445067379, 'Everything Has Changed'), (0.007249647798783373, 'Getaway Car'), (0.007073880528354188, 'Shake It Off'), (0.007015794619532247, 'Dress'), (0.006938247608389144, 'happiness'), (0.0068521634499028785, 'So It Goes...'), (0.0068340035342978585, 'Delicate')]


Lyric: You're gonna believe them
Predicted song: ['Fifteen']
Correct song: Fifteen
Top 15: [(0.08363287046425581, 'Fifteen'), (0.06744707662061201, 'Shake It Off'), (0.03608920459471683, 'Bejeweled'), (0.03335152628344587, 'cowboy like me'), (0.02986519787815517, 'Timeless'), (0.02

### Model trained on unigrams, bigrams, and trigrams

In [183]:
# initialize vectorizers with and without stopwords
vectorizer_trigrams_rm = CountVectorizer(stop_words="english", ngram_range=(1, 3))
vectorizer_trigrams_all = CountVectorizer(ngram_range=(1, 3))

In [184]:
# fit transform the training set
lyrics_train_trigrams_rm = vectorizer_trigrams_rm.fit_transform(lyrics_train)
lyrics_train_trigrams_all = vectorizer_trigrams_all.fit_transform(lyrics_train)

In [185]:
# observe the shapes - these matrices obviously have more features than when there were only unigrams
print(lyrics_train_trigrams_rm.toarray().shape)
print(lyrics_train_trigrams_all.toarray().shape)

(8040, 19035)
(8040, 43524)


In [186]:
# transform the testing set
lyrics_test_trigrams_rm = vectorizer_trigrams_rm.transform(lyrics_test)
lyrics_test_trigrams_all = vectorizer_trigrams_all.transform(lyrics_test)

In [187]:
print(lyrics_test_trigrams_rm.toarray().shape)
print(lyrics_test_trigrams_all.toarray().shape)

(2010, 19035)
(2010, 43524)


In [188]:
nb_model_trigrams_rm = MultinomialNB()
nb_model_trigrams_all = MultinomialNB()

In [189]:
nb_model_trigrams_rm.fit(lyrics_train_trigrams_rm, song_train)
nb_model_trigrams_all.fit(lyrics_train_trigrams_all, song_train)

In [190]:
# make predictions
preds_trigrams_rm = nb_model_trigrams_rm.predict(lyrics_test_trigrams_rm)
preds_trigrams_all = nb_model_trigrams_all.predict(lyrics_test_trigrams_all)

In [191]:
# compute accuracy scores
accuracy_trigrams_rm = accuracy_score(song_test, preds_trigrams_rm)
accuracy_trigrams_all = accuracy_score(song_test, preds_trigrams_all)
print("Accuracy for (unigram, trigram) model with stopwords removed: ", accuracy_trigrams_rm, "\n", 
      "Accuracy for (unigram, trigram) model with stopwords retained: ", accuracy_trigrams_all, sep="")

Accuracy for (unigram, trigram) model with stopwords removed: 0.4771144278606965
Accuracy for (unigram, trigram) model with stopwords retained: 0.5651741293532339


In [192]:
check_songs(nb_model_trigrams_all, vectorizer_trigrams_all)

Lyric: Fell behind all my classmates and I ended up here
Predicted song: ['All Too Well 10']
Correct song: this is me trying
Top 15: [(0.43570324260846705, 'All Too Well 10'), (0.07689172605926287, 'King Of My Heart'), (0.07352940152314187, 'Castles Crumbling'), (0.05095098947229434, 'I Wish You Would'), (0.049755820230526306, 'Everything Has Changed'), (0.03203575139552464, "I'm Only Me When I'm With You"), (0.025254654434517517, 'All Too Well'), (0.022009565462996847, 'happiness'), (0.020020226263296744, 'Dress'), (0.014409466092154169, 'End Game'), (0.010071409132131566, 'Never Grow Up'), (0.008718347458867627, 'You Belong With Me'), (0.008443677140331354, 'Death By A Thousand Cuts'), (0.006251858611523574, 'long story short'), (0.005698166527652479, 'I Did Something Bad')]


Lyric: Wind in my hair, I was there
Predicted song: ['All Too Well 10']
Correct song: All Too Well
Top 15: [(0.9999999917825448, 'All Too Well 10'), (8.059531962609122e-09, 'All Too Well'), (1.345925691987082e-

### Model trained on bigrams and trigrams

A concern when it comes to utlizing unigrams as features is that the model will weigh individual words that appear in a song many times very highly if that word appears in a different song even just once. For example, any lyric containing the word "all" is likely to output All Too Well in its list of predictions. 

In [195]:
predict_song(nb_model_trigrams_all, vectorizer_trigrams_all, "all i know")

(array(['Everything Has Changed'], dtype='<U39'),
 [(0.9130568202617246, 'Everything Has Changed'),
  (0.0072120601522594665, 'All You Had To Do Was Stay'),
  (0.005467464017994903, 'All Too Well 10'),
  (0.0023475970730876825, 'London Boy'),
  (0.002219905061991272, 'Delicate'),
  (0.002127649267207212, 'You Belong With Me'),
  (0.0019741113906986325, "I'm Only Me When I'm With You"),
  (0.0018684363314578306, 'So It Goes...'),
  (0.0017808592673787208, 'All Too Well'),
  (0.0017238430159414483, 'The Story Of Us'),
  (0.0015025077559177296, 'betty'),
  (0.0014830029704618732, 'Enchanted'),
  (0.001384155920607818, 'long story short'),
  (0.0013574805174089652, 'The Best Day'),
  (0.001264716272841617, 'Mean')])

In [251]:
# initialize vectorizers (one with stopwords removed, other one retaining them)
vectorizer_tri_rm = CountVectorizer(stop_words="english", ngram_range=(2, 3))
vectorizer_tri_all = CountVectorizer(ngram_range=(2, 3))

In [252]:
# fit transform the training set
lyrics_train_tri_rm = vectorizer_tri_rm.fit_transform(lyrics_train)
lyrics_train_tri_all = vectorizer_tri_all.fit_transform(lyrics_train)

In [253]:
# observe the shapes
print(lyrics_train_tri_rm.toarray().shape)
print(lyrics_train_tri_all.toarray().shape)

(8040, 15701)
(8040, 39957)


In [254]:
# transform the testing set
lyrics_test_tri_rm = vectorizer_tri_rm.transform(lyrics_test)
lyrics_test_tri_all = vectorizer_tri_all.transform(lyrics_test)

In [255]:
print(lyrics_test_tri_rm.toarray().shape)
print(lyrics_test_tri_all.toarray().shape)

(2010, 15701)
(2010, 39957)


In [256]:
nb_model_tri_rm = MultinomialNB()
nb_model_tri_all = MultinomialNB()

In [257]:
nb_model_tri_rm.fit(lyrics_train_tri_rm, song_train)
nb_model_tri_all.fit(lyrics_train_tri_all, song_train)

In [258]:
# make predictions
preds_tri_rm = nb_model_tri_rm.predict(lyrics_test_tri_rm)
preds_tri_all = nb_model_tri_all.predict(lyrics_test_tri_all)

In [259]:
# compute accuracy scores
accuracy_tri_rm = accuracy_score(song_test, preds_tri_rm)
accuracy_tri_all = accuracy_score(song_test, preds_tri_all)
print("Accuracy for (bigram, trigram) model with stopwords removed: ", accuracy_tri_rm, "\n", 
      "Accuracy for (bigram, trigram) model with stopwords retained: ", accuracy_tri_all, sep="")

Accuracy for (bigram, trigram) model with stopwords removed: 0.4407960199004975
Accuracy for (bigram, trigram) model with stopwords retained: 0.5761194029850746


In [260]:
best_model = nb_model_tri_all

It looks like the main source of misclassification is short lyrics consisting of fewer than 5 words. This makes sense, because something like "Oh, oh" is much harder to pinpoint than, say, "One night, he wakes, strange look on his face." It looks like if the model is to be used as support in tayLyrics, it would only be actually helpful if we use lyrics of sufficient length. 

In [350]:
taylor_long = taylor[taylor['lyric'].str.split().apply(len) > 3]

In [351]:
random_test = taylor_long.sample(2000, random_state=21)
rand_lyrics = random_test["lyric"]
rand_songs = random_test["track_name"]

In [352]:
vectorizer_long = vectorizer_tri_all.transform(rand_lyrics)

In [353]:
rand_preds = best_model.predict(vectorizer_long)

In [354]:
accuracy_score(rand_songs, rand_preds)

0.9035

### Playing around with the best model

In [317]:
predict_song(best_model, vectorizer_tri_all, "I hit you like bang, we tried to forget it, but we just couldn't") # correct

(array(['End Game'], dtype='<U39'),
 [(0.9999790643401751, 'End Game'),
  (2.6959459926663996e-06, 'Getaway Car'),
  (4.949882415977247e-07, 'no body, no crime'),
  (4.894067756717157e-07, 'The Very First Night'),
  (4.878019417276862e-07, 'Dress'),
  (4.12009933151888e-07, 'Dancing With Our Hands Tied'),
  (3.883941508393997e-07, 'ME!'),
  (3.8217130426758094e-07, 'All Too Well 10'),
  (3.737777993927537e-07, 'I Bet You Think About Me'),
  (3.086296554867145e-07, 'Bye Bye Baby'),
  (2.943184092186004e-07, 'Come Back...Be Here'),
  (2.798071430575251e-07, 'We Are Never Ever Getting Back Together'),
  (2.74563126633123e-07, 'False God'),
  (2.1520627356057965e-07, '22'),
  (2.0432600526328306e-07, 'New Romantics')])

In [318]:
predict_song(best_model, vectorizer_tri_all, "My baby's fly like a jet stream")

(array(['Call It What You Want'], dtype='<U39'),
 [(0.9962779187541664, 'Call It What You Want'),
  (0.00014050839958826736, "Don't Blame Me"),
  (7.874287026041877e-05, 'Death By A Thousand Cuts'),
  (6.433532886879519e-05, 'coney island'),
  (3.344624218687505e-05, 'All Too Well 10'),
  (3.3121374566584174e-05, "This Is Why We Can't Have Nice Things"),
  (3.0068611687371044e-05, '22'),
  (2.897518334453782e-05, "I'm Only Me When I'm With You"),
  (2.6385604414136923e-05, 'I Wish You Would'),
  (2.5982371700955625e-05, 'Blank Space'),
  (2.5264818136272848e-05, 'Everything Has Changed'),
  (2.462287074562163e-05, 'Getaway Car'),
  (2.457927116282166e-05, 'Look What You Made Me Do'),
  (2.4340772408920167e-05, 'So It Goes...'),
  (2.431428837366203e-05, 'End Game')])

In [319]:
predict_song(best_model, vectorizer_tri_all, "Ladies and gentlemen, will you please stand?")

(array(['Lover'], dtype='<U39'),
 [(0.8661052291236252, 'Lover'),
  (0.0032353129054091302, 'betty'),
  (0.0027721260161374024, 'Nothing New'),
  (0.0020309704757076196, 'Love Story'),
  (0.0019743916450157247, 'Stay Beautiful'),
  (0.0015252970044637771, 'Long Live'),
  (0.0012545514204597603, 'august'),
  (0.0011553445835073512, 'All Too Well 10'),
  (0.0010726491515541325, '22'),
  (0.001025169271188413, "I'm Only Me When I'm With You"),
  (0.0009359861012582924, 'I Wish You Would'),
  (0.0009251419140731085, 'Blank Space'),
  (0.0009006586679908218, 'Everything Has Changed'),
  (0.000899179382610405, 'Call It What You Want'),
  (0.000874093432210988, 'So It Goes...')])

In [320]:
predict_song(best_model, vectorizer_tri_all, "Two headlights shine through the sleepless night")

(array(['Treacherous'], dtype='<U39'),
 [(0.9986473272650471, 'Treacherous'),
  (2.2626558942680883e-05, 'All Too Well 10'),
  (1.798647202589803e-05, 'Fifteen'),
  (1.554177459762014e-05, 'Cruel Summer'),
  (1.5275959854252726e-05, 'The Very First Night'),
  (1.5231053307945232e-05, 'Jump Then Fall'),
  (1.4129885315009038e-05, 'All Too Well'),
  (1.3809034068256999e-05, 'Death By A Thousand Cuts'),
  (1.3291609210075265e-05, 'The Moment I Knew'),
  (1.3258345557807187e-05, 'Love Story'),
  (1.3224068505644519e-05, 'Red'),
  (1.3157876841390851e-05, 'This Love'),
  (1.2528107157660476e-05, 'State Of Grace'),
  (1.1895212161559937e-05, 'Anti-Hero'),
  (1.1734219298625625e-05, 'The Other Side Of The Door')])

In [321]:
predict_song(best_model, vectorizer_tri_all, "That I can't say hello to you and risk another goodbye")

(array(['I Almost Do'], dtype='<U39'),
 [(0.9977427800714883, 'I Almost Do'),
  (0.0005239637882887701, "'tis the damn season"),
  (0.00023836386238914795, 'Tell Me Why'),
  (0.00012102298639821544, 'Better Man'),
  (7.766619487065169e-05, "New Year's Day"),
  (5.462284821219675e-05, 'The Very First Night'),
  (3.837941423811344e-05, 'Wonderland'),
  (3.729137494873231e-05, 'Today Was A Fairytale'),
  (3.028038559404978e-05, 'Message In A Bottle'),
  (2.798968179722552e-05, 'Miss Americana & The Heartbreak Prince'),
  (2.675302111078317e-05, '22'),
  (2.5676609579987984e-05, 'End Game'),
  (2.181028706725997e-05, 'Gorgeous'),
  (2.0338233932222156e-05, 'betty'),
  (2.0241643406920995e-05, 'Bad Blood')])

In [322]:
predict_song(best_model, vectorizer_tri_all, "Ooh-ooh-ooh-ooh-ooh")

(array(['We Are Never Ever Getting Back Together'], dtype='<U39'),
 [(0.9139705509863931, 'We Are Never Ever Getting Back Together'),
  (0.08353574247668508, 'ME!'),
  (0.0021648484236013988, 'dorothea'),
  (0.00032640724907036624, 'invisible string'),
  (1.5413107963626675e-06, "Would've, Could've, Should've"),
  (7.435557292461757e-07, 'Dancing With Our Hands Tied'),
  (7.257702198174102e-08, 'High Infidelity'),
  (6.449194674333262e-08, 'exile'),
  (6.29221462621261e-09, 'Sweet Nothing'),
  (6.067110938799289e-09, 'Starlight'),
  (1.0887070241695766e-09, 'Breathe'),
  (1.626462674223086e-10, 'All Too Well 10'),
  (1.415891689874593e-10, '22'),
  (1.3756811253896413e-10, "I'm Only Me When I'm With You"),
  (1.2494695235284544e-10, 'I Wish You Would')])

In [323]:
predict_song(best_model, vectorizer_tri_all, "In red lipstick")

(array(['The Moment I Knew'], dtype='<U39'),
 [(0.25194856175741265, 'The Moment I Knew'),
  (0.010790518296098997, 'Look What You Made Me Do'),
  (0.007922557832714667, 'All Too Well 10'),
  (0.006466825179184315, '22'),
  (0.0063874678321166846, "I'm Only Me When I'm With You"),
  (0.005771260580085831, 'I Wish You Would'),
  (0.005619540285117, 'Blank Space'),
  (0.005509453740475802, 'Call It What You Want'),
  (0.005444960100075555, 'Everything Has Changed'),
  (0.005436373954213943, 'End Game'),
  (0.005398447364334674, 'Getaway Car'),
  (0.005310477159977377, 'Shake It Off'),
  (0.005214678885835655, 'Dress'),
  (0.005158540388273988, 'happiness'),
  (0.005131857005247268, 'So It Goes...')])

In [324]:
predict_song(best_model, vectorizer_tri_all, "Think about the place where you first met me")

(array(['Getaway Car'], dtype='<U39'),
 [(0.9999999999523936, 'Getaway Car'),
  (4.621520284901889e-12, 'Gorgeous'),
  (1.6623118359885664e-12, 'I Bet You Think About Me'),
  (1.594218681589229e-12, 'Bad Blood'),
  (1.4184237261849997e-12, 'right where you left me'),
  (1.021748977570577e-12, 'Better Than Revenge'),
  (7.866431905121289e-13, '22'),
  (7.733516446664934e-13, 'The Very First Night'),
  (6.954019808452815e-13, 'King Of My Heart'),
  (6.787433152806021e-13, 'I Wish You Would'),
  (6.499493753561846e-13, 'Innocent'),
  (6.19359349266863e-13, 'You Belong With Me'),
  (5.209253824460425e-13, 'Is It Over Now?'),
  (4.225912721887503e-13, 'Lover'),
  (4.200263998711958e-13, 'Look What You Made Me Do')])

In [325]:
predict_song(best_model, vectorizer_tri_all, "I don't wanna keep secrets just to keep you")

(array(['Cruel Summer'], dtype='<U39'),
 [(0.9886151860165364, 'Cruel Summer'),
  (0.0006688488066800808, 'End Game'),
  (0.0004951491506496191, "I'm Only Me When I'm With You"),
  (0.0004547434323349502, 'Afterglow'),
  (0.0002747996852207756, 'Come Back...Be Here'),
  (0.00025841297091033365, 'Is It Over Now?'),
  (0.00023599621589174988, 'mirrorball'),
  (0.00021425079095448775, 'So It Goes...'),
  (0.00019974477666444195, "Say Don't Go"),
  (0.00019876958697114718, 'Holy Ground'),
  (0.00019074949382741916, 'Daylight'),
  (0.0001530111866580123, 'long story short'),
  (0.00012525887812678694, 'Castles Crumbling'),
  (0.00012136349869933872, 'Foolish One'),
  (0.0001124214035566652, 'Blank Space')])

In [326]:
predict_song(best_model, vectorizer_tri_all, "Daring you to leave me just so I can try and scare you")

(array(['False God'], dtype='<U39'),
 [(0.9997021107752267, 'False God'),
  (0.00016493617849117388, "Say Don't Go"),
  (8.807737576680491e-06, 'coney island'),
  (6.351489840491865e-06, 'invisible string'),
  (3.3183881392310644e-06, 'willow'),
  (3.2576622708904433e-06, 'Style'),
  (2.833756250218804e-06, 'The Very First Night'),
  (2.372943698341053e-06, 'Getaway Car'),
  (2.0789536949709215e-06, 'Miss Americana & The Heartbreak Prince'),
  (1.982436844171696e-06, 'Haunted'),
  (1.8860149989365461e-06, "'tis the damn season"),
  (1.6640818872323307e-06, 'this is me trying'),
  (1.3806712430729346e-06, 'All Too Well 10'),
  (1.327381838428162e-06, 'The Archer'),
  (1.3060869275716799e-06, 'I Bet You Think About Me')])

In [327]:
predict_song(best_model, vectorizer_tri_all, "We gather stones, never knowing what they'll mean")

(array(['my tears ricochet'], dtype='<U39'),
 [(0.9938384819685051, 'my tears ricochet'),
  (0.0002888367260986306, 'I Know Places'),
  (0.0001788672324323188, 'End Game'),
  (0.00013558260651198127, 'The Lucky One'),
  (0.00011855640680993045, 'Blank Space'),
  (0.00010931913233857718, 'Shake It Off'),
  (5.115214733164157e-05, "This Is Why We Can't Have Nice Things"),
  (4.8121570608858844e-05, 'Nothing New'),
  (4.644898431769027e-05, 'All Too Well 10'),
  (4.5992033509659446e-05, '22'),
  (4.323850196166723e-05, "I'm Only Me When I'm With You"),
  (3.9683538454734115e-05, 'I Wish You Would'),
  (3.896481509124603e-05, 'You Need To Calm Down'),
  (3.856422701549499e-05, 'Everything Has Changed'),
  (3.824343312212407e-05, 'Call It What You Want')])

In [330]:
predict_song(best_model, vectorizer_tri_all, "Uh-huh, uh-huh")

(array(['Glitch'], dtype='<U39'),
 [(0.3324817631397138, 'Glitch'),
  (0.12646949972724875, 'The Great War'),
  (0.09445164527322489, 'Is It Over Now?'),
  (0.08090163678415893, 'Paper Rings'),
  (0.026336380499353988, "I'm Only Me When I'm With You"),
  (0.00759070681968109, 'Karma'),
  (0.0035863087710287715, 'All Too Well 10'),
  (0.0029748388063658366, '22'),
  (0.0026474149632157934, 'I Wish You Would'),
  (0.0025826510478511506, 'Blank Space'),
  (0.002529311168864851, 'Call It What You Want'),
  (0.002503899474452888, 'Everything Has Changed'),
  (0.002484677180157404, 'End Game'),
  (0.0024754212523827356, 'Getaway Car'),
  (0.0024734721978027802, 'Look What You Made Me Do')])

In [332]:
predict_song(best_model, vectorizer_tri_all, "I could show you incredible things")

(array(['Blank Space'], dtype='<U39'),
 [(0.48285961183091974, 'Blank Space'),
  (0.009426621407678982, 'Jump Then Fall'),
  (0.0065044010319507605, 'Invisible'),
  (0.005005469431859659, 'All Too Well 10'),
  (0.004357433272101828, '22'),
  (0.004233684504572694, "I'm Only Me When I'm With You"),
  (0.0038452659290502695, 'I Wish You Would'),
  (0.0037828932252648184, 'mirrorball'),
  (0.0036824238845061604, 'Call It What You Want'),
  (0.003663817270382582, 'Everything Has Changed'),
  (0.003591205035195155, 'Getaway Car'),
  (0.003586258229203078, 'Look What You Made Me Do'),
  (0.0035770746029650815, 'Picture To Burn'),
  (0.00356945795329874, 'End Game'),
  (0.003535817151196132, 'Shake It Off')])

In [358]:
predict_song(best_model, vectorizer_tri_all, "By the way") # not right - Today was a Fairytale has "the way" a lot though

(array(['Today Was A Fairytale'], dtype='<U39'),
 [(0.06606601366391912, 'Today Was A Fairytale'),
  (0.03362640135179085, 'Bejeweled'),
  (0.030649424034485253, 'cowboy like me'),
  (0.030264878107575804, 'Question...?'),
  (0.02623467675673819, 'Stay Beautiful'),
  (0.019544545727522063, 'The Way I Loved You'),
  (0.018211749584057205, 'Picture To Burn'),
  (0.016147356509495652, 'Jump Then Fall'),
  (0.014986381674831665, 'Snow On The Beach'),
  (0.013218332090972614, 'You Are In Love'),
  (0.013208468270418147, 'Mine'),
  (0.011643364446504906, 'Enchanted'),
  (0.011014281638643679, 'Paris'),
  (0.010627266739749765, 'Hey Stephen'),
  (0.009463390820411215, 'End Game')])

In [360]:
predict_song(best_model, vectorizer_tri_all, "Take me away")

(array(['The Very First Night'], dtype='<U39'),
 [(0.879114701578543, 'The Very First Night'),
  (0.004311172393939708, 'Superman'),
  (0.002642949886767918, "Mary's Song"),
  (0.0024677209252992175, 'Lover'),
  (0.0017628509049690464, 'Love Story'),
  (0.0015199823820698031, 'Style'),
  (0.0013986343960045076, 'Wildest Dreams'),
  (0.0013052007253788866, 'Mean'),
  (0.0012493727730723979, 'the lakes'),
  (0.0012443562233638932, 'New Romantics'),
  (0.00114401377830554, 'All Too Well 10'),
  (0.0009711683663875972, "You're On Your Own, Kid"),
  (0.0009338066396096064, '22'),
  (0.0009223474744798472, "I'm Only Me When I'm With You"),
  (0.0008333674251699971, 'I Wish You Would')])