# Examining the distinguishing characteristics of contemporary pop songs

## Contents: 

0. [Question](#zeroth-section) 
1. [Operationalization of the question](#first-section)
2. [Code](#second-section)
3. [Interpretation & discussion](#third-section)
4. [Analysis of new lyrics](#fourth-section)

## Question <a class="anchor" id="zeroth-section"></a>

A record producer has approached you with the question of whether
there are any distinguishable features of a pop song. In other words: what
sets a pop song (lyrically) apart from other genres?


Try to formulate a good operationalization of this question (how are you going
to quantify this, what method do you use, what steps do you need to take,
what time period are you focusing on) and argue why this operationalization would be suitable to formulate an answer to the question of the record
producer.

Implement your operationalization and
formulate an answer to the question of the record producer.

After that, consider the following new lyrics provided by our hypothetical
producer [(copy-pastable lyrics here)](https://gist.githubusercontent.com/veerbeek/d5b5c971abe05a9460e9f31d786183e5/raw/ab8b8192fa330cc0f97314e3a4f0a312190a735a/lyrics.txt):

## 1. Operationalization of the question <a class="anchor" id="first-section"></a>

To answer the question “what are the distinguishable features of a pop song?” the first crucial step is to define what we mean by “features”. Various approaches are possible. For example, we can choose to look at individual words occurring in pop and non-pop songs or we can look at the topics occurring in these genres. In my analysis I have chosen to examine the topics as these bear closer resemblance to the common human notion of “features” of a given genre.

Since we know that genres evolve over time (from previous exercises), the answer to the question "what are the distinguishing features of pop songs?" would likely be different for pop songs from the 1970s vs contemporary pop songs. It would be pragmatic to only look at songs from recent years. To do this, I will fit a topic model on all songs published between 2010 and 2020.
Because each genre is conceptually distinct, we would need at least the same number of topics as music genres. Moreover, because there are sub-genres within every genre of music, we would need a higher number of topics than the number of genres. We can use, say, 50 topics.

After fitting the topic model, I will use the topic distributions of the songs as features to train a logistic regression model for classifying pop and non-pop songs. Once the model is trained, we can interpret the topics most predictive of the pop songs as the most distinguishable features of the genre. Hence, the higher the coefficient of a topic, the more distinguishable it is of the pop genre.

Lastly, I will classify the new song using the logistic regression model and look at the most prominent topics in the song to determin whether it resembles other contemporary pop song.

## 2. Code <a class="anchor" id="second-section"></a>

In [70]:
import pandas as pd
import numpy as np
import spacy
from tqdm.auto import tqdm
from gensim.corpora import Dictionary
import os
from gensim.models.wrappers import LdaMallet
import gensim
import pickle

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [2]:
model = Word2Vec.load("word2vec.model")

### 2.1 Preprocessing

Loading the dataset and making a correction for the dates of the songs using the dates from Spotify.

In [5]:
PATH_DF = 'data/english_cleaned_lyrics.csv' # location of song lyrics dataset
PATH_CORRECTION = 'data/indx2newdate.p' # location of dataset with correct dates

# correcting the dates of the songs
def load_dataset(data_path, path_correction):
    df = pd.read_csv(data_path)
    indx2newdate = pickle.load(open(PATH_CORRECTION, 'rb'))
    df['year'] = df['index'].apply(lambda x: int(indx2newdate[x][0][:4]) if indx2newdate[x][0] != '' else 0)
    return df[df.year > 1960][['song', 'year', 'artist', 'genre', 'lyrics']]

df = load_dataset(PATH_DF, PATH_CORRECTION)

Examining the first five rows of the dataframe

In [6]:
df.head()

Unnamed: 0,song,year,artist,genre,lyrics
0,ego-remix,2009,beyonce-knowles,Pop,Oh baby how you doing You know I'm gonna cut r...
5,all-i-could-do-was-cry,2008,beyonce-knowles,Pop,I heard Church bells ringing I heard A choir s...
6,once-in-a-lifetime,2008,beyonce-knowles,Pop,This is just another day that I would spend Wa...
9,why-don-t-you-love-me,2009,beyonce-knowles,Pop,N n now honey You better sit down and look aro...
16,poison,2009,beyonce-knowles,Pop,You're bad for me I clearly get it I don't see...


Showing the number of songs per genre

In [7]:
df.groupby('genre').agg('count')

Unnamed: 0_level_0,song,year,artist,lyrics
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Country,10545,10545,10545,10545
Electronic,5194,5194,5194,5194
Folk,1373,1373,1373,1373
Hip-Hop,14878,14878,14878,14878
Indie,2489,2489,2489,2489
Jazz,5068,5068,5068,5068
Metal,15671,15671,15671,15671
Other,2449,2449,2449,2449
Pop,23295,23295,23295,23295
R&B,2338,2338,2338,2338


Making a subset of the songs published in or after 2010

In [19]:
df2010 = df[df['year'] >= 2010]
len(df2010)

47436

Showing the number of songs per genre in the >2010 subset

In [18]:
df2010.groupby('genre').agg('count')

Unnamed: 0_level_0,song,year,artist,lyrics
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Country,2877,2877,2877,2877
Electronic,2458,2458,2458,2458
Folk,497,497,497,497
Hip-Hop,6047,6047,6047,6047
Indie,1469,1469,1469,1469
Jazz,1608,1608,1608,1608
Metal,4084,4084,4084,4084
Other,1949,1949,1949,1949
Pop,7791,7791,7791,7791
R&B,669,669,669,669


This has reduced the number of songs in our dataset from 160856 to 47436.

Processing the text through Spacy

In [22]:
processed_texts = [text for text in tqdm(nlp.pipe(df2010.lyrics, 
                                              disable=["ner",
                                                       "parser"]))]

0it [00:00, ?it/s]

Tokenizing the processed texts

In [34]:
tokenized_texts = [[word.lemma_.lower() for word in processed_text if not word.is_stop and not word.is_punct] for processed_text in processed_texts]

Creating a dictionary and corpus

In [37]:
from gensim.corpora import Dictionary

MIN_DF = 5 # minium document frequency
MAX_DF = 0.9 # maximum document frequency

dictionary = Dictionary(tokenized_texts) # get the vocabulary
dictionary.filter_extremes(no_below=MIN_DF, 
                           no_above=MAX_DF)

corpus = [dictionary.doc2bow(text) for text in tokenized_texts]

### 2.2 Fitting a topic model

Learning the topics

In [39]:
mallet_path = r'C:/mallet/bin/mallet.bat'

N_TOPICS = 50 # k
N_ITERATIONS = 1000 # usually 1000 will do

lda = LdaMallet(mallet_path,
                corpus=corpus,
                id2word=dictionary,
                num_topics=N_TOPICS,
                optimize_interval=10,
                iterations=N_ITERATIONS)

Storing the top 10 words of each topic in a list and displaying the contents of the list

In [41]:
topics = []

for topic in range(N_TOPICS): # iterating over the index of topics
    words = lda.show_topic(topic, 10) # fetching the top 10 words of the topics
    topic_n_words = ' '.join([word[0] for word in words]) # joining the words to a single string
    topics.append('Topic {}: {}'.format(str(topic+1), topic_n_words)) # appending to list of topics
    
topics

['Topic 1: cry tear leave hurt give goodbye pain lie hate die',
 'Topic 2: baby girl boy crazy ready wanna body babe love yeah',
 "Topic 3: yeah ooh oooh ohh 'cause mmm hey dat whoa wanna",
 'Topic 4: tonight round cold catch shut middle air alright spin feel',
 'Topic 5: word face write speak time place read mind story perfect',
 'Topic 6: man money woman pay buy spend dollar big rich girl',
 'Topic 7: christmas woah boom bell ring snow santa year merry happy',
 'Topic 8: life live dream forever wake time day sleep die true',
 "Topic 9: wanna run slow wild fast fun feel ey 'cause hide",
 'Topic 10: goin gettin lookin nothin comin feelin yea tryin talkin runnin',
 'Topic 11: big hair wear red room head sit dress bed pretty',
 'Topic 12: hey nah mi dem di yuh ma fi gal nuh',
 'Topic 13: yo em shit ya man rap hit game smoke big',
 'Topic 14: hold arm black tight hand white gimme close night darling',
 'Topic 15: shake city easy street lady town pa york london pum',
 'Topic 16: play mind 

Transforming songs to their topic distribution

In [43]:
# Transforming songs to their topic distribution
transformed_docs = lda.load_document_topics()

# Creating a dataframe of the topic distributions
topic_distributions = pd.DataFrame([[x[1] for x in doc] for doc in transformed_docs], 
             columns=['topic_{}'.format(i+1) for i in range(N_TOPICS)])

topic_distributions.head()

Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,...,topic_41,topic_42,topic_43,topic_44,topic_45,topic_46,topic_47,topic_48,topic_49,topic_50
0,0.001071,0.000995,0.000514,0.00038,0.027242,0.000717,0.000163,0.009653,0.000521,0.00066,...,0.000667,0.000334,0.000446,0.000116,0.000307,0.295475,0.000583,0.295964,0.001102,0.000397
1,0.000558,0.158631,0.009303,0.000198,0.000644,0.000374,8.5e-05,0.000513,0.081586,0.000344,...,0.000348,0.000174,0.000232,6e-05,0.00016,0.000391,0.000304,0.000646,0.000574,0.000207
2,0.001296,0.001204,0.000622,0.000459,0.001496,0.000868,0.000197,0.001192,0.00063,0.000799,...,0.000807,0.000404,0.000539,0.00014,0.000371,0.000908,0.000705,0.064418,0.011819,0.000481
3,0.000521,0.000484,0.00025,0.000185,0.000601,0.029839,7.9e-05,0.000479,0.000253,0.000321,...,0.000324,0.080209,0.000217,5.6e-05,0.000149,0.000365,0.004496,0.017454,0.000535,0.000193
4,0.000516,0.092391,0.000248,0.029427,0.000596,0.000346,7.9e-05,0.08403,0.000251,0.000318,...,0.380499,0.000161,0.021104,5.6e-05,0.000148,0.09645,0.000281,0.000597,0.017242,0.000192


Joining the topic distributions with the rest of the df

In [52]:
joined_topic_dist = df2010.reset_index().join(topic_distributions)

joined_topic_dist.head()

Unnamed: 0,index,song,year,artist,genre,lyrics,topic_1,topic_2,topic_3,topic_4,...,topic_41,topic_42,topic_43,topic_44,topic_45,topic_46,topic_47,topic_48,topic_49,topic_50
0,38,angel,2014,beyonce-knowles,Pop,This is for my fans Uhu uhu This is for my des...,0.001071,0.000995,0.000514,0.00038,...,0.000667,0.000334,0.000446,0.000116,0.000307,0.295475,0.000583,0.295964,0.001102,0.000397
1,49,mine,2014,beyonce-knowles,Pop,I've been watching for the signs Took a trip ...,0.000558,0.158631,0.009303,0.000198,...,0.000348,0.000174,0.000232,6e-05,0.00016,0.000391,0.000304,0.000646,0.000574,0.000207
2,50,superpower,2014,beyonce-knowles,Pop,When the palm of my two hands hold each other...,0.001296,0.001204,0.000622,0.000459,...,0.000807,0.000404,0.000539,0.00014,0.000371,0.000908,0.000705,0.064418,0.011819,0.000481
3,51,haunted,2014,beyonce-knowles,Pop,The winner is Beyonce Knowles female pop voca...,0.000521,0.000484,0.00025,0.000185,...,0.000324,0.080209,0.000217,5.6e-05,0.000149,0.000365,0.004496,0.017454,0.000535,0.000193
4,52,flawless,2014,beyonce-knowles,Pop,Your challengers are a young group from Houst...,0.000516,0.092391,0.000248,0.029427,...,0.380499,0.000161,0.021104,5.6e-05,0.000148,0.09645,0.000281,0.000597,0.017242,0.000192


Transforming the labels to 1 for pop songs and 0 for non-pop songs

In [57]:
joined_topic_dist['label'] = joined_topic_dist['genre'] # creating new feature called "label"
joined_topic_dist['label'].loc[joined_topic_dist['label'] == 'Pop'] = 1  # setting the value of "label" to 1 if song is pop
joined_topic_dist['label'].loc[joined_topic_dist['label'] != 1] = 0      # setting the value of "label" to 0 if song is not pop

joined_topic_dist[joined_topic_dist['genre'] == 'Rock'].head(10)

Unnamed: 0,index,song,year,artist,genre,lyrics,topic_1,topic_2,topic_3,topic_4,...,topic_42,topic_43,topic_44,topic_45,topic_46,topic_47,topic_48,topic_49,topic_50,label
70,343,taken,2010,brightwood,Rock,I am taken I am not my own I am floating Teach...,0.007554,0.007015,0.003623,0.002677,...,0.002353,0.003141,0.000817,0.002163,0.00529,0.004109,0.130974,0.007768,0.002802,0
84,470,hisingen-blues,2011,graveyard,Rock,Going by the riot Call the rest a stone Leadin...,0.00314,0.002916,0.026911,0.001113,...,0.000978,0.001306,0.00034,0.000899,0.002199,0.103329,0.003632,0.003229,0.001165,0
85,471,no-good-mr-holden,2011,graveyard,Rock,New day when beauty's all gone And blues follo...,0.009825,0.001004,0.000518,0.000383,...,0.000337,0.000449,0.000117,0.000309,0.061966,0.000588,0.00125,0.0186,0.000401,0
86,472,uncomfortably-numb,2011,graveyard,Rock,I wasn't there when you needed me You never le...,0.033515,0.001811,0.000936,0.000691,...,0.000608,0.000811,0.000211,0.000559,0.001366,0.001061,0.175862,0.002006,0.000723,0
94,499,move,2020,arcade-fire,Rock,Everybody stand up get down Move when I tell y...,0.064548,0.010101,0.000537,0.000397,...,0.000349,0.000466,0.000121,0.000321,0.073273,0.00967,0.001296,0.001152,0.000415,0
95,502,the-arcade-fire,2010,arcade-fire,Rock,In the middle of the summer I'm not sleeping c...,0.001835,0.001704,0.00088,0.134257,...,0.030262,0.000763,0.000198,0.000525,0.001285,0.000998,0.002123,0.001887,0.00068,0
96,508,m-i-a,2010,arcade-fire,Rock,Here's my song about gun control As my politic...,0.000678,0.011597,0.000325,0.00024,...,0.000211,0.011249,7.3e-05,0.000194,0.000475,0.000369,0.011751,0.017148,0.000251,0
97,529,here-comes-the-night-time-ii,2013,arcade-fire,Rock,Here comes the night time Here comes the nigh...,0.067201,0.00187,0.000966,0.000714,...,0.000627,0.000838,0.000218,0.000577,0.001411,0.001096,0.00233,0.002071,0.000747,0
98,530,joan-of-arc,2013,arcade-fire,Rock,You're the one that they used to hate But they...,0.000934,0.144413,0.000448,0.000331,...,0.000291,0.000388,0.211642,0.000267,0.000654,0.000508,0.144626,0.099176,0.000346,0
99,532,reflektor,2013,arcade-fire,Rock,Trapped in a prism in a prism of light Alone i...,0.001283,0.001191,0.000615,0.000455,...,0.0004,0.000533,0.176557,0.000367,0.000898,0.000698,0.032616,0.001319,0.000476,0


Examining the proportion of pop vs non-pop topics

In [58]:
joined_topic_dist.groupby('label').agg({'label': 'count'})

Unnamed: 0_level_0,label
label,Unnamed: 1_level_1
0,39645
1,7791


### 2.3 Fitting a LOGIT model

Maing a train-test split

In [90]:
# seperating features from labels
X = joined_topic_dist.iloc[:,6:56].astype(float)
y = joined_topic_dist['label'].astype(int)

# making a train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55)

Learning the logistic regression model. As the ratio of pop to non-pop songs is highly unbalanced I am adjusting the class weights to be inversely proportional to the number of observations for each class.

In [None]:
lr = LogisticRegression(max_iter=1000000, class_weight='balanced') 

lr.fit(X_train, y_train)

Getting predictions on the test set and examining the confusion matrix

In [232]:
labels = ['non-pop', 'pop']
y_pred = lr.predict(X_test)

print(classification_report(y_test, y_pred, 
                          target_names=labels))

              precision    recall  f1-score   support

     non-pop       0.91      0.58      0.71      7944
         pop       0.25      0.70      0.36      1544

    accuracy                           0.60      9488
   macro avg       0.58      0.64      0.54      9488
weighted avg       0.80      0.60      0.65      9488



Showing the topics with the highest coefficients

In [136]:
features = X.columns.values # getting the column (topic) names
regression_coefficients = lr.coef_[0] # get the LR coefficients
feature_coef_combined = list(zip(features, regression_coefficients)) # zipping the two lists

# creating a dataframe of topics and coefficients
feature_importance = pd.DataFrame(feature_coef_combined, 
                      columns=['topic', 'coef'])

# printing the top 10 topics most distinctive of pop songs
feature_importance.sort_values('coef', ascending=False).head(10)

Unnamed: 0,topic,coef
2,topic_3,3.07416
1,topic_2,2.455029
43,topic_44,2.17654
35,topic_36,1.987309
8,topic_9,1.722746
47,topic_48,1.714878
41,topic_42,1.644119
3,topic_4,1.528376
23,topic_24,1.459794
27,topic_28,1.111788


Showing the top 10 words of each of the top 10 most distinctive topics

In [152]:
print(f"""
{topics[2]}
{topics[1]}
{topics[43]}
{topics[35]}
{topics[8]}
{topics[47]}
{topics[41]}
{topics[3]}
{topics[23]}
{topics[27]}
""")


Topic 3: yeah ooh oooh ohh 'cause mmm hey dat whoa wanna
Topic 2: baby girl boy crazy ready wanna body babe love yeah
Topic 44: la da ba eh de dee ding dah tu dum
Topic 36: stop hand start kiss real make touch throw feel raise
Topic 9: wanna run slow wild fast fun feel ey 'cause hide
Topic 48: love true heart lover feel 'cause kiss baby thing sweet
Topic 42: chorus verse 2 1 doo repeat 3 cuz bridge 4
Topic 4: tonight round cold catch shut middle air alright spin feel
Topic 24: heart break broken beat start piece love chain feel fall
Topic 28: ah uh ha ho bounce huh bass ahh bomb ow



### 2.4 Classifying new song

In [172]:
new_song = '''The sky breaks open and the rain falls down 
Oh, and the pain is blinding 
But we carry on... 
Seems like the end of everything 
When the one you love 
Turns their back on you 
And the whole world falls down on you... 
It's the end of the world 
It's the end of the world 
Well, I'm holding on...
But the world keeps dragging me down... 
It's the end of the world 
It's the end of the world 
Well, I'm holding on... 
But the world keeps dragging me down... 
I got my heart on lockdown 
And my eyes on the lookout 
But I just know that I'm 
Never gonna win this 
They can keep the lights on 
They can keep the music loud 
I don't need anything 
When I got my music 
And I'm holding on... 
The sky breaks open and the rain falls down 
And the pain is blinding 
But we carry on 
(They tell you lies) 
I'm holding on... 
(They tell you lies)The sky breaks open and the rain falls down 
Oh, and the pain is blinding 
But we carry on... 
Seems like the end of everything 
When the one you love 
Turns their back on you 
And the whole world falls down on you... 
It's the end of the world 
It's the end of the world 
Well, I'm holding on...
But the world keeps dragging me down... 
It's the end of the world 
It's the end of the world 
Well, I'm holding on... 
But the world keeps dragging me down... 
I got my heart on lockdown 
And my eyes on the lookout 
But I just know that I'm 
Never gonna win this 
They can keep the lights on 
They can keep the music loud 
I don't need anything 
When I got my music  
And I'm holding on... 
The sky breaks open and the rain falls down 
And the pain is blinding 
But we carry on 
(They tell you lies) 
I'm holding on... 
(They tell you lies)'''

I will perform the same preprocessing steps to this new song. 

In [166]:
# I will begin with processing and tokenizing the song with spacy.
processed_song = nlp(new_song) # processing song
tokenized_song = [word.lemma_.lower() for word in processed_song if not word.is_stop and not word.is_punct] # tokenizing song
tokenized_song = [word for word in tokenized_song if word != '\n'] # removing new line characters

# Getting the corpus of the new song. Here it is important to use the dictionary of the training set as those are the words that the topics are based on.
corpus_song = dictionary.doc2bow(tokenized_song)

Next I will generate the topic distribution of the new song.

I found how to get the topic distribution of a new unseen document using LdaMallet on stack overflow:
- link1 - https://stackoverflow.com/questions/55789477/how-to-predict-test-data-on-gensim-topic-modelling
- link2 - https://stackoverflow.com/questions/45310925/how-to-get-a-complete-topic-distribution-for-a-document-using-gensim-lda

In [181]:
song_topics = lda[corpus_song]

Next I will classify this new song using the logistic regression model from before

In [233]:
# creating a numpy array that can be passed to the lr.predict() function
song_array = np.array([y for x, y in song_topics]) # converting the list of topic distributions to a numpy array
song_array = song_array.reshape(1, -1) # reshaping array to equate 1 observation with 50 features

# getting predictions
song_pred = lr.predict(song_array)
song_pred[0]

1

I will also display the class probability estimate (for the pop class) of the new song

In [237]:
song_proba = lr.predict_proba(song_array) # generating a class probability estimate for the new song
song_proba[0][1]

0.5417554125384615

I will next examine the topic distribution itself of the new song. I will look at the 10 most prominent topics in the song

In [241]:
# creating a series of the topic distribution
song_series = pd.Series([y for x, y in song_topics], index=['topic_{}'.format(i+1) for i in range(N_TOPICS)])

# showing the top 10 most prominent topics in the new song
song_series.sort_values(ascending=False).head(10)

topic_32    0.307204
topic_40    0.139682
topic_33    0.092578
topic_24    0.083733
topic_14    0.069910
topic_43    0.056430
topic_1     0.049329
topic_22    0.041690
topic_31    0.039625
topic_23    0.026983
dtype: float64

In [253]:
print(f"""
{topics[31]}   // coef: {feature_importance.iloc[31][1]}
{topics[39]}   // coef: {feature_importance.iloc[39][1]}
{topics[32]}   // coef: {feature_importance.iloc[32][1]}
{topics[23]}   // coef: {feature_importance.iloc[23][1]}
{topics[13]}   // coef: {feature_importance.iloc[13][1]}
{topics[42]}   // coef: {feature_importance.iloc[42][1]}
{topics[0]}   // coef: {feature_importance.iloc[0][1]}
{topics[21]}   // coef: {feature_importance.iloc[21][1]}
{topics[30]}   // coef: {feature_importance.iloc[30][1]}
{topics[22]}   // coef: {feature_importance.iloc[22][1]}
""")


Topic 32: world find end hope bring life place follow time lose   // coef: -0.6619051256139659
Topic 40: eye lie close fly open wall door hide wide secret   // coef: -0.2113542965543397
Topic 33: fall leave remember forget time watch memory carry moment day   // coef: -0.21051904989248685
Topic 24: heart break broken beat start piece love chain feel fall   // coef: 1.4597935467189582
Topic 14: hold arm black tight hand white gimme close night darling   // coef: 0.8128662365306996
Topic 43: dance rock beat music roll floor jump play move rhythm   // coef: 0.6645181817183963
Topic 1: cry tear leave hurt give goodbye pain lie hate die   // coef: -0.47719863680777835
Topic 22: light sun shine star sky night dark moon bright rise   // coef: 0.5578400299790878
Topic 31: turn change push pull drop dead hang swear super loud   // coef: 0.6086145575515177
Topic 23: rain blue wind blow tree sky summer grow cold sun   // coef: -1.083791431602423



## 3. Interpretation & discussion <a class="anchor" id="third-section"></a>

When looking at the topics most predictive of pop songs, we can see that one of the most distinguishable features of the genre appears to be **scat singing**, which involves the usage of “meaningless” sounds which are not actual words. We can see this clearly in some of the topics with the highest coefficients for pop songs, such as topic 3 (ooh, oooh, ohh, whoa), topic 44 (la, da, ba, eh, de, dee, ding, dah, dum) and topic 28 (ah, uh, ha, ho, ow).

Other highly distinguishable feature of pop songs are the recurring themes of:
-	**passionate love** in topics 2 (baby, girl,, boy, crazy body babe love), 36 (hand, kiss, touch, feel), 48 (love, true, heart, lover, feel, kiss, baby, sweet) and 24 
-	**excitement** in topics 9 (run, wild, fast, feel) and 4 (tonight, air, spin, feel).

Thus we can conclude that distinctive features of pop songs do indeed exist. These qualitative distinctions can be described as an overarching thematical focus on passionate love and excitement and recurring usage of scatting.

## 4. Analysis of new lyrics <a class="anchor" id="fourth-section"></a>

When applying the logistic regression to the new lyrics, the model classified it as a pop song. When looking at the class probability estimates, we can say that our model was not very confident in this decision. The model estimated the probability of the new song being a pop song at 0.54, which is very close to the decision threshold of 0.5.

When looking at 10 most prominent topics in the new lyrics, we can see mixed results. There are topics with high positive coefficients such as:
-	24 and 14 which cover the theme of **passionate love** 
-	as well as 43 and 31 which cover the theme of **excitement**

However, there are also topics with a high negative coefficient, such as: 
-	topic 23 (rain, blue, wind, blow, tree, sky, summer, grow, cold, sun) and
-	topic 32 (world, find, end, hope, bring, life, place, follow, time, lose)

Thus, while the new song shares thematical similarities with the pop genre, it also contains features that set it apart from typical pop songs.
