# Billboard Hits Prediction

## Problem Statement
It is widely known that finding a hit song is challenging, especially in the local music industry. However, relying solely on subjective judgment and industry experience may lead to missed opportunities or poor investment decisions. And small or independent record companies with a limited budget do not want to waste time and money in producing and promoting a ‘dud’ or ‘miss’ song.

What if the field of data science could lend a hand in this endeavor? Developed an effective Machine Learning model/s which can predict hit songs so as to help all big and small record companies make the right decisions.


## Data Sources

- Billboard Top 100 Weekly Charts. The charts will provide peak position data on the songs and their corresponding artists from year 1999 to 2019.
- Spotify Audio Features and Genres. Audio features and Genres of the respective songs will be scrapped from the Spotify API.
- Lyrics of songs. Lyrics of the respective songs will be scrapped from Genius.com.

## Objective

According to Google, a song is considered to be a hit once it reaches the Top 40 of the Billboard Top 100 charts. We want to see whether the audio features together with the lyrical content of the songs help to determine a hit song. We will be using Natural Language Processing (NLP) to analyze the lyrics of the songs.

In [1]:
import pandas as pd

## Reading in Dataset with Missing Lyrics

In [2]:
missing_lyrics = pd.read_csv("./data/data.csv", encoding="utf-8")

In [3]:
missing_lyrics.shape

(9248, 27)

## Reading in Dataset of the Identified Missing Lyrics

In [4]:
ident_lyrics = pd.read_csv("./data/datalyrics.csv", encoding="utf-8")

In [5]:
# Taking artist,, song and Lyrics columns into consideration
ident_lyrics = ident_lyrics[["artist", "song", "Lyrics"]]

In [6]:
ident_lyrics.shape

(401, 3)

## Merging the missing lyrics data and the identified missing lyrics data

In [7]:
# Merge missing_lyrics with ident_lyrics_relevant on 'song' and 'artist'
merged_data = missing_lyrics.merge(ident_lyrics, on=['song', 'artist'], how='left', suffixes=('_missing', '_data'))

In [8]:
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9248 entries, 0 to 9247
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   artist             9248 non-null   object 
 1   song               9248 non-null   object 
 2   peak-rank          9248 non-null   int64  
 3   Hit or Not         9248 non-null   object 
 4   Lyrics_missing     8847 non-null   object 
 5   danceability       9248 non-null   float64
 6   energy             9248 non-null   float64
 7   key                9248 non-null   int64  
 8   loudness           9248 non-null   float64
 9   mode               9248 non-null   int64  
 10  speechiness        9248 non-null   float64
 11  acousticness       9248 non-null   float64
 12  instrumentalness   9248 non-null   float64
 13  liveness           9248 non-null   float64
 14  valence            9248 non-null   float64
 15  tempo              9248 non-null   float64
 16  type               9248 

## Drop all the unnecessary columns

In [9]:
# Initial removal of unimportant columns
merged_data.drop(columns=['type','id','uri','track_href','analysis_url','artist_id','track_popularity','artist_popularity'], inplace=True)

In [10]:
# Use combine_first() to fill in merged_data 'Lyrics_missing' with values from 'Lyrics_data'
merged_data['Lyrics_missing'] = merged_data['Lyrics_missing'].combine_first(merged_data['Lyrics_data'])

merged_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9248 entries, 0 to 9247
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist            9248 non-null   object 
 1   song              9248 non-null   object 
 2   peak-rank         9248 non-null   int64  
 3   Hit or Not        9248 non-null   object 
 4   Lyrics_missing    9212 non-null   object 
 5   danceability      9248 non-null   float64
 6   energy            9248 non-null   float64
 7   key               9248 non-null   int64  
 8   loudness          9248 non-null   float64
 9   mode              9248 non-null   int64  
 10  speechiness       9248 non-null   float64
 11  acousticness      9248 non-null   float64
 12  instrumentalness  9248 non-null   float64
 13  liveness          9248 non-null   float64
 14  valence           9248 non-null   float64
 15  tempo             9248 non-null   float64
 16  duration_ms       9248 non-null   int64  


In [11]:
# Drop 'Lyrics_data' column
merged_data = merged_data.drop('Lyrics_data', axis=1)

## Check for any null values with merged_data

In [12]:
# Check for any null values with merged_data
merged_data.isnull().sum()

artist               0
song                 0
peak-rank            0
Hit or Not           0
Lyrics_missing      36
danceability         0
energy               0
key                  0
loudness             0
mode                 0
speechiness          0
acousticness         0
instrumentalness     0
liveness             0
valence              0
tempo                0
duration_ms          0
time_signature       0
genres               0
dtype: int64

In [13]:
# Drop the rows of merged_data that contains null values
merged_data = merged_data.dropna()

In [14]:
merged_data

Unnamed: 0,artist,song,peak-rank,Hit or Not,Lyrics_missing,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,genres
0,"""Weird Al"" Yankovic",Canadian Idiot,82,non-hit,37 ContributorsCanadian Idiot Lyrics[Verse 1]\...,0.543,0.697,8,-9.211,1,0.0612,0.00206,0.000002,0.3430,0.861,185.978,143040,4,"['comedy rock', 'comic', 'parody']"
1,"""Weird Al"" Yankovic",White & Nerdy,9,hit,99 ContributorsWhite & Nerdy Lyrics[Chorus]\nT...,0.791,0.613,1,-11.628,0,0.0763,0.09860,0.000000,0.0765,0.896,143.017,170640,4,"['comedy rock', 'comic', 'parody']"
2,"""Weird Al"" Yankovic",Word Crimes,39,hit,93 ContributorsWord Crimes Lyrics[Intro]\nEver...,0.897,0.430,7,-12.759,1,0.0551,0.01180,0.000000,0.0473,0.964,121.987,223120,4,"['comedy rock', 'comic', 'parody']"
4,N Sync,Bye Bye Bye,4,hit,58 ContributorsBye Bye Bye Lyrics[Intro: Justi...,0.610,0.926,8,-4.843,0,0.0479,0.03100,0.001200,0.0821,0.861,172.638,200400,4,"['boy band', 'dance pop', 'pop']"
5,N Sync,I Drive Myself Crazy,67,non-hit,14 ContributorsThinking of You (I Drive Myself...,0.495,0.704,9,-5.260,1,0.0331,0.01840,0.000000,0.1900,0.407,174.056,239733,4,"['boy band', 'dance pop', 'pop']"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9243,will.i.am & Nicki Minaj,Check It Out,24,hit,76 ContributorsTranslationsPortuguêsCheck It O...,0.848,0.694,1,-4.272,1,0.0572,0.04930,0.000012,0.1050,0.728,130.075,251013,4,"['dance pop', 'pop']"
9244,will.i.am Featuring Justin Bieber,#thatPOWER,17,hit,60 ContributorsTranslationsPortuguês#thatPOWER...,0.797,0.608,6,-6.096,0,0.0584,0.00112,0.000077,0.0748,0.402,127.999,279506,4,"['dance pop', 'pop']"
9245,will.i.am Featuring Mick Jagger & Jennifer Lopez,T.H.E (The Hardest Ever),36,hit,46 ContributorsT.H.E. (The Hardest Ever) Lyric...,0.586,0.712,9,-4.823,1,0.0969,0.10400,0.000006,0.0377,0.450,106.024,287973,4,"['dance pop', 'pop']"
9246,will.i.am Featuring Miley Cyrus,Fall Down,58,non-hit,17 ContributorsFall Down Lyrics[Chorus: will.i...,0.619,0.621,0,-5.465,0,0.0357,0.00994,0.003320,0.1210,0.338,127.042,307493,4,"['dance pop', 'pop']"


In [15]:
# Check for any null values in merged_data
merged_data.isnull().sum()

artist              0
song                0
peak-rank           0
Hit or Not          0
Lyrics_missing      0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
time_signature      0
genres              0
dtype: int64

In [16]:
# Rename the 'Lyrics_missing' to 'lyrics
merged_data = merged_data.rename(columns={'Lyrics_missing': 'lyrics'})

## Export merged_data

In [17]:
merged_data.to_csv('./data/data_top100.csv', encoding='UTF8', index=False)