# Capstone

The following notebook was created to help us understand what it takes, lyrically, to be in the Billboard Top 100 Charts. The data explored is largely from the years between 1999-2019, however there is some additional data that dates back to 1959. 

Throughout the notebook, we will be extensively cleaning our data and performing three separate models. The models being ran throughout the notebook are Random Forests, XGB and SVM. Additionally there will be further data exploration on how genre can play an intricate role when it comes to being involved in the Billboard's top 100 charts,. 

The following variable was created as to not run cell 7. Cell 7 is where I scraped for both a genius API and a spotify API to get lyrics as well as certain spotify features. If you'd like to run cell 7 to see how it works, please feel free to change False to True.

In [None]:
scrape = False
api_file = '/Users/jamesbrochhausen/.secret/Spotify.json'
# 

In [None]:
import pandas as pd
import numpy as np
import pandas as pd
import sklearn.metrics as metrics
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_curve, auc
from sklearn import tree
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import plot_confusion_matrix, recall_score
from xgboost import XGBRFClassifier,XGBClassifier
import shap
from imblearn.over_sampling import SMOTE
shap.initjs()
import warnings
from collections import Counter
from nltk import FreqDist
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
import string
from wordcloud import WordCloud
warnings.filterwarnings('ignore')

# Cleaning

## Random Songs

# NEED TO FIND SPOTIFY 2000

# Used for songs not in top 100. FIND WHERE SPOTIFY 2000.CSV FILE

In [None]:
df_rando = pd.read_csv('Spotify-2000.csv', index_col=[0])

In [None]:
df_rando.head()

### Spotify and Genius API

In [None]:
import os

In [None]:
# pip install lyricsgenius
# pip install spotipy

In [None]:
if scrape:
    import json

    with open(api_file) as file:
        login = json.loads(file.read())

    login.keys()

    import spotipy
    from spotipy.oauth2 import SpotifyClientCredentials

    sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=login['client_id'],
                                                               client_secret=login['client_secret']))

    df_rando['Lyrics']=''
    df_rando.head()

    import lyricsgenius
    genius = lyricsgenius.Genius(login['genius_secret'])

    def get_lyrics(x):
        title = x['Title']
        artist = x['Artist']

        song = genius.search_song(title, artist)
    #     print(song.lyrics)
        try: 
            return song.lyrics
        except:
            return 'Not Found'
    print('Scraping data')
    df_rando['Lyrics']=df_rando.apply(lambda x: get_lyrics(x), axis=1)

    df_rando.to_csv('randomsongs.csv',index=False)
else:
     print ('Skipping since scrape == False')   
# Change this to a relative file path

In [None]:
df_rando = pd.read_csv('randomsongs.csv')

In [None]:
df_rando.head()

In [None]:
print(df_rando['Lyrics'].iloc[0])

Now that I've confirmed all the lyrics are in the songs, I need to go through the Lyrics column and removes words like Chorus and Bridge and Verse.

In [None]:
## Removing words from rows in Lyrics Column

word_list = ['[Chorus 1]','[Chorus 2]','[Chorus 3]','[Chorus 4]',
             '[Verse 1]','[Verse 2]','[Verse 3]','[Verse 4]','[Bridge]',
             '[Intro]','[Chorus]','[Outro]','[outro]','[Verse]','[verse]',
            '[Pre-Chorus]','[pre-chorus]','[Instrumental]','[instrumental]',
             '[post-chorus]','[Post-Chorus]']

In [None]:
# for word in word_list:
#     for lyric in df_rando['Lyrics']:
# #     if word in df_rando['Lyrics']:
#         if word in lyric:
# #             print(word)
#             df_rando['Lyrics'].replace((word,''),inplace=True)

In [None]:
def replace_words(word_list, lyrics):
#     print(lyrics)
    for word in word_list:
        if word in lyrics:
            lyrics = lyrics.replace(word,'')
#             print(word)
    return lyrics


df_rando['Lyrics']=df_rando['Lyrics'].apply(lambda x: replace_words(word_list,
                                                                    x))

In [None]:
print(df_rando['Lyrics'][1])

## Billboard

In [None]:
## Add your markdowns
df_bb = pd.read_csv('billboardHot100_1999-2019.csv', index_col=[0])

Removing unnecassary columns for our solution

In [None]:
df_bb = df_bb.drop(columns=['Features','Writing.Credits','Week','Date'])
df_bb.head()

In [None]:
print(df_bb['Lyrics'].iloc[0])

In [None]:
df_bb.shape

In [None]:
## Checking for duplicates
df_bb['Name'].duplicated().any()

Since I'm going to eventually be joining this column with my Hot Stuff column below, I'm going to need to remove and and all duplicates before I can initiate that process down the line.

In [None]:
df_bb['Name'].duplicated().value_counts()

The dataset above houses duplicate songs as they appear multiple times in a week and multiple times throughout a year. Which makes sense why there are so many. What we care about most however is the lyrics and how similar they are to one another. So we're going to drop all duplicate lyrics (just in case there are some songs with the same title).

In [None]:
df_bb['Lyrics'].duplicated().value_counts()

In [None]:
df_bb[df_bb['Lyrics'].duplicated(keep=False)].sort_values('Lyrics')

In [None]:
df_bb = df_bb.drop_duplicates(subset=['Lyrics'])

In [None]:
print(df_bb.shape)
df_bb.head(3)

### Verifying all the lyrics are in the actual song

In [None]:
print(df_bb['Lyrics'][1])

## Hot Features

May use this as well with an API call to create a data frame that includes more of the other information. But for now we're focused on achieving lyrical results.

In [None]:
df_hot_audio_features = pd.read_csv('Hot 100 Audio Features.csv')

In [None]:
df_hot_audio_features.head(3)

In [None]:
##Further removing features I don't want included

df_hot_audio_features = df_hot_audio_features.drop(columns=['spotify_track_id',
                                    'spotify_track_preview_url',
                     'spotify_track_album','spotify_track_explicit',
                     'spotify_track_duration_ms'])

Checking duplicates again to remove before joining our two data frames.

In [None]:
# df_hot_audio_features['SongID'].duplicated().value_counts()

In [None]:
df_hot_audio_features['Song'].duplicated().value_counts()

In [None]:
df_hot_audio_features = df_hot_audio_features.drop_duplicates(subset=['Song'])

In [None]:
df_hot_audio_features['Song'].duplicated().value_counts()

In [None]:
df_hot_audio_features.head(3)

In [None]:
print(df_hot_audio_features.shape)
df_bb.shape

In [None]:
df_hot_audio_features.head()

In [None]:
df_bb.head()

In [None]:
# df_hot_stuff.info()

## Merge Data Frames on Song Title

In [None]:
## Show all columns
pd.set_option('display.max_columns', None)

In [None]:
# Join Columns on both Artists and Song name

print(df_bb.shape, df_hot_audio_features.shape)
master_df = pd.merge(df_hot_audio_features, df_bb,
                  left_on=['Performer','Song'], right_on=['Artists','Name'],
                     how='inner')



# print(df_bb.shape)
# print(df_hot_audio_features.shape)
# print()
print(master_df.shape)
master_df.head(3)

## Deal with duplicates round 1

In [None]:
## One duplicate
master_df['Song'].duplicated().value_counts()

In [None]:
## Remove last duplicate
master_df = master_df.drop_duplicates(subset=['Song'])

In [None]:
master_df['Song'].duplicated().value_counts()

In [None]:
master_df.isna().sum()

In [None]:
master_df.head(3)

## Merging data

In [None]:
master_df.head(1)

In [None]:
df_rando.head(1)

Dropping columns in df_rando that are not in master df and vice versa aside from the billboard rankings which we will fill out later.

In [None]:
# title, yes
# artist, yes
# year, no
# bpm, no
# energy, yes
# dance, yes
# loudness, yes
# liveness, yes
# valence, yes
# length, yes
# accousticness, yes
# speechiness, yes
# pupularity, yes



df_rando = df_rando.drop(['Beats Per Minute (BPM)','Year'], axis=1)
df_rando.head(3)

In [None]:
master_df.head(1)

In [None]:
# tempo, no
# instrumental, no
# mode, no
# key, no
# songid, no

master_df = master_df.drop(['SongID','tempo','instrumentalness','mode','key','Artists','Name'],axis=1)
master_df.head(3)

## Dealing with Null values in our features with the mean.

In [None]:
master_df = master_df.fillna(master_df.mean())

In [None]:
master_df.isna().sum()

In [None]:
## Fill no genres with 'no genre'
master_df["spotify_genre"].fillna("['No Genre']", inplace = True) 

In [None]:
master_df.isna().sum()

In [None]:
master_df.info()

In [None]:
master_df['Weekly.rank'] = master_df['Weekly.rank'].astype(float)

In [None]:
master_df.info()

In [None]:
df_rando.dtypes

In [None]:
df_rando.head(10)

First I'm going to rename the columns in my master data frame to map our random data frame. This way when we join we will eliminate duplicate columns appearing. It will also look much cleaner.

In [None]:
master_df.rename(columns={'Song':'Title', 'Performer':'Artist',
                         'spotify_genre':'Top Genre',
                          'spotify_track_popularity':'Popularity',
                         'danceability':'Danceability','loudness':'Loudness (dB)',
                        'energy':'Energy','speechiness':'Speechiness',
                        'acousticness':'Acousticness','liveness':'Liveness',
                          'valence':'Valence','time_signature':'Length (Duration)',
                        }, inplace=True)

In [None]:
master_df.head(5)

# Getting Genres

In [None]:
master_df.loc[0,'Top Genre']

Right now our top genre columns contents are a string. Python’s eval() allows me to evaluate  expressions from a string-based or compiled-code-based input. Which will basically help me convert this back into a list. 

In [None]:
## Experimenting, it worked
eval(master_df.loc[0,'Top Genre'])

In [None]:
## applying to the entire column
master_df['Top Genre'] = master_df['Top Genre'].apply(eval)

In [None]:
print(master_df.loc[0,'Top Genre'])

Now my goal is to turn this back into a string, by leveraging our .join function. I'm doing this so that when I join our master_df and rando_df below they'll be in the same 'format'.

In [None]:
' '.join(master_df.loc[0,'Top Genre'])

In [None]:
master_df['Top Genre'] = [', '.join(map(str, l)) for l in master_df['Top Genre']]


In [None]:
master_df.head()

In [None]:
# final_df = pd.concat([master_df,df_rando])

In [None]:
# master_df_2 = master_df.explode('Top Genre')

In [None]:
# final_df_2 = pd.concat([master_df_2,df_rando])

In [None]:
# final_df_2['Top Genre'].value_counts().head(60)

In [None]:
# final_df_2['Top Genre'].value_counts().head(20).index

In [None]:
# rep = {'pop':'Pop', 'contemporary country':'Country', 'country':'Country',
#        'country road':'Country'}
#        'pop rap', 'post-teen pop', 'rap', 'hip hop', 'r&b',
#        'modern country rock', 'album rock', 'urban contemporary', 'pop rock',
#        'trap', 'hip pop', 'post-grunge', 'neo mellow', 'southern hip hop',
#        'alternative metal']


In [None]:
# final_df_2['Top Genre'].map(rep)

In [None]:
# final_df_2

In [None]:
master_df['df'] = 'master_df'

In [None]:
df_rando['df'] = 'rando_df'

Now the my column names are the same (except for the ones with additional data) we can now join the two data frames.

In [None]:
potential_master = pd.concat([master_df,df_rando])

In [None]:
potential_master.head()

In [None]:
print(df_rando.shape)
print(master_df.shape)
potential_master.shape

In [None]:
final_df = potential_master

In [None]:
print(final_df.shape)
final_df.head(3)

In [None]:
final_df.isna().sum()

Dropping Genre and my Top Genre has the all the genres in it now. And we can see that all of the genres mapped over correctly into Top Genre as there are no null values.

In [None]:
final_df.drop(['Genre'], axis=1, inplace=True)

# Exploring the Text / EDA

In [None]:
final_df.head()

In [None]:
corpus = final_df['Lyrics'].to_list()

In [None]:
stopwords_ls = stopwords.words('english')
stopwords_ls[:10]

In [None]:
stopwords_ls.extend(string.punctuation)
stopwords_ls[-5:]

Now the normal, word_tokenize function doesn't know how to effectively read words from my experience. For example the word can't would be produced as "can" "'t". Which is not what we want, we want the full words. I found that the tweet tokenizer is able to handle this dataset more appropriately.

In [None]:
from nltk.tokenize import TweetTokenizer

In [None]:
tknzr = TweetTokenizer()

In [None]:
token = tknzr.tokenize('.'.join(corpus))

In [None]:
removed_tokens = [w.lower() for w in token if w.lower() not in stopwords_ls]
removed_tokens[:25]

In [None]:
freq = FreqDist(removed_tokens)
freq.most_common(50)

In [None]:
stopwords_ls.extend(['—','2018','scp','’','”','“','feat'])

In [None]:
removed_tokens = [w.lower() for w in token if w.lower() not in stopwords_ls]
removed_tokens[:25]

In [None]:
freq = FreqDist(removed_tokens)
freq.most_common(50)

## WordCloud

In [None]:
wordcloud = WordCloud(stopwords=stopwords_ls, collocations=False)
wordcloud.generate(','.join(removed_tokens))

plt.figure(figsize = (15,15), facecolor = 'black')
plt.imshow(wordcloud)
plt.axis('off')

## Bigram

In [None]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
lyric = nltk.BigramCollocationFinder.from_words(removed_tokens)
lyric_freq = lyric.score_ngrams(bigram_measures.raw_freq)

In [None]:
pd.DataFrame(lyric_freq, columns=["Lyric","Freq"]).head(20)

## Exploring genre

In [None]:
final_df['Top Genre']
# .value_counts()[:20]

In [None]:
genre_corpus = final_df['Top Genre'].to_list()

In [None]:
## Viewing the most popular genres and recognizing there are similarites
# FreqDist(genre_corpus)

In [None]:
# final_df['Genre'] = final_df['Genre'].to_list()

In [None]:
# final_df.info()

In [None]:
# #Took most popular genres
# ids = {'id':['Country','Metal','Pop','Classical','Rock','Electronic',
#              'Rap','R&;B'],
#       'Genre':['Country','Metal','Pop','Classical','Rock','Electronic',
#                      'Rap','R&;B']}
# ids = dict(zip(ids['id'], ids['Genre']))               
# print(ids)

In [None]:
# final_df.head(1)

In [None]:
# final_df.info()

In [None]:
# final_df['Top Genre'][0]

In [None]:
# final_df['Top Genre'] = [str(i) for i in final_df['Top Genre']]

In [None]:
# final_df['Top Genre'] = final_df['Top Genre'].apply(lambda x: x.upper()) 

# final_df.head()  

In [None]:
# final_df.dtypes

In [None]:
## Viewing the most popular genres and recognizing there are similarites
FreqDist(genre_corpus)

I'm importing re as this method takes a regular expression pattern and a string and searches for that pattern within the string. This helps set me up for when I need to make our genres just one singular generic genre.

In [None]:
import re
## Most frequent / common genres
rep = ['Country','Metal','Pop','Classical','Rock','Electronic',
       'Rap','R&;B','Adult Standards','Indie','Cabaret','Hip Hop','Soul']
final_df['Top Genre_'] = [re.findall(r'|'.join(rep), i,
                                 re.IGNORECASE) for i in final_df['Top Genre']]

In [None]:
final_df['Top Genre_']

Now I'm going through my Top Genre_ list and making sure that the genres don't repeat in their cell like we saw above. eq() will perform comparisons of my dataframe column objects with constant, series or another dataframes objects.

In [None]:
final_df['Top Genre'][final_df['Top Genre_'].str.len().eq(0)].value_counts()

Now that this is somewhat cleaned up. I need to replace every genre that's not in 'rep' with 'unique genre'. This will make the information more digestible, rather than having a genre look like 'cyberpunk'.

In [None]:
def replace_genre(rep, genre):
    #if genre has nothing in it or whitespace
    if len(genre)==0:
        return "Unique Genre"
    else:
        return max(set(genre), key=genre.count)
final_df['Top Genre_']=final_df['Top Genre_'].apply(lambda x: replace_genre(rep,
                                                                            x)) 

In [None]:
final_df['Top Genre_'].value_counts()

In [None]:
final_df.head()

In [None]:
# final_df['Top Genre'][final_df['Top Genre_'].str.len().eq(0)].value_counts()

In [None]:
# import re
# rep = ['Country','Metal','Pop','Classical','Rock','Electronic',
#              'Rap','R&;B'] ## add other genres here
# final_df['Top Genre_'] = [re.findall(r'|'.join(rep), i,
#                                  re.IGNORECASE)[0] for i in final_df['Top Genre']]

In [None]:
# print(final_df['Top Genre'])

In [None]:
# final_df.head()

In [None]:
final_df['Top Genre_'].value_counts()

In [None]:
final_df.isna().sum()

In [None]:
## filling null with 0's

final_df['Weekly.rank'].fillna(value=0, inplace=True)
final_df['Peak.position'].fillna(value=0, inplace=True)
final_df['Weeks.on.chart'].fillna(value=0, inplace=True)

In [None]:
final_df.isna().sum()

In [None]:
from numpy import median

sns.catplot(x="Top Genre_", y="Weekly.rank", kind="bar",
            estimator=np.median,data=final_df,
            height=10,aspect=11/10)
sns.swarmplot(x="Top Genre_", y="Weekly.rank", data=final_df,edgecolor='black',
              linewidth=0.5)
plt.show();

In [None]:
sns.catplot(x="Top Genre_", y="Peak.position", kind="bar",
            estimator=np.median,data=final_df,
            height=10,aspect=11/10)
sns.swarmplot(x="Top Genre_", y="Peak.position", data=final_df)
plt.show();

In [None]:
sns.catplot(x="Top Genre_", y="Weeks.on.chart", kind="bar",
            estimator=np.median,data=final_df,
            height=10,aspect=11/10)
sns.swarmplot(x="Top Genre_", y="Weeks.on.chart", data=final_df)
plt.show();

# Train Test Split

In [None]:
final_df.head(1)

SPELL  OUT WHY ON THE BELOW. Create a top 100 column which shows us if this song was in the billboard top 100.

In [None]:


final_df['top_100'] = (final_df['Peak.position']>=1)

In [None]:
final_df.head()

In [None]:
final_df.sort_values(['Artist','Title','df'],inplace=True)

In [None]:
final_df[final_df.duplicated(keep=False,
                             subset=('Artist','Title'))]

## look in to dropping duplicates here
# .sort_values('Title')

In [None]:
df_bb[df_bb['Lyrics'].duplicated(keep=False)].sort_values('Lyrics')

In [None]:
## Exploring how many made it
final_df['top_100'].value_counts(1)

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()
final_df['top.100'] = le.fit_transform(final_df['top_100'])
final_df.info()

In [None]:
y = final_df['top.100']
X = final_df['Lyrics']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.30,
                                                    random_state=10)

In [None]:
y_train.value_counts(1)

Leveraging tweet tokenizer again so it can capture the full word.

In [None]:
tokenizer = nltk.TweetTokenizer(preserve_case=False)
tokenizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


## Initialize TfIdf Vectorizer, feed in function of tokenize
vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenize,
                            stop_words=stopwords_ls)

# Vectorize data and make X_train_tfidf and X_test_tfidf
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
X_train_tfidf

In [None]:
# from imblearn.over_sampling import RandomOverSampler 

In [None]:
X_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train_tfidf, y_train)

# Random Forest

In [None]:
rf = RandomForestClassifier(class_weight="balanced")
rf.fit(X_train_tfidf,y_train)

In [None]:
y_hat_test = rf.predict(X_test_tfidf)

In [None]:
y_test_arr = y_test.to_numpy()

In [None]:
def plot_confusion_matrix(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):    
    import matplotlib.pyplot as plt
    import numpy as np
    import itertools

    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(8, 8))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45)
        plt.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]


    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.2f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:0.2f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")


    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
#     plt.grid(False)
    plt.show()

In [None]:
## Evaluate and view my model.

from sklearn.metrics import multilabel_confusion_matrix

def evaluate_model(y_test,y_hat_test,X_test,clf=None,
                  scoring=None,
                   verbose=False,scorer=False,
                   classes=['Not Top 100','Top 100'],
                  normalize = 'true'):
    
    print(metrics.classification_report(y_test,y_hat_test,
                                        target_names=classes))
    if scoring is None:
        scoring = metrics.recall_score(y_test,y_hat_test,average='macro')
    
    cm = metrics.confusion_matrix(y_test, y_hat_test,
    normalize = normalize)
    plot_confusion_matrix(cm,
                      normalize    = False,
                      target_names = classes,
                      title        = "Confusion Matrix")


    if verbose:
        print("MODEL PARAMETERS:")

        print(pd.Series(rf.get_params()))
## Use scoring = recall_macro in gridsearch.        
    if scorer:
        
        return recall_macro(y_test,y_hat_test)

In [None]:
evaluate_model(y_test_arr,y_hat_test,X_test_tfidf,rf)

In [None]:
pd.Series(rf.feature_importances_,
          index=vectorizer.get_feature_names()).sort_values().tail(25).plot(kind='barh')

# XGB

In [None]:
xgb_rf = XGBClassifier()
xgb_rf.fit(X_train_tfidf, y_train)

In [None]:
y_pred2 = xgb_rf.predict(X_test_tfidf)

evaluate_model(y_test_arr,y_pred2,X_test,xgb_rf)

In [None]:
pd.Series(xgb_rf.feature_importances_,
          index=vectorizer.get_feature_names()).sort_values().tail(25).plot(kind='barh')

# SVM

In [None]:
from sklearn.svm import SVC

clf = SVC(kernel='linear')
clf.fit(X_train_tfidf,y_train)
y_pred = clf.predict(X_test_tfidf)

evaluate_model(y_test_arr,y_pred,X_test_tfidf,clf)

# Conclusion