## Lyrics Dataset EDA
Description of dataset (Moura et al. 2020: Temporal Analysis and Visualisation of Music): "Using spotipy we fetched data from songs of 7 musical genres (rock, reggae, jazz, blues, hip hop, country, pop) and release date ranging from 1950 to 2019. Our dataset consisted of 82452 songs distributed of 7 musical genres and release dates ranging from 1950 to 2019. The main information retained was the artist name, track name, release date, genre and track id. The track id is a unique id for each searched track. We used the ’track id’ as input to the spotipy’s audio features tool and we kept only some of these
features. The selected features were:  

• Acousticness: Presence of acoustic instruments;  

• Danceability: how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm,stability, beat strength, and overall regularity;  

• Loudness: The average loudness in decibels (dB) across the entire track;  

• Instrumentalness: A high value describes whether a track contains fewer vocals;  

• Valence: High (low) values means that the track is more happy, euphoric (sad, angry);  

• Energy: Measures intensity and activity of music. Energetic tracks will be fast, loud and noisy.  

(...) Using Genius API, we queried lyrics using the artist name and song name of the same songs in which we obtained the metadata. We started cleaning the texts by identifying the language using Google’s library language-detection [Shuyo 2010] and removed all non-English texts. One of the patterns found in the retrieved lyrics was the presence of bracketed texts including the artist who sings that phrase and if the phrase is either a verse, chorus, or intro. There were also texts in parentheses which in most cases were onomatopoeia or backing vocals. These texts between parenthesis/brackets were removed. We cleaned the remaining texts consisted of removing symbols, numbers, and stop words like common English words and proper nouns. The remaining words were lemmatized to its canonical form using WordNet Lemmatizer [Fellbaum 2005] provided
by NLTK package[Bird 2009]. The data is available at [Moura 2020].

In [5]:
#THE DATASET PROVIDES LYRICS IN LEMMATIZED FORM WHICH WOULD PRECLUDE SOME TYPES OF ANALYSES (RHYMES)
#IN THE SOURCE PAPER, THE AUTHORS DID TOPIC ANALYSIS BY GENRE AND OVER TIME

In [1]:
#import relevant packages
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re, string

In [10]:
data = pd.read_csv("tcc_ceds_music.csv",delimiter=',', encoding='latin-1')
data.shape

(28372, 31)

In [11]:
data.head()

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,...,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,topic,age
0,0,mukesh,mohabbat bhi jhoothi,1950,pop,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,...,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.13711,sadness,1.0
1,4,frankie laine,i believe,1950,pop,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,...,0.001284,0.001284,0.331745,0.64754,0.954819,2e-06,0.325021,0.26324,world/life,1.0
2,6,johnnie ray,cry,1950,pop,sweetheart send letter goodbye secret feel bet...,24,0.00277,0.00277,0.00277,...,0.00277,0.225422,0.456298,0.585288,0.840361,0.0,0.351814,0.139112,music,1.0
3,10,pÃ©rez prado,patricia,1950,pop,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,...,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.77535,0.743736,romantic,1.0
4,12,giorgos papadopoulos,apopse eida oneiro,1950,pop,till darling till matter know till dream live ...,48,0.00135,0.00135,0.417772,...,0.0688,0.00135,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,romantic,1.0


In [7]:
data.describe()

Unnamed: 0.1,Unnamed: 0,release_date,len,dating,violence,world/life,night/time,shake the audience,family/gospel,romantic,...,like/girls,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,age
count,28372.0,28372.0,28372.0,28372.0,28372.0,28372.0,28372.0,28372.0,28372.0,28372.0,...,28372.0,28372.0,28372.0,28372.0,28372.0,28372.0,28372.0,28372.0,28372.0,28372.0
mean,42946.323558,1990.236888,73.028444,0.021112,0.118396,0.120973,0.057387,0.017422,0.017045,0.048681,...,0.028057,0.129389,0.030996,0.533348,0.665249,0.3392347,0.080049,0.532864,0.569875,0.425187
std,24749.325492,18.487463,41.829831,0.05237,0.178684,0.1722,0.111923,0.04067,0.041966,0.106095,...,0.058473,0.181143,0.071652,0.173218,0.108434,0.3267143,0.211245,0.250972,0.244385,0.264107
min,0.0,1950.0,1.0,0.000291,0.000284,0.000291,0.000289,0.000284,0.000289,0.000284,...,0.000284,0.000284,0.000289,0.005415,0.0,2.811248e-07,0.0,0.0,0.0,0.014286
25%,20391.25,1975.0,42.0,0.000923,0.00112,0.00117,0.001032,0.000993,0.000923,0.000975,...,0.000975,0.001144,0.000993,0.412975,0.595364,0.03423598,0.0,0.329143,0.380361,0.185714
50%,45405.5,1991.0,63.0,0.001462,0.002506,0.006579,0.001949,0.001595,0.001504,0.001754,...,0.001595,0.005263,0.001754,0.538612,0.67905,0.2259028,8.5e-05,0.539365,0.580567,0.414286
75%,64090.5,2007.0,93.0,0.004049,0.192608,0.197793,0.065842,0.010002,0.004785,0.042301,...,0.026622,0.235113,0.032622,0.656666,0.749026,0.6325298,0.009335,0.738252,0.772766,0.642857
max,82451.0,2019.0,199.0,0.647706,0.981781,0.962105,0.973684,0.497463,0.545303,0.940789,...,0.594459,0.981424,0.95881,0.993502,1.0,1.0,0.996964,1.0,1.0,1.0


In [8]:
data['lyrics'].head(10)

0    hold time feel break feel untrue convince spea...
1    believe drop rain fall grow believe darkest ni...
2    sweetheart send letter goodbye secret feel bet...
3    kiss lips want stroll charm mambo chacha merin...
4    till darling till matter know till dream live ...
5    convoy light dead ahead merchantmen trump dies...
6    piece mindin world knowin life come bring give...
7    care moment hold fast press lips dream heaven ...
8    lonely night surround power read mind hour nig...
9    tear heart seat stay awhile tear heart game st...
Name: lyrics, dtype: object