# NLP: CLASSIFICATION OF SONG LYRICS WITH EXPLICIT CONTENT

## Part 1: Lyrics Preprocessing

In [1]:
import pandas as pd
import fasttext
from src.preprocess import *
import warnings
warnings.filterwarnings("ignore")

### Load dataset

The dataset was found from [Kaggle](https://www.kaggle.com/mousehead/songlyrics), but it seems like it has been removed now.

In [2]:
data = pd.read_csv("./data/raw_data.csv")
data.drop("Unnamed: 0", axis=1, inplace=True)
data.head()

Unnamed: 0,artist,song,link,text,explicit_label
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd...",no match
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl...",FALSE
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...,FALSE
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...,no match
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...,FALSE


The raw data has 57,650 rows and 5 columns. We also have no missing data (yay!).

In [3]:
print(data.shape)
data.isnull().sum()

(57650, 5)


artist            0
song              0
link              0
text              0
explicit_label    0
dtype: int64

### Clean dataset

First, we remove the `link` column and rows with `no_match` in the `explicit_label` column. This removes more than half of the songs we started with.

In [4]:
# Drop columns and rows
data.drop("link", inplace=True, axis=1)
data = data[data["explicit_label"] != "no match"]
data.shape

(24676, 4)

Next, we remove non-English songs as well as lives and remixes of a song. I wrote the function for removing non-English songs here because `lid.176.ftz` gives me permission errors when I tried to include it in the source code. Doing this removes another ~200 songs.

In [5]:
# Remove non-English songs
def keep_english(df):
    pretrained_model = "lid.176.ftz"
    model = fasttext.load_model(pretrained_model)
    language = []
    for word in df["text"]:
        word = word.replace("\n", "")
        l = model.predict(word)[0]
        language.append(str(l)[11:13])
    df["language"] = language
    df = df[df["language"] == "en"]
    return df

data = keep_english(data)

# Remove lives and remixes
data = data[~data["song"].str.contains("\[Live\]")]
data = data[~data["song"].str.contains("\(Live\)")]
data = data[~data["song"].str.contains("Remix")]
data.shape



(24482, 5)

We convert binary labels to `1` and `0` for later use.

In [6]:
# Convert labels
data.loc[data["explicit_label"] == "TRUE", ["explicit_label"]] = 1
data.loc[data["explicit_label"] == "FALSE", ["explicit_label"]] = 0
data.head()

Unnamed: 0,artist,song,text,explicit_label,language
1,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl...",0,en
2,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...,0,en
4,ABBA,Bang-A-Boomerang,Making somebody happy is a question of give an...,0,en
7,ABBA,Chiquitita,"Chiquitita, tell me what's wrong \nYou're enc...",0,en
11,ABBA,Dancing Queen,"You can dance, you can jive, having the time o...",0,en


Now, we make the lyrics analysis-friendly as follows:
* Tokenize lyric string into a list of words
* Make everything lower-case
* Replace line break (`\n`) with space (` `) Remove punctuation marks
* Remove digits
* Expand contractions
* Replace multiple spaces with a single space
* Remove stopwords
* Remove hyperlinks (if any)
* Remove `&gt;` (if any)
* Remove emojis (if any)

In [7]:
path = "./data"
prep_lyrics(data, path)

### Lemmatize

After preparing the lyrics, we now lemmatize the words to reduce them to their root words. We start by assigning parts of speech to the words.

In [8]:
# Load data and lemmatize
data = pd.read_csv("./data/data.csv")
lemmatize(data, path)

# Take a look!
data = pd.read_csv("./data/data.csv", converters={"prepped_lyrics": eval,
                                                  "lemmatized": eval,
                                                  "unique_words": eval})
data.head()

Unnamed: 0,artist,song,text,explicit_label,language,prepped_lyrics,lemmatized,unique_words
0,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl...",0,en,"[take, easy, please, touch, gently, summer, ev...","[take, easy, please, touch, gently, summer, ev...","[thousand, butterfly, slow, night, soul, body,..."
1,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...,0,en,"[never, know, go, put, lousy, rotten, show, bo...","[never, know, go, put, lousy, rotten, show, bo...","[take, say, found, another, way, know, thank, ..."
2,ABBA,Bang-A-Boomerang,Making somebody happy is a question of give an...,0,en,"[making, somebody, happy, question, give, take...","[make, somebody, happy, question, give, take, ...","[show, tool, boomerang, throw, found, boom, kn..."
3,ABBA,Chiquitita,"Chiquitita, tell me what's wrong \nYou're enc...",0,en,"[chiquitita, tell, wrong, enchained, sorrow, e...","[chiquitita, tell, wrong, enchain, sorrow, eye...","[shoulder, cry, candle, feather, best, way, so..."
4,ABBA,Dancing Queen,"You can dance, you can jive, having the time o...",0,en,"[dance, jive, time, life, see, girl, watch, sc...","[dance, jive, time, life, see, girl, watch, sc...","[light, teaser, leave, night, another, dance, ..."


Now, our data is ready for exploratory data analysis (EDA) and modeling!