# Cleaning Data

Below describes our process for cleaning the data, removing unwanted features, and other small considerations.

## Dropping Unimportant Features

This section describes which features from the original data we remove and why.

In [1]:
## importing libraries

import os
import pandas as pd

import matplotlib.pyplot as plt

In [2]:
## loading in data

data_folder = "../data/raw/"
file = data_folder + "all_songs_data_raw.csv"

# concat all csvs into one dataframe
df_list = [pd.read_csv(file)]

raw_df = pd.concat(df_list, ignore_index=True)

We will start by removing any features that will not be useful due to their form, like "Media" or "Writers". These are URLs or other embedded information, and are not useful or interesting to look at.

In [3]:
# copying soon-to-be-cleaned data
cleaned_df = raw_df.copy()

# dropping writers and media columns
cleaned_df = cleaned_df.drop(columns=["Media", "Writers"])

# also drop album and song urls
cleaned_df = cleaned_df.drop(columns=["Album URL", "Song URL"])

We will then drop features that are simply redundant, like "Date" when all we care about is predicting the year.

In [4]:
cleaned_df = cleaned_df.drop(columns=["Release Date"])

## Cleaning Data Values

This section will describe our process for cleaning up values in the data.

The first values we will clean are all the values in the "Year" column. They are currently floats, when they can easily be ints.

In [5]:
cleaned_df["Year"] = cleaned_df["Year"].astype(int)

Next we will clean the "Featured Artists" column, as right now it is comprised of metadata, when we want the names of the artists only.

In [6]:
import ast

combined_artists_list = []

for i, row in cleaned_df.iterrows():
    artist = row["Artist"]
    featured_artists = row["Featured Artists"]

    # ensure featured artists is a string of a list of dicts
    if isinstance(featured_artists, str) and featured_artists != "[]":
        # convert if needed
        featured_artists = ast.literal_eval(featured_artists)

    if isinstance(featured_artists, list) and featured_artists:

        # get all names
        featured_artists_names = [fa['name'] for fa in featured_artists if isinstance(fa, dict) and 'name' in fa]

        # combine names if they dont match
        combined_artists = [artist]
        for fa_name in featured_artists_names:
            if fa_name.lower() not in artist.lower():
                combined_artists.append(fa_name)
                
        # deals with adding to csv
        combined_artists_list.append(", ".join(combined_artists))
    else:
        combined_artists_list.append(artist)
        
cleaned_df["Artists"] = combined_artists_list

There is still one more thing to be done with this column, though. Many bands are named "[Main Vocalist] and the [Band Name]", especially in the earlier years of this dataset. This should be treated as one artist, if possible. This is different from many songs today that have 2 or more artists separated by "and" in the format "[1st Artist] and [2nd Artist]". There needs to be a way to distinguish these two, as this may affect the model.

A note: after many attempts at separating out these two cases, we have to concede that this will very likely have to be done manually or through a method that we do not have the time to persue right now. Unfortunately, we will have to accept that the model will have to use this flawed data.

Now that "Artists" is a column and is finalized, there is no need for the Artists and Featured Artist columns, so we drop them.

In [7]:
cleaned_df = cleaned_df.drop(columns=["Artist", "Featured Artists"])

## Removing Non-Applicable Data Values

Now we will remove any instrumental songs. The "lyrics" are unpredictable in their representation and we cannot do sentiment analysis on them, so they must be removed. Oddly enough, there is no definite way to do this, but from looking at the data, removing any row with an empty lyrics, adverb, nouns, corpus or verbs column will do.

In [8]:
cleaned_df = cleaned_df.dropna(subset=["Lyrics", "Verbs", "Nouns", "Adverbs", "Corpus"], how='any')

As you can see, all of the songs with the smallest number of lyrics have actual words instead of symbols or placeholders that would signify an instrumental.

In [9]:
print(cleaned_df.nsmallest(10, 'Word Counts')[['Lyrics', 'Word Counts']])

                                                 Lyrics  Word Counts
5952  I love it when you call me A-nita cause it's m...           19
1504  Spoken: Grand piano Reed and pipe organ Glocke...           26
1849  Guess mine is not the first heart broken My ey...           28
23    Sorry, sorry, oh so sorry SPOKEN: Uh-oh! **I r...           33
1686  Baby face, youve got the cutest little baby fa...           55
340   (Roy Orbison)  Sweet dream baby Sweet dream ba...           62
845   Here he comes now I've got to tell him somehow...           62
1318  DAY BY DAY GODSPELL Day by day (solo voice) Da...           65
1007  This brand new album is called Hawaii Five-O P...           66
2492   The dark side's callin' now Nothin' is real S...           66


## Creating New Features

The below code creates a new feature that is the ratio between the total amount of words and the unique words, thus creating a "Repetition Ratio". For example, if a song has 180 words and the amount of unique words is 90, the repetition ratio is 2, meaning each word is said twice on average.

In [10]:
cleaned_df["Repetition Ratio"] = cleaned_df["Word Counts"] / cleaned_df["Unique Word Counts"]

The next feature we can create is the sentiment score for the corpous of the lyrics. In theory this method could be applied to the verbs, adverbs, nouns, etc. of the song, but this is unlikely to provide useful results. However, the model could use the sentiment score of the lyrics of a song to help predict its year, if there is a pattern.

The following code will use Natural Language Toolkit for sentiment analysis on these lyrics.

In [11]:
from nltk.corpus import opinion_lexicon
from nltk.tokenize import word_tokenize
from nltk import download

# download lexicons
download('opinion_lexicon')
download('punkt')

# funct to apply to each song
def lexicon_sentiment_score(text):
    
    # set of pos and neg words
    positive_words = set(opinion_lexicon.positive())
    negative_words = set(opinion_lexicon.negative())

    # normalize corpuses
    words = word_tokenize(text.lower())

    # get pos and neg count
    pos_count = sum(1 for word in words if word in positive_words)
    neg_count = sum(1 for word in words if word in negative_words)
    total = pos_count + neg_count

    # return ratio aka sentiment score
    if total == 0:
        return 0.5
    return pos_count / total

cleaned_df["Sentiment"] = cleaned_df["Corpus"].apply(lexicon_sentiment_score)


[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     /home/devel/nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /home/devel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Saving the Data

Finally we save the cleaned data to its new location.

In [12]:
cleaned_data_folder = "../data/cleaned/"
cleaned_file = cleaned_data_folder + "all_songs_data_cleaned.csv"
os.makedirs(cleaned_data_folder, exist_ok=True)
cleaned_df.to_csv(cleaned_file, index=False)