# Cleaning Data

Below describes our process for cleaning the data, removing unwanted features, adding features, and normalizing the data.

## Dropping Unimportant Features

This section describes which features from the original data we remove and why.

In [1]:
## importing libraries

import os
import pandas as pd

import matplotlib.pyplot as plt

In [2]:
## loading in data

data_folder = "../data/raw/"
file = data_folder + "all_songs_data_raw.csv"

# concat all csvs into one dataframe
df_list = [pd.read_csv(file)]

raw_df = pd.concat(df_list, ignore_index=True)

We will start by removing any features that will not be useful due to their form, like "Media" or "Writers". These are URLs or other embedded information, and are not useful or interesting to look at.

In [3]:
# copying soon-to-be-cleaned data
cleaned_df = raw_df.copy()

# dropping writers and media columns
cleaned_df = cleaned_df.drop(columns=["Media", "Writers"])

# also drop album and song urls
cleaned_df = cleaned_df.drop(columns=["Album URL", "Song URL"])

We will then drop features that are simply redundant, like "Date" when all we care about is predicting the year.

In [4]:
cleaned_df = cleaned_df.drop(columns=["Release Date"])

## Cleaning Data Values

This section will describe our process for cleaning up values in the data.

The first values we will clean are all the values in the "Year" column. They are currently floats, when they can easily be ints.

In [5]:
cleaned_df["Year"] = cleaned_df["Year"].astype(int)

Next we will clean the "Featured Artists" column, as right now it is comprised of metadata, when we want the names of the artists only.

In [6]:
import ast

combined_artists_list = []

for i, row in cleaned_df.iterrows():
    artist = row["Artist"]
    featured_artists = row["Featured Artists"]

    # ensure featured artists is a string of a list of dicts
    if isinstance(featured_artists, str) and featured_artists != "[]":
        # convert if needed
        featured_artists = ast.literal_eval(featured_artists)

    if isinstance(featured_artists, list) and featured_artists:

        # get all names
        featured_artists_names = [fa['name'] for fa in featured_artists if isinstance(fa, dict) and 'name' in fa]

        # combine names if they dont match
        combined_artists = [artist]
        for fa_name in featured_artists_names:
            if fa_name.lower() not in artist.lower():
                combined_artists.append(fa_name)
                
        # deals with adding to csv
        combined_artists_list.append(", ".join(combined_artists))
    else:
        combined_artists_list.append(artist)
        
cleaned_df["Artists"] = combined_artists_list

There is still one more thing to be done with this column, though. Many bands are named "[Main Vocalist] and the [Band Name]", especially in the earlier years of this dataset. This should be treated as one artist, if possible. This is different from many songs today that have 2 or more artists separated by "and" in the format "[1st Artist] and [2nd Artist]". There needs to be a way to distinguish these two, as this may affect the model.

A note: after many attempts at separating out these two cases, we have to concede that this will very likely have to be done manually or through a method that we do not have the time to persue right now. Unfortunately, we will have to accept that the model will have to use this flawed data.

Now that "Artists" is a column and is finalized, there is no need for the Artists and Featured Artist columns, so we drop them.

In [7]:
cleaned_df = cleaned_df.drop(columns=["Artist", "Featured Artists"])

## Removing Non-Applicable Data Values

Now we will remove any instrumental songs. The "lyrics" are unpredictable in their representation and we cannot do sentiment analysis on them, so they must be removed. Oddly enough, there is no definite way to do this, but from looking at the data, removing any row with an empty lyrics, adverb, nouns, corpus or verbs column will do.

In [8]:
cleaned_df = cleaned_df.dropna(subset=["Lyrics", "Verbs", "Nouns", "Adverbs", "Corpus"], how='any')

As you can see, all of the songs with the smallest number of lyrics have actual words instead of symbols or placeholders that would signify an instrumental.

In [9]:
print(cleaned_df.nsmallest(10, 'Word Counts')[['Lyrics', 'Word Counts']])

                                                 Lyrics  Word Counts
5952  I love it when you call me A-nita cause it's m...           19
1504  Spoken: Grand piano Reed and pipe organ Glocke...           26
1849  Guess mine is not the first heart broken My ey...           28
23    Sorry, sorry, oh so sorry SPOKEN: Uh-oh! **I r...           33
1686  Baby face, youve got the cutest little baby fa...           55
340   (Roy Orbison)  Sweet dream baby Sweet dream ba...           62
845   Here he comes now I've got to tell him somehow...           62
1318  DAY BY DAY GODSPELL Day by day (solo voice) Da...           65
1007  This brand new album is called Hawaii Five-O P...           66
2492   The dark side's callin' now Nothin' is real S...           66


## Creating New Features

The below code creates a new feature that is the ratio between the total amount of words and the unique words, thus creating a "Repetition Ratio". For example, if a song has 180 words and the amount of unique words is 90, the repetition ratio is 2, meaning each word is said twice on average.

Note: This feature ended up not being helpful to the model.

In [10]:
#cleaned_df["Repetition_Ratio"] = cleaned_df["Word Counts"] / cleaned_df["Unique Word Counts"]

The next feature we can create is the sentiment score for the corpus of the lyrics. In theory this method could be applied to the verbs, adverbs, nouns, etc. of the song, but this is unlikely to provide useful results. However, the model could use the sentiment score of the lyrics of a song to help predict its year, if there is a pattern.

The following code will use Natural Language Toolkit for sentiment analysis on these lyrics.

In [11]:
from nltk.corpus import opinion_lexicon
from nltk import download

# download lexicons
download('opinion_lexicon')

# set of pos and neg words
positive_words = set(opinion_lexicon.positive())
negative_words = set(opinion_lexicon.negative())

# funct to apply to each song
def lexicon_sentiment_score(tokens):

    if not isinstance(tokens, list): 
        return 0.5
    
    # get pos and neg count
    pos_count = sum(token in positive_words for token in tokens)
    neg_count = sum(token in negative_words for token in tokens)
    total = pos_count + neg_count

    # return ratio aka sentiment score
    return 0.5 if total == 0 else pos_count / total

cleaned_df["Tokens"] = cleaned_df["Corpus"].str.lower().str.split() #pretokenize text
cleaned_df["Lyrics_Sentiment"] = cleaned_df["Tokens"].apply(lexicon_sentiment_score)  # apply sentiment analysis
cleaned_df = cleaned_df.drop(columns="Tokens", axis=1)


[nltk_data] Downloading package opinion_lexicon to
[nltk_data]     /home/devel/nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!


Upon further analysis, it has been determined that more features would be helpful. We will apply the same sentiment analysis on the album names and the song names.

Note: These features ended up not being helpful to the model.

In [12]:
'''
cleaned_df["Tokens"] = cleaned_df["Album"].str.lower().str.split()
cleaned_df["Album_Sentiment"] = cleaned_df["Tokens"].apply(lexicon_sentiment_score)
cleaned_df = cleaned_df.drop(columns="Tokens", axis=1)

cleaned_df["Tokens"] = cleaned_df["Song Title"].str.lower().str.split()
cleaned_df["Song_Sentiment"] = cleaned_df["Tokens"].apply(lexicon_sentiment_score)
cleaned_df = cleaned_df.drop(columns="Tokens", axis=1)
'''

'\ncleaned_df["Tokens"] = cleaned_df["Album"].str.lower().str.split()\ncleaned_df["Album_Sentiment"] = cleaned_df["Tokens"].apply(lexicon_sentiment_score)\ncleaned_df = cleaned_df.drop(columns="Tokens", axis=1)\n\ncleaned_df["Tokens"] = cleaned_df["Song Title"].str.lower().str.split()\ncleaned_df["Song_Sentiment"] = cleaned_df["Tokens"].apply(lexicon_sentiment_score)\ncleaned_df = cleaned_df.drop(columns="Tokens", axis=1)\n'

The following code will calculate the average number of words per sentence in the lyrics. This and the following features will be created so the model will have more to work from.

Note: This feature ended up not being helpful to the model.

In [13]:
import re

def average_sentence_length(text):
    # split into sentences
    sentences = re.split(r'[.!?]', text)
    sentences = [s.strip() for s in sentences if s.strip()]
    # count total words
    total_words = sum(len(sentence.split()) for sentence in sentences)
    # avg sentence length
    return total_words / len(sentences) if sentences else 0

#cleaned_df['Avg_Sentence_Length'] = cleaned_df['Lyrics'].apply(average_sentence_length)

The average word length may also be a helpful feature. The following code creates this feature.

In [14]:
def average_word_length(lyrics):
    words = lyrics.split()
    total_length = sum(len(word) for word in words)
    return total_length / len(words) if words else 0

cleaned_df['Avg_Word_Length'] = cleaned_df['Lyrics'].apply(average_word_length)


We will add the Flesch-Kincaid Readability formula, again to give the model more to work with.

Note: This feature ended up not being helpful to the model.

In [15]:
import textstat

def calculate_readability_score(text):
    return textstat.flesch_reading_ease(text)
    
#cleaned_df['Readability_Score'] = cleaned_df['Lyrics'].apply(calculate_readability_score)

We will also add number of verbs, nouns and adverbs. Note: This section has been added after noting feature importance with a Random Forest Model. It is apparent that the only standout feature is Word Count. We hope that adding more similar features, like the number of verbs, nouns and adverbs would be helpful to the model. Another Note: This seemed to have a positive impact, however, we believe that more features of a similar nature could help the model.

Final Note: These features were ultimately not helpful to the model. They have been removed.

In [16]:

#cleaned_df['Verb_Count'] = cleaned_df['Verbs'].apply(lambda x: len(x.split()))
#cleaned_df['Noun_Count'] = cleaned_df['Nouns'].apply(lambda x: len(x.split()))
#cleaned_df['Adverb_Count'] = cleaned_df['Adverbs'].apply(lambda x: len(x.split()))

#cleaned_df['Verb to Noun'] = cleaned_df['Verb_Count'] / cleaned_df['Noun_Count']
#cleaned_df['Noun to Adverb'] = cleaned_df['Noun_Count'] / cleaned_df['Adverb_Count']
#cleaned_df['Adverb to Verb'] = cleaned_df['Adverb_Count'] / cleaned_df['Verb_Count']


## Standardizing Non-Numeric Features

The following will contain code on standardizing non-numeric features, so the model, again, has more to work with. We will use label encoding because of the memory efficiency in avoiding the extremely high dimensionality of all of the unique artists. This has its drawbacks, but we want to give the model all of the data it needs to find a pattern.

The features we will label encode are Album and Artists. We will also scale them to be between 0 and 1 with a Standard Deviation of 1.

In [17]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

label_encoder = LabelEncoder()
cleaned_df['Album_Encoded'] = label_encoder.fit_transform(cleaned_df['Album'])
cleaned_df['Artists_Encoded'] = label_encoder.fit_transform(cleaned_df['Artists'])


scaler = StandardScaler()
cleaned_df['Artists_Normalized'] = scaler.fit_transform(cleaned_df[['Artists_Encoded']])
cleaned_df['Album_Normalized'] = scaler.fit_transform(cleaned_df[['Album_Encoded']])

## Standardizing Numeric Features

Rank needs to be standardized between 0 and 1, where lower ranks are better. Note: This feature ended up not being helpful to the model.

In [18]:
scaler = StandardScaler()
#cleaned_df['Rank'] = scaler.fit_transform(cleaned_df[['Rank']])

We will continue to standardize features, now focusing on numerical features. We will standardize these to have a mean of 0, and a standard deviation of 1 for best results. Thus, we will use StandardScaler.

In [19]:
# original features
'''
numerical_features = ['Word Counts', 'Unique Word Counts', 'Repetition_Ratio', 
                      'Lyrics_Sentiment', 'Song_Sentiment', 'Album_Sentiment',
                      'Avg_Sentence_Length', 'Avg_Word_Length', 'Readability_Score', 
                      'Verb_Count', 'Adverb_Count', 'Noun_Count',
                      'Verb to Noun', 'Noun to Adverb', 'Adverb to Verb'
                     ]
'''
# trimmed features to be important
numerical_features = ['Word Counts', 'Lyrics_Sentiment', 'Avg_Word_Length', 'Artists_Normalized', 'Album_Normalized']

scaler = StandardScaler()
cleaned_df[numerical_features] = scaler.fit_transform(cleaned_df[numerical_features])

## Final Dropping of Features

We will now drop the features we have used to extract numerical and standardized features from. They can no longer be used by us, and the model cannot use them.

In [20]:
columns_to_drop = ["Album", "Lyrics", "Rank", "Song Title", 
                   "Verbs", "Nouns", "Adverbs", "Corpus", 
                   "Artists", "Album_Encoded", "Artists_Encoded"]
cleaned_df = cleaned_df.drop(columns=columns_to_drop, axis=1)

## Saving the Data

Finally we save the cleaned data to its new location.

In [21]:
cleaned_data_folder = "../data/cleaned/"
cleaned_file = cleaned_data_folder + "all_songs_data_cleaned.csv"
os.makedirs(cleaned_data_folder, exist_ok=True)
cleaned_df.to_csv(cleaned_file, index=False)