# Cleaning Data

Below describes our process for cleaning the data, removing unwanted features, and other small considerations.

## Dropping Unimportant Features

This section describes which features from the original data we remove and why.

In [1]:
## importing libraries

import os
import pandas as pd

import matplotlib.pyplot as plt

In [2]:
## loading in data

data_folder = "../data/raw/"
file = data_folder + "all_songs_data_processed.csv"

# concat all csvs into one dataframe
df_list = [pd.read_csv(file)]

raw_df = pd.concat(df_list, ignore_index=True)

We will start by removing any features that will not be useful due to their form, like "Media" or "Writers". These are URLs or other embedded information, and are not useful or interesting to look at.

In [3]:
# copying soon-to-be-cleaned data
cleaned_df = raw_df.copy()

# dropping writers and media columns
cleaned_df = cleaned_df.drop(columns=["Media", "Writers"])

# also drop album and song urls
cleaned_df = cleaned_df.drop(columns=["Album URL", "Song URL"])

We will then drop features that are simply redundant, like "Date" when all we care about is predicting the year.

In [4]:
cleaned_df = cleaned_df.drop(columns=["Release Date"])

## Cleaning Data Values

This section will describe our process for cleaning up values in the data.

The first values we will clean are all the values in the "Year" column. They are currently floats, when they can easily be ints.

In [5]:
cleaned_df["Year"] = cleaned_df["Year"].astype(int)

Next we will clean the "Featured Artists" column, as right now it is comprised of metadata, when we want the names of the artists only.

In [6]:
import ast

combined_artists_list = []

for i, row in cleaned_df.iterrows():
    artist = row["Artist"]
    featured_artists = row["Featured Artists"]

    # ensure featured artists is a string of a list of dicts
    if isinstance(featured_artists, str) and featured_artists != "[]":
        # convert if needed
        featured_artists = ast.literal_eval(featured_artists)

    if isinstance(featured_artists, list) and featured_artists:
        #print(f"Artist: {artist}")

        # get all names
        featured_artists_names = [fa['name'] for fa in featured_artists if isinstance(fa, dict) and 'name' in fa]
        
        #print(f"Featured Artists Names: {featured_artists_names}")

        # combine names if they dont match
        combined_artists = [artist]
        for fa_name in featured_artists_names:
            if fa_name.lower() not in artist.lower():
                combined_artists.append(fa_name)

        #print(f"Combined Artists List: {combined_artists}")
        #print()

# deals with adding to csv
        combined_artists_list.append(", ".join(combined_artists))
    else:

        combined_artists_list.append(artist)
        
cleaned_df["Artists"] = combined_artists_list

Now that "Artists" is a column, there is no need for the Artists and Featured Artist columns, so we drop them.

In [7]:
cleaned_df = cleaned_df.drop(columns=["Artist", "Featured Artists"])

## Removing Non-Applicable Data Values

Now we will remove any instrumental songs. The "lyrics" are unpredictable in their representation and we cannot do sentiment analysis on them, so they must be removed. Oddly enough, there is no definite way to do this, but from looking at the data, removing any row with an empty lyrics, adverb, nouns, corpus or verbs column will do.

In [8]:
cleaned_df = cleaned_df.dropna(subset=["Lyrics", "Verbs", "Nouns", "Adverbs", "Corpus"], how='any')

As you can see, all of the songs with the smallest number of lyrics have actual words instead of symbols or placeholders that would signify an instrumental.

In [17]:
print(cleaned_df.nsmallest(20, 'Word Counts')[['Lyrics', 'Word Counts']])

                                                 Lyrics  Word Counts
5952  I love it when you call me A-nita cause it's m...           19
1504  Spoken: Grand piano Reed and pipe organ Glocke...           26
1849  Guess mine is not the first heart broken My ey...           28
23    Sorry, sorry, oh so sorry SPOKEN: Uh-oh! **I r...           33
1686  Baby face, youve got the cutest little baby fa...           55
340   (Roy Orbison)  Sweet dream baby Sweet dream ba...           62
845   Here he comes now I've got to tell him somehow...           62
1318  DAY BY DAY GODSPELL Day by day (solo voice) Da...           65
1007  This brand new album is called Hawaii Five-O P...           66
2492   The dark side's callin' now Nothin' is real S...           66
1619  Doo, doo, doot, doo, doo, doot Doo, doo, doo, ...           70
750   Aww, I'm itchy,and I don't know where to scrat...           71
1036   From Atlanta, Georgia, to the Gulf Stream wat...           72
4474   I know there's something in

## Saving the Data

Finally we save the cleaned data to its new location.

In [9]:
cleaned_data_folder = "../data/cleaned/"
cleaned_file = cleaned_data_folder + "all_songs_data_cleaned.csv"
os.makedirs(cleaned_data_folder, exist_ok=True)
cleaned_df.to_csv(cleaned_file, index=False)