# Converting the Spotify Dataset
This Jupyter Notebook will convert the dataset in `dataset/songs_normalize.csv` to and updated version of the dataset. The main updates are below.

- Append a new column for each genre in the original `genres` column. The value of each column is either True if that song is part of that genre, or False otherwise.
- Append a new column called `hasFeature` based on whether the song has a feature on the song title. A song is classified as having a "feature" if the title has one of the following keywords: `'feature', 'feat', 'ft', 'featuring', '(with', 'vs', 'vs.'`.

The updated dataset is stored in `dataset/songs_updated.csv`.

In [1]:
import pandas as pd

In [2]:
# Load dataset
song_df = pd.read_csv('../dataset/songs_normalize.csv')


## Adding the `isGenre` columns
To create a new column for each genre, first we must find all the possible genres based on the values in the dataset. The original genre column contains a comma-separated list of genres. This will create a `set` with all the values of genre that we can find.

Then we create a dictionary with the key as the genre name and the values as a list of boolean values based on whether the song is part of the genre or not.

There are a few caveats to note:
- There are a few data points with no genre. Their values are `set()`. Instead of including that column, we remove that column.
- We prefix the new column names with `is`. As an example, `ispop` is one of the columns.

In [3]:
# Get list of all genres
genre_set = set()
for genres in song_df['genre'].values:
    genre_list = genres.split(',')
    for genre in genre_list:
        genre_set.add(genre.strip())
print(genre_set)

{'easy listening', 'rock', 'jazz', 'classical', 'latin', 'country', 'Folk/Acoustic', 'blues', 'Dance/Electronic', 'set()', 'metal', 'World/Traditional', 'R&B', 'hip hop', 'pop'}


In [4]:
# Create a copy of the set to add a prefix 'is' to columns
# but make sure to ignore set() as this means there is no
# genre associated with it
is_genre_set = set()
for genre in genre_set:
    if genre == 'set()':
        continue
    is_genre_set.add('is' + genre)

In [5]:
# Populate a dictionary with the genres as keys and a list of True/False as values
# depending on whether the song is of that genre or not
# Make sure to ignore 'set()'
genre_dict = dict.fromkeys(is_genre_set, 0)
for genre in genre_set:
    if genre == 'set()':
        continue
    is_genre = []
    for genres in song_df['genre'].values:
        if genre in genres:
            is_genre.append(True)
        else:
            is_genre.append(False)
    genre_dict['is' + genre] = is_genre

In [6]:
# Check output of genre_dict and convert to a data frame
for genre, is_genre in genre_dict.items():
    print(f'{genre}: {is_genre[:5]}')

genres_df = pd.DataFrame(genre_dict)
genres_df.head()

isclassical: [False, False, False, False, False]
isjazz: [False, False, False, False, False]
isDance/Electronic: [False, False, False, False, False]
isR&B: [False, False, False, False, False]
isblues: [False, False, False, False, False]
isrock: [False, True, False, True, False]
islatin: [False, False, False, False, False]
iscountry: [False, False, True, False, False]
isFolk/Acoustic: [False, False, False, False, False]
iship hop: [False, False, False, False, False]
isWorld/Traditional: [False, False, False, False, False]
iseasy listening: [False, False, False, False, False]
ismetal: [False, False, False, True, False]
ispop: [True, True, True, False, True]


Unnamed: 0,isclassical,isjazz,isDance/Electronic,isR&B,isblues,isrock,islatin,iscountry,isFolk/Acoustic,iship hop,isWorld/Traditional,iseasy listening,ismetal,ispop
0,False,False,False,False,False,False,False,False,False,False,False,False,False,True
1,False,False,False,False,False,True,False,False,False,False,False,False,False,True
2,False,False,False,False,False,False,False,True,False,False,False,False,False,True
3,False,False,False,False,False,True,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,True


In [7]:
# Append the data frame to the existing one
song_updated_df = song_df.copy()
song_updated_df = pd.concat([song_updated_df, genres_df], axis=1)
song_updated_df.head()


Unnamed: 0,artist,song,duration_ms,explicit,year,popularity,danceability,energy,key,loudness,...,isblues,isrock,islatin,iscountry,isFolk/Acoustic,iship hop,isWorld/Traditional,iseasy listening,ismetal,ispop
0,Britney Spears,Oops!...I Did It Again,211160,False,2000,77,0.751,0.834,1,-5.444,...,False,False,False,False,False,False,False,False,False,True
1,blink-182,All The Small Things,167066,False,1999,79,0.434,0.897,0,-4.918,...,False,True,False,False,False,False,False,False,False,True
2,Faith Hill,Breathe,250546,False,1999,66,0.529,0.496,7,-9.007,...,False,False,False,True,False,False,False,False,False,True
3,Bon Jovi,It's My Life,224493,False,2000,78,0.551,0.913,0,-4.063,...,False,True,False,False,False,False,False,False,True,False
4,*NSYNC,Bye Bye Bye,200560,False,2000,65,0.614,0.928,8,-4.806,...,False,False,False,False,False,False,False,False,False,True


## Adding the `hasFeature` column
Now, time to search for the list of keywords in a song title to determine whether a song has a feature or not. Once we know this, we can append this new column as well to the dataset.

In [8]:
# Add a hasFeature column
# Cannot use just 'with' as this returns all song titles with 'with' in the actual
# song title. So instead use '(with' which seems to be more accurate
feature_keywords = {'feature', 'feat', 'ft', 'featuring', '(with', 'vs', 'vs.'}

has_feature = []
for song_title in song_df['song']:
    has_feature.append(any(keyword in song_title for keyword in feature_keywords))

In [9]:
# Append new data to existing data frame
song_updated_df['hasFeature'] = has_feature

## Saving the new dataset
The dataset has been updated as we had hoped. Now we can save the new dataset under a new name `songs_updated.csv`!

In [10]:
# Write updated data frame to CSV
song_updated_df.to_csv('../dataset/songs_updated.csv', index=False)