# Extracting Features From the Original Dataset

**Disclaimer: this notebook is based on the notebook that can be found [here](https://github.com/enjuichang/PracticalDataScience-ENCA/tree/main).**

This notebook is used to extract features from the original dataset, which gives us limited information about the songs. Here, we use the "ari.py" script to extract a set of features about each song, along with the popularities of both the artist and song itself, along with genres

In [1]:
# import from the other file
from scripts.ari import ari_to_features
import pandas as pd  # for creating DataFrames, where we can store our data into
from tqdm import tqdm
import re

Load the raw data in a DataFrame. It is a subset of the [Million Playlist Dataset](https://www.aicrowd.com/challenges/spotify-million-playlist-dataset-challenge).

In [2]:
# load the raw_data from the repo
dataPath = 'data/raw_data_train.csv'
df = pd.read_csv(dataPath)
df.head()  # show the first 10 entries

Unnamed: 0.1,Unnamed: 0,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms,album_name,name
0,0,0,Missy Elliott,spotify:track:0UaMYEvWZi0ZqiDOoHU3YI,spotify:artist:2wIVse2owClT7go1WT98tk,Lose Control (feat. Ciara & Fat Man Scoop),spotify:album:6vV5UrXcfyQD1wu4Qo2I9K,226863,The Cookbook,Throwbacks
1,1,1,Britney Spears,spotify:track:6I9VzXrHxO9rA9A5euc8Ak,spotify:artist:26dSoYclwsYLMAKD3tpOr4,Toxic,spotify:album:0z7pVBGOD7HCIB7S8eLkLI,198800,In The Zone,Throwbacks
2,2,2,Beyoncé,spotify:track:0WqIKmW4BTrj3eJFmnCKMv,spotify:artist:6vWDO969PvNqNYHIOW5v0m,Crazy In Love,spotify:album:25hVFAxTlDvXbx2X2QkUkE,235933,Dangerously In Love (Alben für die Ewigkeit),Throwbacks
3,3,3,Justin Timberlake,spotify:track:1AWQoqb9bSvzTjaLralEkT,spotify:artist:31TPClRtHm23RisEBtV3X7,Rock Your Body,spotify:album:6QPkyl04rXwTGlGlcYaRoW,267266,Justified,Throwbacks
4,4,4,Shaggy,spotify:track:1lzr43nnXAijIGYnCT8M8H,spotify:artist:5EvFsr3kj42KNv97ZEnqij,It Wasn't Me,spotify:album:6NmFmPX56pcLBOFMhIiKvF,227600,Hot Shot,Throwbacks


The data about a song that is stored can be retrieved by looking at the columns of the DataFrame.

In [3]:
df.columns

Index(['Unnamed: 0', 'pos', 'artist_name', 'track_uri', 'artist_uri',
       'track_name', 'album_uri', 'duration_ms', 'album_name', 'name'],
      dtype='object')

Each column adds information to a track:

- Unnamed:0: This is the index over all tracks in the database. The name is this weird, because it was empty in the .csv file and the DataFrame library named it that way.
- pos: the index of the track in the playlist it belongs to
- artist_name: the name of the artist of the track
- track_uri: unique identifier of the track ([more](https://developer.spotify.com/documentation/web-api/concepts/spotify-uris-ids#:~:text=the%20following%20parameters%3A-,Spotify%20URI,-The%20resource%20identifier))
- artist_uri: unique identifier of the artist ([more](https://developer.spotify.com/documentation/web-api/concepts/spotify-uris-ids#:~:text=the%20following%20parameters%3A-,Spotify%20URI,-The%20resource%20identifier))
- track_name: name of the track
- album_uri: unique identifier of the album ([more](https://developer.spotify.com/documentation/web-api/concepts/spotify-uris-ids#:~:text=the%20following%20parameters%3A-,Spotify%20URI,-The%20resource%20identifier))
- duration_ms: the duration of the track in milliseconds
- album_name: the name of the album the track was published with
- name: the name of the playlist the track belongs to

In [4]:
# edit the track-uris to a more usable format
df["track_uri"] = df["track_uri"].apply(lambda x: re.findall(r'\w+$', x)[0])  # removes spotify:track: and stores it instead of the old values
df["track_uri"]

0        0UaMYEvWZi0ZqiDOoHU3YI
1        6I9VzXrHxO9rA9A5euc8Ak
2        0WqIKmW4BTrj3eJFmnCKMv
3        1AWQoqb9bSvzTjaLralEkT
4        1lzr43nnXAijIGYnCT8M8H
                  ...          
60771    2nJSRl7z2N66NJFBhIalCy
60772    1P3MmQNUtdleRmaAElZFA1
60773    2e6Tbu1hiYjUR7kgMIZpkO
60774    3wF0rpX1njF2FLFGc45rxV
60775    2JK1AI0SfgdAnmhieLt43Z
Name: track_uri, Length: 60776, dtype: object

As a next step, we use the [ari.py](scripts/ari.py) script to extract the features from a song using its URI.

In [5]:
testDF = df
feature = ari_to_features(df["track_uri"])  # this function is defined in ari.py and needs the secret.txt to be set up
feature_df_test = pd.DataFrame(feature)
feature_df_test.head()

dbdcbf257a1d4973b8e0d263c64dc8d
88c2dab2674848a0a69cfdf335ab0001


SpotifyOauthError: error: invalid_client, error_description: Invalid client

## Included Features

The code cell below gives an example of the features extracted from each track, showing the kind of information that is used to cluster the data further on.

In [None]:
#Test the feature extraction script, and display features
ari_to_features(df["track_uri"][0])

## Extraction

Below here, we extract features from each track using the Spotify API and the associated URI. This is done in 3 sections, due to the extremely long runtime of this process. We build a DataFrame containing these features.

In [None]:
first_half = df["track_uri"].unique()[:10000]
second_half = df["track_uri"].unique()[10000:20000]
third_half = df["track_uri"].unique()[20000:]
dataLIST = [first_half,second_half,third_half]

In [None]:
featureLIST = []

for i in tqdm([uri for uri in dataLIST[0]]):
    try:
        featureLIST.append(ari_to_features(i))
    except:
        continue


In [None]:
for i in tqdm([uri for uri in dataLIST[1]]):
    try:
        featureLIST.append(ari_to_features(i))
    except:
        continue

In [None]:
for i in tqdm([uri for uri in dataLIST[2]]):
    try:
        featureLIST.append(ari_to_features(i))
    except:
        continue

In [None]:
#Preview the DataFrame
featureDF = pd.DataFrame(featureLIST)
featureDF

## Finalising and Export

We finally merge the feature DataFrame with the original dataset, as this also contains useful information in the artist name and track name. This is then exported, as our processed data.

In [None]:
new_df = pd.merge(testDF,featureDF, left_on = "track_uri", right_on= "id")

In [None]:
new_df.to_csv('../data/processed_data.csv')