## Data Collection 

### Obtaining Data from Spotify:
To analyze your Spotify listening history, you can request your data directly from Spotify. Spotify provides users with the option to download their personal data, including detailed listening history and user interactions.

For more information go into the Spotify Page 

[Get your data](https://support.spotify.com/us/article/data-rights-and-privacy-settings/)



**Data Contents**:

-   The JSON file includes detailed information such as:
    -   `ts`: Timestamp when the track was played.
    -   `username`: Your Spotify username.
    -   `platform`: Platform used (e.g., Linux, Windows, mobile).
    -   `ms_played`: Duration the track was played (in milliseconds).
    -   `master_metadata_track_name`: The name of the track.
    -   `master_metadata_album_artist_name`: The artist of the track.
    -   `spotify_track_uri`: Unique Spotify URI for the track.
    -   `skipped`: Boolean indicating whether the track was skipped.
    -   And other metadata like  `reason_start`,  `reason_end`,  `shuffle`,  `offline`, etc.

This JSON data serves as the foundation for building a model to analyze your music listening behavior. By exploring attributes such as ms_played and skipped, you can determine which songs you liked or disliked and train a your model to understand your preferences.

In [53]:
import pandas as pd
import json

# Load JSON data from a file
with open('path to your extended streaming history from spotify', 'r') as f:
    data = json.load(f)

# Normalize JSON data into a DataFrame
df = pd.json_normalize(data)

In [54]:
# Define a threshold for "liked" based on ms_played (e.g., 30 seconds = 30000 ms)
threshold = 30000  # 30 seconds

# Create a 'liked' column
df['liked'] = df.apply(lambda row: 1 if not row['skipped'] and row['ms_played'] > threshold else 0, axis=1)

In [55]:
df.head()

Unnamed: 0,ts,username,platform,ms_played,conn_country,ip_addr_decrypted,user_agent_decrypted,master_metadata_track_name,master_metadata_album_artist_name,master_metadata_album_album_name,...,episode_show_name,spotify_episode_uri,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,liked
0,2023-01-16T20:04:01Z,micheellemendezz,linux,262960,EC,186.70.158.146,unknown,Last Christmas,Wham!,LAST CHRISTMAS,...,,,trackdone,trackdone,True,False,False,1673899178,False,1
1,2023-01-16T20:06:03Z,micheellemendezz,linux,121526,EC,186.70.158.146,unknown,Holly Jolly Christmas,Michael Bublé,Christmas - Deluxe Special Edition,...,,,trackdone,trackdone,True,False,False,1673899442,False,1
2,2023-01-16T20:11:51Z,micheellemendezz,linux,347426,EC,186.70.158.146,unknown,Clair de lune,Claude Debussy,Träumerei - Liebestraum - Für Elise - Clair de...,...,,,trackdone,trackdone,True,False,False,1673899564,False,1
3,2023-01-16T20:14:35Z,micheellemendezz,linux,162637,EC,186.70.158.146,unknown,La Bachata,Manuel Turizo,La Bachata,...,,,trackdone,trackdone,True,False,False,1673899912,False,1
4,2023-01-16T20:17:52Z,micheellemendezz,linux,196600,EC,186.70.158.146,unknown,Hey Mor,Ozuna,OzuTochi,...,,,trackdone,trackdone,True,False,False,1673900075,False,1


For this analysis we are going to need to have the audio features from spotify, for that we are using the spotify API. To get this information with the track id

In [57]:
import spotipy
from spotipy.oauth2 import SpotifyOAuth
import pandas as pd

# Replace with your Spotify API credentials
SPOTIPY_CLIENT_ID = "your client id"
SPOTIPY_CLIENT_SECRET = "your cliente secret :)"
SPOTIPY_REDIRECT_URI = "http://localhost/"

# Set up Spotify client
sp = spotipy.Spotify(auth_manager=SpotifyOAuth(
    client_id=SPOTIPY_CLIENT_ID,
    client_secret=SPOTIPY_CLIENT_SECRET,
    redirect_uri=SPOTIPY_REDIRECT_URI,
    scope='user-library-read'
))

# Extract track IDs from the 'spotify_track_uri' column by splitting the string
# The 'spotify_track_uri' column should contain values like "spotify:track:3ZFTkvIE7kyPt6Nu3PEa7V"
df['track_id'] = df['spotify_track_uri'].apply(lambda x: x.split(':')[-1] if pd.notnull(x) else None)

# Drop rows with no track_id (for example, episodes that have no corresponding track data)
df = df.dropna(subset=['track_id'])

# Create a list of track IDs
track_ids = df['track_id'].tolist()

# Fetch audio features for each track ID in batches
audio_features = []
batch_size = 50  # Spotify allows up to 100 tracks per request for audio features

# Process in batches to avoid rate limits
for i in range(0, len(track_ids), batch_size):
    batch_ids = track_ids[i:i + batch_size]
    features = sp.audio_features(batch_ids)
    
    # Filter out any None values from the features list
    features = [f for f in features if f is not None]
    audio_features.extend(features)

# Convert the filtered list of features into a DataFrame
features_df = pd.DataFrame(audio_features)

# Combine the audio features with the original data
df_final = df.merge(features_df, left_on='track_id', right_on='id', how='left')

# Save the resulting DataFrame to a CSV for later use
df_final.to_csv('spotify_data_with_Efeatures.csv', index=False)

print("Audio features extraction complete. Data saved as 'spotify_data_with_Efeatures.csv'.")


Audio features extraction complete. Data saved as 'spotify_data_with_Efeatures.csv'.


The csv should be exporter as spotify data with features ready for data analysis, for further model development