# TuneWeaver: Data Collection and Preparation for Playlist Recommendation Analysis

**Project:** TuneWeaver - Enhanced A/B/n Testing & Production-Ready Pipeline for Serendipity Playlists
**Notebook Purpose:** This Jupyter Notebook handles the initial and crucial phase of the TuneWeaver project: **Data Collection and Preparation**. Its primary goal is to acquire a substantial dataset of music tracks from the Spotify API, including their metadata and audio features. Subsequently, it performs thorough pre-processing and cleaning to prepare this data for the downstream tasks of algorithm development, simulation, and A/B/n testing analysis.

**Key Activities in this Notebook:**
1.  **Spotify API Interaction:** Utilizing the Spotipy library to connect to the Spotify Web API.
2.  **Data Acquisition:** Fetching data for over 5,000 tracks across a diverse set of genres. This includes:
    * Track metadata (ID, name, artist, album, popularity).
    * Audio features (e.g., danceability, energy, valence, tempo).
3.  **Initial Data Storage:** Saving the raw, collected data into a `songs.csv` file.
4.  **Detailed Pre-processing:**
    * Loading and inspecting the raw dataset.
    * Comprehensive missing value analysis and imputation.
    * Normalization of numerical features to ensure consistent scaling for subsequent analysis and modeling.
5.  **Final Processed Data Storage:** Saving the cleaned and transformed dataset as `songs_processed.csv`.

**Outcome:** A well-structured, clean, and normalized dataset (`songs_processed.csv`) ready for the next stages of the TuneWeaver project, particularly for implementing and evaluating different playlist recommendation algorithms.

## Part 1: Data Collection using Spotify API (Spotipy)

This section outlines the process of collecting music track data. We will use the [Spotipy](https://spotipy.readthedocs.io/) library, a lightweight Python client for the Spotify Web API.

**Goal:** To compile a dataset of at least 5,000 unique songs, including their audio features and metadata, covering a variety of genres.

In [None]:
# Cell 3: Imports and Configuration for Data Collection

# Core libraries
import pandas as pd
import time
import os # For securely accessing environment variables

# Spotify API client
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

# --- Spotify API Credentials Configuration ---
# BEST PRACTICE: Store your API credentials as environment variables
# and load them using os.getenv(). Avoid hardcoding them directly in your script/notebook.
#
# Example of setting environment variables (in your terminal/shell before launching Jupyter):
# export SPOTIPY_CLIENT_ID='your_actual_client_id'
# export SPOTIPY_CLIENT_SECRET='your_actual_client_secret'
#
# Then, you can load them like this:
# client_id = os.getenv('SPOTIPY_CLIENT_ID')
# client_secret = os.getenv('SPOTIPY_CLIENT_SECRET')
#
# If you don't have them set as environment variables, you can temporarily
# assign them below FOR DEVELOPMENT PURPOSES ONLY.
# **REMEMBER TO REMOVE OR REPLACE WITH os.getenv() BEFORE COMMITTING TO GITHUB.**

client_id = os.getenv('SPOTIPY_CLIENT_ID')
client_secret = os.getenv('SPOTIPY_CLIENT_SECRET')

# Fallback for development if environment variables are not set (NOT FOR PRODUCTION/GITHUB)
# If you uncomment the lines below to hardcode, replace 'YOUR_CLIENT_ID' and 'YOUR_CLIENT_SECRET'
if not client_id or not client_secret:
    print("⚠️ Spotify API credentials not found in environment variables.")
    print("   Please set SPOTIPY_CLIENT_ID and SPOTIPY_CLIENT_SECRET environment variables for secure credential management.")
    print("   For temporary local development only, you can uncomment and fill the lines below.")
    print("   Ensure these are NOT committed to version control if hardcoded.")
    # client_id = 'YOUR_CLIENT_ID'  # <<< TEMPORARY: REPLACE WITH YOUR CLIENT ID
    # client_secret = 'YOUR_CLIENT_SECRET' # <<< TEMPORARY: REPLACE WITH YOUR CLIENT SECRET

# --- Data Collection Parameters ---
# Filename for the raw collected data from Spotify
raw_output_filename = 'songs.csv'

# Target number of unique tracks to collect.
# The project plan specifies 5,000+ tracks.
total_tracks_target = 5500 # Aiming slightly above 5000 for a buffer

# Number of tracks to attempt to fetch per genre.
# Spotify's `recommendations` endpoint limit is 100 per call.
# This parameter, combined with the number of genres, helps reach `total_tracks_target`.
num_tracks_per_genre = 400 # Adjusted to help ensure target is met with genre diversity

# Define a diverse list of genres to fetch tracks from.
# These serve as seeds for Spotify's recommendation engine.
genres = [
    "pop", "rock", "hip-hop", "electronic", "classical", "jazz", "r-n-b",
    "country", "folk", "metal", "blues", "funk", "soul", "indie",
    "latin", "reggae", "punk", "alternative", "dance", "ambient"
]

# --- Initial Check and Confirmation ---
# This check is crucial to ensure credentials are set before proceeding.
if not client_id or client_id == 'YOUR_CLIENT_ID' or \
   not client_secret or client_secret == 'YOUR_CLIENT_SECRET':
    print("="*80)
    print("❌ CRITICAL ERROR: Spotify API credentials are not properly set. ")
    print("   This notebook requires valid Spotify API credentials to collect data.")
    print("   Option 1 (Recommended): Set them as environment variables in your system:")
    print("     export SPOTIPY_CLIENT_ID='your_actual_client_id'")
    print("     export SPOTIPY_CLIENT_SECRET='your_actual_client_secret'")
    print("   Option 2 (Temporary for Local Development):")
    print("     Uncomment the `client_id` and `client_secret` lines in this cell and replace")
    print("     the placeholder values with your actual credentials.")
    print("     IMPORTANT: If you use Option 2, DO NOT commit these hardcoded credentials to GitHub.")
    print("   The script cannot proceed without valid credentials.")
    print("="*80)
    # Optionally, raise an error to halt execution if you prefer the notebook to stop here:
    # raise ValueError("Spotify API credentials not set. Halting execution.")
else:
    print("✅ Configuration loaded successfully.")
    print(f"   Targeting approximately {total_tracks_target} unique tracks.")
    print(f"   Raw data will be saved to: '{raw_output_filename}'")
    print(f"   Attempting to fetch up to {num_tracks_per_genre} tracks per genre from {len(genres)} genres.")
    print("   Note: Actual Spotify API credentials are not displayed for security.")