# Song Hotness Predictor: Data Acquisition & Preprocessing

In this notebook, we will:
1. Download/Access the Million Song Dataset
2. Load required libraries and the dataset
3. Understand the dataset structure
4. Preprocess the data (handle missing values, data imbalance, and perform feature scaling).

In [None]:
import os
import pandas as pd
import numpy as np
import tarfile
import requests
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from tqdm import tqdm

## Step 1: Data Acquisition

The Million Song Dataset (MSD) is very large, so here we demonstrate accessing a subset or a sample. We downloaded the dataset from their website: http://millionsongdataset.com/pages/getting-dataset/

In [2]:
dataset_url = "http://labrosa.ee.columbia.edu/~dpwe/tmp/millionsongsubset.tar.gz"
local_filename = "millionsongsubset.tar.gz"
extraction_path = "data"


In [5]:
def download_dataset(url, filename):
    # Download the dataset with a progress bar
    response = requests.get(url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    block_size = 1024  # 1 Kilobyte

    with open(filename, 'wb') as file, tqdm(
        total=total_size, unit='iB', unit_scale=True, desc="Downloading"
    ) as bar:
        for data in response.iter_content(block_size):
            bar.update(len(data))
            file.write(data)
    # Check if file size matches expected size
    actual_size = os.path.getsize(filename)
    if actual_size < total_size:
        raise Exception("Download incomplete: expected {} bytes but got {} bytes".format(total_size, actual_size))
    print("Download complete.")

# Download if file does not exist or if re-download is needed
if not os.path.exists(local_filename):
    try:
        download_dataset(dataset_url, local_filename)
    except Exception as e:
        print("Error during download:", e)
        # Optionally, delete the incomplete file:
        if os.path.exists(local_filename):
            os.remove(local_filename)
else:
    print("Dataset already downloaded.")

# Extract the dataset
if not os.path.exists(extraction_path):
    os.makedirs(extraction_path, exist_ok=True)
    try:
        with tarfile.open(local_filename, "r:gz") as tar:
            tar.extractall(path=extraction_path)
        print("Extraction complete. Files are available in:", extraction_path)
    except (tarfile.TarError, EOFError) as e:
        print("Error during extraction:", e)
        print("The archive may be corrupted. Consider re-downloading the file.")
else:
    print("Dataset already extracted.")


Downloading: 100%|██████████| 1.98G/1.98G [06:29<00:00, 5.09MiB/s]   


Download complete.
Extraction complete. Files are available in: data
