# Spotify Data Reduction

We decided to model Spotify representing Tracks, Albums, Artists, Genres and Charts.

As a starting point, we downloaded a dataset from Kaggle that contains all the songs in Spotify's Daily Top 200 charts in 35+1 (global) countries around the world for a period of over 3 years (2017-2020):

**https://www.kaggle.com/pepepython/spotify-huge-database-daily-charts-over-3-years?select=Database+to+calculate+popularity.csv**


Since this initial dataset is very huge (1.53 GB), we decided to reduce the initial CSV file in order to obtain a smaller CSV file that is suitable for our project.

## Setup

We import all the necessary libraries and we set the paths to the input/output files.
* **input file** must be called "spotifyCharts.csv"
* **output file** will be called "reducedSpotifyCharts.csv"

In [None]:
# imports
import os
from pathlib import Path
import pandas as pd
from datetime import datetime, timedelta

In [None]:
# Get absolute path
absPath = str(Path(os.path.abspath(os.getcwd())).absolute())

# All the datasets must be placed in a single folder called "datasets"
datasetsPath = os.path.join(absPath, "datasets")

# Create datasets directory if not exists
if not os.path.exists(datasetsPath):
    os.mkdir(datasetsPath)

# Setup datasets paths
spotifyChartsPath = os.path.join(datasetsPath, "spotifyCharts.csv")
spotifyReducedChartsPath = os.path.join(datasetsPath, "reducedSpotifyCharts.csv")

## Open the file

In [None]:
# Load Spotify Charts
trackCharts = pd.read_csv(spotifyChartsPath, sep=",", index_col=0)

# Drop NaN columns
trackCharts = trackCharts.dropna()

# Print track charts info
trackCharts.info()

## Sampling 

We define a function to reduce the dataframe using two parameters:
* **onlyFirst** to select only the first *x* tracks of the charts. For example, using *onlyFirst = 50*, only the top 50 songs are taken.
* **daysRange** to sample charts every *x* days. For example, using *daysRange = 7*, you will have only a chart for each week.

In [None]:
def selectSamples(trackCharts, onlyFirst=-1, daysRange=7):
    reducedTrackCharts = pd.DataFrame()

    # First and final date in the csv
    firstDateStr = trackCharts.iloc[-1]["date"]
    endDateStr = trackCharts.iloc[0]["date"]

    # Initialize for the while
    actualDate = datetime.strptime(firstDateStr, "%d/%m/%Y").date()
    endDate = datetime.strptime(endDateStr, "%d/%m/%Y").date()

    while(actualDate < endDate):
        if onlyFirst > 0:
            reducedTrackCharts = pd.concat([reducedTrackCharts, trackCharts.loc[
                (trackCharts['date'] == actualDate.strftime("%d/%m/%Y")) &
                (trackCharts['position'] <= onlyFirst)
            ]], ignore_index=True)
        else:
            reducedTrackCharts = pd.concat(
                [reducedTrackCharts, trackCharts.loc[
                    trackCharts['date'] == actualDate.strftime("%d/%m/%Y")
                ]], ignore_index=True)
    
        actualDate = actualDate + timedelta(days=daysRange)
    
    return reducedTrackCharts


#### For our project we decided to retrieve a ***weekly TOP 100***, obtaining a smaller CSV file (109 MB vs 1.53 GB)

In [None]:
# Reduce the chart tracks
reducedTrackCharts = selectSamples(trackCharts, onlyFirst=100, daysRange=7)

# Print DataFrame info
reducedTrackCharts.info()

In [None]:
# Save the DataFrame to file
reducedTrackCharts.to_csv(spotifyReducedChartsPath)