Loads data saved from notebook 01_load_and_clean

In [2]:
import pandas as pd
# Load the processed Spotify UK dataset
uk = pd.read_csv("../data/processed/spotify_uk.csv")
uk.head()

Unnamed: 0,title,rank,date,artist,url,region,chart,trend,streams,is_top100


Sort the sataset, to caluculate trends and days-on-chart, we must ensure rows are in the correct order.

In [3]:
# Convert the 'date' column to datetime format and sort the DataFrame
uk["date"] = pd.to_datetime(uk["date"])
# Sort by 'title' and 'date' columns
uk = uk.sort_values(["title", "date"]).reset_index(drop=True)


Feature 1: How long has a song been on the chart?

In [4]:
uk["days_on_chart"] = uk.groupby("title").cumcount() + 1
uk[["title", "date", "days_on_chart"]].head(10)

Unnamed: 0,title,date,days_on_chart


For each song, count how many days it appeared before, 1 first appearance, 2 seconde ...

Feature 2: Did the song move up, down or stayed in the same possition?

In [5]:
# Calculate the trend of each song's position on the chart
trend_map = {
    "MOVE_UP": 1,
    "MOVE_DOWN": -1,
    "SAME_POSITION": 0
}
# Map the trend values to numerical values
uk["trend_num"] = uk["trend"].map(trend_map)
uk[["trend", "trend_num"]].head()

Unnamed: 0,trend,trend_num


Feature 3: Where did the song rank yesterday?

In [6]:
uk["prev_rank"] = uk.groupby("title")["rank"].shift(1)

Feature 4: How fast is the song moving?

In [7]:
# Calculate difference between current rank and previous rank
uk["rank_change"] = uk["prev_rank"] - uk["rank"]

Feature 5: Normalization of large differences in stream numbers.

In [8]:
import numpy as np
# Minimize stream numbers to let ML model handle data better
uk["log_streams"] = np.log1p(uk["streams"])
uk[["streams", "log_streams"]].head()


Unnamed: 0,streams,log_streams


Save the processed dataset

In [9]:
# Save the updated DataFrame with new features
uk.to_csv("../data/processed/spotify_uk_features.csv", index=False)
