# Preprocessing v2

### Steps:

1. **Load Datasets:**
   - Load 'charts_processed.csv'.

2. **Align Datasets:**
   - Rename 'id' in audio features to 'track_id'.
   - Keep only top 200 chart entries.

3. **Add Columns:**
   - Extract 'track_id' from chart URLs.

4. **Remove Columns:**
   - Drop unnecessary columns ('url', 'chart', 'trend').

5. **Calculate Streams Percentage:**
   - Create 'streams_percentage' in charts.
   - Calculate percentage for each row.

6. **Validate Data:**
   - Check if sampled date-region 'streams_percentage' sums close to 1.0.

7. **Save CSV:**
   - Save preprocessed data as 'charts_processed_v2.csv'.

Note: This version ensures dataset alignment, calculates streams percentage and validates data integrity. The result is saved for future use.


In [None]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm

tqdm.pandas()

In [None]:
KAGGLE = True

In [None]:
if KAGGLE:
    CHARTS_PATH = '/kaggle/input/regionalrhythms/charts_processed.csv'
    PATH_TO_SAVE = '/kaggle/working/'
else:
    CHARTS_PATH = "../../data/charts_processed.csv"
    PATH_TO_SAVE = "../../data/"

In [None]:
# Load the datasets into dataframes
charts_df = pd.read_csv(CHARTS_PATH, parse_dates=['date'], date_format='%Y-%m-%d')
charts_df.head()

In [None]:
# For now only restrict the dataset to top200 charts for stream/ranking analysis.
charts_df = charts_df[charts_df["chart"] == "top200"]
charts_df["track_id"] = charts_df["url"].apply(lambda x: x.split("/")[-1])

# drop the url, chart and trend columns
charts_df.drop(columns=["url", "chart", "trend"], inplace=True)

# Create a new column for streams_percentage
charts_df['streams_percentage'] = 0.0

# Create a dictionary to store total streams for each region-date combination
total_streams_dict = {}

# Populate the dictionary
for (region, date), group in tqdm(charts_df.groupby(['region', 'date'])):
    total_streams_dict[(region, date)] = group['streams'].sum()

charts_df['streams_percentage'] = charts_df.progress_apply(lambda row: row['streams'] / total_streams_dict.get((row['region'], row['date']), 0), axis=1)

charts_df.head()

In [None]:
# first build 200 date-region combinations
date_region_combinations = list(charts_df.groupby(["date", "region"]).groups.keys())

# sample 200 combinations
indices = np.random.choice(len(date_region_combinations), 200)
sampled_date_region_combinations = [date_region_combinations[i] for i in indices]

# check if the streams_percentage adds up to 1, dont worry about the rounding errors print if the sum is not 1
for date, region in tqdm(sampled_date_region_combinations):
    df = charts_df[(charts_df["date"] == date) & (charts_df["region"] == region)]
    if not np.isclose(df["streams_percentage"].sum(), 1.0):
        print("Sum is not 1.0")

In [None]:
# save the csv file
charts_df.to_csv(PATH_TO_SAVE + "charts_processed_v2.csv", index=False)