# **Global Music Streaming Trends**
Luca-Andrei Codorean, 30233-1 CTI-RO @2025

This projects consists of an implementation of a text classifer that wishes to succesfully predict cases of heart failure.
The used dataset can be found at: https://www.kaggle.com/datasets/atharvasoundankar/global-music-streaming-trends-and-listener-insights

In order to proceed with the solution, the dependencies found in ``requirements.txt`` should be installed, using the following command ```pip install -r requirements.
txt```

## Data Preprocessing

The really first step is realted to data-preprocessing and visualization. The first function will just take one of the three datasets obtained after ```scr/data_loader.py``` script has been run.
The ```data_loader``` script splitted the dataset in three datasets as follows: train, test, and val. 

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler

def preprocess_data(dataset_path: str):
    dataset_df = pd.read_csv(dataset_path)

    scaler  = StandardScaler()
    le_dict = {}

    for column in dataset_df.select_dtypes(include=["object"]).columns:
        le = LabelEncoder() 
        dataset_df[column] = le.fit_transform(dataset_df[column])  
        le_dict[column] = le  

    y = dataset_df["Listening Time (Morning/Afternoon/Night)"]
    X = dataset_df.drop(columns=["Listening Time (Morning/Afternoon/Night)"])

    X_scaled = scaler.fit_transform(X)
    X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
    
    return X_scaled_df, y, le_dict, scaler

### Plotting the histograms and class distribution

An issue reagrading plotting the histograms has been identified. In the early stages of the development, the columns containing strings instead of numbers were unable to be plotted as histograms. For that, they were plotted as class distribution diagrams, firstly bars, then pie charts. 

It's been a problem with understanding the meaning of these columns so they were mapped accordingly. See ```preprocess_data``` function.

Basically, the histograms are used for numerical columns whereas the class distribution diagrams are used for the other columns.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

def plot_visualization(X, le_dict, scaler, output_dir: str):


    temp = scaler.inverse_transform(X)
    df = pd.DataFrame(temp, columns=X.columns)

    
    for key  in le_dict:
        if key in df.columns:
            label_encoder = le_dict[key]
            df[key] = label_encoder.inverse_transform(df[key].astype(int))

        
    for column in df.columns:
        plt.figure(figsize=(8, 6))
        if pd.api.types.is_numeric_dtype(df[column]):
            plt.title(f"Histogram of {column}")
            plt.xlabel(column)
            df[column].plot(kind='hist', bins=30, color='skyblue', edgecolor='black')
            plt.ylabel("Frequency")
        else:
            plt.title(f"Class distribution of {column}")
            category_counts = df[column].value_counts()
            category_counts.plot(kind='pie', autopct='%1.1f%%', startangle=90)
            plt.ylabel("")

        plt.tight_layout()
        plt.savefig(f"{output_dir+"/"}{column}.png")
        plt.close()


In [3]:
import os

dataset_formatted_file_path = "/home/luca/SI/Project/data/preprocessed/"

train_dataset_path = "/home/luca/SI/Project/data/raw/train.csv"
dataset_formatted_file_name = "train.csv"

validation_dataset_path = "/home/luca/SI/Project/data/raw/val.csv"
val_formtted_file_name = "val.csv"

test_dataset_path = "/home/luca/SI/Project/data/raw/test.csv"
test_formtted_file_name = "test.csv"

(X, y, le_dict, scaler) = preprocess_data(dataset_path=train_dataset_path)
(X_val, y_val, _, _)    = preprocess_data(dataset_path=validation_dataset_path)  

# plot_visualization(X=X, le_dict=le_dict, scaler=scaler, output_dir="/home/luca/SI/Project/outputs/data_vizualization")
X.to_csv(os.path.join(dataset_formatted_file_path, dataset_formatted_file_name), index=False)
X_val.to_csv(os.path.join(dataset_formatted_file_path, val_formtted_file_name), index=False)

## Constructing the model and the training process
 
After the data has been pre-processed, the very first step is to combine the output of the pre-processing phase into a ```StreamingPreferencesDataset``` object. This way, we will be able to structure a MLNN easier. For this, the ```StreamingPreferencesDatasetMLP``` class has been created. It's implementation can be found in ```src.data_set.py```.

Once the dataset object is set-up, it's attributes can be used to inialized the ```HeartFailureMLP``` object that is responsible to implement the training model. Its implementation is available in ```src.model.py```. 

An important hyperparameter for the training process is the ```batch_size``` used by the dataloader. The model should be tested using multiple values for the ```BATCH_SIZE``` parameter in order to get the best results. Same goes for the ```LEARNINIG_RATE``` parameter.

```EPOCHS``` parameter denotes the number of times the algorithm goes through the dataset.

Thus, the first code fragment will handle initalization of diferent hyperparameters and of the model.
Another analisys will be done in order to observe model's reaction to different optimizers such as Adams, SDG, SDG with momentum.
The ```scheduler``` is used to provide the model with an already-implemented learning-rate scheduler. Its purpose is to reduce LR's value in order to get better results.


In [None]:
from torch.utils.data   import DataLoader
from sklearn.utils.class_weight import compute_class_weight
from src.data_set       import StreamingPreferencesDataset
from src.model          import StreamingPreferencesDatasetMLP

import torch
import torch.optim as optim
import torch.nn as nn
import numpy as np
 
BATCH_SIZE                  = 16
DROPOUT_PERCENTAGE          = 0.4
LEARNING_RATE               = 2e-5
OPTIMIZER_STEP_SIZE         = 50
EARLY_STOPPING_PATIENCE     = 200
EPOCHS                      = 1000
DEVICE                      = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

train_dataset   = StreamingPreferencesDataset(X, y)
val_dataset     = StreamingPreferencesDataset(X_val, y_val)
train_loader    = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader      = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=True)


all_labels = [int(label) for _, label in train_dataset]
classes = np.unique(all_labels)
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=classes,
    y=all_labels
)
class_weights_tensor = torch.tensor(class_weights, dtype=torch.float).to(DEVICE)


model           = StreamingPreferencesDatasetMLP(dropout_percentage=DROPOUT_PERCENTAGE).to(DEVICE)
print(model)
criterion       = nn.CrossEntropyLoss(weight=class_weights_tensor, label_smoothing=0.3)  

optimizer       = optim.SGD(model.parameters(), momentum=0.9, lr=LEARNING_RATE) 
# optimizer     = optim.AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=1e-4)
# optimizer       = optim.RMSprop(model.parameters(), lr=LEARNING_RATE, alpha=0.9, eps=1e-8)

scheduler      = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer, mode='min', factor=0.7, patience=20, threshold_mode='rel',threshold=1e-4)
# scheduler     = optim.lr_scheduler.StepLR(step_size=OPTIMIZER_STEP_SIZE, optimizer=optimizer, gamma=0.1)
# scheduler       = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS, eta_min=1e-6)



StreamingPreferencesDatasetMLP(
  (model): Sequential(
    (0): Linear(in_features=10, out_features=128, bias=True)
    (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Dropout(p=0.3, inplace=False)
    (4): Linear(in_features=128, out_features=64, bias=True)
    (5): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): Dropout(p=0.3, inplace=False)
    (8): Linear(in_features=64, out_features=32, bias=True)
    (9): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): ReLU()
    (11): Dropout(p=0.3, inplace=False)
    (12): Linear(in_features=32, out_features=16, bias=True)
    (13): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (14): ReLU()
    (15): Dropout(p=0.3, inplace=False)
    (16): Linear(in_features=16, out_features=3, bias=True)
  )
)


In [12]:
from sklearn.metrics            import accuracy_score, precision_score, recall_score, f1_score
from torch.utils.tensorboard    import SummaryWriter

writer = SummaryWriter(log_dir="runs/64funnel,CrossEntropyLoss,dp=0.3,ReduceLROnPlateau,SGD(momentum=0.9, lr=1e-4),batch_size=16,EPOCHS=1000")

def compute_metrics(all_preds, all_labels, phase, epoch):
    acc  = accuracy_score(all_labels, all_preds)
    prec = precision_score(all_labels, all_preds,   average='macro', zero_division=0)
    rec  = recall_score(all_labels, all_preds,      average='macro', zero_division=0)
    f1   = f1_score(all_labels, all_preds,          average='macro', zero_division=0)

    writer.add_scalar(f"{phase}/Accuracy", acc, epoch)
    writer.add_scalar(f"{phase}/Precision", prec, epoch)
    writer.add_scalar(f"{phase}/Recall", rec, epoch)
    writer.add_scalar(f"{phase}/F1-Score", f1, epoch)

    return acc, prec, rec, f1

In [13]:
import torch
import os


def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, epochs, patience):
    best_loss = float('inf')
    epochs_without_improvement = 0 

    for epoch in range(epochs):

        model.train()
        running_loss, total = 0.0, 0
        train_preds, train_labels = [], []

        for inputs, labels in train_loader:
            inputs, labels = inputs.to(DEVICE), labels.to(DEVICE)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item() * inputs.size(0)
            total += labels.size(0)

            preds = torch.argmax(outputs, dim=1)  
            train_preds.extend(preds.cpu().numpy())
            train_labels.extend(labels.cpu().numpy())

        epoch_loss = running_loss / total
        compute_metrics(train_preds, train_labels, "train", epoch)
        writer.add_scalar('Loss/train', epoch_loss, epoch)
        
        model.eval()
        val_loss, val_total = 0.0, 0
        val_preds, val_labels = [], []

        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(DEVICE), labels.to(DEVICE)
                outputs = model(inputs)
                loss = criterion(outputs, labels.long())

                val_loss += loss.item() * inputs.size(0)
                val_total += labels.size(0)

                preds = torch.argmax(outputs, dim=1)  
                val_preds.extend(preds.cpu().numpy())
                val_labels.extend(labels.cpu().numpy())

        val_epoch_loss = val_loss / val_total
        compute_metrics(val_preds, val_labels, "val", epoch)

        scheduler.step(val_epoch_loss)

        current_lr = optimizer.param_groups[0]['lr']
        writer.add_scalar('LearningRate', current_lr, epoch)
        writer.add_scalar('Loss/val', val_epoch_loss, epoch)

        print(f"Epoch [{epoch+1}/{epochs}] Training Loss: {epoch_loss:.4f} Validation Loss: {val_epoch_loss:.4f}")

        if val_epoch_loss < best_loss:
            best_loss = val_epoch_loss
            epochs_without_improvement = 0
            os.makedirs("./models", exist_ok=True)
            torch.save(model.state_dict(), f"./models/predat_model_best.pth")
        else:
            epochs_without_improvement += 1
        if epoch > 30 and epochs_without_improvement >= patience:
            print(f"Early stopping at epoch {epoch+1}")
            break

    torch.save(model.state_dict(), "./models/codiax_model_final.pth")


train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, EPOCHS, EARLY_STOPPING_PATIENCE)
torch.save(model.state_dict(), "./models/codiax_model_final.pth")

Epoch [1/1000] Training Loss: 1.1633 Validation Loss: 1.1252
Epoch [2/1000] Training Loss: 1.1636 Validation Loss: 1.1211
Epoch [3/1000] Training Loss: 1.1555 Validation Loss: 1.1220
Epoch [4/1000] Training Loss: 1.1595 Validation Loss: 1.1211
Epoch [5/1000] Training Loss: 1.1488 Validation Loss: 1.1218
Epoch [6/1000] Training Loss: 1.1556 Validation Loss: 1.1213
Epoch [7/1000] Training Loss: 1.1490 Validation Loss: 1.1192
Epoch [8/1000] Training Loss: 1.1498 Validation Loss: 1.1214
Epoch [9/1000] Training Loss: 1.1510 Validation Loss: 1.1187
Epoch [10/1000] Training Loss: 1.1459 Validation Loss: 1.1176
Epoch [11/1000] Training Loss: 1.1475 Validation Loss: 1.1173
Epoch [12/1000] Training Loss: 1.1438 Validation Loss: 1.1179
Epoch [13/1000] Training Loss: 1.1513 Validation Loss: 1.1164
Epoch [14/1000] Training Loss: 1.1454 Validation Loss: 1.1150
Epoch [15/1000] Training Loss: 1.1436 Validation Loss: 1.1135
Epoch [16/1000] Training Loss: 1.1374 Validation Loss: 1.1137
Epoch [17/1000] T