# **Global Music Streaming Trends**
Luca-Andrei Codorean, 30233-1 CTI-RO @2025

This projects consists of an implementation of a text classifer that wishes to succesfully predict cases of heart failure.
The used dataset can be found at: https://www.kaggle.com/datasets/atharvasoundankar/global-music-streaming-trends-and-listener-insights

In order to proceed with the solution, the dependencies found in ``requirements.txt`` should be installed, using the following command ```pip install -r requirements.
txt```

## Data Preprocessing

The really first step is realted to data-preprocessing and visualization. The first function will just take one of the three datasets obtained after ```scr/data_loader.py``` script has been run.
The ```data_loader``` script splitted the dataset in three datasets as follows: train, test, and val. 

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

def preprocess_data(dataset_path: str):
    dataset_df = pd.read_csv(dataset_path)

    scaler  = MinMaxScaler()
    le_dict = {}

    for column in dataset_df.select_dtypes(include=["object"]).columns:
        le = LabelEncoder() 
        dataset_df[column] = le.fit_transform(dataset_df[column])  
        le_dict[column] = le  

    y = dataset_df["Listening Time (Morning/Afternoon/Night)"]
    X = dataset_df.drop(columns=["Listening Time (Morning/Afternoon/Night)"])

    X_scaled = scaler.fit_transform(X)
    X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
    
    return X_scaled_df, y, le_dict, scaler

## Plotting the histograms and class distribution

An issue reagrading plotting the histograms has been identified. In the early stages of the development, the columns containing strings instead of numbers were unable to be plotted as histograms. For that, they were plotted as class distribution diagrams, firstly bars, then pie charts. 

It's been a problem with understanding the meaning of these columns so they were mapped accordingly. See ```preprocess_data``` function.

Basically, the histograms are used for numerical columns whereas the class distribution diagrams are used for the other columns.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

def plot_visualization(X, le_dict, scaler, output_dir: str):


    temp = scaler.inverse_transform(X)
    df = pd.DataFrame(temp, columns=X.columns)

    
    for key  in le_dict:
        if key in df.columns:
            label_encoder = le_dict[key]
            df[key] = label_encoder.inverse_transform(df[key].astype(int))

        
    for column in df.columns:
        plt.figure(figsize=(8, 6))
        if pd.api.types.is_numeric_dtype(df[column]):
            plt.title(f"Histogram of {column}")
            plt.xlabel(column)
            df[column].plot(kind='hist', bins=30, color='skyblue', edgecolor='black')
            plt.ylabel("Frequency")
        else:
            plt.title(f"Class distribution of {column}")
            category_counts = df[column].value_counts()
            category_counts.plot(kind='pie', autopct='%1.1f%%', startangle=90)
            plt.ylabel("")

        plt.tight_layout()
        plt.savefig(f"{output_dir+"/"}{column}.png")
        plt.close()


In [3]:
import os

dataset_formatted_file_path = "/home/luca/SI/Project/data/preprocessed/"

train_dataset_path = "/home/luca/SI/Project/data/raw/train.csv"
dataset_formatted_file_name = "train.csv"

validation_dataset_path = "/home/luca/SI/Project/data/raw/val.csv"
val_formtted_file_name = "val.csv"

test_dataset_path = "/home/luca/SI/Project/data/raw/test.csv"
test_formtted_file_name = "test.csv"

(X, y, le_dict, scaler) = preprocess_data(dataset_path=train_dataset_path)
(X_val, y_val, _, _)    = preprocess_data(dataset_path=validation_dataset_path)  
(X_test, y_test, _, _)  = preprocess_data(dataset_path=test_dataset_path)

# plot_visualization(X=X, le_dict=le_dict, scaler=scaler, output_dir="/home/luca/SI/Project/outputs/data_vizualization")
X.to_csv(os.path.join(dataset_formatted_file_path, dataset_formatted_file_name), index=False)
X_val.to_csv(os.path.join(dataset_formatted_file_path, val_formtted_file_name), index=False)
X_test.to_csv(os.path.join(dataset_formatted_file_path, test_formtted_file_name), index=False)

## Constructing the model and the training process
 



### Training using a custom neuronal network

After the data has been pre-processed, the very first step is to combine the output of the pre-processing phase into a ```StreamingPreferencesDataset``` object. This way, we will be able to structure a MLNN easier. For this, the ```StreamingPreferencesDatasetMLP``` class has been created. It's implementation can be found in ```src.data_set.py```.

Once the dataset object is set-up, it's attributes can be used to inialized the ```HeartFailureMLP``` object that is responsible to implement the training model. Its implementation is available in ```src.model.py```. 

An important hyperparameter for the training process is the ```batch_size``` used by the dataloader. The model should be tested using multiple values for the ```BATCH_SIZE``` parameter in order to get the best results. Same goes for the ```LEARNINIG_RATE``` parameter.

```EPOCHS``` parameter denotes the number of times the algorithm goes through the dataset.

Thus, the first code fragment will handle initalization of diferent hyperparameters and of the model.
Another analisys will be done in order to observe model's reaction to different optimizers such as Adams, SDG, SDG with momentum.
The ```scheduler``` is used to provide the model with an already-implemented learning-rate scheduler. Its purpose is to reduce LR's value in order to get better results.

In [4]:
from sklearn.metrics            import accuracy_score, precision_score, recall_score, f1_score
from torch.utils.tensorboard    import SummaryWriter

def compute_metrics(writer, all_preds, all_labels, phase, epoch):
    acc  = accuracy_score(all_labels, all_preds)
    prec = precision_score(all_labels, all_preds,   average='macro', zero_division=0)
    rec  = recall_score(all_labels, all_preds,      average='macro', zero_division=0)
    f1   = f1_score(all_labels, all_preds,          average='macro', zero_division=0)

    writer.add_scalar(f"{phase}/Accuracy", acc, epoch)
    writer.add_scalar(f"{phase}/Precision", prec, epoch)
    writer.add_scalar(f"{phase}/Recall", rec, epoch)
    writer.add_scalar(f"{phase}/F1-Score", f1, epoch)

    return acc, prec, rec, f1

In [5]:
from torch.utils.data   import DataLoader
from sklearn.utils.class_weight import compute_class_weight
from src.data_set       import StreamingPreferencesDataset
from src.model          import StreamingPreferencesDatasetMLP

import torch
import torch.optim as optim
import torch.nn as nn
import numpy as np
 
BATCH_SIZE                  = 64
DROPOUT_PERCENTAGE          = 0.2
LEARNING_RATE               = 1e-2
OPTIMIZER_STEP_SIZE         = 70
EARLY_STOPPING_PATIENCE     = 1000
EPOCHS                      = 300
DEVICE                      = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

train_dataset   = StreamingPreferencesDataset(X, y)
val_dataset     = StreamingPreferencesDataset(X_val, y_val)
train_loader    = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader      = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=True)

writer = SummaryWriter(log_dir="runs/temp")


all_labels = [int(label) for _, label in train_dataset]
classes = np.unique(all_labels)
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=classes,
    y=all_labels
)
class_weights_tensor = torch.tensor(class_weights, dtype=torch.float).to(DEVICE)


model           = StreamingPreferencesDatasetMLP(dropout_percentage=DROPOUT_PERCENTAGE).to(DEVICE)
print(model)

criterion       = nn.CrossEntropyLoss(weight=class_weights_tensor, label_smoothing=0.3)  
optimizer       = optim.SGD(model.parameters(), lr=LEARNING_RATE) #momentum=0.9, 
# optimizer     = optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=1e-4)
# optimizer       = optim.RMSprop(model.parameters(), lr=LEARNING_RATE, alpha=0.9, eps=1e-8)

scheduler      = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer, mode='min', factor=0.7, patience=10, threshold_mode='rel',threshold=1e-4)
# scheduler     = optim.lr_scheduler.StepLR(step_size=OPTIMIZER_STEP_SIZE, optimizer=optimizer, gamma=0.4)
# scheduler       = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS, eta_min=1e-6)



StreamingPreferencesDatasetMLP(
  (model): Sequential(
    (0): Linear(in_features=10, out_features=128, bias=True)
    (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Dropout(p=0.2, inplace=False)
    (4): Linear(in_features=128, out_features=64, bias=True)
    (5): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): Dropout(p=0.2, inplace=False)
    (8): Linear(in_features=64, out_features=32, bias=True)
    (9): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): ReLU()
    (11): Dropout(p=0.2, inplace=False)
    (12): Linear(in_features=32, out_features=16, bias=True)
    (13): ReLU()
    (14): Linear(in_features=16, out_features=3, bias=True)
  )
)


In [6]:
import matplotlib.pyplot as plt
import numpy as np
import torch
from sklearn.metrics import confusion_matrix
from torch.utils.tensorboard import SummaryWriter

def log_confusion_matrix_tensorboard(y_true, y_pred, label_map, writer, global_step):
 
   
    if torch.is_tensor(y_true):
        y_true = y_true.cpu().numpy()
    if torch.is_tensor(y_pred):
        y_pred = y_pred.cpu().numpy()

 
    cm = confusion_matrix(y_true, y_pred)
    classes = [label_map[i] for i in range(len(label_map))]


    fig, ax = plt.subplots(figsize=(8, 8))
    im = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)
    ax.figure.colorbar(im, ax=ax)
    ax.set(
        xticks=np.arange(cm.shape[1]),
        yticks=np.arange(cm.shape[0]),
        xticklabels=classes,
        yticklabels=classes,
        ylabel='True label',
        xlabel='Predicted label',
        title='Confusion Matrix'
    )
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")


    fmt = 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()

    writer.add_figure("Confusion Matrix", fig, global_step)
    plt.close(fig)


In [7]:
def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, epochs, writer):

    for epoch in range(epochs):

        model.train()
        running_loss, total = 0.0, 0
        train_preds, train_labels = [], []

        for inputs, labels in train_loader:
            inputs, labels = inputs.to(DEVICE), labels.to(DEVICE)

            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            running_loss += loss.item() * inputs.size(0)
            total += labels.size(0)

            preds = torch.argmax(outputs, dim=1)  
            train_preds.extend(preds.cpu().numpy())
            train_labels.extend(labels.cpu().numpy())

        epoch_loss = running_loss / total
        compute_metrics(writer, train_preds, train_labels, "train", epoch)
        writer.add_scalar('Loss/train', epoch_loss, epoch)
        
        model.eval()
        val_loss, val_total = 0.0, 0
        val_preds, val_labels = [], []

        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(DEVICE), labels.to(DEVICE)
                outputs = model(inputs)
                loss = criterion(outputs, labels)

                val_loss += loss.item() * inputs.size(0)
                val_total += labels.size(0)

                preds = torch.argmax(outputs, dim=1)  
                val_preds.extend(preds.cpu().numpy())
                val_labels.extend(labels.cpu().numpy())

        val_epoch_loss = val_loss / val_total
        compute_metrics(writer, val_preds, val_labels, "val", epoch)

        scheduler.step(val_epoch_loss)

        current_lr = optimizer.param_groups[0]['lr']
        writer.add_scalar('LearningRate', current_lr, epoch)
        writer.add_scalar('Loss/val', val_epoch_loss, epoch)

        print(f"Epoch [{epoch+1}/{epochs}] Training Loss: {epoch_loss:.4f} Validation Loss: {val_epoch_loss:.4f}")

# train_model(model, train_loader, val_loader, criterion=criterion, optimizer=optimizer, scheduler=scheduler, epochs=EPOCHS, writer=writer)
# torch.save(model.state_dict(), "./models/codiax_model_final.pth")


In [8]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from torchmetrics.classification import MulticlassConfusionMatrix

def log_confusion_matrix_tensorboard(y_true, y_pred, label_map, writer, global_step=0):
    num_classes = len(label_map)
    print(num_classes)

    cm_metric = MulticlassConfusionMatrix(num_classes=3)
    confmat = cm_metric(y_pred, y_true)  


    confmat = confmat.cpu().numpy()
    labels = list(label_map.values())
    df_cm = pd.DataFrame(confmat, index=labels, columns=labels)

 
    fig, ax = plt.subplots(figsize=(6, 5))
    sns.heatmap(df_cm, annot=True, fmt='d', cmap='Blues', ax=ax)
    ax.set_xlabel("Predicted")
    ax.set_ylabel("Actual")
    ax.set_title("Confusion Matrix")

    
    writer.add_figure("ConfusionMatrix", fig, global_step=global_step)
    plt.close(fig)


In [9]:
from collections import Counter
import pandas as pd
import torch
from sklearn.metrics import accuracy_score
from torch.utils.data import DataLoader

def evaluate_model(model, dataset, batch_size, device, label_map, y_true=None):

    model.to(device)
    model.eval()

    loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

    all_preds = []
    with torch.no_grad():
        for inputs, _ in loader:
            inputs = inputs.to(device)
            outputs = model(inputs)
            preds = torch.argmax(outputs, dim=1)
            all_preds.append(preds.cpu())
    
    final_preds = torch.cat(all_preds)
    final_preds_np = final_preds.numpy()


    decoded_preds = [label_map[int(p)] for p in final_preds_np]
    label_counts = Counter(decoded_preds)
    total = sum(label_counts.values())
    label_percentages = {label: (count / total) * 100 for label, count in label_counts.items()}

    df_percentages = pd.DataFrame({
        'Label': list(label_percentages.keys()),
        'Percentage': list(label_percentages.values())
    }).sort_values(by='Label')

    accuracy = None
    if y_true is not None:
        y_true_np = y_true.cpu().numpy() if torch.is_tensor(y_true) else y_true
        accuracy = accuracy_score(y_true_np, final_preds_np)

    return df_percentages, accuracy, final_preds_np, final_preds


In [10]:
label_map = {0: "Morning", 1: "Afternoon", 2: "Night"}

test_dataset   = StreamingPreferencesDataset(X_test, y_test)
                                                                 
df_percentages, acc, final_preds_np, final_preds = evaluate_model(model, test_dataset, BATCH_SIZE, DEVICE, label_map, y_true=y_test)
print(f"Accuracy: {acc:.4f}")

all_labels = []
loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
for _, labels in loader:
    all_labels.append(labels)
y_true_tensor = torch.cat(all_labels).to(DEVICE) 

y_pred_np = final_preds.cpu().numpy() if torch.is_tensor(final_preds) else final_preds
y_pred_tensor = torch.tensor(y_pred_np, dtype=torch.long)

log_confusion_matrix_tensorboard(y_true=y_true_tensor, y_pred=y_pred_tensor, label_map=label_map, writer=writer, global_step=0)


Accuracy: 0.3490
3


RuntimeError: Encountered different devices in metric calculation (see stacktrace for details). This could be due to the metric class not being on the same device as input. Instead of `metric=MulticlassConfusionMatrix(...)` try to do `metric=MulticlassConfusionMatrix(...).to(device)` where device corresponds to the device of the input.

### Training using Logistic Regression

Logistic Regression is a classic method of classification in machine learning. The difference between Logistic Regression and Liniar Regression is the capability of predicting continous outcomes. For multi-class clasification such as this one, Logistic Regression uses the softmax function. In this training process ```CrossEntropyLoss``` will be used as criterion. This removes the need of the explicity softmax layer. 

In [None]:
from src.logistic_regression import LogisticRegression

model           = LogisticRegression().to(DEVICE)

LEARNING_RATE = 1e-2
criterion      = nn.CrossEntropyLoss(label_smoothing=0.1)  
# optimizer      = optim.SGD(model.parameters(), lr=LEARNING_RATE)
optimizer     = optim.Adam(model.parameters(), lr=LEARNING_RATE, weight_decay=1e-4)
scheduler      = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer, mode='min', factor=0.7, patience=10, threshold_mode='rel',threshold=1e-4)

logisticWriter = SummaryWriter(log_dir="runs/temp2")


train_model(model=model, train_loader=train_loader, val_loader=val_loader, criterion=criterion, optimizer=optimizer, scheduler=scheduler, epochs=EPOCHS, writer=logisticWriter)
torch.save(model.state_dict(), "./models/logistic_model_final.pth")

df_percentages, acc, final_preds_np, final_preds = evaluate_model(model, test_dataset, BATCH_SIZE, DEVICE, label_map, y_true=y_test)
print(df_percentages)
print(f"Accuracy: {acc:.4f}")

all_labels = []
loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)
for _, labels in loader:
    all_labels.append(labels)
y_true_tensor = torch.cat(all_labels).to(DEVICE) 

y_pred_np = final_preds.cpu().numpy() if torch.is_tensor(final_preds) else final_preds
y_pred_tensor = torch.tensor(y_pred_np, dtype=torch.long)

log_confusion_matrix_tensorboard(y_true=y_true_tensor, y_pred=y_pred_tensor, label_map=label_map, writer=logisticWriter, global_step=0)


### Ensemble methods



In [None]:
from src.bagging_ensemble import BaggingEnsemble
from src.logistic_regression import LogisticRegression

X_train_tensor = torch.tensor(X.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y.values, dtype=torch.long)
X_test_tensor  = torch.tensor(X_test.values, dtype=torch.float32)

logisticRegressionEnsemble = BaggingEnsemble(
    base_model_class=LogisticRegression,
    n_models=100,
    lr=LEARNING_RATE,
    epochs=EPOCHS,
)

logisticEnsambleWriter = SummaryWriter("runs/logistic_ensamble")

logisticRegressionEnsamble.fit(X_train_tensor, y_train_tensor, logisticEnsambleWriter)
y_pred = logisticRegressionEnsamble.predict(X_test_tensor)
acc = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {acc:.4f}")

MLPEnsemble = BaggingEnsemble(
    base_model_class=StreamingPreferencesDatasetMLP,
    n_models=100,
    lr=LEARNING_RATE,
    epochs=EPOCHS,
)

logisticEnsambleWriter = SummaryWriter("runs/logistic_ensamble")

logisticRegressionEnsamble.fit(X_train_tensor, y_train_tensor, logisticEnsambleWriter)
y_pred = logisticRegressionEnsamble.predict(X_test_tensor)
acc = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {acc:.4f}")

Acuratețe ensemble: 0.3490
