## **Visual Information Processing and Management**
---
---

Università degli Studi Milano Bicocca \
CdLM Informatica — A.A 2025/2026

---
---

#### **Componenti del gruppo:**
— Oleksandra Golub (856706) \
— Andrea Spagnolo (879254)

## **Librerie**

In [95]:
# caricamento di librerie
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cv2
from PIL import Image
from collections import Counter


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cuda


## **Percorsi a datasets e l'analisi della loro struttura**



In [71]:
BASE1 = "/kaggle/input/datasets/oleksandragolub/visual-exam-dataset-original/exam_dataset/"

TRAIN1 = os.path.join(BASE1, "train/train")
VAL1   = os.path.join(BASE1, "valid/valid")
TEST1  = os.path.join(BASE1, "test/test")
TESTD1 = os.path.join(BASE1, "test_degradato/test_degradato")

LABELS1 = os.path.join(BASE1, "sports_labels.csv")

In [72]:
BASE2 = "/kaggle/input/datasets/andreaspagnolo/visual-exam-dataset/visual_dataset/"

TRAIN2 = os.path.join(BASE2, "train")
VAL2   = os.path.join(BASE2, "valid")
TEST2  = os.path.join(BASE2, "test")
TESTD2 = os.path.join(BASE2, "test_degradato")

LABELS2 = os.path.join(BASE2, "sports_labels.csv")

In [73]:
BASE3 = "/kaggle/input/datasets/andreaspagnolo/visual-exam-dataset-2/visual_dataset/"

TRAIN3 = os.path.join(BASE3, "train")
VAL3   = os.path.join(BASE3, "valid")
TEST3  = os.path.join(BASE3, "test")
TESTD3 = os.path.join(BASE3, "test_degradato")

LABELS3 = os.path.join(BASE3, "sports_labels.csv")

In [74]:
DATASETS = {
    "original": {
        "base": BASE1,
        "train": TRAIN1, "valid": VAL1, "test": TEST1, "testd": TESTD1,
        "labels": LABELS1
    },
    "db2": {
        "base": BASE2,
        "train": TRAIN2, "valid": VAL2, "test": TEST2, "testd": TESTD2,
        "labels": LABELS2
    },
    "db3": {
        "base": BASE3,
        "train": TRAIN3, "valid": VAL3, "test": TEST3, "testd": TESTD3,
        "labels": LABELS3
    }
}

In [75]:
def check_dataset(train, val, test, testd):
    print("Train exists:", os.path.exists(train))
    print("Valid exists:", os.path.exists(val))
    print("Test exists:", os.path.exists(test))
    print("Test degradato exists:", os.path.exists(testd))
    print("-" * 40)

print("DATASET 1")
check_dataset(TRAIN1, VAL1, TEST1, TESTD1)

print("DATASET 2")
check_dataset(TRAIN2, VAL2, TEST2, TESTD2)

print("DATASET 3")
check_dataset(TRAIN3, VAL3, TEST3, TESTD3)

DATASET 1
Train exists: True
Valid exists: True
Test exists: True
Test degradato exists: True
----------------------------------------
DATASET 2
Train exists: True
Valid exists: True
Test exists: True
Test degradato exists: True
----------------------------------------
DATASET 3
Train exists: True
Valid exists: True
Test exists: True
Test degradato exists: True
----------------------------------------


In [76]:
def count_images_per_class(directory):
    class_counts = {}

    for class_name in sorted(os.listdir(directory)):
        class_path = os.path.join(directory, class_name)
        if os.path.isdir(class_path):
            images = [
                f for f in os.listdir(class_path)
                if f.lower().endswith(('.jpg', '.jpeg', '.png'))
            ]
            class_counts[class_name] = len(images)

    return class_counts


def print_class_distribution(directory, name):
    counts = count_images_per_class(directory)

    print(f"\n{name}")
    print("=" * 50)

    for cls, n in counts.items():
        print(f"{cls:25s} : {n}")

    print("-" * 50)
    print("Totale:", sum(counts.values()))
    print("Numero classi:", len(counts))
    print("=" * 50)


In [77]:
def print_totals_only(paths, dataset_name):
    print(f"\n{dataset_name}")
    print("-" * 40)
    for split, path in paths.items():
        counts = count_images_per_class(path)
        print(f"{split.upper():15s} -> {sum(counts.values())} immagini")
    print("-" * 40)


In [78]:
splits = ["train", "valid", "test", "testd"]
tables_by_split = {}

for split in splits:
    cols = []
    for key in ["original", "db2", "db3"]:
        counts = count_images_per_class(DATASETS[key][split])
        cols.append(pd.Series(counts, name=key))
    table = pd.concat(cols, axis=1).fillna(0).astype(int).sort_index()
    table.loc["TOTAL"] = table.sum(axis=0)
    tables_by_split[split] = table

In [79]:
print("=== TRAIN ===")
print(tables_by_split["train"].to_string())

=== TRAIN ===
                       original    db2    db3
air hockey                  112    112    191
ampute football             112    112    191
archery                     132    132    191
arm wrestling                99     99    191
axe throwing                113    113    191
balance beam                147    147    191
barell racing               123    123    191
baseball                    174    174    191
basketball                  169    169    191
baton twirling              108    108    191
bike polo                   110    110    191
billiards                   145    145    191
bmx                         140    140    191
bobsled                     138    138    191
bowling                     120    120    191
boxing                      116    116    191
bull riding                 149    149    191
bungee jumping              125    125    191
canoe slamon                164    164    191
cheerleading                131    131    191
chuckwagon racing   

In [80]:
print("=== VALID ===")
display(tables_by_split["valid"])

=== VALID ===


Unnamed: 0,original,db2,db3
air hockey,5,5,5
ampute football,5,5,5
archery,5,5,5
arm wrestling,5,5,5
axe throwing,5,5,5
...,...,...,...
weightlifting,5,5,5
wheelchair basketball,5,5,5
wheelchair racing,5,5,5
wingsuit flying,5,5,5


In [81]:
print("=== TEST ===")
display(tables_by_split["test"])

=== TEST ===


Unnamed: 0,original,db2,db3
air hockey,5,5,5
ampute football,5,5,5
archery,5,5,5
arm wrestling,5,5,5
axe throwing,5,5,5
...,...,...,...
weightlifting,5,5,5
wheelchair basketball,5,5,5
wheelchair racing,5,5,5
wingsuit flying,5,5,5


In [82]:
print("=== TEST_DEGRADATO ===")
display(tables_by_split["testd"])

=== TEST_DEGRADATO ===


Unnamed: 0,original,db2,db3
air hockey,5,5,5
ampute football,5,5,5
archery,5,5,5
arm wrestling,5,5,5
axe throwing,5,5,5
...,...,...,...
weightlifting,5,5,5
wheelchair basketball,5,5,5
wheelchair racing,5,5,5
wingsuit flying,5,5,5


In [87]:
ds_a = "original"
ds_b = "db2"   

for split in ["train", "valid", "test", "testd"]:

    table = tables_by_split[split]

    diff = table[ds_a] - table[ds_b]
    diff_nonzero = diff[diff != 0]

    print("\n" + "="*60)
    print(f"DIFFERENZE {ds_a} - {ds_b} in {split.upper()}")
    print("="*60)

    if len(diff_nonzero) == 0:
        print("Nessuna differenza!")
    else:
        print(diff_nonzero.to_string())


DIFFERENZE original - db2 in TRAIN
sky surfing   -41
TOTAL         -41

DIFFERENZE original - db2 in VALID
Nessuna differenza!

DIFFERENZE original - db2 in TEST
Nessuna differenza!

DIFFERENZE original - db2 in TESTD
Nessuna differenza!


In [91]:
ds_a = "original"
ds_b = "db3"   

for split in ["train", "valid", "test", "testd"]:

    table = tables_by_split[split]

    diff = table[ds_a] - table[ds_b]
    diff_nonzero = diff[diff != 0]

    print("\n" + "="*60)
    print(f"DIFFERENZE {ds_a} - {ds_b} in {split.upper()}")
    print("="*60)

    if len(diff_nonzero) == 0:
        print("Nessuna differenza!")
    else:
        print(diff_nonzero.to_string())



DIFFERENZE original - db3 in TRAIN
air hockey                -79
ampute football           -79
archery                   -59
arm wrestling             -92
axe throwing              -78
balance beam              -44
barell racing             -68
baseball                  -17
basketball                -22
baton twirling            -83
bike polo                 -81
billiards                 -46
bmx                       -51
bobsled                   -53
bowling                   -71
boxing                    -75
bull riding               -42
bungee jumping            -66
canoe slamon              -27
cheerleading              -60
chuckwagon racing         -71
cricket                   -62
croquet                   -57
curling                   -50
disc golf                 -68
fencing                   -56
field hockey              -34
figure skating men        -63
figure skating pairs      -40
figure skating women      -34
fly fishing               -57
formula 1 racing           -1
fris

In [130]:
def find_min_max_classes(datasets_dict, splits=("train","valid","test","testd")):
    
    for key, info in datasets_dict.items():
        print("\n" + "="*70)
        print(f"DATASET: {key}")
        print("="*70)
        
        for split in splits:
            
            counts = count_images_per_class(info[split])
            
            if len(counts) == 0:
                print(f"\n{split.upper()} -> Nessuna immagine trovata")
                continue
            
            min_class = min(counts, key=counts.get)
            max_class = max(counts, key=counts.get)
            
            print(f"\n{split.upper()}")
            print("-"*40)
            print(f"Classe MINIMA : {min_class} -> {counts[min_class]} immagini")
            print(f"Classe MASSIMA: {max_class} -> {counts[max_class]} immagini")

find_min_max_classes(DATASETS)


DATASET: original

TRAIN
----------------------------------------
Classe MINIMA : sky surfing -> 59 immagini
Classe MASSIMA: football -> 191 immagini

VALID
----------------------------------------
Classe MINIMA : air hockey -> 5 immagini
Classe MASSIMA: air hockey -> 5 immagini

TEST
----------------------------------------
Classe MINIMA : air hockey -> 5 immagini
Classe MASSIMA: air hockey -> 5 immagini

TESTD
----------------------------------------
Classe MINIMA : air hockey -> 5 immagini
Classe MASSIMA: air hockey -> 5 immagini

DATASET: db2

TRAIN
----------------------------------------
Classe MINIMA : ultimate -> 97 immagini
Classe MASSIMA: football -> 191 immagini

VALID
----------------------------------------
Classe MINIMA : air hockey -> 5 immagini
Classe MASSIMA: air hockey -> 5 immagini

TEST
----------------------------------------
Classe MINIMA : air hockey -> 5 immagini
Classe MASSIMA: air hockey -> 5 immagini

TESTD
----------------------------------------
Classe MIN

## **Analisi immagini**



In [104]:
total_steps = len(DATASETS) * 4
current_step = 0

def build_summary_table_from_paths(datasets_dict, splits=("train","valid","test","testd"), sample_size=500, seed=42):
    global current_step
    rows = []
    
    for key, info in datasets_dict.items():
        for split in splits:
            current_step += 1
            print(f"[{current_step}/{total_steps}] {key} - {split}")
            
            stats = analyze_sizes_sampled(info[split], sample_size=sample_size, seed=seed)
            rows.append({"dataset": key, "split": split, **stats})
    
    print("Analisi completata!")
    return pd.DataFrame(rows)

summary = build_summary_table_from_paths(DATASETS, sample_size=500, seed=42)
summary

[1/12] original - train
[2/12] original - valid
[3/12] original - test
[4/12] original - testd
[5/12] db2 - train
[6/12] db2 - valid
[7/12] db2 - test
[8/12] db2 - testd
[9/12] db3 - train
[10/12] db3 - valid
[11/12] db3 - test
[12/12] db3 - testd
Analisi completata!


Unnamed: 0,dataset,split,checked,bad_files,unique_sizes,mode_size,mode_pct,w_min,w_max,w_mean,h_min,h_max,h_mean
0,original,train,500,0,1,224x224,100.0,224,224,224,224,224,224
1,original,valid,500,0,1,224x224,100.0,224,224,224,224,224,224
2,original,test,500,0,1,224x224,100.0,224,224,224,224,224,224
3,original,testd,500,0,1,224x224,100.0,224,224,224,224,224,224
4,db2,train,500,0,1,224x224,100.0,224,224,224,224,224,224
5,db2,valid,500,0,1,224x224,100.0,224,224,224,224,224,224
6,db2,test,500,0,1,224x224,100.0,224,224,224,224,224,224
7,db2,testd,500,0,1,224x224,100.0,224,224,224,224,224,224
8,db3,train,500,0,1,224x224,100.0,224,224,224,224,224,224
9,db3,valid,500,0,1,224x224,100.0,224,224,224,224,224,224


In [105]:
for key, info in DATASETS.items():
    n = sum(
        1 for root,_,files in os.walk(info["train"])
        for f in files if f.lower().endswith(('.jpg','.jpeg','.png'))
    )
    print(key, "train images:", n, "| path:", info["train"])

original train images: 13492 | path: /kaggle/input/datasets/oleksandragolub/visual-exam-dataset-original/exam_dataset/train/train
db2 train images: 13533 | path: /kaggle/input/datasets/andreaspagnolo/visual-exam-dataset/visual_dataset/train
db3 train images: 19100 | path: /kaggle/input/datasets/andreaspagnolo/visual-exam-dataset-2/visual_dataset/train


In [119]:
def analyze_rgb_stats(directory, sample_size=500, seed=42):
    paths = sample_image_paths(directory, sample_size, seed)
    
    means = []
    stds = []
    
    for p in paths:
        try:
            img = Image.open(p).convert("RGB")
            arr = np.array(img) / 255.0
            
            means.append(arr.mean(axis=(0,1)))  # media R,G,B
            stds.append(arr.std(axis=(0,1)))    # std R,G,B
            
        except:
            continue
    
    means = np.array(means)
    stds = np.array(stds)
    
    return {
        "mean_R": means[:,0].mean(),
        "mean_G": means[:,1].mean(),
        "mean_B": means[:,2].mean(),
        "std_R": stds[:,0].mean(),
        "std_G": stds[:,1].mean(),
        "std_B": stds[:,2].mean(),
    }

def analyze_blur(directory, sample_size=500, seed=42):
    paths = sample_image_paths(directory, sample_size, seed)
    
    blur_values = []
    
    for p in paths:
        try:
            img = cv2.imread(p, cv2.IMREAD_GRAYSCALE)
            
            if img is None:
                continue
                
            lap = cv2.Laplacian(img, cv2.CV_64F)
            blur = lap.var()
            
            if not np.isnan(blur):
                blur_values.append(blur)
                
        except:
            continue
    
    if len(blur_values) == 0:
        print(" Nessun valore blur calcolato")
        return {"mean_blur": None, "std_blur": None}
    
    return {
        "mean_blur": float(np.mean(blur_values)),
        "std_blur": float(np.std(blur_values)),
        "n_samples": len(blur_values)
    }

def analyze_brightness(directory, sample_size=500, seed=42):
    paths = sample_image_paths(directory, sample_size, seed)
    
    brightness_vals = []
    
    for p in paths:
        try:
            img = Image.open(p).convert("L")
            arr = np.array(img) / 255.0
            brightness_vals.append(arr.mean())
        except:
            continue
    
    return {
        "mean_brightness": np.mean(brightness_vals),
        "std_brightness": np.std(brightness_vals)
    }

def analyze_edge_density(directory, sample_size=500, seed=42):
    paths = sample_image_paths(directory, sample_size, seed)
    densities = []

    for p in paths:
        try:
            img = cv2.imread(p, cv2.IMREAD_GRAYSCALE)
            if img is None:
                continue

            edges = cv2.Canny(img, 100, 200)  # soglie standard
            density = (edges > 0).mean()      # frazione di pixel edge
            densities.append(density)

        except:
            continue

    if len(densities) == 0:
        return {"mean_edge_density": None, "std_edge_density": None, "n_samples": 0}

    return {
        "mean_edge_density": float(np.mean(densities)),
        "std_edge_density": float(np.std(densities)),
        "n_samples": len(densities)
    }

def analyze_texture_energy(directory, sample_size=500, seed=42):
    paths = sample_image_paths(directory, sample_size, seed)
    energies = []

    for p in paths:
        try:
            img = cv2.imread(p, cv2.IMREAD_GRAYSCALE)
            if img is None:
                continue

            gx = cv2.Sobel(img, cv2.CV_64F, 1, 0, ksize=3)
            gy = cv2.Sobel(img, cv2.CV_64F, 0, 1, ksize=3)
            mag = np.sqrt(gx**2 + gy**2)

            energies.append(float(mag.mean()))
        except:
            continue

    if len(energies) == 0:
        return {"mean_texture": None, "std_texture": None, "n_samples": 0}

    return {
        "mean_texture": float(np.mean(energies)),
        "std_texture": float(np.std(energies)),
        "n_samples": len(energies)
    }

In [133]:
def full_analysis_table(dataset_key, sample_size=500):
    rows = []
    
    for split in ["train", "valid", "test", "testd"]:
        
        rgb = analyze_rgb_stats(DATASETS[dataset_key][split], sample_size)
        blur = analyze_blur(DATASETS[dataset_key][split], sample_size)
        bright = analyze_brightness(DATASETS[dataset_key][split], sample_size)
        edges = analyze_edge_density(DATASETS[dataset_key][split], sample_size)
        tex = analyze_texture_energy(DATASETS[dataset_key][split], sample_size)
        
        rows.append({
            "dataset": dataset_key,   
            "split": split,
            **rgb,
            **blur,
            **bright,
            **edges,
            **tex
        })
    
    return pd.DataFrame(rows)

db1_summary = full_analysis_table("original")
db2_summary = full_analysis_table("db2")
db3_summary = full_analysis_table("db3")

print("========== ORIGINAL ==========")
display(db1_summary)

print("========== DB2 ==========")
display(db2_summary)

print("========== DB3 ==========")
display(db3_summary)



Unnamed: 0,dataset,split,mean_R,mean_G,mean_B,std_R,std_G,std_B,mean_blur,std_blur,n_samples,mean_brightness,std_brightness,mean_edge_density,std_edge_density,mean_texture,std_texture
0,original,train,0.466415,0.461975,0.442781,0.241691,0.231724,0.23151,3593.597907,3298.996925,500,0.461113,0.163182,0.129332,0.059916,90.461685,33.655799
1,original,valid,0.468186,0.464925,0.45538,0.235633,0.224704,0.226245,3498.27585,2844.941061,500,0.464811,0.161967,0.124531,0.058555,87.31034,31.507624
2,original,test,0.468685,0.470484,0.458403,0.238841,0.22942,0.23041,3739.712116,2863.493243,500,0.468567,0.154473,0.127463,0.060545,88.853241,33.171461
3,original,testd,0.464544,0.476067,0.458719,0.232141,0.233136,0.22624,3513.27295,2941.878631,500,0.470648,0.161743,0.118836,0.061833,83.843109,33.668226




Unnamed: 0,dataset,split,mean_R,mean_G,mean_B,std_R,std_G,std_B,mean_blur,std_blur,n_samples,mean_brightness,std_brightness,mean_edge_density,std_edge_density,mean_texture,std_texture
0,db2,train,0.477233,0.471769,0.446523,0.237263,0.227986,0.229021,3545.785188,3675.556769,500,0.470527,0.157572,0.125781,0.063486,87.612245,34.853319
1,db2,valid,0.468186,0.464925,0.45538,0.235633,0.224704,0.226245,3498.27585,2844.941061,500,0.464811,0.161967,0.124531,0.058555,87.31034,31.507624
2,db2,test,0.468685,0.470484,0.458403,0.238841,0.22942,0.23041,3739.712116,2863.493243,500,0.468567,0.154473,0.127463,0.060545,88.853241,33.171461
3,db2,testd,0.464544,0.476067,0.458719,0.232141,0.233136,0.22624,3513.27295,2941.878631,500,0.470648,0.161743,0.118836,0.061833,83.843109,33.668226




Unnamed: 0,dataset,split,mean_R,mean_G,mean_B,std_R,std_G,std_B,mean_blur,std_blur,n_samples,mean_brightness,std_brightness,mean_edge_density,std_edge_density,mean_texture,std_texture
0,db3,train,0.464936,0.468809,0.460996,0.236903,0.230082,0.23276,3385.410147,3324.633869,500,0.466755,0.156222,0.124838,0.060324,87.788532,33.982834
1,db3,valid,0.468186,0.464925,0.45538,0.235633,0.224704,0.226245,3498.27585,2844.941061,500,0.464811,0.161967,0.124531,0.058555,87.31034,31.507624
2,db3,test,0.468685,0.470484,0.458403,0.238841,0.22942,0.23041,3739.712116,2863.493243,500,0.468567,0.154473,0.127463,0.060545,88.853241,33.171461
3,db3,testd,0.464544,0.476067,0.458719,0.232141,0.233136,0.22624,3513.27295,2941.878631,500,0.470648,0.161743,0.118836,0.061833,83.843109,33.668226


Il test degradato:
- Non è diverso globalmente
- Non è blur
- Non è più scuro
- Non ha color shift
- Ha meno micro-texture
- Ha meno edge
- Ha meno alta frequenza

È un degrado strutturale fine.

Struttura dataset:

- Gli split valid/test/test_degradato risultano equivalenti nei tre dataset analizzati. Le differenze tra i dataset si concentrano principalmente nello split di training.

Differenze nel training:

- Il training di original mostra maggiore densità di edge e maggiore energia di texture rispetto a db2 e db3, suggerendo una distribuzione leggermente diversa dei dettagli locali nelle immagini di training. Questo potrebbe essere dovuto all’applicazione di tecniche di data augmentation.

Spiegazione del crollo su degradato:

- Il test degradato presenta una riduzione di edge density (circa 5%) e texture energy (circa 4–7% a seconda del training), indicando perdita di micro-struttura visiva. Questo tipo di shift locale (non globale) può impattare significativamente la classificazione multi-classe.