<!--source of study https://www.kaggle.com/code/ryanholbrook/mutual-information -->
### Mutual Information
[**From Wikipedia https://en.wikipedia.org/wiki/Mutual_information**]

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" (in units such as shannons (bits), nats or hartleys) obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

Not limited to real-valued random variables and linear dependence like the correlation coefficient, MI is more general and determines how different the joint distribution of the pair
$(X,Y)$  is from the product of the marginal distributions of $X$ and $Y$. MI is the expected value of the pointwise mutual information (PMI).

The quantity was defined and analyzed by Claude Shannon in his landmark paper "A Mathematical Theory of Communication", although he did not call it "mutual information". This term was coined later by Robert Fano. Mutual Information is also known as information gain.

### Why is it different from Correlation?
[**From Chat GPT**]

| Feature                                     | Correlation                  | Mutual Information    |
| ------------------------------------------- | ---------------------------- | --------------------- |
| Detects linear dependence                   | ✅                            | ✅                     |
| Detects non-linear dependence               | ❌                            | ✅                     |
| Works for numeric+categorical               | ❌                            | ✅                     |
| Zero means independence                     | ❌ (only linear independence) | ✅ (true independence) |
| Invariant to scaling & monotonic transforms | ❌                            | ✅                     |
| Requires normality assumptions              | Often (for interpretation)   | ❌                     |


In [44]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.feature_selection import mutual_info_regression
import importlib
import os

import start_utils
from start_utils import (
    create_features, read_train_data, read_test_data, 
    simple_train, build_pipe, train_regularization, 
    #greedy_feature_selection, random_bucket_feature_search, 
    predict_and_submit
    )
importlib.reload(start_utils)



<module 'start_utils' from '/Users/giacomobonanni/Desktop/FDS_Kaggle_Competition/start_utils.py'>

In [None]:
# Set Matplotlib defaults
#plt.style.use("seaborn-whitegrid")
#import seaborn as sns
sns.set_theme(style="whitegrid")




# Load data
COMPETITION_NAME = 'fds-pokemon-battles-prediction-2025'
DATA_PATH = os.path.join('input', COMPETITION_NAME)

train_file_path = os.path.join(DATA_PATH, 'train.jsonl')
test_file_path = os.path.join(DATA_PATH, 'test.jsonl')

train_data = read_train_data(train_file_path)
test_data = read_test_data(test_file_path)

#2 create features
# Create feature DataFrames for both training and test sets
print("Processing training data...")
df = create_features(train_data)

#df = pd.read_csv("../input/fe-course-data/ames.csv")


# Utility functions from Tutorial
def make_mi_scores(X, y):
    X = X.copy()
    """
    for colname in X.select_dtypes(["object", "category"]):
        X[colname], _ = X[colname].factorize()
    """
    # All discrete features should now have integer dtypes
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores


def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.rc("figure", autolayout=True)
    plt.rc(
        "axes",
        labelweight="bold",
        labelsize="large",
        titleweight="bold",
        titlesize=14,
        titlepad=10,
    )
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

In [None]:
df.columns

In [None]:
#Here I left battle_id to show that it's obviously related to player_won since the records are sorted by battle_id (first all the player_won=1, than the ones with player_won=0)
step = 5
for i in range(0,len(df.columns), step):
    features = df.columns[i:i+step]#['battle_duration', 'nr_pokemon_sconfitti_p1','p1_type_advantage']#df.columns[0:5]
    sns.relplot(
        x="value", y="player_won", col="variable", data=df.melt(id_vars="player_won", value_vars=features), facet_kws=dict(sharex=False),
    );

In [None]:
X = df.copy()
#print(X.isna().all())
for col in X.columns:
    if(X[col].isna().sum() > 0):
        print(col,X[col].isna().sum())
    #print(col, X[col].shape, X[col].isna().sum())



In [None]:
X = df.copy()
X.pop('battle_id')
y = X.pop('player_won')
# print(X.shape)
# step = 3
# mi_scores = None
# for i in range(0, len(X.columns), step):
#     X_cols = X.columns[i:i+step]
#     #print(f"Making MI Score for {X_cols}")
#     mi_scores = make_mi_scores(X[X_cols], y)
mi_scores = make_mi_scores(X, y)

In [None]:
print("mi_scores: ",mi_scores,mi_scores.shape)
print("\nhead:",mi_scores.head(20))
print(mi_scores.tail(20))  # uncomment to see bottom 20


step = 15
for i in range(0,len(mi_scores),step):
    plt.figure(dpi=100, figsize=(8, 5))
    if (mi_scores[i:i+step] == 0).all():
        break
    plot_mi_scores(mi_scores[i:i+step])
    #break
# plot_mi_scores(mi_scores.tail(20))  # uncomment to see bottom 20

In [None]:
#I want to check if the features with max MI Score (top 20) increase the accuracy score and how much
top_mi_scores = mi_scores#.head(20)
print(len(mi_scores))
for i in range(0,len(top_mi_scores)):
    print(f"Calculating for i={i}")
    selected = top_mi_scores[0:i+1].keys()
    #print(selected)
    X_selected = X[selected]
    #print(f"selected shape={X_selected.shape}")
    final_pipe = train_regularization(X_selected,y)
#exit()



In [24]:
#I want to check if the features with max MI Score (top 20) increase the accuracy score and how much
top_mi_scores = mi_scores
# selected = top_mi_scores[0:5].keys()
# X_selected = X[selected]
# #print(f"selected shape={X_selected.shape}")
# final_pipe = train_regularization(X_selected,y)

# for i in range(0,5):
#     print(f"Calculating for i={i}")
#     selected = top_mi_scores[0:i+1].keys()
#     #print(selected)
#     X_selected = X[selected]
#     #print(f"selected shape={X_selected.shape}")
#     final_pipe = train_regularization(X_selected,y)
# top_mi_scores = mi_scores
# selected = top_mi_scores[0:50].keys()
# X_selected = X[selected]
# #print(f"selected shape={X_selected.shape}")
# final_pipe = train_regularization(X_selected,y)

selected = top_mi_scores[0:76].keys()
X_selected = X[selected]
#print(f"selected shape={X_selected.shape}")
final_pipe = train_regularization(X_selected,y)
test_df = create_features(test_data)
predict_and_submit(test_df, selected, final_pipe)

train_regularization!!
Best params: {'logreg__C': 10, 'logreg__penalty': 'l2', 'logreg__solver': 'liblinear'}
Best CV mean: 0.8444 ± 0.0082


Extracting features:   0%|          | 0/5000 [00:00<?, ?it/s]

Generating predictions on the test set...

'submission.csv' file created successfully!


### Flattening Data

In [25]:
#Flattening data into numeric features
import json
import pandas as pd
from tqdm import tqdm
import time
from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd
import numpy as np
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd

start = time.time()
import json
import pandas as pd
from tqdm import tqdm
import re
import json
import re
from tqdm import tqdm
import pandas as pd

import json


def flatten_dict(d, parent_key="", sep="."):
    """Recursively flatten dicts/lists into a single-level dict."""
    items = []
    for k, v in d.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key, sep=sep).items())
        elif isinstance(v, list):
            # Flatten short semantic lists like effects or types
            if len(v) > 0 and all(isinstance(x, (str, int, float, bool, type(None))) for x in v):
                for i, val in enumerate(v):
                    items.append((f"{new_key}{sep}{i}", val))
            #skip long or nested lists (like nested dicts)
            else:
                continue
        else:
            items.append((new_key, v))
    return dict(items)



VALID_POKEMON_KEYS = {
    "p1_pokemon_state", "p2_pokemon_state",
    "p1_move_details", "p2_move_details",
}

all_timeline_keys = set()
all_effects = set() 

with open("train.jsonl") as f:
    for line in tqdm(f, desc="Scanning columns"):
        data = json.loads(line.strip())
        if "battle_timeline" not in data:
            continue
        for i, turn in enumerate(data["battle_timeline"], start=1):
            if not isinstance(turn, dict):
                continue
            for key, value in flatten_dict(turn, f"battle_timeline.{i}").items():
                # Skip invalid base-level objects like p1_pokemon_state
                if any(key.endswith(f"{valid}") for valid in VALID_POKEMON_KEYS):
                    continue
                # Collect all effects that appear
                if "effects" in key and isinstance(value, str):
                    all_effects.add(value.strip())
                all_timeline_keys.add(key)

# Normalize keys so turn numbers become a template placeholder
template_keys = {
    re.sub(r"battle_timeline\.\d+\.", "battle_timeline.{turn}.", key)
    for key in all_timeline_keys
}

for side in ["p1", "p2"]:
    for eff in all_effects:
        template_keys.add(
            f"battle_timeline.{{turn}}.{side}_pokemon_state.effects.0.{eff}"
        )


template_keys = sorted(template_keys)
print(f"Collected {len(template_keys)} unique timeline feature templates, including {len(all_effects)} effects.")

records = []
with open("train.jsonl") as f:
    for line in tqdm(f, desc="Flattening"):
        data = json.loads(line.strip())
        flat_record = {}

        #Flatten team details
        if "p1_team_details" in data and isinstance(data["p1_team_details"], list):
            for i, poke in enumerate(data["p1_team_details"]):
                for key, value in poke.items():
                    if isinstance(value, list):
                        for j, v in enumerate(value):
                            flat_record[f"p1_team_details.{i}.{key}.{j}"] = v
                    else:
                        flat_record[f"p1_team_details.{i}.{key}"] = value

        #Flatten p2 lead details
        if "p2_lead_details" in data and isinstance(data["p2_lead_details"], dict):
            for key, value in data["p2_lead_details"].items():
                if isinstance(value, list):
                    for j, v in enumerate(value):
                        flat_record[f"p2_lead_details.{key}.{j}"] = v
                else:
                    flat_record[f"p2_lead_details.{key}"] = value

        #Flatten timeline turns (1–30)
        for turn in range(1, 31):
            if "battle_timeline" in data and len(data["battle_timeline"]) >= turn:
                turn_data = data["battle_timeline"][turn - 1]
                if isinstance(turn_data, dict):
                    for key, value in flatten_dict(turn_data, f"battle_timeline.{turn}").items():
                        if any(valid in key for valid in VALID_POKEMON_KEYS):
                            if any(key.endswith(f"{valid}") for valid in VALID_POKEMON_KEYS):
                                continue
                            flat_record[key] = value
            # Ensure *every* timeline key exists for this turn
            for tmpl in template_keys:
                key = tmpl.format(turn=turn)
                if key not in flat_record:
                    flat_record[key] = None

        # --- Copy top-level keys ---
        for k, v in data.items():
            if k not in ["p1_team_details", "p2_lead_details", "battle_timeline"]:
                flat_record[k] = v
        records.append(flat_record)

# ---------- 3️⃣ Build DataFrame ----------
df = pd.DataFrame(records)
df.columns = [c.replace('.', '_') for c in df.columns]
print(f"✅ Final shape: {df.shape}")




####OLD
bool_cols = df.select_dtypes(include=["bool"]).columns

df[bool_cols] = df[bool_cols].astype(int)

print(f"Converted {len(bool_cols)} boolean columns to integers.")


unique_types = df.dtypes.unique().tolist()

print("Unique column data types:")
print(unique_types)
type_cols = [col for col in df.columns if "type" in col]
df = pd.get_dummies(df, columns=type_cols, drop_first=True)

#60
cat_cols = [c for c in df.columns if "move_details_category" in c]
df = pd.get_dummies(df, columns=cat_cols, prefix=cat_cols, drop_first=True)

#status
status_cols = [c for c in df.columns if "state_status" in c]

df = pd.get_dummies(df, columns=status_cols, prefix=status_cols, drop_first=True)

#effect
effects_cols = [c for c in df.columns if "state_effects" in c]
print(f"Found {len(effects_cols)} state_effects columns:")
print(effects_cols[:10])


# Identify state_effects columns
import re
from sklearn.preprocessing import MultiLabelBinarizer

# only base effect columns (exclude already expanded ones)
effects_cols = [
    c for c in df.columns
    if "state_effects" in c and re.search(r"state_effects_\d+$", c)
]
print(f"Transforming {len(effects_cols)} base state_effects columns...")

mlb = MultiLabelBinarizer()

for c in tqdm(effects_cols, desc="Encoding effects"):
    col_values = df[c]

    #Normalize into lists of strings
    def normalize_effects(x):
        if x is None or (isinstance(x, float) and np.isnan(x)):
            return []
        if isinstance(x, str):
            return [x.strip()]
        if isinstance(x, (list, tuple)):
            out = []
            for el in x:
                if el is None or (isinstance(el, float) and np.isnan(el)):
                    continue
                if isinstance(el, str):
                    out.append(el.strip())
                else:
                    out.append(str(el))
            return out
        return [str(x)]

    col_values = col_values.apply(normalize_effects)

    #Skip columns with no effects at all
    if sum(len(v) for v in col_values) == 0:
        continue

    #Convert to list-of-lists before fitting
    col_list = col_values.tolist()
    mlb.fit(col_list)
    transformed = mlb.transform(col_list)

    #Create encoded DataFrame
    encoded = pd.DataFrame(
        transformed,
        columns=[f"{c}_{cls}" for cls in mlb.classes_],
        index=df.index
    )

    #Merge 
    df = pd.concat([df.drop(columns=[c]), encoded], axis=1)


print("✅ All state_effects columns transformed into binary features safely!")

#I drop the move names they just add noise
move_name_cols = [c for c in df.columns if "move_details_name" in c]
df = df.drop(columns=move_name_cols)
print(f"Dropped {len(move_name_cols)} move_details_name columns (redundant with move stats).")


#####JOIN


#"team details" columns for player 1
p1_team_cols = [c for c in df.columns if c.startswith("p1_team_details_")]

# Extract before battle Pokémon dictionaries for player 1
def build_pokemon_dict(row, prefix="p1_team_details_"):
    poke_dict = {}
    for i in range(6):  # team slots 0–5
        name_col = f"{prefix}{i}_name"
        if name_col in row and pd.notna(row[name_col]):
            # capture all base_* stats for this slot
            stat_cols = [c for c in df.columns if c.startswith(f"{prefix}{i}_base_")]
            stats = {stat.split("_base_")[-1]: row[stat] for stat in stat_cols}
            poke_dict[row[name_col]] = stats
    return poke_dict


#timeline Pokémon name columns
timeline_name_cols = [c for c in df.columns if "pokemon_state_name" in c]

#collect all new columns (avoid fragmentation)
new_columns = {}

#Iterate battles (rows)
for idx, row in df.iterrows():
    # Build Pokémon -> stats dictionaries
    p1_map = build_pokemon_dict(row, prefix="p1_team_details_")
    p2_map = build_pokemon_dict(row, prefix="p2_team_details_") if any(
        c.startswith("p2_team_details_") for c in df.columns
    ) else {}

    #P2 lead details if present
    if pd.notna(row.get("p2_lead_details_name")):
        p2_map[row["p2_lead_details_name"]] = {
            k.replace("p2_lead_details_base_", ""): v
            for k, v in row.items()
            if k.startswith("p2_lead_details_base_")
        }

    # timeline stats
    for col in timeline_name_cols:
        name = row[col]
        if pd.isna(name):
            continue
        side = "p1" if "_p1_" in col else "p2"
        stat_map = p1_map if side == "p1" else p2_map

        if name in stat_map:
            for stat, value in stat_map[name].items():
                new_col = col.replace("_name", f"_base_{stat}")
                if new_col not in new_columns:
                    new_columns[new_col] = [np.nan] * len(df)
                new_columns[new_col][idx] = float(value)

# new df with all new stat columns
df_new = pd.DataFrame(new_columns, index=df.index)

#Join
df = pd.concat([df, df_new], axis=1)

#drop original pokemon name columns (redundant)
df = df.drop(columns=timeline_name_cols + [c for c in df.columns if c.endswith("_name")])

#defragment memory layout
df = df.copy()

#no duplicate columns
df = df.loc[:, ~df.columns.duplicated()].copy()
df.columns = [c.replace('.', '_') for c in df.columns]

print(f"✅ Final shape: {df.shape}")
dupes = df.columns[df.columns.duplicated()]
if len(dupes):
    print("⚠️ Still duplicate columns:", dupes.tolist())
else:
    print("✅ No duplicate columns remain!")

print("✅ Stats injected successfully!")
unique_types = df.dtypes.unique().tolist()

print("Unique column data types:")
print(unique_types)

#object (dtype 'O') columns
object_cols = df.select_dtypes(include=["object"]).columns.tolist()

print(f"Found {len(object_cols)} object columns:")#381
print(object_cols)

Scanning columns: 10000it [00:05, 1680.30it/s]


Collected 47 unique timeline feature templates, including 8 effects.


Flattening: 10000it [00:10, 952.52it/s]


✅ Final shape: (10000, 1482)
Converted 1 boolean columns to integers.
Unique column data types:
[dtype('O'), dtype('int64'), dtype('float64')]
Found 480 state_effects columns:
['battle_timeline_1_p1_pokemon_state_effects_0', 'battle_timeline_1_p2_pokemon_state_effects_0', 'battle_timeline_1_p1_pokemon_state_effects_0_clamp', 'battle_timeline_1_p1_pokemon_state_effects_0_confusion', 'battle_timeline_1_p1_pokemon_state_effects_0_firespin', 'battle_timeline_1_p1_pokemon_state_effects_0_noeffect', 'battle_timeline_1_p1_pokemon_state_effects_0_reflect', 'battle_timeline_1_p1_pokemon_state_effects_0_substitute', 'battle_timeline_1_p1_pokemon_state_effects_0_wrap', 'battle_timeline_1_p2_pokemon_state_effects_0_clamp']
Transforming 60 base state_effects columns...


Encoding effects: 100%|██████████| 60/60 [00:05<00:00, 10.35it/s]


✅ All state_effects columns transformed into binary features safely!
Dropped 60 move_details_name columns (redundant with move stats).
✅ Final shape: (10000, 2829)
✅ No duplicate columns remain!
✅ Stats injected successfully!
Unique column data types:
[dtype('int64'), dtype('float64'), dtype('O'), dtype('bool')]
Found 450 object columns:
['battle_timeline_1_p1_pokemon_state_effects_0_clamp', 'battle_timeline_1_p1_pokemon_state_effects_0_confusion', 'battle_timeline_1_p1_pokemon_state_effects_0_firespin', 'battle_timeline_1_p1_pokemon_state_effects_0_noeffect', 'battle_timeline_1_p1_pokemon_state_effects_0_reflect', 'battle_timeline_1_p1_pokemon_state_effects_0_substitute', 'battle_timeline_1_p1_pokemon_state_effects_0_wrap', 'battle_timeline_1_p2_pokemon_state_effects_0_clamp', 'battle_timeline_1_p2_pokemon_state_effects_0_confusion', 'battle_timeline_1_p2_pokemon_state_effects_0_firespin', 'battle_timeline_1_p2_pokemon_state_effects_0_noeffect', 'battle_timeline_1_p2_pokemon_state_eff

In [26]:
#(10000, 2829)
#df = df.drop(columns=object_cols)
df = df[[col for col in df.columns if col not in object_cols]].copy()

df.shape

(10000, 2379)

In [27]:
#once
#X.pop('battle_id')
#y = X.pop('player_won')
X = df[[col for col in df.columns if col not in ["battle_id", "player_won"]]].copy()
X = X.fillna(0)

#print(X.isna().all())
for col in X.columns:
    if(X[col].isna().sum() > 0):
        print(col,X[col].isna().sum())
    #print(col, X[col].shape, X[col].isna().sum())


y = df['player_won']

In [28]:

#Mi score Calculation took 42.52199697494507 time
print(X.shape)
start_out = time.time()
# step = 3
# mi_scores = None
# for i in range(0, len(X.columns), step):
#     start = time.time()
#     X_cols = X.columns[i:i+step]
#     print(f"Making MI Score for {X_cols}")
#     mi_scores = make_mi_scores(X[X_cols], y)
#     end = start = time.time()
#     print(f"Mi score {i}:{i+step} took {end-start} time")
mi_scores = make_mi_scores(X, y)
end_out = time.time()
print(f"Mi score Calculation took {end_out-start_out} time")

(10000, 2377)
Mi score Calculation took 42.34749126434326 time


In [46]:
mi_scores.head(100)
for i in [100,200,500,1000, 1500, 2000, 2377]:
    selected = mi_scores.head(i).keys()
    #print(selected)
    X_selected = X[selected]
    #print(f"selected shape={X_selected.shape}")
    final_pipe = train_regularization(X_selected,y, USE_PCA=True, n_components=min(i,300))
# selected = top_mi_scores[0:i+1].keys()
# #print(selected)
# X_selected = X[selected]
# #print(f"selected shape={X_selected.shape}")
# final_pipe = train_regularization(X_selected,y)

train_regularization!!
Best params: {'logreg__C': 0.1, 'logreg__penalty': 'l2', 'logreg__solver': 'liblinear'}
Best CV mean: 0.7580 ± 0.0093
train_regularization!!
Best params: {'logreg__C': 0.01, 'logreg__penalty': 'l2', 'logreg__solver': 'liblinear'}
Best CV mean: 0.7810 ± 0.0070
train_regularization!!
Best params: {'logreg__C': 0.01, 'logreg__penalty': 'l2', 'logreg__solver': 'liblinear'}
Best CV mean: 0.7927 ± 0.0132
train_regularization!!
Best params: {'logreg__C': 0.1, 'logreg__penalty': 'l1', 'logreg__solver': 'liblinear'}
Best CV mean: 0.8024 ± 0.0098
train_regularization!!


KeyboardInterrupt: 

In [None]:
print("mi_scores: ",mi_scores,mi_scores.shape)
print("\nhead:",mi_scores.head(20))
print(mi_scores.tail(20))  # uncomment to see bottom 20


step = 15
for i in range(0,len(mi_scores),step):
    plt.figure(dpi=100, figsize=(8, 5))
    if (mi_scores[i:i+step] == 0).all():
        break
    plot_mi_scores(mi_scores[i:i+step])
    #break
# plot_mi_scores(mi_scores.tail(20))  # uncomment to see bottom 20