# Early vs Late Fusion
Using the model that proved to work best, we will try to implement and compare which approach works better, early fusion or late fusion.

### Early fusion
Early fusion combines multiple input modalities or feature sets before feeding them into the model. This typically involves concatenating raw features from different sources into a single feature vector.

In [1]:
from DATASET import clean_df
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load dataset
df = clean_df.copy()
df = df.dropna(subset=['nota_assignatura'])
# Separate current year data (to predict)
df_train = df[df['curs_academic'] != '2023/24'].copy()
df_pred_target = df[df['curs_academic'] == '2023/24'].copy()


                              Estudi Curs acadèmic  \
0  Graduat en Enginyeria Informàtica       2020/21   
1  Graduat en Enginyeria Informàtica       2020/21   
2  Graduat en Enginyeria Informàtica       2020/21   
3  Graduat en Enginyeria Informàtica       2020/21   
4  Graduat en Enginyeria Informàtica       2020/21   

                          Id Anonim  Sexe                 Assignatura  \
0  1DFB71F2B000D1421808D0B3F67B335E  Home                     Àlgebra   
1  1DFB71F2B000D1421808D0B3F67B335E  Home                      Càlcul   
2  1DFB71F2B000D1421808D0B3F67B335E  Home  Electricitat i Electrònica   
3  1DFB71F2B000D1421808D0B3F67B335E  Home      Fonaments d'Enginyeria   
4  1DFB71F2B000D1421808D0B3F67B335E  Home     Fonaments d'Informàtica   

   Codi assignatura  Nota_assignatura Curs acadèmic accés estudi  \
0            103801               0.0                    2020/21   
1            103802               0.0                    2020/21   
2            102771             

In [2]:
X_train = df_train.drop(columns=['nota_assignatura'])
y_train = df_train['nota_assignatura']

X_pred = df_pred_target.drop(columns=['nota_assignatura'])
y_pred = df_pred_target['nota_assignatura'] 

# 4. Select categorical columns to encode
categorical_cols = X_train.select_dtypes(include='object').columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ],
    remainder='passthrough'  # Keep non-categorical columns
)



In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.ensemble import VotingRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Define ensemble with reduced complexity for speed
ensemble_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', VotingRegressor(estimators=[
        ('rf', RandomForestRegressor(n_estimators=30, max_depth=10, random_state=42)),
        ('gb', GradientBoostingRegressor(n_estimators=50, random_state=42)),
        ('dt', DecisionTreeRegressor(max_depth=8, random_state=42))
    ]))
])

# Fit ensemble
ensemble_pipeline.fit(X_train, y_train)

# Predict on 2023/24
df_pred_target['predicted_nota_assignatura_ensemble'] = ensemble_pipeline.predict(X_pred)


In [4]:
# We assume we select the best features, in this case we included all of the features we have
best_features = ['assignatura', 'codi_assignatura', 'curs_academic', 'discapacitat', 
                 'estudis_mare', 'estudis_pare', 'nota_d_acces', 'sexe', 'taxa_exit', 'via_acces_estudi']

# Divide df_train into X and y
X = df_train[best_features]
y = df_train['nota_assignatura']

# Split train/val to validate performance before predicting 2023/24
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:

categorical_cols = X.select_dtypes(include='object').columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ],
    remainder='passthrough'
)

ensemble_early = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', VotingRegressor([
        ('rf', RandomForestRegressor(n_estimators=30, max_depth=10, random_state=42)),
        ('gb', GradientBoostingRegressor(n_estimators=50, random_state=42)),
        ('dt', DecisionTreeRegressor(max_depth=8, random_state=42))
    ]))
])

ensemble_early.fit(X_train, y_train)
y_pred_early = ensemble_early.predict(X_val)

print("Early Fusion - R²:", r2_score(y_val, y_pred_early))
print("Early Fusion - RMSE:", mean_squared_error(y_val, y_pred_early))


Early Fusion - R²: 0.4742565397311099
Early Fusion - RMSE: 5.392016780491727


In [8]:
#Define grupos (puedes ajustar estos grupos)
group1 = ['assignatura', 'codi_assignatura', 'curs_academic']
group2 = ['discapacitat', 'estudis_mare', 'estudis_pare']
group3 = ['nota_d_acces', 'sexe', 'taxa_exit', 'via_acces_estudi']

feature_groups = [group1, group2, group3]

models = []
preds_test = []

for group in feature_groups:
    X_train_g = X_train[group]
    X_test_g = X_val[group]
    
    cat_cols_g = X_train_g.select_dtypes(include='object').columns.tolist()
    preprocessor_g = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols_g)
    ], remainder='passthrough')
    
    model_g = Pipeline([
        ('preprocessor', preprocessor_g),
        ('regressor', VotingRegressor([
            ('rf', RandomForestRegressor(n_estimators=30, max_depth=10, random_state=42)),
            ('gb', GradientBoostingRegressor(n_estimators=50, random_state=42)),
            ('dt', DecisionTreeRegressor(max_depth=8, random_state=42))
        ]))
    ])
    
    model_g.fit(X_train_g, y_train)
    models.append(model_g)
    
    pred_g = model_g.predict(X_test_g)
    preds_test.append(pred_g)

# Combina predicciones por promedio simple
y_pred_late = np.mean(preds_test, axis=0)

print("Late Fusion - R²:", r2_score(y_val, y_pred_late))
print("Late Fusion - RMSE:", mean_squared_error(y_val, y_pred_late))


Late Fusion - R²: 0.27155372557612933
Late Fusion - RMSE: 7.470933700956228


### Late Fusion
Late fusion involves training separate models for each modality and then combining their outputs (e.g., predictions, probabilities) afterward.

In [None]:
# Datos académicos
X_academic = df_train[['assignatura', 'codi_assignatura', 'curs_academic', 'nota_d_acces', 'taxa_exit', 'via_acces_estudi']]

# Datos sociodemográficos
X_social = df_train[['discapacitat', 'estudis_mare', 'estudis_pare', 'sexe']]
#Define grupos (puedes ajustar estos grupos)
group1 = ['assignatura', 'codi_assignatura', 'curs_academic']
group2 = ['discapacitat', 'estudis_mare', 'estudis_pare']
group3 = [ 'sexe', 'taxa_exit' ]
group4 = ['nota_d_acces',  'via_acces_estudi']  

feature_groups = [group1, group2, group3, group4]

models = []
preds_test = []

for group in feature_groups:
    X_train_g = X_train[group]
    X_test_g = X_val[group]
    
    cat_cols_g = X_train_g.select_dtypes(include='object').columns.tolist()
    preprocessor_g = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols_g)
    ], remainder='passthrough')
    
    model_g = Pipeline([
        ('preprocessor', preprocessor_g),
        ('regressor', VotingRegressor([
            ('rf', RandomForestRegressor(n_estimators=30, max_depth=10, random_state=42)),
            ('gb', GradientBoostingRegressor(n_estimators=50, random_state=42)),
            ('dt', DecisionTreeRegressor(max_depth=8, random_state=42))
        ]))
    ])
    
    model_g.fit(X_train_g, y_train)
    models.append(model_g)
    
    pred_g = model_g.predict(X_test_g)
    preds_test.append(pred_g)

# Combina predicciones por promedio simple
y_pred_late = np.mean(preds_test, axis=0)

print("Late Fusion - R²:", r2_score(y_val, y_pred_late))
print("Late Fusion - RMSE:", mean_squared_error(y_val, y_pred_late))

Late Fusion - R²: 0.6952073694551999
Late Fusion - RMSE: 3.127768594326613


In [14]:
#Define grupos (puedes ajustar estos grupos)
group1 = ['assignatura', 'codi_assignatura', 'curs_academic']
group2 = ['discapacitat', 'estudis_mare', 'estudis_pare']
group3 = [ 'sexe', 'taxa_exit' ]
group4 = ['nota_d_acces',  'via_acces_estudi']  

feature_groups = [group1, group2, group3, group4]

models = []
preds_test = []

for group in feature_groups:
    X_train_g = X_train[group]
    X_test_g = X_val[group]
    
    cat_cols_g = X_train_g.select_dtypes(include='object').columns.tolist()
    preprocessor_g = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols_g)
    ], remainder='passthrough')
    
    model_g = Pipeline([
        ('preprocessor', preprocessor_g),
        ('regressor', VotingRegressor([
            ('rf', RandomForestRegressor(n_estimators=30, max_depth=10, random_state=42)),
            ('gb', GradientBoostingRegressor(n_estimators=50, random_state=42)),
            ('dt', DecisionTreeRegressor(max_depth=8, random_state=42))
        ]))
    ])
    
    model_g.fit(X_train_g, y_train)
    models.append(model_g)
    
    pred_g = model_g.predict(X_test_g)
    preds_test.append(pred_g)

# Combina predicciones por promedio simple
y_pred_late = np.mean(preds_test, axis=0)

print("Late Fusion - R²:", r2_score(y_val, y_pred_late))
print("Late Fusion - RMSE:", mean_squared_error(y_val, y_pred_late))

Late Fusion - R²: 0.22040590859271847
Late Fusion - RMSE: 7.995504919243982


All the features separates in different groups

In [7]:
#Define grupos (puedes ajustar estos grupos)
group1 = ['assignatura']
group2 = ['discapacitat']
group3 = [ 'taxa_exit' ]
group4 = [ 'via_acces_estudi']  
group5 = ['codi_assignatura']
group6 = ['nota_d_acces']
group7 = ['curs_academic']
group8 = ['estudis_mare']
group9 = ['estudis_pare']
group10 = ['sexe']


feature_groups = [group1, group2, group3, group4, group5, group6, group7, group8, group9, group10]

models = []
preds_test = []

for group in feature_groups:
    X_train_g = X_train[group]
    X_test_g = X_val[group]
    
    cat_cols_g = X_train_g.select_dtypes(include='object').columns.tolist()
    preprocessor_g = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols_g)
    ], remainder='passthrough')
    
    model_g = Pipeline([
        ('preprocessor', preprocessor_g),
        ('regressor', VotingRegressor([
            ('rf', RandomForestRegressor(n_estimators=30, max_depth=10, random_state=42)),
            ('gb', GradientBoostingRegressor(n_estimators=50, random_state=42)),
            ('dt', DecisionTreeRegressor(max_depth=8, random_state=42))
        ]))
    ])
    
    model_g.fit(X_train_g, y_train)
    models.append(model_g)
    
    pred_g = model_g.predict(X_test_g)
    preds_test.append(pred_g)

# Combina predicciones por promedio simple
y_pred_late = np.mean(preds_test, axis=0)

print("Late Fusion - R²:", r2_score(y_val, y_pred_late))
print("Late Fusion - RMSE:", mean_squared_error(y_val, y_pred_late))

Late Fusion - R²: 0.11886146320174007
Late Fusion - RMSE: 9.036943177427638


## Results:

Results of early fusion are:

Early Fusion - R²: 0.4742565397311099
Early Fusion - RMSE: 5.392016780491727

El modelo explica el 47% de las diferencias reales entre estudiantes. El 53% restante queda sin explicar.


Best results of late fusion are:
Late Fusion - R²: 0.6952073694551999
Late Fusion - RMSE: 3.127768594326613

With these groups:
group1 = ['assignatura', 'codi_assignatura', 'curs_academic']
group2 = ['discapacitat', 'estudis_mare', 'estudis_pare']
group3 = [ 'sexe', 'taxa_exit' ]
group4 = ['nota_d_acces',  'via_acces_estudi']  

El modelo explica casi un 70% de la varianza y tiene un error muy bajo. Ya que los grupos parecen tener coherencia interna y suficiente información predictiva cada uno.

Separar fuentes de información puede mejorar la capacidad de predicción del modelo.
