# **Series Temporales**

En este archivo desarrollaremos las siguientes series temporales:
- Time Series Forecasting en Python.
- Modelos estadísticos (AR, ARIMA, SARIMA, Exponential Smoothing).
- Recursive Forecasting (Random Forest, Gradient Boosting Regression).
- Multivariate Forecasting, Ensemble modeling.

Primero importamos todas las librerías que usaremos y las instalamos en caso de ser necesario.

In [51]:
# Importamos las librerías necesarias

# Manipulación de datos
import pandas as pd

# Preparación de datos
from sklearn.preprocessing import LabelEncoder, scale
from sklearn.model_selection import train_test_split

# Modelos de series temporales
from statsmodels.tsa.ar_model import AutoReg
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Métricas de evaluación
from sklearn.metrics import mean_squared_error

El próximo paso es cargar los datos limpios.

In [52]:
datos = pd.read_csv('../../data/partidos_limpio.csv')
datos.head()

Unnamed: 0,Season,Round,Day,Date,Results,Home,Country (Home),Points (Home),Score (Home),Score (Away),...,MP_away,Starts_away,Gls_away,Ast_away,G+A_away,G-PK_away,PK_away,PKatt_away,CrdY_away,CrdR_away
0,2023-2024,Round of 16,Tue,2024-02-13,A,RB Leipzig,Germany,88.736698,0,1,...,10.0,110.0,20.0,17.0,37.0,20.0,0.0,1.0,18.0,0.0
1,2023-2024,Round of 16,Tue,2024-02-13,A,FC Copenhagen,Denmark,80.431647,1,3,...,10.0,110.0,28.0,20.0,48.0,25.0,3.0,3.0,10.0,0.0
2,2023-2024,Round of 16,Wed,2024-02-14,H,Paris S-G,France,114.33458,2,0,...,8.0,88.0,8.0,5.0,13.0,8.0,0.0,1.0,18.0,0.0
3,2023-2024,Round of 16,Wed,2024-02-14,H,Lazio,Italy,99.943311,1,0,...,10.0,110.0,18.0,14.0,32.0,16.0,2.0,2.0,13.0,1.0
4,2023-2024,Round of 16,Tue,2024-02-20,D,PSV Eindhoven,The Netherlands,98.784903,1,1,...,10.0,110.0,15.0,12.0,27.0,14.0,1.0,1.0,16.0,0.0


### Preparación de datos

Veamos la información general de nuestros datos.

In [53]:
datos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 598 entries, 0 to 597
Data columns (total 39 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Season          598 non-null    object 
 1   Round           598 non-null    object 
 2   Day             598 non-null    object 
 3   Date            598 non-null    object 
 4   Results         598 non-null    object 
 5   Home            598 non-null    object 
 6   Country (Home)  598 non-null    object 
 7   Points (Home)   598 non-null    float64
 8   Score (Home)    598 non-null    int64  
 9   Score (Away)    598 non-null    int64  
 10  Points (Away)   598 non-null    float64
 11  Country (Away)  598 non-null    object 
 12  Away            598 non-null    object 
 13  Venue           598 non-null    object 
 14  Referee         598 non-null    object 
 15  # Pl_home       540 non-null    float64
 16  Age_home        540 non-null    float64
 17  MP_home         540 non-null    flo

Vemos que todavía tenemos algunas filas nulas. Al limpiar los datos no nos importaba tener algunas filas nulas, pero para hacer la clasterización es muy importante no contar con ningún dato de este tipo.

In [54]:
# Eliminamos las filas que contienen valores nulos
datos = datos.dropna()

# Vemos que se ha hecho el cambio correctamente
datos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 540 entries, 0 to 597
Data columns (total 39 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Season          540 non-null    object 
 1   Round           540 non-null    object 
 2   Day             540 non-null    object 
 3   Date            540 non-null    object 
 4   Results         540 non-null    object 
 5   Home            540 non-null    object 
 6   Country (Home)  540 non-null    object 
 7   Points (Home)   540 non-null    float64
 8   Score (Home)    540 non-null    int64  
 9   Score (Away)    540 non-null    int64  
 10  Points (Away)   540 non-null    float64
 11  Country (Away)  540 non-null    object 
 12  Away            540 non-null    object 
 13  Venue           540 non-null    object 
 14  Referee         540 non-null    object 
 15  # Pl_home       540 non-null    float64
 16  Age_home        540 non-null    float64
 17  MP_home         540 non-null    flo

Observamos que hay algunas variables categóricas que pasaremos a numéricas para poder incluirlas en nuestros modelos. En este mismo paso escalaremos estas nuevas variables numéricas y se guarda en el diccionario de mapeo directamente el valor escalado.

In [55]:
# Columnas a modificar
cols = ['Season', 'Round', 'Day', 'Date', 'Results', 'Home', 'Away', 'Country (Home)', 'Country (Away)', 'Venue', 'Referee', 'Year', 'Month', 'Number Day']

# Inicializamos el label encoder
label_encoder = LabelEncoder()

# Creamos un diccionario para guardar los mapeos de valores escalados
mapping = {}

# Transformamos la columna 'Date' a datetime
datos['Date'] = pd.to_datetime(datos['Date'])

# Iteramos sobre las columnas y las transformamos
for col in cols:
    # Si la columna es de tipo datetime, la transformamos a año, mes y día
    if col == 'Date':
        datos['Year'] = datos['Date'].dt.year
        datos['Month'] = datos['Date'].dt.month
        datos['Number Day'] = datos['Date'].dt.day
        continue
    
    # Guardamos los valores únicos originales antes de la transformación
    unique_values = datos[col].unique()
    
    datos[col] = label_encoder.fit_transform(datos[col])

    if col != 'Results':
        # Escalamos los valores originales y creamos un mapeo de los valores originales a los valores escalados
        datos[col] = scale(datos[col])
    
    mapping[col] = dict(zip(unique_values, datos[col].unique()))

# Eliminamos la columna 'Date'
datos.drop('Date', axis=1, inplace=True)

# Verificamos los cambios
datos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 540 entries, 0 to 597
Data columns (total 41 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Season          540 non-null    float64
 1   Round           540 non-null    float64
 2   Day             540 non-null    float64
 3   Results         540 non-null    int32  
 4   Home            540 non-null    float64
 5   Country (Home)  540 non-null    float64
 6   Points (Home)   540 non-null    float64
 7   Score (Home)    540 non-null    int64  
 8   Score (Away)    540 non-null    int64  
 9   Points (Away)   540 non-null    float64
 10  Country (Away)  540 non-null    float64
 11  Away            540 non-null    float64
 12  Venue           540 non-null    float64
 13  Referee         540 non-null    float64
 14  # Pl_home       540 non-null    float64
 15  Age_home        540 non-null    float64
 16  MP_home         540 non-null    float64
 17  Starts_home     540 non-null    flo

### Modelos

Comenzamos las Series Temporales separando los datos en entrenamiento y prueba.

In [56]:
# Definimos nuestras variables x e y
x = datos.drop(labels=['Results', 'Score (Home)', 'Score (Away)', 'Referee'], axis=1)
y = datos['Results']

# Estandarizamos los datos

# Columnas a estandarizar
cols = [['Points (Home)', 'Points (Away)', '# Pl_home','Age_home','MP_home','Starts_home','Gls_home','Ast_home','G+A_home','G-PK_home','PK_home','PKatt_home','CrdY_home','CrdR_home','# Pl_away','Age_away','MP_away','Starts_away','Gls_away','Ast_away','G+A_away','G-PK_away','PK_away','PKatt_away','CrdY_away','CrdR_away']]

# Se recorren las columnas especificadas y se escala cada una
for col in cols:
    x[col] = scale(x[col])

# Dividimos los datos en entrenamiento y prueba
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

Realizamos los modelos estadísticos de series temporales.

In [57]:
# AR
model_ar = AutoReg(y_train, lags=5)
fit_ar = model_ar.fit()
predictions_ar = fit_ar.predict(start=len(y_train), end=len(y_train) + len(y_test) - 1)

# ARIMA
model_arima = ARIMA(y_train, order=(5,1,0))
fit_arima = model_arima.fit()
predictions_arima = fit_arima.forecast(steps=len(y_test))

# SARIMA
model_sarima = SARIMAX(y_train, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
fit_sarima = model_sarima.fit()
predictions_sarima = fit_sarima.forecast(steps=len(y_test))

# Exponential Smoothing
model_exp = ExponentialSmoothing(y_train, seasonal='add', seasonal_periods=12)
fit_exp = model_exp.fit()
predictions_exp = fit_exp.forecast(steps=len(y_test))

# Modelos de aprendizaje automático
# Random Forest
model_rf = RandomForestRegressor()
model_rf.fit(x_train, y_train)
predictions_rf = model_rf.predict(x_test)

# Gradient Boosting Regression
model_gb = GradientBoostingRegressor()
model_gb.fit(x_train, y_train)
predictions_gb = model_gb.predict(x_test)

  self._init_dates(dates, freq)
  return get_prediction_index(
  fcast_index = self._extend_index(index, steps, forecast_index)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  return get_prediction_index(
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  return get_prediction_index(
  self._init_dates(dates, freq)
  return get_prediction_index(


Calculamos su error cuadrático medio para evaluar el rendimiento de cada modelo.

In [58]:
# Evaluación del rendimiento
mse_ar = mean_squared_error(y_test, predictions_ar)
mse_arima = mean_squared_error(y_test, predictions_arima)
mse_sarima = mean_squared_error(y_test, predictions_sarima)
mse_exp = mean_squared_error(y_test, predictions_exp)
mse_rf = mean_squared_error(y_test, predictions_rf)
mse_gb = mean_squared_error(y_test, predictions_gb)

print("Error Cuadrático Medio (AR):", mse_ar)
print("Error Cuadrático Medio (ARIMA):", mse_arima)
print("Error Cuadrático Medio (SARIMA):", mse_sarima)
print("Error Cuadrático Medio (Exponential Smoothing):", mse_exp)
print("Error Cuadrático Medio (Random Forest):", mse_rf)
print("Error Cuadrático Medio (Gradient Boosting):", mse_gb)

Error Cuadrático Medio (AR): 0.7806611753927005
Error Cuadrático Medio (ARIMA): 0.9670572125985409
Error Cuadrático Medio (SARIMA): 0.7870244234157282
Error Cuadrático Medio (Exponential Smoothing): 0.7963452295931766
Error Cuadrático Medio (Random Forest): 0.4735512345679013
Error Cuadrático Medio (Gradient Boosting): 0.4964359369577438


El Error Cuadrático Medio (ECM) cuantifica el promedio de los errores cuadráticos entre las predicciones del modelo y los valores reales. Un ECM más bajo indica que el modelo tiene una mejor capacidad para predecir los valores reales, mientras que un ECM más alto indica que el modelo tiene una peor capacidad predictiva.

Por lo tanto, comparando los valores de los ECMs para cada modelo, podemos concluir que los mejores modelos (es decir, aquellos que predicen mejor) es el Random Forest y el Gradient Boosting, aunque su ECM siga siendo bastante alto.

### Predicciones

Una vez entrenados y evaluados los modelos, pongámoslos a prueba. Vamos a ver qué equipos del siguiente DataFrame pasan a la final.

#### Semifinales

In [59]:
# Definimos una lista llamada 'semis' que contiene datos de partidos de fútbol de las semifinales
semis = [['2023-2024', 'Semi-finals', 'Tue', 'Bayern Munich', 'Germany', 107.882298136646, 114.5545351473923, 'Spain', 'Real Madrid', 'Allianz Arena', 23, 28.3, 10, 110, 18, 14, 32, 16, 2, 2, 13.0, 1.0, 22, 28.0, 10, 110, 20, 17, 37, 20, 0, 1, 18.0, 0.0, 2024, 4, 30],
        ['2023-2024', 'Semi-finals', 'Wed', 'Dortmund', 'Germany', 91.17303312629399, 114.33458049886625, 'France', 'Paris S-G', 'Signal Iduna Park', 23, 28.0, 10, 110, 15, 12, 27, 14, 1, 1, 16.0, 0.0, 21, 25.3, 10, 110, 19, 12, 31, 16, 3, 3, 27.0, 0.0, 2024, 5, 1],
        ['2023-2024', 'Semi-finals', 'Tue', 'Paris S-G', 'France', 114.33458049886625, 91.17303312629399, 'Germany', 'Dortmund', 'Parc des Princes', 21, 25.3, 10, 110, 19, 12, 31, 16, 3, 3, 27.0, 0.0, 23, 28.0, 10, 110, 15, 12, 27, 14, 1, 1, 16.0, 0.0, 2024, 5, 7],
        ['2023-2024', 'Semi-finals', 'Wed', 'Real Madrid', 'Spain', 114.5545351473923, 107.882298136646, 'Germany', 'Bayern Munich', 'Estadio Santiago Bernabéu', 22, 28.0, 10, 110, 20, 17, 37, 20, 0, 1, 18.0, 0.0, 23, 28.3, 10, 110, 18, 14, 32, 16, 2, 2, 13.0, 1.0, 2024, 5, 8]]

# Obtenemos las columnas relevantes del DataFrame 'partidos' para usar como nombres de columnas en el DataFrame 'semis'
partidos_cols = datos.drop(labels=['Results', 'Score (Home)', 'Score (Away)', 'Referee'], axis=1).columns

# Creamos un DataFrame 'semis' a partir de la lista 'semis' con las columnas obtenidas del DataFrame 'partidos'
semis = pd.DataFrame(semis, columns=(partidos_cols))

semis.head()

Unnamed: 0,Season,Round,Day,Home,Country (Home),Points (Home),Points (Away),Country (Away),Away,Venue,...,Ast_away,G+A_away,G-PK_away,PK_away,PKatt_away,CrdY_away,CrdR_away,Year,Month,Number Day
0,2023-2024,Semi-finals,Tue,Bayern Munich,Germany,107.882298,114.554535,Spain,Real Madrid,Allianz Arena,...,17,37,20,0,1,18.0,0.0,2024,4,30
1,2023-2024,Semi-finals,Wed,Dortmund,Germany,91.173033,114.33458,France,Paris S-G,Signal Iduna Park,...,12,31,16,3,3,27.0,0.0,2024,5,1
2,2023-2024,Semi-finals,Tue,Paris S-G,France,114.33458,91.173033,Germany,Dortmund,Parc des Princes,...,12,27,14,1,1,16.0,0.0,2024,5,7
3,2023-2024,Semi-finals,Wed,Real Madrid,Spain,114.554535,107.882298,Germany,Bayern Munich,Estadio Santiago Bernabéu,...,14,32,16,2,2,13.0,1.0,2024,5,8


Pasamos todas las columnas dategóricas a numéricas.

In [60]:
data = semis.copy()

# Aplicamos mapping a las columnas
for col, col_mapping in mapping.items():
    if col in data.columns:
        data[col] = data[col].map(col_mapping)
    else:
        if col == 'Squad':
            data['Home'] = data['Home'].map(col_mapping)
            data['Away'] = data['Away'].map(col_mapping)
        elif col == 'Country':
            data['Country (Home)'] = data['Country (Home)'].map(col_mapping)
            data['Country (Away)'] = data['Country (Away)'].map(col_mapping)

data.head()

Unnamed: 0,Season,Round,Day,Home,Country (Home),Points (Home),Points (Away),Country (Away),Away,Venue,...,Ast_away,G+A_away,G-PK_away,PK_away,PKatt_away,CrdY_away,CrdR_away,Year,Month,Number Day
0,1.680782,1.725816,-0.467123,-1.13818,-0.481396,107.882298,114.554535,1.29397,1.227972,-1.517567,...,17,37,20,0,1,18.0,0.0,1.680782,0.644724,2.074771
1,1.680782,1.725816,0.751459,-0.667073,-0.481396,91.173033,114.33458,-0.76983,0.933521,0.722487,...,12,31,16,3,3,27.0,0.0,1.680782,1.551367,-1.822424
2,1.680782,1.725816,-0.467123,0.92291,-0.777134,114.33458,91.173033,-0.475002,-0.656518,0.535815,...,12,27,14,1,1,16.0,0.0,1.680782,1.551367,-1.016108
3,1.680782,1.725816,0.751459,1.217352,1.293032,114.554535,107.882298,-0.475002,-1.127641,-0.615323,...,14,32,16,2,2,13.0,1.0,1.680782,1.551367,-0.881722


Escalamos las columnas que no se han modificado con el mapeo.

In [61]:
# Columnas a estandarizar
cols = [['Points (Home)', 'Points (Away)', '# Pl_home','Age_home','MP_home','Starts_home','Gls_home','Ast_home','G+A_home','G-PK_home','PK_home','PKatt_home','CrdY_home','CrdR_home','# Pl_away','Age_away','MP_away','Starts_away','Gls_away','Ast_away','G+A_away','G-PK_away','PK_away','PKatt_away','CrdY_away','CrdR_away']]

# Se recorren las columnas especificadas y se escala cada una
for col in cols:
    data[col] = scale(data[col])

data.head()

Unnamed: 0,Season,Round,Day,Home,Country (Home),Points (Home),Points (Away),Country (Away),Away,Venue,...,Ast_away,G+A_away,G-PK_away,PK_away,PKatt_away,CrdY_away,CrdR_away,Year,Month,Number Day
0,1.680782,1.725816,-0.467123,-1.13818,-0.481396,0.094187,0.795424,1.29397,1.227972,-1.517567,...,1.588203,1.473911,1.60591,-1.341641,-0.904534,-0.095783,-0.57735,1.680782,0.644724,2.074771
1,1.680782,1.725816,0.751459,-0.667073,-0.481396,-1.661918,0.772307,-0.76983,0.933521,0.722487,...,-0.855186,-0.210559,-0.229416,1.341641,1.507557,1.628305,-0.57735,1.680782,1.551367,-1.822424
2,1.680782,1.725816,-0.467123,0.92291,-0.777134,0.772307,-1.661918,-0.475002,-0.656518,0.535815,...,-0.855186,-1.333539,-1.147079,-0.447214,-0.904534,-0.478913,-0.57735,1.680782,1.551367,-1.016108
3,1.680782,1.725816,0.751459,1.217352,1.293032,0.795424,0.094187,-0.475002,-1.127641,-0.615323,...,0.122169,0.070186,-0.229416,0.447214,0.301511,-1.053609,1.732051,1.680782,1.551367,-0.881722


Hacemos nuestras predicciones y mostramos en pantalla.

In [62]:
# Predicciones
pred_ar_semis = fit_ar.predict(start=len(y_train), end=len(y_train) + len(data) - 1)
pred_arima_semis = fit_arima.forecast(steps=len(data))
pred_sarima_semis = fit_sarima.forecast(steps=len(data))
pred_exp_semis = fit_exp.forecast(steps=len(data))
pred_rf_semis = model_rf.predict(data)
pred_gb_semis = model_gb.predict(data)

# Variables para mostrar los resultados
X_home = semis['Home'].tolist()
X_away = semis['Away'].tolist()
predictions_ar_lst = pred_ar_semis.tolist()
predictions_arima_lst = pred_arima_semis.tolist()
predictions_sarima_lst = pred_sarima_semis.tolist()
predictions_exp_lst = pred_exp_semis.tolist()
predictions_rf_lst = pred_rf_semis.tolist()
predictions_gb_lst = pred_gb_semis.tolist()

# Se crea un DataFrame con los valores obtenidos
res = pd.DataFrame({'Home': X_home, 'Away': X_away, 'AR': predictions_ar_lst, 'ARIMA': predictions_arima_lst, 'SARIMA': predictions_sarima_lst, 'Exponential Smoothing': predictions_exp_lst, 'Random Forest': predictions_rf_lst, 'Gradient Boosting': predictions_gb_lst})

# Mostrar los resultados
res

  return get_prediction_index(
  fcast_index = self._extend_index(index, steps, forecast_index)
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(


Unnamed: 0,Home,Away,AR,ARIMA,SARIMA,Exponential Smoothing,Random Forest,Gradient Boosting
0,Bayern Munich,Real Madrid,1.363581,0.901447,1.162919,1.354937,1.24,1.647804
1,Dortmund,Paris S-G,1.331972,0.748488,1.09265,1.322639,1.14,1.039123
2,Paris S-G,Dortmund,1.1929,0.713214,1.145572,1.387128,1.57,1.587373
3,Real Madrid,Bayern Munich,1.196364,0.639334,0.841413,1.064729,1.59,1.467369


Obviamente estos valores no nos van a decir el ganador de los partidos. Hagamos una aproximación y cambiemos estos números a su valor correspondiente que encontramos en mapping['Results'].

In [63]:
aprox_ar = [round(valor) for valor in predictions_ar_lst]
aprox_arima = [round(valor) for valor in predictions_arima_lst]
aprox_sarima = [round(valor) for valor in predictions_sarima_lst]
aprox_exp = [round(valor) for valor in predictions_exp_lst]
aprox_rf = [round(valor) for valor in predictions_rf_lst]
aprox_gb = [round(valor) for valor in predictions_gb_lst]

# Crear un mapeo inverso del diccionario 'Results'
reverse_mapping = {value: key for key, value in mapping['Results'].items()}

# Mapeamos los valores aproximados
for i in range(len(aprox_ar)):
    aprox_ar[i] = reverse_mapping.get(aprox_ar[i], None)
    aprox_arima[i] = reverse_mapping.get(aprox_arima[i], None)
    aprox_sarima[i] = reverse_mapping.get(aprox_sarima[i], None)
    aprox_exp[i] = reverse_mapping.get(aprox_exp[i], None)
    aprox_rf[i] = reverse_mapping.get(aprox_rf[i], None)
    aprox_gb[i] = reverse_mapping.get(aprox_gb[i], None)

# Modificamos el DataFrame 'res' con los valores aproximados
res['AR'] = aprox_ar
res['ARIMA'] = aprox_arima
res['SARIMA'] = aprox_sarima
res['Exponential Smoothing'] = aprox_exp
res['Random Forest'] = aprox_rf
res['Gradient Boosting'] = aprox_gb

# Mostrar los resultados
res

Unnamed: 0,Home,Away,AR,ARIMA,SARIMA,Exponential Smoothing,Random Forest,Gradient Boosting
0,Bayern Munich,Real Madrid,D,D,D,D,D,H
1,Dortmund,Paris S-G,D,D,D,D,D,D
2,Paris S-G,Dortmund,D,D,D,D,H,H
3,Real Madrid,Bayern Munich,D,D,D,D,H,D


Vemos que estos modelos tienden a predecir empates en nuestros partidos. Por lo tanto, no podemos concluir cuáles son los equipos que pasan a la última ronda del torneo, a excepción de los modelo Random Forest y Gradient Boosting.

Random Forest nos indica que la final sería Real Madrid vs Paris S-G. Por otro lado, la final de Grandient Boosting sería Bayern Munich vs Paris S-G.

#### Final

In [64]:
final1 = [['2023-2024', 'Final', 'Sat', 'Real Madrid', 'Spain', 114.5545351473923, 114.33458049886625, 'France', 'Paris S-G', 'Wembley Stadium', 22, 28.0, 10, 110, 20, 17, 37, 20, 0, 1, 18.0, 0.0, 21, 25.3, 10, 110, 19, 12, 31, 16, 3, 3, 27.0, 0.0, 2024, 6, 1]]
final1 = pd.DataFrame(final1, columns=(partidos_cols))

final1.head()

Unnamed: 0,Season,Round,Day,Home,Country (Home),Points (Home),Points (Away),Country (Away),Away,Venue,...,Ast_away,G+A_away,G-PK_away,PK_away,PKatt_away,CrdY_away,CrdR_away,Year,Month,Number Day
0,2023-2024,Final,Sat,Real Madrid,Spain,114.554535,114.33458,France,Paris S-G,Wembley Stadium,...,12,31,16,3,3,27.0,0.0,2024,6,1


In [65]:
final2 = [['2023-2024', 'Final', 'Sat', 'Bayern Munich', 'Germany', 107.882298136646, 114.33458049886625, 'France', 'Paris S-G', 'Wembley Stadium', 23, 28.3, 10, 110, 18, 14, 32, 16, 2, 2, 13.0, 1.0, 21, 25.3, 10, 110, 19, 12, 31, 16, 3, 3, 27.0, 0.0, 2024, 6, 1]]
final2 = pd.DataFrame(final2, columns=(partidos_cols))

final2.head()

Unnamed: 0,Season,Round,Day,Home,Country (Home),Points (Home),Points (Away),Country (Away),Away,Venue,...,Ast_away,G+A_away,G-PK_away,PK_away,PKatt_away,CrdY_away,CrdR_away,Year,Month,Number Day
0,2023-2024,Final,Sat,Bayern Munich,Germany,107.882298,114.33458,France,Paris S-G,Wembley Stadium,...,12,31,16,3,3,27.0,0.0,2024,6,1


Generamos dos sets de datos dependiendo de la final jugada y tranformamos las columnas a numéricas usando el mismo mapping. También escalamos dos datos que no se modificaron tras el mapeo.

In [66]:
data1 = final1.copy()

# Aplicamos mapping a las columnas
for col, col_mapping in mapping.items():
    if col in data1.columns:
        data1[col] = data1[col].map(col_mapping)
    else:
        if col == 'Squad':
            data1['Home'] = data1['Home'].map(col_mapping)
            data1['Away'] = data1['Away'].map(col_mapping)
        elif col == 'Country':
            data1['Country (Home)'] = data1['Country (Home)'].map(col_mapping)
            data1['Country (Away)'] = data1['Country (Away)'].map(col_mapping)

# Como el DataFrame es tan pequeño, no estandariza bien los datos. Por lo tanto, cogeremos las estandarizaciones del DataFrame 'semis'

# Columnas a estandarizar
cols_h = ['Points (Home)', '# Pl_home','Age_home','MP_home','Starts_home','Gls_home','Ast_home','G+A_home','G-PK_home','PK_home','PKatt_home','CrdY_home','CrdR_home']
cols_a = ['Points (Away)', '# Pl_away','Age_away','MP_away','Starts_away','Gls_away','Ast_away','G+A_away','G-PK_away','PK_away','PKatt_away','CrdY_away','CrdR_away']

# Se obtienen los datos estandarizados de las semifinales
datos_estandarizados_h = data.loc[0, cols_h]
datos_estandarizados_a = data.loc[2, cols_a]

# Se reemplazan los valores en el DataFrame 'data1'
data1[cols_h] = datos_estandarizados_h[cols_h]
data1[cols_a] = datos_estandarizados_a[cols_a]

# Verificar los cambios
data1.head()

Unnamed: 0,Season,Round,Day,Home,Country (Home),Points (Home),Points (Away),Country (Away),Away,Venue,...,Ast_away,G+A_away,G-PK_away,PK_away,PKatt_away,CrdY_away,CrdR_away,Year,Month,Number Day
0,1.680782,-2.549141,-4.122872,1.217352,1.293032,0.094187,-1.661918,-0.76983,0.933521,1.655842,...,-0.855186,-1.333539,-1.147079,-0.447214,-0.904534,-0.478913,-0.57735,1.680782,2.458009,-1.822424


In [67]:
data2 = final2.copy()

# Aplicamos mapping a las columnas
for col, col_mapping in mapping.items():
    if col in data2.columns:
        data2[col] = data2[col].map(col_mapping)
    else:
        if col == 'Squad':
            data2['Home'] = data2['Home'].map(col_mapping)
            data2['Away'] = data2['Away'].map(col_mapping)
        elif col == 'Country':
            data2['Country (Home)'] = data2['Country (Home)'].map(col_mapping)
            data2['Country (Away)'] = data2['Country (Away)'].map(col_mapping)

# Como el DataFrame es tan pequeño, no estandariza bien los datos. Por lo tanto, cogeremos las estandarizaciones del DataFrame 'semis'

# Se obtienen los datos estandarizados de las semifinales
datos_estandarizados_h = data.loc[3, cols_h]
datos_estandarizados_a = data.loc[2, cols_a]

# Se reemplazan los valores en el DataFrame 'data1'
data2[cols_h] = datos_estandarizados_h[cols_h]
data2[cols_a] = datos_estandarizados_a[cols_a]

# Verificar los cambios
data2.head()

Unnamed: 0,Season,Round,Day,Home,Country (Home),Points (Home),Points (Away),Country (Away),Away,Venue,...,Ast_away,G+A_away,G-PK_away,PK_away,PKatt_away,CrdY_away,CrdR_away,Year,Month,Number Day
0,1.680782,-2.549141,-4.122872,-1.13818,-0.481396,0.795424,-1.661918,-0.76983,0.933521,1.655842,...,-0.855186,-1.333539,-1.147079,-0.447214,-0.904534,-0.478913,-0.57735,1.680782,2.458009,-1.822424


Una vez preparados los datos, predecimos el resultado de la final con nuestros modelos.

In [69]:
pred_rf_final = model_rf.predict(data1)
pred_gb_final = model_gb.predict(data2)

Visualizamos y analizamos los resultados.

In [72]:
X_home = []
X_away = []
preds = []
pred_vals = [pred_rf_final[0], pred_gb_final[0]]

# Variables para mostrar los resultados
for i in range(2):
    if i == 0:
        h = final1['Home'][0]
        a = final1['Away'][0]
        pred = 'RF'
    else:
        h = final2['Home'][0]
        a = final2['Away'][0]
        pred = 'GB'

    X_home.append(h)
    X_away.append(a)
    preds.append(pred)

# Se crea un DataFrame con los valores obtenidos
res = pd.DataFrame({'Home': X_home, 'Away': X_away, 'Prediction': preds, 'Value': pred_vals})

# Mostrar los resultados
res

Unnamed: 0,Home,Away,Prediction,Value
0,Real Madrid,Paris S-G,RF,1.41
1,Bayern Munich,Paris S-G,GB,1.338778


Sin necesidad de usar una función de aproximación a los valores, podemos ver que los resultados se aproximan a 1, que corresponde con 'Empate'. Veamos consultando el diccionario de mapeo.

In [73]:
mapping['Results']

{'A': 0, 'H': 2, 'D': 1}

Efectivamente, el resultado de ambas finales es empate así que mediante series temporales no podemos concluir quién será el ganador del torneo.