# Proyecto_1 : Optimizando la Planificación de Vuelos y la Experiencia del Cliente: Utilizando el Conjunto de Datos de "Air Passengers"

## Instalación de librerías

En este proyecto utilizaremos una líbrería de autoML llamda Pycaret para modelar la serie de tiempo.

In [None]:
#!pip install pycaret==2.2.3
!pip install pycaret
!pip install -U scikit-learn==0.23.2
!pip install plotly==5.1.0 
!pip install plotly-express==0.4.1
!pip install pygwalker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pycaret
  Downloading pycaret-3.0.2-py3-none-any.whl (483 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m483.6/483.6 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
Collecting pyod>=1.0.8 (from pycaret)
  Downloading pyod-1.0.9.tar.gz (149 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.0/150.0 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting category-encoders>=2.4.0 (from pycaret)
  Downloading category_encoders-2.6.1-py2.py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/81.9 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
Collecting importlib-metadata>=4.12.0 (from pycaret)
  Downloading importlib_metadata-6.6.0-py3-none-any.whl (22 kB)
Collecting deprecation>=2.1.0 (from pycaret)
  Downloading deprecation-2.1.0-py2.py3-none-an

In [6]:
import pycaret
import sklearn
import pandas as pd
import numpy as np
import plotly.express as px

print(pycaret.__version__)
print(sklearn.__version__)

1.2.2


## Importando los datos

In [2]:
# Conexión a Google Colab
from google.colab import drive
drive.mount('/content/drive/')
# Se le define la ruta de los archivos en el Drive
%cd '/content/drive/My Drive/Proyectos/1_Proyecto/'
# Se listan los archivos en Drive
!ls

Mounted at /content/drive/
/content/drive/My Drive/Proyectos/1_Proyecto
1_Proyecto.ipynb  AirPassengers.csv  logs.log


In [7]:
df = pd.read_csv('AirPassengers.csv')

# Se transforma la columna Date a formato fecha
df['Date'] = pd.to_datetime(df['Date'])
df.head()

Unnamed: 0,Date,Passengers
0,1949-01-01,112
1,1949-02-01,118
2,1949-03-01,132
3,1949-04-01,129
4,1949-05-01,121


## Análisis exploratorio de los datos

https://www.datamasteryacademy.com/blog/pygwalker-tutorial-a-tableau-like-python-library-for-interactive-data-exploration-and-visualization

In [9]:
import pygwalker as pyg
pyg.walk(df)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
df.tail()

Unnamed: 0,Date,Passengers
139,1960-08-01,606
140,1960-09-01,508
141,1960-10-01,461
142,1960-11-01,390
143,1960-12-01,432


In [None]:
# Dimensiones del Df
df.shape

(144, 2)

In [None]:
# Función para contar los datos faltantes por columna
def valores_faltantes(df):
    miss_values_count = df.isnull().sum(min_count=1)
    miss_values_count = miss_values_count[miss_values_count != 0]
    print(f"Número de columnas con datos faltantes: {miss_values_count.shape[0]}")
    if miss_values_count.shape[0]:
        print("Recuento de valores nulos por columna: ")
        for name, miss_vals in miss_values_count.items():
            p = miss_vals > 1
            print(f"  - A la columna '{name}' le falta{'n' if p else ''} "
                f"{miss_vals} dato{'s' if p else ''}.")
    return
  
# Se llama la función
valores_faltantes(df)

Número de columnas con datos faltantes: 0


In [None]:
# Revisión del tipo de datos que contiene el DF
df.dtypes

Date          datetime64[ns]
Passengers             int64
dtype: object

El set de datos tiene 144 registros con fechas entre 1949 y 1960. No existen registros faltantes y el tipo de dato para cada columna es el adecuado.

In [None]:
fig = px.line(df, x="Date", y=["Passengers"], template = 'plotly_dark')

fig.update_layout(
    xaxis_title="Fecha",
    yaxis_title="Pasajeros",
    title="Número de Pasajeros a lo largo del tiempo"
)

fig.show()

Este gráfico nos muestra que el crecimiento de la demanda es ascendente en el tiempo. Sin embargo, se muestra un ciclo en aproximadamente en 12 meses. Para tratar de visualizarlo mejor ahora graficaremos la media móvil:

In [None]:
# Media movil de doce meses
df['MA12'] = df['Passengers'].rolling(12).mean()
df

Unnamed: 0,Date,Passengers,MA12
0,1949-01-01,112,
1,1949-02-01,118,
2,1949-03-01,132,
3,1949-04-01,129,
4,1949-05-01,121,
...,...,...,...
139,1960-08-01,606,463.333333
140,1960-09-01,508,467.083333
141,1960-10-01,461,471.583333
142,1960-11-01,390,473.916667


In [None]:
fig = px.line(df, x="Date", y=["Passengers", "MA12"], template = 'plotly_dark')

fig.update_layout(
    xaxis_title="Fecha",
    yaxis_title="Pasajeros",
    title="Número de Pasajeros a lo largo del tiempo"
)

fig.show()

Se observa estacionalidad de los datos con dos pequeños incrementos luego un incremento marcado para finalmente descender. Como se dijo anteriormente, existe una tendencia creciente.

### Transformación de los datos

Se adecua la columna 'Date' como índice

In [None]:
df.drop(['MA12'], axis=1, inplace=True)
df.set_index('Date', inplace=True)

### División en entrenamiento y testeo

### Preprocesamiento de datos

In [None]:
from pycaret.time_series import *
exp_name = setup(data = df, target = 'Passengers',  fh = 12)

Unnamed: 0,Description,Value
0,session_id,5251
1,Target,Passengers
2,Approach,Univariate
3,Exogenous Variables,Not Present
4,Original data shape,"(144, 1)"
5,Transformed data shape,"(144, 1)"
6,Transformed train set shape,"(132, 1)"
7,Transformed test set shape,"(12, 1)"
8,Rows with missing values,0.0%
9,Fold Generator,ExpandingWindowSplitter


In [None]:
#  Gráfico de diferencias (diff) y los gráficos de la función de autocorrelación (ACF) y la función de autocorrelación parcial (PACF)
plot_model(plot="diff", data_kwargs={"order_list": [1, 2], "acf": True, "pacf": True})

In [None]:
plot_model(plot="diff", data_kwargs={"lags_list": [[1], [1, 12]], "acf": True, "pacf": True})

In [None]:
plot_model(plot = 'decomp', data_kwargs = {'seasonal_period': 12})

In [None]:
plot_model(plot = 'decomp', data_kwargs = {'type' : 'multiplicative'})

### Entrenamiento del modelo

Se parte de un modelo Arima para revisar su desempeño

In [None]:
arima = create_model('arima')
plot_model(plot = 'ts')

Unnamed: 0,cutoff,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2
0,1956-12,0.4462,0.4933,13.0286,16.1485,0.0327,0.0334,0.9151
1,1957-12,0.5983,0.5993,18.292,20.3442,0.0506,0.0491,0.8916
2,1958-12,1.0044,0.928,28.6999,30.1669,0.0671,0.0697,0.7964
Mean,NaT,0.683,0.6735,20.0069,22.2199,0.0501,0.0507,0.8677
SD,NaT,0.2356,0.1851,6.5117,5.8746,0.0141,0.0148,0.0513


Processing:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
plot_model(estimator = arima, plot = 'forecast', data_kwargs = {'fh' : 24})

In [None]:
tuned_arima = tune_model(arima)
plot_model([arima, tuned_arima], data_kwargs={"labels": ["Baseline", "Tuned"]})

Unnamed: 0,cutoff,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2
0,1956-12,0.4998,0.5596,14.5955,18.3176,0.0364,0.0374,0.8908
1,1957-12,0.6077,0.6076,18.5794,20.629,0.0515,0.0499,0.8885
2,1958-12,0.6812,0.6804,19.4658,22.117,0.0447,0.046,0.8906
Mean,NaT,0.5963,0.6159,17.5469,20.3545,0.0442,0.0444,0.89
SD,NaT,0.0745,0.0497,2.1181,1.5632,0.0062,0.0052,0.001


Processing:   0%|          | 0/7 [00:00<?, ?it/s]

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:   11.3s finished


Se adicionan la metrica EVS (Variación de la variancia) y se modelan los datos buscando el mejor MAE

In [None]:
# Adicionar metricas personalizadas para usar en el CV
from sklearn.metrics import explained_variance_score
add_metric('evs', 'EVS', explained_variance_score)

best = compare_models(sort = 'MAE')

Unnamed: 0,Model,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2,EVS,TT (Sec)
exp_smooth,Exponential Smoothing,0.5852,0.6105,17.1926,20.1633,0.0435,0.0439,0.8918,0.9522,0.15
ets,ETS,0.5931,0.6212,17.4165,20.5102,0.044,0.0445,0.8882,0.9507,0.24
et_cds_dt,Extra Trees w/ Cond. Deseasonalize & Detrending,0.6581,0.7221,19.4195,23.8902,0.0483,0.0482,0.8482,0.9116,0.7
arima,ARIMA,0.683,0.6735,20.0069,22.2199,0.0501,0.0507,0.8677,0.9705,0.2033
huber_cds_dt,Huber w/ Cond. Deseasonalize & Detrending,0.6813,0.7866,20.0334,25.967,0.0491,0.0499,0.8113,0.8873,0.77
lr_cds_dt,Linear w/ Cond. Deseasonalize & Detrending,0.7004,0.7702,20.6084,25.4401,0.0509,0.0514,0.8215,0.9003,0.42
ridge_cds_dt,Ridge w/ Cond. Deseasonalize & Detrending,0.7004,0.7703,20.6086,25.4405,0.0509,0.0514,0.8215,0.9003,0.51
en_cds_dt,Elastic Net w/ Cond. Deseasonalize & Detrending,0.7029,0.7732,20.6816,25.5362,0.0511,0.0516,0.8201,0.8994,0.3833
llar_cds_dt,Lasso Least Angular Regressor w/ Cond. Deseasonalize & Detrending,0.7048,0.7751,20.7366,25.6009,0.0512,0.0517,0.8192,0.8987,0.4367
lasso_cds_dt,Lasso w/ Cond. Deseasonalize & Detrending,0.7048,0.7751,20.7373,25.6005,0.0512,0.0517,0.8193,0.8988,0.3667


Processing:   0%|          | 0/117 [00:00<?, ?it/s]

Evaluando el mejor modelo en los datos de testeo:

In [None]:
prediction_Extra_tree = predict_model(best)

Unnamed: 0,Model,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2,EVS
0,Exponential Smoothing,0.3382,0.4575,10.2997,15.8074,0.0221,0.0216,0.9549,0.959


El modelo es bastante bueno, su R2 es cercano a 1 y el MAE es pequeño. Ahora visualizamos los valores 

In [None]:
plot_model(estimator = best, plot = 'forecast', data_kwargs = {'fh' : 24})

In [None]:
prediction_Extra_tree

Unnamed: 0,y_pred
1960-01,417.281
1960-02,394.0567
1960-03,462.4373
1960-04,448.5887
1960-05,471.8593
1960-06,539.8763
1960-07,623.8054
1960-08,631.1408
1960-09,515.5723
1960-10,449.8958


In [None]:
final_best = finalize_model(best)

Para reentrenar el modelo con todos los datos y modelar 24 periodos futuros

In [None]:
pred_unseen = predict_model(finalize_model(best), fh = 24)
pred_unseen

Unnamed: 0,y_pred
1961-01,445.2424
1961-02,418.2253
1961-03,465.3098
1961-04,494.9512
1961-05,505.4759
1961-06,573.3127
1961-07,663.5964
1961-08,654.904
1961-09,546.761
1961-10,488.4468
