# Actividad 1
Navegar documentación de MLFlow y encontrar tres páginas que capten su atención.
- Explicar por qué eligieron dichas páginas.



**1. [MLflow Documentation Overview](https://mlflow.org/docs/3.2.0/ml/)**
Elegí esta página porque ofrece una visión general completa de lo que es MLflow, sus componentes principales y la manera en que se estructura la herramienta. Me pareció importante comenzar desde aquí porque ayuda a entender el panorama completo antes de entrar a ejemplos prácticos. Además, centraliza enlaces a otras secciones clave de la documentación, lo que permite tener un mapa claro de los recursos disponibles.



**2. [Tracking Quickstart](https://mlflow.org/docs/3.2.0/ml/tracking/quickstart/)**
Esta página me llamó la atención porque muestra, paso a paso, cómo comenzar a usar MLflow Tracking. Es especialmente útil para aprender a registrar experimentos y métricas de manera práctica. La elegí porque introduce código sencillo en Python que se puede correr directamente en un notebook, lo que facilita experimentar y ver resultados rápidos sin necesidad de conocimientos avanzados previos.



**3. [Logging the First Model](https://mlflow.org/docs/3.2.0/ml/getting-started/logging-first-model/)**
Decidí incluir esta página porque explica cómo guardar un modelo de machine learning dentro de MLflow por primera vez, lo cual es un paso fundamental en cualquier flujo de trabajo de ML. Es valiosa porque conecta la teoría con la práctica: no solo se ejecuta el modelo, sino que se documenta y guarda para su posterior consulta o despliegue. Esto la convierte en un recurso clave para entender la integración con scikit-learn.

---


# Actividad 2
Seguir los pasos del tutorial para usar MLFlow en un notebook de Jupyter
- Analizar sección para generar data sintética y ver si se pueden hacer mejoras al algoritmo propuesto.

[Documentación de MLFlow](https://mlflow.org/docs/3.2.0/ml/getting-started/logging-first-model/notebooks/logging-first-model/)

In [17]:
import warnings
warnings.filterwarnings('ignore')

In [18]:
# NOTE: review the links mentioned above for guidance on connecting to a managed tracking server, such as the Databricks Managed MLflow
from mlflow import MlflowClient

client = MlflowClient(tracking_uri="http://127.0.0.1:8080")

In [19]:
# Search experiments without providing query terms behaves effectively as a 'list' action

all_experiments = client.search_experiments()

print(all_experiments)

[<Experiment: artifact_location='file:C:/code/2025UVG/ML/taller3/mlruns/1', creation_time=1756858162002, experiment_id='1', last_update_time=1756858162002, lifecycle_stage='active', name='Apple_Models', tags={'mlflow.note.content': 'This is the grocery forecasting project. This '
                        'experiment contains the produce models for apples.',
 'project_name': 'grocery-forecasting',
 'project_quarter': 'Q3-2023',
 'store_dept': 'produce',
 'team': 'stores-ml'}>, <Experiment: artifact_location='file:C:/code/2025UVG/ML/taller3/mlruns/0', creation_time=1756858117919, experiment_id='0', last_update_time=1756858117919, lifecycle_stage='active', name='Default', tags={}>]


In [20]:
from pprint import pprint

# Extract the experiment name and lifecycle_stage
default_experiment = [
  {"name": experiment.name, "lifecycle_stage": experiment.lifecycle_stage}
  for experiment in all_experiments
  if experiment.name == "Default"
][0]

pprint(default_experiment)

{'lifecycle_stage': 'active', 'name': 'Default'}


In [21]:
experiment_description = (
  "This is the grocery forecasting project. "
  "This experiment contains the produce models for apples."
)

experiment_tags = {
  "project_name": "grocery-forecasting",
  "store_dept": "produce",
  "team": "stores-ml",
  "project_quarter": "Q3-2023",
  "mlflow.note.content": experiment_description,
}

experiment_name = "Apple_Models"
existing_experiment = client.get_experiment_by_name(experiment_name)
if existing_experiment is None:
    produce_apples_experiment = client.create_experiment(name=experiment_name, tags=experiment_tags)
else:
    produce_apples_experiment = existing_experiment.experiment_id

In [22]:
# Use search_experiments() to search on the project_name tag key

apples_experiment = client.search_experiments(
  filter_string="tags.`project_name` = 'grocery-forecasting'"
)

pprint(apples_experiment[0])

<Experiment: artifact_location='file:C:/code/2025UVG/ML/taller3/mlruns/1', creation_time=1756858162002, experiment_id='1', last_update_time=1756858162002, lifecycle_stage='active', name='Apple_Models', tags={'mlflow.note.content': 'This is the grocery forecasting project. This '
                        'experiment contains the produce models for apples.',
 'project_name': 'grocery-forecasting',
 'project_quarter': 'Q3-2023',
 'store_dept': 'produce',
 'team': 'stores-ml'}>


In [23]:
print(apples_experiment[0].tags["team"])



stores-ml


In [24]:
from datetime import datetime, timedelta

import numpy as np
import pandas as pd


def generate_apple_sales_data_with_promo_adjustment(base_demand: int = 1000, n_rows: int = 5000):
  """
  Generates a synthetic dataset for predicting apple sales demand with seasonality and inflation.

  This function creates a pandas DataFrame with features relevant to apple sales.
  The features include date, average_temperature, rainfall, weekend flag, holiday flag,
  promotional flag, price_per_kg, and the previous day's demand. The target variable,
  'demand', is generated based on a combination of these features with some added noise.

  Args:
      base_demand (int, optional): Base demand for apples. Defaults to 1000.
      n_rows (int, optional): Number of rows (days) of data to generate. Defaults to 5000.

  Returns:
      pd.DataFrame: DataFrame with features and target variable for apple sales prediction.

  Example:
      >>> df = generate_apple_sales_data_with_seasonality(base_demand=1200, n_rows=6000)
      >>> df.head()
  """

  # Set seed for reproducibility
  np.random.seed(9999)

  # Create date range
  dates = [datetime.now() - timedelta(days=i) for i in range(n_rows)]
  dates.reverse()

  # Generate features
  df = pd.DataFrame(
      {
          "date": dates,
          "average_temperature": np.random.uniform(10, 35, n_rows),
          "rainfall": np.random.exponential(5, n_rows),
          "weekend": [(date.weekday() >= 5) * 1 for date in dates],
          "holiday": np.random.choice([0, 1], n_rows, p=[0.97, 0.03]),
          "price_per_kg": np.random.uniform(0.5, 3, n_rows),
          "month": [date.month for date in dates],
      }
  )

  # Introduce inflation over time (years)
  df["inflation_multiplier"] = 1 + (df["date"].dt.year - df["date"].dt.year.min()) * 0.03

  # Incorporate seasonality due to apple harvests
  df["harvest_effect"] = np.sin(2 * np.pi * (df["month"] - 3) / 12) + np.sin(
      2 * np.pi * (df["month"] - 9) / 12
  )

  # Modify the price_per_kg based on harvest effect
  df["price_per_kg"] = df["price_per_kg"] - df["harvest_effect"] * 0.5

  # Adjust promo periods to coincide with periods lagging peak harvest by 1 month
  peak_months = [4, 10]  # months following the peak availability
  df["promo"] = np.where(
      df["month"].isin(peak_months),
      1,
      np.random.choice([0, 1], n_rows, p=[0.85, 0.15]),
  )

  # Generate target variable based on features
  base_price_effect = -df["price_per_kg"] * 50
  seasonality_effect = df["harvest_effect"] * 50
  promo_effect = df["promo"] * 200

  df["demand"] = (
      base_demand
      + base_price_effect
      + seasonality_effect
      + promo_effect
      + df["weekend"] * 300
      + np.random.normal(0, 50, n_rows)
  ) * df["inflation_multiplier"]  # adding random noise

  # Add previous day's demand
  df["previous_days_demand"] = df["demand"].shift(1)
  df["previous_days_demand"].fillna(method="bfill", inplace=True)  # fill the first row

  # Drop temporary columns
  df.drop(columns=["inflation_multiplier", "harvest_effect", "month"], inplace=True)

  return df

In [25]:
# Generate the dataset!

data = generate_apple_sales_data_with_promo_adjustment(base_demand=1_000, n_rows=1_000)

data[-20:]

Unnamed: 0,date,average_temperature,rainfall,weekend,holiday,price_per_kg,promo,demand,previous_days_demand
980,2025-08-14 18:11:08.592437,34.130183,1.454065,0,0,1.449177,0,999.30629,1029.418398
981,2025-08-15 18:11:08.592437,32.353643,9.462859,0,0,2.856503,0,842.129427,999.30629
982,2025-08-16 18:11:08.592436,18.816833,0.39147,1,0,1.326429,0,1317.616709,842.129427
983,2025-08-17 18:11:08.592436,34.533012,2.120477,1,0,0.970131,0,1395.802075,1317.616709
984,2025-08-18 18:11:08.592435,23.057202,2.365705,0,0,1.049931,0,1019.486305,1395.802075
985,2025-08-19 18:11:08.592435,34.810165,3.089005,0,0,2.035149,0,1002.564672,1019.486305
986,2025-08-20 18:11:08.592434,29.208905,3.673292,0,0,2.518098,0,1086.143402,1002.564672
987,2025-08-21 18:11:08.592433,16.428676,4.077782,0,0,1.268979,0,1093.207186,1086.143402
988,2025-08-22 18:11:08.592433,32.067512,2.734454,0,0,0.762317,0,1069.939894,1093.207186
989,2025-08-23 18:11:08.592432,31.938203,13.883486,1,0,1.153301,0,1321.40954,1069.939894


In [26]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

import mlflow

# Use the fluent API to set the tracking uri and the active experiment
mlflow.set_tracking_uri("http://127.0.0.1:8080")

# Sets the current active experiment to the "Apple_Models" experiment and returns the Experiment metadata
apple_experiment = mlflow.set_experiment("Apple_Models")

# Define a run name for this iteration of training.
# If this is not set, a unique name will be auto-generated for your run.
run_name = "apples_rf_test"

# Define an artifact path that the model will be saved to.
artifact_path = "rf_apples"

In [28]:
# Split the data into features and target and drop irrelevant date field and target field
X = data.drop(columns=["date", "demand"])
y = data["demand"]

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

params = {
  "n_estimators": 100,
  "max_depth": 6,
  "min_samples_split": 10,
  "min_samples_leaf": 4,
  "bootstrap": True,
  "oob_score": False,
  "random_state": 888,
}

# Train the RandomForestRegressor
rf = RandomForestRegressor(**params)

# Fit the model on the training data
rf.fit(X_train, y_train)

# Predict on the validation set
y_pred = rf.predict(X_val)

# Calculate error metrics
mae = mean_absolute_error(y_val, y_pred)
mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_val, y_pred)

# Assemble the metrics we're going to write into a collection
metrics = {"mae": mae, "mse": mse, "rmse": rmse, "r2": r2}

# Initiate the MLflow run context
with mlflow.start_run(run_name=run_name) as run:
  # Log the parameters used for the model fit
  mlflow.log_params(params)

  # Log the error metrics that were calculated during validation
  mlflow.log_metrics(metrics)

  # Log an instance of the trained model for later use
  mlflow.sklearn.log_model(sk_model=rf, input_example=X_val, name=artifact_path)

🏃 View run apples_rf_test at: http://127.0.0.1:8080/#/experiments/1/runs/2147cfbe92d24704aed2fb9616150fb9
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/1


---

# Actividad 3
Analizar el código que muestra cómo conectar MLFlow con un Pipeline de scikit-learn
- Deben buscar en la documentación la sección que explica cómo se pueden usar las dos herramientas en conjunto.