# Lab: Basic ML Model with Weather Dataset + MLflow Integration

Welcome to this lab! Here you will learn how to:

1. **Load and prepare a weather dataset**, with temperature and humidity data.
2. **Train a Machine Learning model** using Scikit-learn, a powerful tool for Machine Learning in Python, to predict rain.
3. **Evaluate the model** computing metrics to determine how well it makes predictions on new data.
4. **Integrate MLflow**, one of the most used tool to track metrics, parameters, and model versions.

We will follow a guided approach with detailed explanations at each step.  
The first part focuses on Scikit-learn and the weather dataset. The second part extends the existing code with MLflow.

---

## Part 1: From Data to Machine Learning Model (Supervised Learning with Scikit-learn)  

### Objective  
Build a **classification model** that can predict whether it will rain, using **temperature** and **humidity** as input data. The model will be trained using **Scikit-learn**, a powerful tool for Machine Learning in Python.

### 1. Preparing the Dataset  

Before training a Machine Learning model, it is essential to clean the data, as missing or incorrect values can compromise predictions. A well-prepared dataset allows the model to learn better and provide more accurate results.  

For this lab, we will use an example dataset:  
[Weather Test Data](https://raw.githubusercontent.com/boradpreet/Weather_dataset/refs/heads/master/Weather%20Test%20Data.csv)  

The **Weather Test Data** dataset contains meteorological information collected at different times. Each row represents an observation with parameters such as **temperature**, **humidity**, **atmospheric pressure**, and other weather variables.  

The goal of this dataset is to analyze weather patterns and use them to train a Machine Learning model capable of predicting future conditions, such as the probability of rain or temperature variations.

In [None]:
import pandas as pd

# Dataset URL 
url = "https://raw.githubusercontent.com/boradpreet/Weather_dataset/refs/heads/master/Weather%20Test%20Data.csv"

# Load the dataset in a Pandas dataframe
df = pd.read_csv(url)

# Show first 5 rows
df.head(5)



### 2. Data Exploration and Cleaning  

To ensure our model works correctly, we first need to examine and prepare the dataset. Here are the key steps:  

1. **Check for missing data**: Identify if there are any missing values, as they could compromise model training. If necessary, we can remove them or replace them with appropriate values.  
2. **Convert the `Label` column**: Transform categorical values (*NoRain* and *Rain*) into numerical values (0 for *NoRain*, 1 for *Rain*), so the model can interpret them correctly.  
3. **Select key features**: Choose only the most relevant columns (e.g., temperature and humidity) to simplify the model and improve its performance.


In [None]:
# 1. Remove missing values
df = df.dropna()

# 2. Transform column 'RainToday' into numerical values
df['RainToday'] = df['RainToday'].apply(lambda x: 1 if x == 'Yes' else 0)

# 3. Feature selection 
features = ['MinTemp', 'MaxTemp', 'Humidity3pm', 'Humidity9am']

X = df[features]
y = df['RainToday']


### 3. Splitting the Dataset into Training and Test Sets  

To properly train and evaluate the model, we split the dataset into two parts:  

- **X (features)**: Contains the information we will use for predictions, such as **temperature** and **humidity**.  
- **y (target)**: Represents the variable we want to predict, i.e., whether it will rain (*Rain*) or not (*NoRain*).  

We split the data into **80% training set** and **20% test set** for the following reasons:  

1. **Model Training**  
   - 80% of the data is used to teach the model to recognize patterns between features and the target variable.  

2. **Model Evaluation**  
   - The remaining 20% of the data is not used in training but serves to test the model on unseen data.  
   - This helps us understand whether the model can make accurate predictions on new data.  

3. **Avoid Overfitting**  
   - If we tested the model on the same data it was trained on, we might get deceptively good results, as the model would have simply memorized them.  
   - Using separate test data helps verify whether the model can generalize its predictions to real-world data.  

This split is a crucial step in building a reliable model capable of making accurate predictions on unseen data.

In [None]:
from sklearn.model_selection import train_test_split

# Dataset split (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training dataset dimensions: {len(X_train)}")
print(f"Test dataset dimensions: {len(X_test)}")

### 4. Building and Training the Model  

Now that we have prepared the data, we can build and train a Machine Learning model. For this, we will use a classifier called [**RandomForestClassifier**](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), one of the most widely used techniques for classification problems.  

#### Why use **Random Forest**?  
- It is a model based on **decision trees**, which divide the data into multiple steps to make accurate decisions.  
- It is **robust** and works well with both numerical and categorical data.  
- It is less sensitive to noisy data than a single decision tree because it combines multiple trees to improve accuracy.  

The model will be trained using **temperature** and **humidity** data to predict whether there will be **rain** or not. After training, we will test it on new data to evaluate its accuracy.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create and train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

### 5. Model Evaluation  

After training the model, we need to verify how accurate its predictions are. To do this, we will calculate **accuracy** and other evaluation metrics.  

#### Why is model evaluation important?  
A Machine Learning model is not useful if we do not know how reliable it is. Evaluation helps us understand:  
- **Whether the model is learning correctly from the data** or simply memorizing answers (overfitting).  
- **Whether it can be used on new data** and make realistic predictions.  

#### Confusion Matrix  
Besides accuracy, we will use the **confusion matrix**, a visual method that shows where the model makes correct predictions and where it makes mistakes.  
- It helps identify **false positives** and **false negatives**, which are critical errors in many real-world scenarios.  
- It is useful for improving the model, for example, by adjusting decision thresholds or balancing input data.  

With these analyses, we can determine whether our model is ready for use or needs improvement.

In [None]:
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Prediction on test set
y_pred = model.predict(X_test)

# Compute accuracy and f1-score
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"Accuracy of the model: {accuracy:.2f}")
print(f"F1-score of the model: {f1:.2f}")

# Print classification report
print(classification_report(y_test, y_pred))

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# Create a heatmap with Seaborn
target_names = ["No Rain", "Rain"]
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=target_names, yticklabels=target_names)

# Add titles
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")

# Show the plot
plt.show()

### Conclusion (Part 1)  

In this first part, we followed a step-by-step process to build a Machine Learning model capable of predicting rain. Here’s what we did:  

1. **Loaded the weather dataset** to analyze temperature, humidity, and other variables.  
2. **Cleaned and prepared the data**, handling missing values and converting the target variable into a format the model can understand.  
3. **Split the dataset** into training (80%) and test (20%) sets to properly train and evaluate the model.  
4. **Built a classification model** using **RandomForestClassifier**, a powerful and robust algorithm.  
5. **Evaluated the model’s performance** by calculating accuracy and analyzing the confusion matrix to identify errors.  

Now that we have built the base model, in the next part, we will explore how to integrate **MLflow** to track experiments and further improve performance.

---

## Tech: Installing and Configuring MLflow  

### Objective  
Set up a local **MLflow** instance to log experiments, monitor metrics, and manage Machine Learning models in an organized way.  

### 1. Starting MLflow  

To start MLflow locally, run the following command in the terminal:  

```bash
    mlflow ui
```

Once started, the graphical interface will be accessible at:  

```
    http://127.0.0.1:5000
```

This setup allows **saving experiments and models locally**, enabling tracking of different model versions, comparing evaluation metrics, and optimizing training processes.  

In the next sections, we will see how to log parameters, metrics, and models directly within MLflow.

### MLflow in Production  

In production environments and for clients, **MLflow is not run locally** but is integrated into a more robust and scalable infrastructure. This prevents issues related to manual experiment management and data persistence.  

Common solutions include:  

- **Docker Compose**  
  - MLflow is started using a `docker-compose.yml` file, which configures a backend database and remote storage for saving experiments.  
  - This approach is useful for controlled environments where a quick and reproducible setup is needed.  
  - An example implementation is available in the internal repository:  
    [kiratech/mlops-service-portfolio](https://github.com/kiratech/mlops-service-portfolio/tree/main).  

- **Kubernetes (K8s)**  
  - MLflow is deployed on a **Kubernetes cluster**, allowing scalable and centralized experiment management.  
  - This approach is ideal for enterprise environments that require high levels of reliability, security, and scalability.  

Both solutions rely on a **multi-container architecture**, which includes:  
- **A persistent database** (e.g., PostgreSQL or MySQL) to store experiment metadata.  
- **An S3 or MinIO storage** to save models and artifacts, ensuring secure and scalable data management.  

These approaches ensure that MLflow can be reliably used in production, integrating with cloud or on-premise infrastructures for effective Machine Learning model management.

---

## Part 2: MLflow Integration  

Now, we will extend the existing code to **track our experiments** using **MLflow**. This will allow us to monitor the model training process, compare different configurations, and manage model versions in a structured way.  

### Why integrate MLflow?  
With MLflow, we can:  
- **Log training parameters** (e.g., `n_estimators` for Random Forest) to compare different configurations.  
- **Save evaluation metrics** (e.g., accuracy, F1-score) to monitor the model’s performance.  
- **Store the trained model** to easily reload and reuse it in the future without retraining.  

### Objective  
Integrate MLflow into the existing code to **track and version models**, logging parameters, metrics, and artifacts in a structured way.  

### 1. MLflow Configuration  
Before we start tracking experiments, let's set up the necessary variables to use MLflow in this project.  

In [None]:
import mlflow
import mlflow.sklearn

# Set name of the experiment and tracking URI of local instance
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("weather_classification_experiment")

We verify that from our [web interface](http://127.0.0.1:5000), the new experiment is visible.

### 2. Logging Parameters, Metrics, and Model  

With **MLflow**, we can automatically save and track various information during model training. This helps compare performances across different configurations and easily retrieve the best models.  

Here’s what we can log:  

- **Parameters** → Values used to configure the model, such as `n_estimators` (number of trees in Random Forest) and other hyperparameters.  
- **Metrics** → Performance indicators of the model, such as **accuracy**, **F1-score**, precision, and recall.  
- **Model** → The trained model version, which can be reloaded and reused without retraining.  

By logging these elements, we can analyze and compare different model versions in a structured and reproducible way.

In [None]:
# Use the same type of model used in Part 1
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
from mlflow.models.signature import infer_signature

# Execute 4 experiments to train multiple models 
n_estimators = [1, 10, 100, 500]

for n_e in n_estimators:
    # Create a new MLFlow run
    with mlflow.start_run():
        # Log Param
        mlflow.log_param("n_estimators", n_e)

        # Create and train the model instance
        rf_model = RandomForestClassifier(n_estimators=n_e, random_state=42)
        rf_model.fit(X_train, y_train)

        # Compute metrics
        y_pred = rf_model.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')
        mlflow.log_metric("accuracy", accuracy)
        mlflow.log_metric("f1", f1)

        # Create heatmap with Seaborn
        target_names = ["No Rain", "Rain"]
        cm = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(6, 4))
        sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=target_names, yticklabels=target_names)

        # Add titles
        plt.xlabel("Predicted Label")
        plt.ylabel("True Label")
        plt.title("Confusion Matrix")

        # Save the plot as PNG
        if not os.path.exists("dev/"):
            os.makedirs("dev/")
        plt.savefig("dev/confusion_matrix.png")
        plt.close()
        # Save the confusion matrix on MLFlow as artifact
        mlflow.log_artifact("dev/confusion_matrix.png")

        # Save the model on MLFlow
        example_dict = {'MinTemp': 1.1, 'MaxTemp': 1.1, 'Humidity3pm': 1.1, 'Humidity9am': 1.1}
        signature = infer_signature(model_input=example_dict)
        mlflow.sklearn.log_model(rf_model, "random_forest_model", signature=signature)

        print(f"Experiment finished. Registered accuracy: {accuracy:.2f}")

### 3. Viewing and Comparing Results  

After logging parameters, metrics, and models, we can use **MLflow** to explore and compare different experiments.  

MLflow provides a web interface accessible at:  

```bash
    http://127.0.0.1:5000
```

By accessing this interface, in the **Experiments** section, it will be possible to:  
- **Examine the parameters** used in each experiment.  
- **Compare metrics** across different model configurations.  
- **View and download saved models**, making reuse and deployment easier.  

This feature allows monitoring the model's performance evolution and quickly identifying the best configurations.

### 4. Loading a Saved Model with MLflow  

MLflow allows saving and reloading trained models easily, avoiding the need to retrain them every time.  

To retrieve a saved model in MLflow, you need to copy the **run ID** of the executed experiment. This ID uniquely identifies each logged experiment and allows loading the corresponding model for future predictions.  

This feature is particularly useful for:  
- **Reusing a trained model** without repeating the training process.  
- **Comparing different model versions** to choose the most effective one.  
- **Integrating the model into applications or APIs**, without rebuilding it from scratch.  

In the next sections, we will see how to perform this process in practice.

In [None]:
import mlflow.sklearn
from sklearn.metrics import accuracy_score

# Insert a real run_id you find in MLFlow UI
RUN_ID = "<run_id_of_your_experiment>"

loaded_model = mlflow.sklearn.load_model(f"runs:/{RUN_ID}/random_forest_model")

# Verify accuracy
y_loaded_pred = loaded_model.predict(X_test)
acc_loaded = accuracy_score(y_test, y_loaded_pred)
print(f"Accuracy of the loaded model: {acc_loaded:.2f}")

## Conclusions  

In this lab, we followed a complete process to build and monitor a Machine Learning model applied to weather data. Specifically, we:  

1. **Created a classification model** using **Scikit-learn**, leveraging temperature and humidity to predict rain.  
2. **Integrated MLflow** to track training parameters, log evaluation metrics, and manage model versions in a structured way.  
3. **Explored and compared results** through the MLflow UI interface, reviewing different configurations and loading a saved model for future predictions.  

This approach allows us to improve the Machine Learning model development process, making it more organized, reproducible, and scalable.

## Next Steps  

Now that we have built and tracked our model, we can explore further improvements and integrate the work into a more advanced workflow.  

- **Hyperparameter Optimization**: Test different configurations of `n_estimators`, `max_depth`, and other model parameters, logging results in **MLflow** to identify the best combination.  
- **Automation with CI/CD**: Integrate a **Continuous Integration/Continuous Deployment (CI/CD)** system to automatically train and deploy new model versions, reducing the risk of manual errors.  
- **Model Monitoring in Production**: Implement a **model drift monitoring** system to detect any drops in accuracy over time and determine when retraining with new data is necessary.  

These steps help transform the developed model into a robust and reliable system, ready for real-world applications.