<img src="https://global.utsa.edu/tec-partnership/images/logos/logotipo-horizontal-azul-transparente.png"  width="600">

## **Avance 4: Proyecto Integrador**
## Visualización interactiva de calidad de aire en AR en aplicaciones móviles con análisis y forecasting con AI y ML
### **TC5035 - Proyecto Integrador (Gpo 10)**
### **Equipo #56**
#### Tecnológico de Monterrey
---
*   NOMBRE: Paulina Escalante Campbell
*   MATRÍCULA: A01191962


### **Objetivo**
---


### **Dataset Inicial**
---
Global Air Quality Dataset 🌍
Comprehensive Air Quality Measurements from Major Cities Worldwide 🌍
https://www.kaggle.com/datasets/sazidthe1/global-air-pollution-data/data

### Diccionario de variables del dataset de calidad del aire

| Columna               | Descripción                                                                                 |
|-----------------------|---------------------------------------------------------------------------------------------|
| `country_name`        | Name of the Country                                                                         |
| `city_name`           | Name of the City                                                                            |
| `aqi_value`           | Overall AQI value of the city                                                               |
| `aqi_category`        | Overall AQI category of the city                                                            |
| `co_aqi_value`        | AQI value of Carbon Monoxide of the city                                                    |
| `co_aqi_category`     | AQI category of Carbon Monoxide of the city                                                 |
| `ozone_aqi_value`     | AQI value of Ozone of the city                                                              |
| `ozone_aqi_category`  | AQI category of Ozone of the city                                                           |
| `no2_aqi_value`       | AQI value of Nitrogen Dioxide of the city                                                  |
| `no2_aqi_category`    | AQI category of Nitrogen Dioxide of the city                                               |
| `pm2.5_aqi_value`     | AQI value of Particulate Matter (≤ 2.5 micrometers) of the city                             |
| `pm2.5_aqi_category`  | AQI category of Particulate Matter (≤ 2.5 micrometers) of the city                          |



### **Leer archivos, imports y google cloud drive**

In [None]:
# Setup inicial del proyecto con GPU y google drive, conectar a runtime de T4GPU
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

import psutil

ram_gb = psutil.virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

from google.colab import drive
drive.mount('/content/drive')

# Asegurarse que los datos han sido copiados a este directorio de google drive
import os
DIR = "/content/drive/MyDrive/Colab Notebooks/ProyectoIntegrador"
os.chdir(DIR)

Mon Oct 13 01:32:09 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   40C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
# Imports para análisis de datos y visualizaciones
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px

# Networking imports
import requests
import time
import unicodedata

# Normalizing
from scipy import stats
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, PowerTransformer
from sklearn.feature_selection import VarianceThreshold, SelectKBest, chi2, f_classif, mutual_info_classif
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold, learning_curve

# Models
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.dummy import DummyRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
import pickle

# characteristics
from sklearn.feature_selection import (
    SelectKBest, f_regression, mutual_info_regression,
    RFE, SequentialFeatureSelector
)
from sklearn.inspection import permutation_importance
from sklearn.decomposition import PCA
from sklearn.metrics import (
    confusion_matrix, classification_report, accuracy_score
)

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Usar el dataset final del Avance 2 (con features seleccionadas), datos ya están preprocesados con el Avance 1 y 2
df_selected = pd.read_csv('data_features_selected_clean.csv')  # 7 features seleccionadas + target

# Exploramos el modelo con dos posibles features
df = df_selected
df

Unnamed: 0,co_aqi_value,ozone_aqi_value,distance_from_equator,pm25_no2_ratio,ozone_co_ratio,country_mean_aqi,aqi_value
0,-0.257657,-1.010310,-1.609334,-0.429844,-1.493037,-0.836505,-0.609049
1,-0.257657,0.014283,0.441423,-0.399228,0.213691,-0.452705,-0.247782
2,-0.257657,-0.136393,1.461118,-0.440050,-0.037298,-0.663941,-0.710203
3,-0.257657,-0.106258,1.220933,-0.531899,0.012900,-0.750846,-0.305584
4,-0.257657,-0.287068,1.336271,-0.684980,-0.288288,-0.828714,-0.276683
...,...,...,...,...,...,...,...
16754,0.898454,3.660628,0.148227,-0.042038,2.271804,0.994122,1.110582
16755,-0.257657,0.014283,1.206799,-0.123682,0.213691,-0.328210,-0.175528
16756,0.898454,3.479818,-0.205231,0.403599,2.121211,1.673678,1.457398
16757,-0.257657,-0.558284,1.188464,-0.678176,-0.740068,-0.661446,-0.478993


In [None]:
df_selected.head(1)

Unnamed: 0,co_aqi_value,ozone_aqi_value,distance_from_equator,pm25_no2_ratio,ozone_co_ratio,country_mean_aqi,aqi_value
0,-0.257657,-1.01031,-1.609334,-0.429844,-1.493037,-0.836505,-0.609049


In [None]:
df.shape

(16759, 7)

#**Sección 1**
---

La variable objetivo para el modelo es **aqi_value**



In [None]:
df = pd.read_csv('data_features_selected_clean.csv')

target = 'aqi_value'
features = [col for col in df.columns if col != target]

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


# Cargar modelo baseline random forest
with open('baseline_model.pkl', 'rb') as f:
    rf_model = pickle.load(f)

print(f"\nDataset:")
print(f"   - Total features: {len(features)}")
print(f"   - Train: {X_train.shape[0]} muestras")
print(f"   - Test: {X_test.shape[0]} muestras")

### **Sección 1.1**

In [1]:
# Test

#**Sección 2**
---

La variable objetivo para el modelo es **aqi_value**

Ya que tenemos Random Forest seleccionado, nos gustaría explorar otros algoritmos


In [2]:
# Test

#**Sección 3**
---
N/A


In [3]:
#Test

###**Sección 3.1**

In [4]:
# Test

#**Sección 4**
---
Test

In [5]:
#Test

## **Desempeño**
---
Se establece un desempeño mínimo a obtener (histórico vs no existe) y se verifica si el modelo baseline que se propone alcanza un nivel aceptable.

```
1. PREDICCIÓN AMBIENTAL (Calidad del Aire):
   - R² mínimo aceptable: 0.60-0.70
   - R² bueno: 0.70-0.85
   - R² excelente: > 0.85
   
   Fuente: Environmental Modelling & Software (2020)
   "Air Quality Prediction Models: A Review"

2. APLICACIONES EN SALUD PÚBLICA:
   - MAE aceptable: < 10 puntos AQI
   - MAE bueno: < 5 puntos AQI
   - MAE excelente: < 3 puntos AQI
   
   Fuente: Journal of Environmental Management (2021)
   "Machine Learning for Air Quality Forecasting"

3. SISTEMAS DE ALERTAS AMBIENTALES:
   - Accuracy de categorización: > 75%
   - Falsos negativos críticos: < 5%
   
```

# **Conclusiones**

---

Se estableció desempeño mínimo basado en:
* (1) baseline ingenuo (DummyRegressor, R²≈0)
