# Para el Entrenamiento (train)

## 0) Dataset

Nos basamos en el dataset de Kaggle: [Water Quality](https://www.kaggle.com/datasets/adityakadiwal/water-potability).

## 1) Preparación de datos
- Discreticen por igual frecuencia e igual rango las columnas: `ph`, `Sulfate` y `Trihalomethanes`.
  <br>Usen `duplicates='drop', retbins=True` para posteriormente guardarlas.
  <br>Y nombrarlas como `saved_bins_ph`, `saved_bins_sulfate` y `saved_bins_trihalomethanes`.

- Y guarden las discretizaciones:
    ```
    import pickle

    with open('saved_bins_ph.pickle', 'wb') as handle:
        pickle.dump(saved_bins_ph, handle, protocol=pickle.HIGHEST_PROTOCOL)

    with open('saved_bins_sulfate.pickle', 'wb') as handle:
        pickle.dump(saved_bins_sulfate, handle, protocol=pickle.HIGHEST_PROTOCOL)

    with open('saved_bins_trihalomethanes.pickle', 'wb') as handle:
        pickle.dump(saved_bins_trihalomethanes, handle, protocol=pickle.HIGHEST_PROTOCOL)
    ```

- Agregar categoría `desconocido` a los valores NaN de las 3 columnas previamente mencionadas.

- Hacer un `get dummies`.

## 2) Clasificación
- Su variable target o de interés a clasificar es `Potability`.

- Recuerden comentar y NO utilizar la siguiente celda:
    ```
    data_x = data_x.values
    data_y = data_y.values
    ```

- Utilicen el 30% del dataset para test.

- Para el Random Forest consideren los parámetros `n_estimators = 1000` y `random_state = 99`

- Guarden el modelo con el nombre `rf.pkl`.

- Guarden el nombre de las columnas
  ```
    import pickle

    # Guardamos las columnas x (sin Potability)
    with open('categories_ohe.pickle', 'wb') as handle:
        pickle.dump(data_x.columns, handle, protocol=pickle.HIGHEST_PROTOCOL)
  ```

# Para la llamada a la API (call_api.py)
- Consideren esta data para el campo `data` del request
    ```
    data = {
        'ph': 0,
        'Hardness': 204.890455,
        'Solids': 20791.318981,
        'Chloramines': 7.300212,
        'Sulfate': 368.516441,
        'Conductivity': 564.308654,
        'Organic_carbon': 10.379783,
        'Trihalomethanes': 86.990970,
        'Turbidity': 2.963135,
    }
    ```


# Para la creación de la API (main.py)

- De la misma manera que cargan el modelo, carguen el nombre de las columnas.
<br>Coloquen estas líneas de código debajo del bloque de código que carga el modelo.
    ```
    # Columnas
    COLUMNS_PATH = "model/categories_ohe.pickle"
    with open(COLUMNS_PATH, 'rb') as handle:
        ohe_tr = pickle.load(handle)
    ```

- También carguen los bins de las discretizaciones.
  ```
    # Bins
    BINS_PH = 'model/saved_bins_ph.pickle'
    with open(BINS_PH, 'rb') as handle:
        new_saved_bins_ph = pickle.load(handle)

    BINS_SULFATE = 'model/saved_bins_sulfate.pickle'
    with open(BINS_SULFATE, 'rb') as handle:
        new_saved_bins_sulfate = pickle.load(handle)

    BINS_TRIHALOMETHANES = 'model/saved_bins_trihalomethanes.pickle'
    with open(BINS_TRIHALOMETHANES, 'rb') as handle:
        new_saved_bins_trihalomethanes = pickle.load(handle)
  ```

- Para Pydantic: Recuerden que sus 9 columnas son flotantes / float.

- Para el endpoint `/prediccion`, llamen a su función `predict_water_potability`

- Tienen que adaptar los datos de input respecto a los datos que recibe el modelo. Entonces tienen que agregarle / reformatear el nombre de las columnas.
    ```
    # Crear dataframe
    single_instance = pd.DataFrame.from_dict(answer_dict)

    # Manejar puntos de corte o bins
    single_instance["ph"] = single_instance["ph"].astype(float)
    single_instance["ph"] = pd.cut(single_instance['ph'],
                                     bins=new_saved_bins_ph,
                                     include_lowest=True)
    
    single_instance["Sulfate"] = single_instance["Sulfate"].astype(float)
    single_instance["Sulfate"] = pd.cut(single_instance['Sulfate'],
                                     bins=new_saved_bins_sulfate,
                                     include_lowest=True)

    single_instance["Trihalomethanes"] = single_instance["Trihalomethanes"].astype(float)
    single_instance["Trihalomethanes"] = pd.cut(single_instance['Trihalomethanes'],
                                     bins=new_saved_bins_trihalomethanes,
                                     include_lowest=True)

    # One hot encoding
    single_instance_ohe = pd.get_dummies(single_instance).reindex(columns = ohe_tr).fillna(0)
    
    prediction = model.predict(single_instance_ohe)
    ```

---
# Prueba de la API

Ahora que saben que está funcionando su API.
<br>Varien el campo `data` del request y evaluen que les trae como response.

**Request 1**
```
data = {
    'ph': 0,
    'Hardness': 204.890455,
    'Solids': 20791.318981,
    'Chloramines': 7.300212,
    'Sulfate': 368.516441,
    'Conductivity': 564.308654,
    'Organic_carbon': 10.379783,
    'Trihalomethanes': 86.990970,
    'Turbidity': 2.963135,
}
```
**Response 2**
```
{'score': 0}
```

**Request 1**
```
data = {
    'ph': 7.7984536762012135,
    'Hardness': 188.39494231709176,
    'Solids': 32704.569285770576,
    'Chloramines': 11.078872478914501,
    'Sulfate': 258.1911841475428,
    'Conductivity': 507.1786882733106,
    'Organic_carbon': 18.272439235274646,
    'Trihalomethanes': 85.17766213336226,
    'Turbidity': 4.107267203260775,
}
```

**Response 2**
```
{'score': 1}
```

1) Cargar Dataset

In [1]:
!pip install -q gdown
import gdown

folder_id = "1cQ31xAkg6vmYXc_K6tBlUbEiuN8gBR-D"
gdown.download_folder(id=folder_id, quiet=False, use_cookies=False)

!ls

Retrieving folder contents


Processing file 1lJwC6HVocUnD1UuopUIV5bOScuOF67fD Api.ipynb
Processing file 1oKgao1wJHM9c5nAd1myObFyGRPvCKgcr train.py
Processing file 1auSKVSGpfCb8N3725ZjjfDq_RLOuiKuP water_potability.csv


Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1lJwC6HVocUnD1UuopUIV5bOScuOF67fD
To: /content/tp7/Api.ipynb
100%|██████████| 73.0k/73.0k [00:00<00:00, 23.7MB/s]
Downloading...
From (original): https://drive.google.com/uc?id=1oKgao1wJHM9c5nAd1myObFyGRPvCKgcr
From (redirected): https://drive.google.com/uc?id=1oKgao1wJHM9c5nAd1myObFyGRPvCKgcr&confirm=t&uuid=dbb6c9af-0080-4f7f-9c25-29834db18de4
To: /content/tp7/train.py
100%|██████████| 1.50k/1.50k [00:00<00:00, 5.09MB/s]
Downloading...
From: https://drive.google.com/uc?id=1auSKVSGpfCb8N3725ZjjfDq_RLOuiKuP
To: /content/tp7/water_potability.csv
100%|██████████| 525k/525k [00:00<00:00, 93.2MB/s]

sample_data  tp7



Download completed


In [2]:
import pandas as pd
import numpy as np
import pickle
from pathlib import Path

In [3]:
import pandas as pd
df = pd.read_csv("tp7/water_potability.csv")
df.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
0,,204.890455,20791.318981,7.300212,368.516441,564.308654,10.379783,86.99097,2.963135,0
1,3.71608,129.422921,18630.057858,6.635246,,592.885359,15.180013,56.329076,4.500656,0
2,8.099124,224.236259,19909.541732,9.275884,,418.606213,16.868637,66.420093,3.055934,0
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0


PREPARACION DE DATOS

In [4]:
cols = ["ph", "Sulfate", "Trihalomethanes"]
Q = 5


saved_bins_ph = {}
saved_bins_sulfate = {}
saved_bins_trihalomethanes = {}


df_disc = df.copy()

for col in cols:
    series = df_disc[col]

    try:
        df_disc[f"{col}_qcut"], bins_q = pd.qcut(
            series, q=Q, duplicates="drop", retbins=True
        )
    except ValueError:

        bins_q = np.linspace(series.min(), series.max(), num=Q+1)
        df_disc[f"{col}_qcut"] = pd.cut(series, bins=bins_q, include_lowest=True)


    df_disc[f"{col}_cut"], bins_c = pd.cut(
        series, bins=Q, include_lowest=True, retbins=True
    )


    bundle = {"qcut": bins_q, "cut": bins_c}
    if col == "ph":
        saved_bins_ph = bundle
    elif col == "Sulfate":
        saved_bins_sulfate = bundle
    else:
        saved_bins_trihalomethanes = bundle


2) Agregar categoría 'desconocido' a NaN de las 3 columnas discretizadas

In [5]:
for col in cols:
    for kind in ("qcut", "cut"):
        c = f"{col}_{kind}"
        df_disc[c] = (
            df_disc[c]
            .astype("object")   # Interval -> object
            .astype(str)        # "(a, b]" -> string; NaN -> "nan"
            .replace("nan", "desconocido")
        )

3) Guardar los bins para reusar en producción

In [6]:
with open("saved_bins_ph.pickle", "wb") as handle:
    pickle.dump(saved_bins_ph, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open("saved_bins_sulfate.pickle", "wb") as handle:
    pickle.dump(saved_bins_sulfate, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open("saved_bins_trihalomethanes.pickle", "wb") as handle:
    pickle.dump(saved_bins_trihalomethanes, handle, protocol=pickle.HIGHEST_PROTOCOL)

4) One-Hot Encoding

In [7]:
target = "Potability"
cols_to_dummify = [c for c in df_disc.columns if c.endswith("_qcut") or c.endswith("_cut")]
base = pd.concat([df[[target]], df_disc[cols_to_dummify]], axis=1)

df_ohe = pd.get_dummies(base, drop_first=False)

print("Shape OHE:", df_ohe.shape)
df_ohe.head()

Shape OHE: (3276, 37)


Unnamed: 0,Potability,"ph_qcut_(-0.001, 5.822]","ph_qcut_(5.822, 6.702]","ph_qcut_(6.702, 7.437]","ph_qcut_(7.437, 8.311]","ph_qcut_(8.311, 14.0]",ph_qcut_desconocido,"ph_cut_(-0.015, 2.8]","ph_cut_(11.2, 14.0]","ph_cut_(2.8, 5.6]",...,"Trihalomethanes_qcut_(62.656, 70.446]","Trihalomethanes_qcut_(70.446, 79.701]","Trihalomethanes_qcut_(79.701, 124.0]",Trihalomethanes_qcut_desconocido,"Trihalomethanes_cut_(0.614, 25.39]","Trihalomethanes_cut_(25.39, 50.043]","Trihalomethanes_cut_(50.043, 74.695]","Trihalomethanes_cut_(74.695, 99.348]","Trihalomethanes_cut_(99.348, 124.0]",Trihalomethanes_cut_desconocido
0,0,False,False,False,False,False,True,False,False,False,...,False,False,True,False,False,False,False,True,False,False
1,0,True,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,True,False,False,False
2,0,False,False,False,True,False,False,False,False,False,...,True,False,False,False,False,False,True,False,False,False
3,0,False,False,False,False,True,False,False,False,False,...,False,False,True,False,False,False,False,False,True,False
4,0,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False


5) CLASIFICACION

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
import pickle


In [9]:
target = 'Potability'
data_y = df_ohe[target]
data_x = df_ohe.drop(columns=[target])

# No usar estas líneas (solo comentarlas):
# data_x = data_x.values
# data_y = data_y.values
#  Train/test split (70/30)
X_train, X_test, y_train, y_test = train_test_split(
    data_x, data_y, test_size=0.30, random_state=99, stratify=data_y
)

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")


Train shape: (2293, 36), Test shape: (983, 36)


ENTRENAMOS EL MODELO

In [10]:
rf = RandomForestClassifier(
    n_estimators=1000,
    random_state=99,
    n_jobs=-1
)
rf.fit(X_train, y_train)

In [11]:
y_pred = rf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print(f"✅ Accuracy: {acc:.4f}\n")
print("📊 Classification report:")
print(classification_report(y_test, y_pred))

✅ Accuracy: 0.5656

📊 Classification report:
              precision    recall  f1-score   support

           0       0.62      0.74      0.67       600
           1       0.42      0.30      0.35       383

    accuracy                           0.57       983
   macro avg       0.52      0.52      0.51       983
weighted avg       0.54      0.57      0.55       983



In [12]:
df.fillna(df.median(), inplace=True)


In [13]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=99)

# Convert boolean columns to integers
data_x_numeric = data_x.astype(int)

X_res, y_res = sm.fit_resample(data_x_numeric, data_y)

In [14]:
rf = RandomForestClassifier(
    n_estimators=1000,
    max_depth=8,
    min_samples_leaf=5,
    random_state=99,
    n_jobs=-1
)


In [15]:
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print(f"✅ Accuracy: {acc:.4f}\n")
print("📊 Classification report:")
print(classification_report(y_test, y_pred))

✅ Accuracy: 0.6277

📊 Classification report:
              precision    recall  f1-score   support

           0       0.63      0.92      0.75       600
           1       0.58      0.17      0.26       383

    accuracy                           0.63       983
   macro avg       0.60      0.54      0.51       983
weighted avg       0.61      0.63      0.56       983



GUARDAMOS EL MODELO

In [16]:
with open('rf.pkl', 'wb') as f:
    pickle.dump(rf, f, protocol=pickle.HIGHEST_PROTOCOL)

print("💾 Modelo guardado como 'rf.pkl'")

💾 Modelo guardado como 'rf.pkl'


In [17]:
with open('categories_ohe.pickle', 'wb') as handle:
    pickle.dump(data_x.columns, handle, protocol=pickle.HIGHEST_PROTOCOL)

Creación, llamada y prueba de la API.


In [18]:
# === CELDA ÚNICA PARA COLAB: API  ===
import subprocess, time, requests
from google.colab import output

main_py = r'''
from fastapi import FastAPI, Body
from pydantic import BaseModel, Field
import pickle
import pandas as pd

app = FastAPI(title="API - Water Potability")

# --- RUTAS DE ARTEFACTOS ---
MODEL_PATH = "rf.pkl"
COLUMNS_PATH = "categories_ohe.pickle"
BINS_PH = "saved_bins_ph.pickle"
BINS_SULFATE = "saved_bins_sulfate.pickle"
BINS_TRIHALOMETHANES = "saved_bins_trihalomethanes.pickle"

# --- Cargar modelo ---
with open(MODEL_PATH, "rb") as handle:
    model = pickle.load(handle)

# --- Cargar "ohe_tr" en lista de columnas ---
with open(COLUMNS_PATH, "rb") as handle:
    _raw_ohe = pickle.load(handle)

def _to_columns(ohe_obj):
    import pandas as pd
    if isinstance(ohe_obj, (list, tuple)):
        return list(ohe_obj), str(type(ohe_obj))
    if isinstance(ohe_obj, pd.Index):
        return list(ohe_obj), "pandas.Index"
    if hasattr(ohe_obj, "tolist"):
        return list(ohe_obj.tolist()), "numpy.ndarray"
    if hasattr(ohe_obj, "get_feature_names_out"):
        return list(ohe_obj.get_feature_names_out()), f"{type(ohe_obj)}"
    if isinstance(ohe_obj, dict):
        for key in ["columns", "feature_names", "ohe_columns"]:
            if key in ohe_obj:
                return list(ohe_obj[key]), f"dict[{key}]"
    return [], f"UNSUPPORTED_TYPE: {type(ohe_obj)}"

ohe_tr, ohe_type = _to_columns(_raw_ohe)

# --- Cargar bins ---
with open(BINS_PH, "rb") as handle:
    bins_ph = pickle.load(handle)
with open(BINS_SULFATE, "rb") as handle:
    bins_sulfate = pickle.load(handle)
with open(BINS_TRIHALOMETHANES, "rb") as handle:
    bins_trihalo = pickle.load(handle)

# --- Input/Output models ---
class WaterData(BaseModel):
    ph: float
    Hardness: float
    Solids: float
    Chloramines: float
    Sulfate: float
    Conductivity: float
    Organic_carbon: float
    Trihalomethanes: float
    Turbidity: float

class Prediction(BaseModel):
    score: int = Field(..., description="0 = no potable, 1 = potable")

# --- Ejemplos Swagger ---
EXAMPLE_BASE = {
    "ph": 0,
    "Hardness": 204.890455,
    "Solids": 20791.318981,
    "Chloramines": 7.300212,
    "Sulfate": 368.516441,
    "Conductivity": 564.308654,
    "Organic_carbon": 10.379783,
    "Trihalomethanes": 86.990970,
    "Turbidity": 2.963135
}
EXAMPLE_REQ1 = {
    "ph": 7.7984536762012135,
    "Hardness": 188.39494231709176,
    "Solids": 32704.569285770576,
    "Chloramines": 11.078872478914501,
    "Sulfate": 258.1911841475428,
    "Conductivity": 507.1786882733106,
    "Organic_carbon": 18.272439235274646,
    "Trihalomethanes": 85.17766213336226,
    "Turbidity": 4.107267203260775
}

@app.get("/")
def root():
    return {"status": "ok", "ohe_type": ohe_type, "ohe_len": len(ohe_tr)}

def _extract_bins_dict(bins_obj):
    """
    Devuelve (bins_cut, bins_qcut) si existen en el pickle.
    El pickle puede ser:
      - list/ndarray de bins (solo cut)
      - dict con claves 'cut' y/o 'qcut'
    """
    bins_cut = None
    bins_qcut = None
    if isinstance(bins_obj, dict):
        # nombres típicos
        for k in ["cut", "bins_cut"]:
            if k in bins_obj:
                bins_cut = bins_obj[k]
                break
        for k in ["qcut", "bins_qcut"]:
            if k in bins_obj:
                bins_qcut = bins_obj[k]
                break
        # fallback si solo hay uno
        if bins_cut is None and "cut" in bins_obj:
            bins_cut = bins_obj["cut"]
        if bins_qcut is None and "qcut" in bins_obj:
            bins_qcut = bins_obj["qcut"]
    else:
        # Si no es dict, asume que es solo CUT
        bins_cut = bins_obj
    return bins_cut, bins_qcut

def _apply_binnings(df: pd.DataFrame, col: str, bins_obj):
    """
    Crea columnas *_cut y/o *_qcut con los mismos nombres que espera ohe_tr:
      ej: 'ph_cut_(5.6, 8.4]' y 'ph_qcut_(7.437, 8.311]'
    """
    bins_cut, bins_qcut = _extract_bins_dict(bins_obj)

    if bins_cut is not None:
        df[f"{col}_cut"] = pd.cut(df[col].astype(float), bins=bins_cut, include_lowest=True)
        df[f"{col}_cut"] = df[f"{col}_cut"].astype("object").astype(str).replace("nan", "desconocido")

    if bins_qcut is not None:
        df[f"{col}_qcut"] = pd.cut(df[col].astype(float), bins=bins_qcut, include_lowest=True)
        df[f"{col}_qcut"] = df[f"{col}_qcut"].astype("object").astype(str).replace("nan", "desconocido")

def _preprocess(df_in: pd.DataFrame) -> pd.DataFrame:
    df = df_in.copy()

    # 1) Generar features binned con los prefijos esperados
    _apply_binnings(df, "ph", bins_ph)
    _apply_binnings(df, "Sulfate", bins_sulfate)
    _apply_binnings(df, "Trihalomethanes", bins_trihalo)

    # IMPORTANTE: si en entrenamiento se dropearon las originales, haz lo mismo:
    # (ajusta según cómo entrenaste)
    for col in ["ph", "Sulfate", "Trihalomethanes"]:
        if col in df.columns:
            df.drop(columns=[col], inplace=True)

    # 2) One-hot y reindex a columnas esperadas
    df_ohe = pd.get_dummies(df).reindex(columns=ohe_tr, fill_value=0)
    return df_ohe

@app.post("/prediccion", response_model=Prediction, summary="Predict Water Potability")
def predict_water_potability(
    data: WaterData = Body(
        ...,
        examples={
            "baseline": {"summary": "Ejemplo base", "value": EXAMPLE_BASE},
            "request1": {"summary": "Request 1", "value": EXAMPLE_REQ1},
        },
    )
):
    X = _preprocess(pd.DataFrame([data.dict()]))
    pred = int(model.predict(X)[0])
    return {"score": pred}



'''
with open("main.py","w") as f:
    f.write(main_py)

# --- Reiniciar Uvicorn en segundo plano
subprocess.run(["pkill","-9","-f","uvicorn"], check=False, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
proc = subprocess.Popen(
    ["uvicorn","main:app","--host","0.0.0.0","--port","8000","--no-access-log"],
    stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
)

# --- Esperar server
BASE_LOCAL = "http://127.0.0.1:8000"
for _ in range(60):
    try:
        if requests.get(BASE_LOCAL, timeout=0.5).ok:
            break
    except Exception:
        time.sleep(0.25)

# --- URL Swagger
proxy_url = output.eval_js("google.colab.kernel.proxyPort(8000)")
print("Abrí Swagger acá:", proxy_url + "/docs")

#-- Pruebas rápidas
baseline = {
    "ph": 0,
    "Hardness": 204.890455,
    "Solids": 20791.318981,
    "Chloramines": 7.300212,
    "Sulfate": 368.516441,
    "Conductivity": 564.308654,
    "Organic_carbon": 10.379783,
    "Trihalomethanes": 86.990970,
    "Turbidity": 2.963135
}
request1 = {
    "ph": 7.7984536762012135,
    "Hardness": 188.39494231709176,
    "Solids": 32704.569285770576,
    "Chloramines": 11.078872478914501,
    "Sulfate": 258.1911841475428,
    "Conductivity": 507.1786882733106,
    "Organic_carbon": 18.272439235274646,
    "Trihalomethanes": 85.17766213336226,
    "Turbidity": 4.107267203260775
}


print("Pred baseline ->", requests.post(BASE_LOCAL + "/prediccion", json=baseline, timeout=10).json())
print("Pred request1 ->", requests.post(BASE_LOCAL + "/prediccion", json=request1, timeout=10).json())



Abrí Swagger acá: https://8000-m-s-3g9jnjr3rkhh3-a.asia-east1-0.prod.colab.dev/docs
Pred baseline -> {'score': 0}
Pred request1 -> {'score': 1}
