# :chart_with_upwards_trend: Modelo de abandono de clientes: Ingeniería de características (Features)

### Ejecute el primer Notebook por completo antes de ejecutar este segundo Notebook.

### Primero, agregue los paquetes `imbalanced-learn`, `snowflake-ml-python`, `altair`, `pandas` y `numpy` desde el selector de paquetes en la parte superior derecha. Usaremos estos paquetes más adelante en el Notebook.

Para preparar nuestros datos para nuestro modelo, necesitaremos manejar el problema de datos desequilibrados mediante un muestreo ascendente de nuestro conjunto de datos.

Para esto, usaremos el algoritmo `SMOTE` del paquete `imblearn`.

In [None]:
import pandas as pd
import numpy as np
import streamlit as st
import altair as alt
from imblearn.over_sampling import SMOTE 

import warnings
warnings.filterwarnings("ignore")

from snowflake.snowpark.context import get_active_session
session = get_active_session()
session.query_tag = {"origin":"sf_sit-is", 
                     "name":"churn_prediction", 
                     "version":{"major":1, "minor":0},
                     "attributes":{"is_quickstart":1, "source":"notebook"}}

# Saving telco_churn_pdf into variable from Snowflake
telco_churn_pdf = session.sql("SELECT * FROM TELCO_CHURN_PDF").to_pandas()

# Extract the training features
features_names = [col for col in telco_churn_pdf.columns if col not in ['Churn']]
features = telco_churn_pdf[features_names]

In [None]:
# extract the target
target = telco_churn_pdf['Churn']
st.markdown("## Vamos a equilibrar el conjunto de datos.")
# upsample the minority class in the dataset
upsampler = SMOTE(random_state = 111)
features, target = upsampler.fit_resample(features, target)
st.dataframe(features.head())

st.markdown("## Upsampled data.")
upsampled_data = pd.concat([features, target], axis=1)
upsampled_data.reset_index(inplace=True)
upsampled_data.rename(columns={'index': 'INDEX'}, inplace=True)
st.dataframe(upsampled_data.head())

st.markdown("## Preview of upsampled data.")
upsampled_data = session.create_dataframe(upsampled_data)
# Get the list of column names from the dataset
feature_names_input = [c for c in upsampled_data.columns if c != '"Churn"' and c != "INDEX"]
upsampled_data[feature_names_input]

Una vez que nos hayamos ocupado de eso, usaremos scikit-learn para preprocesar nuestros datos en un formato que el modelo espera. Esto significa escalar nuestras características y dividir nuestros datos en conjuntos de datos de prueba y entrenamiento.

Podemos realizar el preprocesamiento de StandardScaler a través de sklearn para procesar en memoria o el preprocesamiento de Snowpark ML para el procesamiento pushdown.

## Preprocesamiento de Sci-kit learn con Pandas DataFrames

In [None]:
import sklearn.preprocessing as pp_original
# Initialize a StandardScaler object with input and output column names
scaler = pp_original.StandardScaler()
features_pdf = upsampled_data[feature_names_input].to_pandas()

# Fit the scaler to the dataset
scaler.fit(features_pdf)

# Transform the dataset using the fitted scaler
scaled_features = scaler.transform(features_pdf)
scaled_features = pd.DataFrame(scaled_features, columns = features_names)
scaled_features

## Preprocesamiento de Snowpark ML con Snowpark

Tenga en cuenta la similitud entre las API utilizadas para sklearn y Snowpark ML.

In [None]:
import snowflake.ml.modeling.preprocessing as pp

# Initialize a StandardScaler object with input and output column names
scaler = pp.StandardScaler(
    input_cols=feature_names_input,
    output_cols=feature_names_input
)

# Fit the scaler to the dataset
scaler.fit(upsampled_data)

# Transform the dataset using the fitted scaler
scaled_features = scaler.transform(upsampled_data)
scaled_features

## Realicemos la prueba de división de los datos de prueba utilizando 80/20.

In [None]:
# Split the scaled_features dataset into training and testing sets with an 80/20 ratio
training, testing = scaled_features.random_split(weights=[0.8, 0.2], seed=111)

# Entrenamiento de modelos: Random Forest Classifier 

El modelo del día es un [Random Forest Classifier ](https://towardsdatascience.com/understanding-random-forest-58381e0602d2). No entraré en detalles sobre cómo funciona, pero, en resumen, crea un conjunto de modelos más pequeños que hacen predicciones sobre los mismos datos. La predicción que tenga más votos será la predicción final con la que se decida el modelo.

In [None]:
from snowflake.ml.modeling.ensemble import RandomForestClassifier

# Define the target variable (label) column name
label = ['"Churn"']

# Define the output column name for the predicted label
output_label = ['"predicted_churn"']

# Initialize a RandomForestClassifier object with input, label, and output column names
model = RandomForestClassifier(
    input_cols=feature_names_input,
    label_cols=label,
    output_cols=output_label,
)

# Train the RandomForestClassifier model using the training set
_ = model.fit(training)

# Predict the target variable (churn) for the testing set using the trained model
results = model.predict(testing)

testing

# Evaluación del modelo

La evaluación del modelo consiste en comprobar el rendimiento de nuestro modelo de aprendizaje automático comparando sus predicciones con los resultados reales.

In [None]:
# return only the predicted churn values
predictions = results.to_pandas().sort_values("INDEX")[output_label].astype(int).to_numpy().flatten()
actual = testing.to_pandas().sort_values("INDEX")[['Churn']].to_numpy().flatten()

## Importancia de las características (Features)

La importancia de las características consiste en determinar qué variables de entrada son las verdaderas MVP a la hora de hacer predicciones con nuestro modelo de aprendizaje automático. Descubriremos qué características son las más importantes observando cuánto contribuyen al rendimiento general del modelo.

In [None]:
rf = model.to_sklearn()
importances = pd.DataFrame(
    list(zip(features.columns, rf.feature_importances_)),
    columns=["feature", "importance"],
)

bar_chart = alt.Chart(importances).mark_bar().encode(
    x="importance:Q",
    y=alt.Y("feature:N", sort="-x")
)
st.altair_chart(bar_chart, use_container_width=True)

## Predicción de la pérdida de un nuevo usuario

Al utilizar nuestro modelo de bosque aleatorio entrenado, podemos hacer predicciones que nos indiquen si un nuevo cliente se dará de baja o no.

In [None]:
account_weeks = "10"
data_usage = "1.7"
mins_per_month = "82"
daytime_calls = "67"
customer_service_calls = "4"
monthly_charge = "37"
roam_mins = "0"
overage_fee = "9.5"
renewed_contract = "true"
has_data_plan = "true"
user_vector = np.array([
    account_weeks,
    1 if renewed_contract else 0,
    1 if has_data_plan else 0,
    data_usage,
    customer_service_calls,
    mins_per_month,
    daytime_calls,
    monthly_charge,
    overage_fee,
    roam_mins,
]).reshape(1,-1)

user_dataframe = pd.DataFrame(user_vector, columns=[f'"{_}"' for _ in features.columns])

#### Marco de datos de entrada para el nuevo usuario

In [None]:
st.markdown("#### New user")
user_dataframe
user_vector = scaler.transform(user_dataframe)
st.markdown("#### Churn prediction")
model.predict(user_vector)[['"predicted_churn"']].values

In [None]:
st.markdown("#### Scaled dataframe for new user")
st.dataframe(user_vector)
st.markdown("#### Prediction")
predicted_value = model.predict(user_vector)[['"predicted_churn"']].values.astype(int).flatten()
user_probability = model.predict_proba(user_vector)
probability_of_prediction = max(user_probability[user_probability.columns[-2:]].values[0]) * 100
prediction = 'churn' if predicted_value == 1 else 'not churn'
st.markdown(prediction)

In [None]:
col1, col2 = st.columns(2)

with col1: 
    account_weeks = st.slider("Semanas como cliente", int(features["AccountWeeks"].min()) , int(features["AccountWeeks"].max()))
    data_usage = st.slider("Minutos Diarios", int(features["DataUsage"].min()) , int(features["DataUsage"].max()))
    mins_per_month = st.slider("Minutos Mensuales", int(features["DayMins"].min()) , int(features["DayMins"].max()))
    daytime_calls = st.slider("Interacciones Diarias", int(features["DayCalls"].min()) , int(features["DayCalls"].max()))
    #renewed_contract =  st.selectbox("Renovación contrato?",('true','false'))
    
with col2: 
    monthly_charge = st.slider("Cargo recurrente", int(features["MonthlyCharge"].min()) , int(features["MonthlyCharge"].max()))
    roam_mins = st.slider("Antiguedad", int(features["RoamMins"].min()) , int(features["RoamMins"].max()))
    customer_service_calls = st.slider("Contactos a soporte", int(features["CustServCalls"].min()) , int(features["CustServCalls"].max()))
    overage_fee = st.slider("Cargos adicionales", int(features["OverageFee"].min()) , int(features["OverageFee"].max()))
    #has_data_plan = st.selectbox("Tiene cuenta Premium?",('true','false'))

user_vector = np.array([
    account_weeks,
    1 if renewed_contract else 0,
    1 if has_data_plan else 0,
    data_usage,
    customer_service_calls,
    mins_per_month,
    daytime_calls,
    monthly_charge,
    overage_fee,
    roam_mins,
]).reshape(1,-1)

user_dataframe = pd.DataFrame(user_vector, columns=[f'"{_}"' for _ in features.columns])
user_vector = scaler.transform(user_dataframe)
with col1: 
    st.markdown("#### Input dataframe for new user")
    st.dataframe(user_dataframe)
with col2:
    st.markdown("#### Scaled dataframe for new user")
    st.dataframe(user_vector)

st.markdown("#### Prediction")
predicted_value = model.predict(user_vector)[['"predicted_churn"']].values.astype(int).flatten()
user_probability = model.predict_proba(user_vector)
probability_of_prediction = max(user_probability[user_probability.columns[-2:]].values[0]) * 100
prediction = 'churn' if predicted_value == 1 else 'not churn'
st.markdown(prediction)

## Exporting Model with Timestamp

In [None]:
import pickle
import datetime
filename = f'telco-eda-model-{datetime.datetime.now()}.pkl'

pickle.dump(model, open(filename,'wb'))
print(f"Saved to {filename}")

Congratulations on making it to the end of this Lab where we explored churn modeling using Snowflake Notebooks! We learned how to import/load data to Snowflake, train a Random Forest model, visualize predictions, and build an interactive data app, and make predictions for new users.
