# 📘 Telecom X – Parte 2: Predicción de Cancelación (Churn)

## Índice
1. **PREPARACIÓN DE DATOS**
2. **CORRELACIÓN Y SELECCIÓN DE VARIABLES**
3. **MODELO PREDICTIVO**
4. **INTERPRETACIÓN Y CONCLUSIONES**
5. **INFORME FINAL**


## PREPARACIÓN DE DATOS

In [None]:
import pandas as pd
from pathlib import Path

candidates = [
    'TelecomX_Data_Modelo_Encoded.csv',
    '/mnt/data/TelecomX_Data_Modelo_Encoded.csv',
    'TelecomX_Data_Modelo_Base.csv',
    '/mnt/data/TelecomX_Data_Modelo_Base.csv',
    'TelecomX_Data_Estandarizado.csv',
    '/mnt/data/TelecomX_Data_Estandarizado.csv',
]
path = None
for p in candidates:
    if Path(p).exists():
        path = p
        break
assert path is not None, "No se encontró un CSV válido."
print("Usando:", path)

df = pd.read_csv(path)
if 'Evasion' in df.columns and df.drop(columns=['Evasion']).select_dtypes(include=['object']).shape[1] > 0:
    cat_cols = df.drop(columns=['Evasion']).select_dtypes(include=['object']).columns.tolist()
    print("One-hot al vuelo para:", cat_cols)
    X = pd.get_dummies(df.drop(columns=['Evasion']), columns=cat_cols, drop_first=True)
    y = df['Evasion'].astype(int)
else:
    X = df.drop(columns=['Evasion'])
    y = df['Evasion'].astype(int)

print("X:", X.shape, " y:", y.shape)
y.value_counts(normalize=True).round(3)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)
(X_train.shape, X_test.shape)

## CORRELACIÓN Y SELECCIÓN DE VARIABLES

In [None]:
import numpy as np
import matplotlib.pyplot as plt

corr = pd.concat([X, y], axis=1).corr(numeric_only=True)
target_corr = corr['Evasion'].drop(labels=['Evasion']).sort_values(key=lambda s: s.abs(), ascending=False)
print("Top correlaciones con Evasion:")
print(target_corr.head(15))

plt.figure(figsize=(10,7))
plt.imshow(corr, aspect='auto')
plt.colorbar()
plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
plt.yticks(range(len(corr.index)), corr.index)
plt.title('Matriz de correlación')
plt.tight_layout()
plt.show()

## MODELO PREDICTIVO

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

pipe_logreg = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=500, class_weight='balanced', solver='liblinear'))
])
rf = RandomForestClassifier(n_estimators=300, random_state=42, class_weight='balanced', n_jobs=-1)

pipe_logreg.fit(X_train, y_train)
rf.fit(X_train, y_train)

y_pred_log = pipe_logreg.predict(X_test)
y_proba_log = pipe_logreg.predict_proba(X_test)[:,1]
y_pred_rf  = rf.predict(X_test)
y_proba_rf = rf.predict_proba(X_test)[:,1]

'modelos entrenados'

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, confusion_matrix
import numpy as np
import matplotlib.pyplot as plt

def evaluar_modelo(nombre, y_true, y_pred, y_proba):
    print(f"\n=== {nombre} ===")
    print("Accuracy:", round(accuracy_score(y_true, y_pred), 3))
    print("Precision:", round(precision_score(y_true, y_pred), 3))
    print("Recall:", round(recall_score(y_true, y_pred), 3))
    print("F1-score:", round(f1_score(y_true, y_pred), 3))
    print("AUC-ROC:", round(roc_auc_score(y_true, y_proba), 3))
    print("\nReporte:\n", classification_report(y_true, y_pred, digits=3, zero_division=0))

    cm = confusion_matrix(y_true, y_pred)
    fig = plt.figure(figsize=(5,4))
    ax = fig.add_subplot(111)
    im = ax.imshow(cm, interpolation='nearest')
    fig.colorbar(im)
    ax.set_title(f"Matriz de Confusión - {nombre}")
    ax.set_xlabel("Predicción"); ax.set_ylabel("Real")
    ax.set_xticks([0,1]); ax.set_xticklabels(['0','1'])
    ax.set_yticks([0,1]); ax.set_yticklabels(['0','1'])
    for (i,j), v in np.ndenumerate(cm):
        ax.text(j, i, str(v), ha='center', va='center')
    plt.tight_layout(); plt.show()

evaluar_modelo('Regresión Logística', y_test, y_pred_log, y_proba_log)
evaluar_modelo('Random Forest', y_test, y_pred_rf, y_proba_rf)

## INTERPRETACIÓN Y CONCLUSIONES

In [None]:
import pandas as pd
import numpy as np

log_coefs = pd.Series(pipe_logreg.named_steps['clf'].coef_[0], index=X_train.columns).sort_values(key=np.abs, ascending=False)
rf_imp = pd.Series(rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

print("Top 15 | Coeficientes absolutos (LogReg):")
display(log_coefs.head(15).to_frame('coeficiente'))

print("\nTop 15 | Importancias (Random Forest):")
display(rf_imp.head(15).to_frame('importancia'))

## INFORME FINAL

**Resumen ejecutivo**
- Se entrenaron y evaluaron dos modelos (Regresión Logística con escalado y Random Forest sin escalado).
- La clase objetivo está desbalanceada; se usó estratificación y `class_weight='balanced'`.
- El análisis de importancia sugiere la relevancia de **antigüedad (MesesContrato)**, **tipo de contrato** y **cargos mensuales**.

**Recomendaciones**
- Retención temprana para clientes con baja antigüedad y contrato mensual.
- Ajustes de plan/precio para cargos altos.
- Promover servicios de valor (Seguridad/Soporte).
- Despliegue del score de churn para campañas proactivas.
