## Telco Customer Churn Prediction with XGBoost

This notebook demonstrates how to build a customer churn prediction model using the XGBoost algorithm. The dataset used is the Telco Customer Churn dataset, which contains information about telecom customers and whether they churned (stopped using the service).

### 1. Data Loading and Exploration

First, we load the dataset and explore its structure and content.

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

# Display the first few rows of the dataframe
print(df.head())

# Get information about the columns and their data types
print(df.info())

# Get descriptive statistics of the numerical columns
print(df.describe())

# Check for missing values
print(df.isnull().sum())

   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0           1  Female              0     Yes         No       1           No   
1           2    Male              0      No         No      34          Yes   
2           3    Male              0      No         No       2          Yes   
3           4    Male              0      No         No      45           No   
4           5  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV StreamingMovies        Contract Pape

### 2. Data Preprocessing

Next, we preprocess the data. This includes handling missing values, encoding categorical features, and scaling numerical features.

In [2]:
# Handle missing values (if any)
# For example, fill missing "TotalCharges" with the mean
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce") # Convert to numeric, coercing errors
df["TotalCharges"].fillna(df["TotalCharges"].mean(), inplace=True)

# Convert categorical columns to numerical using one-hot encoding
categorical_cols = df.select_dtypes(include="object").columns
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Separate features (X) and target (y)
# Ensure "Churn_Yes" is the correct target column name after one-hot encoding. If your original "Churn" column was Yes/No, 
# and drop_first=True, then "Churn_Yes" would be the column. If it was 0/1, adjust accordingly.
# It"s also possible that the target column is simply "Churn" if it was already binary and not part of get_dummies.
# We will assume "Churn_Yes" for now based on the previous script. Verify this if errors occur.
if "Churn_Yes" in df.columns:
    X = df.drop("Churn_Yes", axis=1)
    y = df["Churn_Yes"]
elif "Churn" in df.columns: # Fallback if "Churn_Yes" is not present and original Churn was binary
    X = df.drop("Churn", axis=1)
    y = df["Churn"]
else:
    # Attempt to find a likely target column if "Churn_Yes" or "Churn" is not directly found after dummification
    possible_target_cols = [col for col in df.columns if "Churn" in col]
    if possible_target_cols:
        target_col_name = possible_target_cols[0] # Take the first match
        print(f"Warning: Target column 'Churn_Yes' or 'Churn' not found. Using {target_col_name} as target.")
        X = df.drop(target_col_name, axis=1)
        y = df[target_col_name]
    else:
        raise KeyError("Target column related to 'Churn' not found in dataframe after preprocessing. Please check column names.")


# Split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale numerical features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Identify numerical columns to scale (excluding binary/dummy variables created by get_dummies)
# This step might need adjustment based on the actual column names after one-hot encoding
numerical_cols = [col for col in X_train.columns if X_train[col].dtype in ["int64", "float64"] and X_train[col].nunique() > 2] 
if numerical_cols:
    X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
    X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["TotalCharges"].fillna(df["TotalCharges"].mean(), inplace=True)


### 3. Model Training and Prediction

Now, we train the XGBoost model and make predictions.

In [3]:
import xgboost as xgb

# Initialize and train the XGBoost classifier
# For newer versions of XGBoost, use_label_encoder=False is deprecated and handled automatically.
# If you encounter a warning, you can remove it. If "use_label_encoder" causes an error, remove it.
model = xgb.XGBClassifier(objective="binary:logistic", eval_metric="logloss", use_label_encoder=False if int(xgb.__version__.split(".")[0]) < 2 else None)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

### 4. Model Evaluation

Finally, we evaluate the model"s performance.

In [4]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Display confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.80

Classification Report:
              precision    recall  f1-score   support

       False       0.84      0.89      0.87      1036
        True       0.64      0.53      0.58       373

    accuracy                           0.80      1409
   macro avg       0.74      0.71      0.72      1409
weighted avg       0.79      0.80      0.79      1409


Confusion Matrix:
[[926 110]
 [177 196]]


This notebook provides a basic framework. You may need to adjust the preprocessing steps, hyperparameter tuning, and model evaluation based on your specific requirements and findings.