About Data:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


In [77]:
# Import libraries and methods/functions
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [78]:
# Read csvs
demographics = pd.read_csv("telecom_demographics.csv")
usage = pd.read_csv("telecom_usage.csv")

# Join dfs and find rate of churn
churn_df = demographics.merge(usage, on="customer_id")
churn_rate = churn_data["churn"].value_counts() / len(churn_data)
print(churn_rate)

0    0.799538
1    0.200462
Name: churn, dtype: float64


In [79]:
# Determine categorical variables
print(churn_df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6500 entries, 0 to 6499
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         6500 non-null   int64 
 1   telecom_partner     6500 non-null   object
 2   gender              6500 non-null   object
 3   age                 6500 non-null   int64 
 4   state               6500 non-null   object
 5   city                6500 non-null   object
 6   pincode             6500 non-null   int64 
 7   registration_event  6500 non-null   object
 8   num_dependents      6500 non-null   int64 
 9   estimated_salary    6500 non-null   int64 
 10  calls_made          6500 non-null   int64 
 11  sms_sent            6500 non-null   int64 
 12  data_used           6500 non-null   int64 
 13  churn               6500 non-null   int64 
dtypes: int64(9), object(5)
memory usage: 761.7+ KB
None


In [80]:
# One Hot Encoding for Categorical Variables
churn_df = pd.get_dummies(churn_df, columns=["telecom_partner", "gender", "state", "city", "registration_event"])

In [81]:
# Standard Scaling
scaler = StandardScaler()
features = churn_df.drop(["customer_id", "churn"], axis=1)
features_scaled = scaler.fit_transform(features)

# Define target column
target = churn_df["churn"]

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, random_state=42, test_size=0.2)

In [82]:
# Instantiate logistic regression model
log_reg_model = LogisticRegression(random_state=42)

# Fit logistic regression model
log_reg_model.fit(X_train, y_train)

# Predict
logreg_pred = log_reg_model.predict(X_test)

In [83]:
# Instantiate random forest model
rf_model = RandomForestClassifier(random_state=42)

# Fit random forest model
rf_model.fit(X_train, y_train)

# Predict
rf_pred = rf_model.predict(X_test)

In [84]:
# Accessing the models
# Confusion matrices
logreg_cm = confusion_matrix(y_test, logreg_pred)
rf_cm = confusion_matrix(y_test, rf_pred)
print(ridge_cm, rf_cm)

[[927 100]
 [248  25]] [[1026    1]
 [ 273    0]]


In [85]:
# Classification reports
logreg_report = classification_report(y_test, logreg_pred)
rf_report = classification_report(y_test, rf_pred)
print(logreg_report)

              precision    recall  f1-score   support

           0       0.79      0.90      0.84      1027
           1       0.21      0.10      0.14       273

    accuracy                           0.73      1300
   macro avg       0.50      0.50      0.49      1300
weighted avg       0.67      0.73      0.69      1300



In [86]:
print(rf_report)

              precision    recall  f1-score   support

           0       0.79      1.00      0.88      1027
           1       0.00      0.00      0.00       273

    accuracy                           0.79      1300
   macro avg       0.39      0.50      0.44      1300
weighted avg       0.62      0.79      0.70      1300



Random Forest model has a higher accuracy.

In [87]:
higher_accuracy = "RandomForest"