![Cartoon of telecom customers](IMG_8811.png)


The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers. "Churn" refers to the process where customers or subscribers stop using a company's services or products. Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


Does Logistic Regression or Random Forest produce a higher accuracy score in predicting telecom churn in India?

- Load the two CSV files into separate DataFrames. Merge them into a DataFrame named churn_df. Calculate and print churn rate, and identify the categorical variables in churn_df.
- Convert categorical features in churn_df into features_scaled. Perform feature scaling separating the appropriate features and scale them. Define your scaled features and target variable for the churn prediction model.
- Split the processed data into training and testing sets giving names of X_train, X_test, y_train, and y_test using an 80-20 split, setting a random state of 42 for reproducibility.
- Train Logistic Regression and Random Forest Classifier models, setting a random seed of 42. Store model predictions in logreg_pred and rf_pred.
- Assess the models on test data. Assign the model's name with higher accuracy ("LogisticRegression" or "RandomForest") to higher_accuracy.

In [14]:
# Import required libraries and methods/functions
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
# OneHotEncoder is not needed if using pd.get_dummies()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
telco_demog = pd.read_csv('telecom_demographics.csv')
telco_usage = pd.read_csv('telecom_usage.csv')

telco_demog.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6500 entries, 0 to 6499
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         6500 non-null   int64 
 1   telecom_partner     6500 non-null   object
 2   gender              6500 non-null   object
 3   age                 6500 non-null   int64 
 4   state               6500 non-null   object
 5   city                6500 non-null   object
 6   pincode             6500 non-null   int64 
 7   registration_event  6500 non-null   object
 8   num_dependents      6500 non-null   int64 
 9   estimated_salary    6500 non-null   int64 
dtypes: int64(5), object(5)
memory usage: 507.9+ KB


In [3]:
telco_usage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6500 entries, 0 to 6499
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   customer_id  6500 non-null   int64
 1   calls_made   6500 non-null   int64
 2   sms_sent     6500 non-null   int64
 3   data_used    6500 non-null   int64
 4   churn        6500 non-null   int64
dtypes: int64(5)
memory usage: 254.0 KB


In [4]:
merged_df = pd.merge(telco_demog, telco_usage, on='customer_id')
merged_df.head()

Unnamed: 0,customer_id,telecom_partner,gender,age,state,city,pincode,registration_event,num_dependents,estimated_salary,calls_made,sms_sent,data_used,churn
0,15169,Airtel,F,26,Himachal Pradesh,Delhi,667173,2020-03-16,4,85979,75,21,4532,1
1,149207,Airtel,F,74,Uttarakhand,Hyderabad,313997,2022-01-16,0,69445,35,38,723,1
2,148119,Airtel,F,54,Jharkhand,Chennai,549925,2022-01-11,2,75949,70,47,4688,1
3,187288,Reliance Jio,M,29,Bihar,Hyderabad,230636,2022-07-26,3,34272,95,32,10241,1
4,14016,Vodafone,M,45,Nagaland,Bangalore,188036,2020-03-11,4,34157,66,23,5246,1


- Load the two CSV files into separate DataFrames. Merge them into a DataFrame named churn_df. Calculate and print churn rate, and identify the categorical variables in churn_df.
- Convert categorical features in churn_df into features_scaled. Perform feature scaling separating the appropriate features and scale them. Define your scaled features and target variable for the churn prediction model.
- Split the processed data into training and testing sets giving names of X_train, X_test, y_train, and y_test using an 80-20 split, setting a random state of 42 for reproducibility.
- Train Logistic Regression and Random Forest Classifier models, setting a random seed of 42. Store model predictions in logreg_pred and rf_pred.
- Assess the models on test data. Assign the model's name with higher accuracy ("LogisticRegression" or "RandomForest") to higher_accuracy.

### Pytanie 1

Calculate and print churn rate, and identify the categorical variables in churn_df.

In [5]:
merged_df['churn'].value_counts()

churn
0    5197
1    1303
Name: count, dtype: int64

In [6]:
churn_rate = merged_df['churn'].value_counts() / len(merged_df)
churn_rate

churn
0    0.799538
1    0.200462
Name: count, dtype: float64

### Pytanie 2

Convert categorical features in churn_df into features_scaled. Perform feature scaling separating the appropriate features and scale them. Define your scaled features and target variable for the churn prediction model

In [7]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6500 entries, 0 to 6499
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         6500 non-null   int64 
 1   telecom_partner     6500 non-null   object
 2   gender              6500 non-null   object
 3   age                 6500 non-null   int64 
 4   state               6500 non-null   object
 5   city                6500 non-null   object
 6   pincode             6500 non-null   int64 
 7   registration_event  6500 non-null   object
 8   num_dependents      6500 non-null   int64 
 9   estimated_salary    6500 non-null   int64 
 10  calls_made          6500 non-null   int64 
 11  sms_sent            6500 non-null   int64 
 12  data_used           6500 non-null   int64 
 13  churn               6500 non-null   int64 
dtypes: int64(9), object(5)
memory usage: 711.1+ KB


In [None]:
# Wybór kolumn kategorycznych
categorical_columns = merged_df.select_dtypes(include='object').columns.to_list()
categorical_columns

['telecom_partner', 'gender', 'state', 'city', 'registration_event']

In [11]:
# OneHot encoding
churn_df = pd.get_dummies(merged_df, columns=categorical_columns)
churn_df

Unnamed: 0,customer_id,age,pincode,num_dependents,estimated_salary,calls_made,sms_sent,data_used,churn,telecom_partner_Airtel,...,registration_event_2023-04-24,registration_event_2023-04-25,registration_event_2023-04-26,registration_event_2023-04-27,registration_event_2023-04-28,registration_event_2023-04-29,registration_event_2023-04-30,registration_event_2023-05-01,registration_event_2023-05-02,registration_event_2023-05-03
0,15169,26,667173,4,85979,75,21,4532,1,True,...,False,False,False,False,False,False,False,False,False,False
1,149207,74,313997,0,69445,35,38,723,1,True,...,False,False,False,False,False,False,False,False,False,False
2,148119,54,549925,2,75949,70,47,4688,1,True,...,False,False,False,False,False,False,False,False,False,False
3,187288,29,230636,3,34272,95,32,10241,1,False,...,False,False,False,False,False,False,False,False,False,False
4,14016,45,188036,4,34157,66,23,5246,1,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6495,78836,54,125785,4,124805,-2,39,5000,0,True,...,False,False,False,False,False,False,False,False,False,False
6496,146521,69,923076,1,65605,20,31,3562,0,False,...,False,False,False,False,False,False,False,False,False,False
6497,40413,19,152201,0,28632,73,14,65,0,True,...,False,False,False,False,False,False,False,False,False,False
6498,64961,26,782127,3,119757,52,8,6835,0,False,...,False,False,False,False,False,False,False,False,False,False


### Pytanie 3

Split the processed data into training and testing sets giving names of X_train, X_test, y_train, and y_test using an 80-20 split, setting a random state of 42 for reproducibility.

In [13]:
scaler = StandardScaler()
features = churn_df.drop(['customer_id', 'churn'], axis=1)
features_scaled = scaler.fit_transform(features)

target = churn_df['churn']


X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, random_state=42)

### Pytanie 4

 Train Logistic Regression and Random Forest Classifier models, setting a random seed of 42. Store model predictions in logreg_pred and rf_pred.

In [15]:
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)

In [16]:
print(confusion_matrix(y_test, logreg_pred))
print(classification_report(y_test, logreg_pred))

[[911 116]
 [243  30]]
              precision    recall  f1-score   support

           0       0.79      0.89      0.84      1027
           1       0.21      0.11      0.14       273

    accuracy                           0.72      1300
   macro avg       0.50      0.50      0.49      1300
weighted avg       0.67      0.72      0.69      1300





**📌 Macierz błędu:**  
$$
\begin{bmatrix}
911 & 116 \\
243 & 30
\end{bmatrix}
$$
- **TP (True Positives, prawidłowo przewidziane "1")** = 30  
- **TN (True Negatives, prawidłowo przewidziane "0")** = 911  
- **FP (False Positives, błędnie przewidziane "1")** = 116  
- **FN (False Negatives, błędnie przewidziane "0")** = 243  

**📊 Metryki:**
| Klasa | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| 0 (Negative) | 0.79 | 0.89 | 0.84 | 1027 |
| 1 (Positive) | 0.21 | 0.11 | 0.14 | 273 |

🔹 **Dokładność (Accuracy):** **72%**  
🔹 **Precision dla klasy 1:** **21%** → Gdy model mówi "1", to tylko w 21% ma rację.  
🔹 **Recall dla klasy 1:** **11%** → Wykrywa tylko 11% wszystkich "1", czyli **słabo identyfikuje klasy pozytywne**.  
🔹 **Macro avg F1-score:** **0.49** → Model jest **niesymetryczny**, dużo lepiej przewiduje "0" niż "1".  

💡 **Wniosek:**  
Regresja logistyczna **znacznie lepiej** przewiduje klasę "0" niż "1". Model **ma trudności z wykrywaniem klasy pozytywnej (1)** i często klasyfikuje je jako "0".




In [17]:
forest = RandomForestClassifier(random_state=42)
forest.fit(X_train, y_train)
forest_pred = forest.predict(X_test)

In [19]:
print(confusion_matrix(y_test, forest_pred))
print(classification_report(y_test, forest_pred))

[[1026    1]
 [ 273    0]]
              precision    recall  f1-score   support

           0       0.79      1.00      0.88      1027
           1       0.00      0.00      0.00       273

    accuracy                           0.79      1300
   macro avg       0.39      0.50      0.44      1300
weighted avg       0.62      0.79      0.70      1300





**📌 Macierz błędu:**  
$$
\begin{bmatrix}
1026 & 1 \\
273 & 0
\end{bmatrix}
$$
- **TP (True Positives, prawidłowo przewidziane "1")** = **0** ❌  
- **TN (True Negatives, prawidłowo przewidziane "0")** = **1026** ✅  
- **FP (False Positives, błędnie przewidziane "1")** = **1**  
- **FN (False Negatives, błędnie przewidziane "0")** = **273** ❌  

**📊 Metryki:**  
| Klasa | Precision | Recall | F1-score | Support |
|-------|-----------|--------|----------|---------|
| 0 (Negative) | 0.79 | 1.00 | 0.88 | 1027 |
| 1 (Positive) | 0.00 | 0.00 | 0.00 | 273 |

🔹 **Dokładność (Accuracy):** **79%**  
🔹 **Precision dla klasy 1:** **0%** → Model **nigdy** nie przewiduje "1" 😱  
🔹 **Recall dla klasy 1:** **0%** → Model **nie wykrywa żadnego prawdziwego "1"**.  
🔹 **Macro avg F1-score:** **0.44** → Wynik **gorszy niż w regresji logistycznej**.  

💡 **Wniosek:**  
- Model **ignoruje całkowicie klasę "1"**.  
- Przewiduje, że **każdy przypadek to klasa "0"**.  
- Osiąga wysoki wynik dla klasy "0", ale **całkowicie zawodzi dla klasy "1"**.  



## **🔎 Dlaczego Random Forest się tak zachowuje?**  

1️⃣ **Brak zrównoważenia klas** (Imbalanced Data)  
- Klasa "0" stanowi **~79% próbek**, a klasa "1" **tylko ~21%**.  
- Random Forest może nauczyć się, że **zawsze zgaduje "0"** i wciąż osiąga 79% dokładności.  

2️⃣ **Domyślne ustawienia hiperparametrów**  
- Domyślny **kryterium podziału** (`gini`) może prowadzić do niezbalansowanego modelu.  
- Może być konieczna regulacja `class_weight='balanced'`, aby wymusić większą wagę dla klasy "1".  

3️⃣ **Overfitting**  
- Random Forest może być przetrenowany na **większościowej klasie** (0) i ignorować mniejszościową (1).  


### Pytanie 5

Assess the models on test data. Assign the model's name with higher accuracy ("LogisticRegression" or "RandomForest") to higher_accuracy.

# Solution

In [None]:
# Import required libraries and methods/functions
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
# OneHotEncoder is not needed if using pd.get_dummies()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Load data
telco_demog = pd.read_csv('telecom_demographics.csv')
telco_usage = pd.read_csv('telecom_usage.csv')

# Join data
churn_df = telco_demog.merge(telco_usage, on='customer_id')

# Identify churn rate
churn_rate = churn_df['churn'].value_counts() / len(churn_df)
print(churn_rate)

# Identify categorical variables
print(churn_df.info())

# One Hot Encoding for categorical variables
churn_df = pd.get_dummies(churn_df, columns=['telecom_partner', 'gender', 'state', 'city', 'registration_event'])

# Feature Scaling
scaler = StandardScaler()

# 'customer_id' is not a feature
features = churn_df.drop(['customer_id', 'churn'], axis=1)
features_scaled = scaler.fit_transform(features)

# Target variable
target = churn_df['churn']

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, random_state=42)

# Instantiate the Logistic Regression
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)

# Logistic Regression predictions
logreg_pred = logreg.predict(X_test)

# Logistic Regression evaluation
print(confusion_matrix(y_test, logreg_pred))
print(classification_report(y_test, logreg_pred))

# Instantiate the Random Forest model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Random Forest predictions
rf_pred = rf.predict(X_test)

# Random Forest evaluation
print(confusion_matrix(y_test, rf_pred))
print(classification_report(y_test, rf_pred))

# Which accuracy score is higher? Ridge or RandomForest
higher_accuracy = "RandomForest"