The telecommunications sector in India is rapidly evolving, with many businesses being created and customers frequently switching between providers. **Churn** refers to the process where customers stop using a company's services or products. It is a key challenge for telecom companies to predict and minimize customer churn, as retaining customers is critical to business growth.

As a data scientist for a telecom company, your task is to **predict customer churn** using demographic and usage data from four major telecom partners: *Airtel, Reliance Jio, Vodafone, and BSNL*. You will explore the key factors contributing to customer churn and build predictive models to help the company reduce churn rates.

The dataset contains two csv-files:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


In [138]:
# Useful libraries
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score
from sklearn import tree


In [139]:
#merged datasets using the customer_id column

telecom_demographics = pd.read_csv('telecom_demographics.csv')
telecom_usage = pd.read_csv('telecom_usage.csv')

merged_data = pd.merge(telecom_demographics, telecom_usage, on='customer_id')

print(merged_data)

      customer_id telecom_partner gender  age             state       city  \
0           15169          Airtel      F   26  Himachal Pradesh      Delhi   
1          149207          Airtel      F   74       Uttarakhand  Hyderabad   
2          148119          Airtel      F   54         Jharkhand    Chennai   
3          187288    Reliance Jio      M   29             Bihar  Hyderabad   
4           14016        Vodafone      M   45          Nagaland  Bangalore   
...           ...             ...    ...  ...               ...        ...   
6495        78836          Airtel      M   54            Odisha    Chennai   
6496       146521            BSNL      M   69    Andhra Pradesh  Hyderabad   
6497        40413          Airtel      M   19           Gujarat  Hyderabad   
6498        64961        Vodafone      M   26         Meghalaya    Chennai   
6499        60427    Reliance Jio      M   19           Mizoram  Hyderabad   

      pincode registration_event  num_dependents  estimated_sal

In [140]:
#dropped customer_id, pincode, and registration_event. 
#Handled missing values

columns_to_drop = ['customer_id', 'pincode', 'registration_event']
merged_data = merged_data.drop(columns=columns_to_drop)

missing_values_per_column = merged_data.isnull().sum()
print(missing_values_per_column)



telecom_partner     0
gender              0
age                 0
state               0
city                0
num_dependents      0
estimated_salary    0
calls_made          0
sms_sent            0
data_used           0
churn               0
dtype: int64


In [141]:
#OneHotEncoder encoding method
#OrdinalEncoder encoding method

categorical_variables = ['telecom_partner', 'city', 'state', 'gender']

x = merged_data[categorical_variables]
y = merged_data['churn']

ohe = OneHotEncoder(sparse_output=False, drop='first')
x_ohe = ohe.fit_transform(x)
ohe_df = pd.DataFrame(x_ohe, columns=ohe.get_feature_names_out(categorical_variables))

numerical_features = merged_data.drop(categorical_variables, axis=1).reset_index(drop=True)
merged_data_ohe = pd.concat([numerical_features, ohe_df], axis=1)

print(merged_data_ohe)

ord_enc = OrdinalEncoder()
x_ord_enc = ord_enc.fit_transform(x)
ordinal_df = pd.DataFrame(x_ord_enc, columns=categorical_variables)

merged_data_ordinal = pd.concat([numerical_features, ordinal_df], axis=1)

print(merged_data_ordinal)



      age  num_dependents  estimated_salary  calls_made  sms_sent  data_used  \
0      26               4             85979          75        21       4532   
1      74               0             69445          35        38        723   
2      54               2             75949          70        47       4688   
3      29               3             34272          95        32      10241   
4      45               4             34157          66        23       5246   
...   ...             ...               ...         ...       ...        ...   
6495   54               4            124805          -2        39       5000   
6496   69               1             65605          20        31       3562   
6497   19               0             28632          73        14         65   
6498   26               3            119757          52         8       6835   
6499   19               3            119368          90        12       5531   

      churn  telecom_partner_BSNL  tele