<a href="https://colab.research.google.com/github/kishon45229/Customer-Churn-Prediction-in-Telecom-Industry/blob/main/Data_preprocessing_part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Customer Churn Prediction in Telecom Industry

This repository focuses on predicting customer churn in the telecom industry. The data preprocessing section is key to ensuring the accuracy of the predictive model. It involves handling missing values, encoding categorical variables into numerical formats, and performing feature engineering to create and modify features that enhance model performance. Additionally, attribute subset selection techniques, such as Recursive Feature Elimination (RFE), are applied to identify the most relevant features. These preprocessing steps are essential for building a reliable churn prediction model, helping telecom companies proactively retain their customers.

ITBIN-2110-0031

TF.FATHIMA

# Data preprocessing part 2

# Import Necessary Libraries

In [39]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.preprocessing import MinMaxScaler

# Add Dataset
I am going to continue from previous section data preprocessing part 1. therefore I read the csv file that completed the data preprocessing part.

In [19]:
df = pd.read_csv('/Section 2 finished dataset.csv')

In [20]:
df.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_bin,MonthlyCharges_bin,Cluster,PCA1,PCA2
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,Yes,Electronic check,29.85,29.85,No,0-12,21-40,0,-2.186195,-0.654456
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,No,Mailed check,56.95,1889.5,No,25-36,41-60,2,-0.115181,0.583276
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,Yes,Mailed check,53.85,108.15,Yes,0-12,41-60,2,-1.311383,1.343579
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,No,Bank transfer (automatic),42.3,1840.75,No,37-48,41-60,0,-0.570548,-1.533997
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,Yes,Electronic check,70.7,151.65,Yes,0-12,61-80,2,-0.992743,1.557438


# Reduction
2.Cube Aggregation

Cube aggregation is a technique used in data warehousing and multidimensional databases to summarize and aggregate data along various dimensions. It involves creating a data cube, which is a multi-dimensional array of values, and then performing operations to summarize the data at different levels of granularity.

In the context of our dataset, cube aggregation can help in summarizing and understanding customer data by aggregating values along various dimensions such as tenure, MonthlyCharges, and TotalCharges.

We did cube aggregation using groupby() on the dataset by aggregating the MonthlyCharges and TotalCharges along the tenure_bin and Contract dimensions.

In [21]:
cube_aggregation = df.groupby(['tenure_bin', 'Contract'])[['MonthlyCharges', 'TotalCharges']].agg(['mean', 'sum', 'count'])

Next, we reset index to flatten the DataFrame for better readability.

In [22]:
cube_aggregation = cube_aggregation.reset_index()
cube_aggregation

Unnamed: 0_level_0,tenure_bin,Contract,MonthlyCharges,MonthlyCharges,MonthlyCharges,TotalCharges,TotalCharges,TotalCharges
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,sum,count,mean,sum,count
0,0-12,Month-to-month,58.217904,116086.5,1994,276.69343,551726.7,1994
1,0-12,One year,35.928455,4419.2,123,303.171545,37290.1,123
2,0-12,Two year,28.766379,1668.45,58,217.846552,12635.1,58
3,13-24,Month-to-month,69.309566,51081.15,737,1257.884193,927060.65,737
4,13-24,One year,44.87868,8841.1,197,863.225381,170055.4,197
5,13-24,Two year,32.306667,2907.6,90,624.129444,56171.65,90
6,25-36,Month-to-month,74.326235,36122.55,486,2238.959979,1088134.55,486
7,25-36,One year,58.0988,14524.7,250,1777.2444,444311.1,250
8,25-36,Two year,40.745313,3911.55,96,1285.418229,123400.15,96
9,37-48,Month-to-month,78.422468,24781.5,316,3308.503639,1045487.15,316


The output is a DataFrame. It is showing a summarized statistics for MonthlyCharges and TotalCharges for each combination of tenure_bin and Contract. This can help in understanding patterns such as:

The mean values indicate the average monthly and total charges for customers in each tenure bin and contract type.
The sum values give the total revenue generated from customers in each category.
The count values show how many customers fall into each category.

## 3.Attribute Subset Selection/Feature Selection
It is a process in machine learning and data preprocessing where you select a subset of relevant features (attributes) for building your model. The goal is to improve the model's performance by removing irrelevant, redundant, or noisy features.

We need to define feature(X) and target(y). All other columns as features expect target column. We choosed target as Churn column because in this dataset, churn prediction is typically a key business objective. Churn refers to the phenomenon where customers stop using the company's services, and predicting churn can help the company identify customers at risk of leaving. This allows the company to take proactive measures to retain those customers.

In [23]:
df['Churn'] = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

X = df.drop(['Churn'], axis=1)
y = df['Churn']

The below code will convert categorical variables to numerical(if any)

In [24]:
X = pd.get_dummies(X, drop_first=True)

Now we applied feature selection method to identify the most relevant features. In the below code we wrote code to select top 20 features.

In [25]:
selector = SelectKBest(score_func=f_classif, k=20)
X_selected = selector.fit_transform(X, y)

We should evaluate the selected features and choose the subset that improves the model's performance.

In [26]:
selected_features = X.columns[selector.get_support()]
selected_features

Index(['tenure', 'MonthlyCharges', 'TotalCharges', 'Cluster', 'PCA2',
       'InternetService_Fiber optic', 'InternetService_No',
       'OnlineSecurity_No internet service', 'OnlineSecurity_Yes',
       'OnlineBackup_No internet service',
       'DeviceProtection_No internet service',
       'TechSupport_No internet service', 'TechSupport_Yes',
       'StreamingTV_No internet service',
       'StreamingMovies_No internet service', 'Contract_One year',
       'Contract_Two year', 'PaperlessBilling_Yes',
       'PaymentMethod_Electronic check', 'tenure_bin_61-72'],
      dtype='object')

Based on the output, mentioned column names are the selected features.

We created a new DataFrame df_selected for selected features and added the target column Churn to that.




In [27]:
df_selected = pd.DataFrame(X_selected, columns=selected_features)
df_selected['Churn'] = y.values
df_selected

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Cluster,PCA2,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No internet service,...,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No internet service,StreamingMovies_No internet service,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Electronic check,tenure_bin_61-72,Churn
0,1.0,29.85,29.85,0.0,-0.654456,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0
1,34.0,56.95,1889.50,2.0,0.583276,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0
2,2.0,53.85,108.15,2.0,1.343579,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
3,45.0,42.30,1840.75,0.0,-1.533997,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0
4,2.0,70.70,151.65,2.0,1.557438,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7027,24.0,84.80,1990.50,2.0,1.131844,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0
7028,72.0,103.20,7362.90,1.0,-0.941804,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0
7029,11.0,29.60,346.45,0.0,-0.885263,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0
7030,4.0,74.40,306.60,2.0,2.249267,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1


Based on the new DataFrame you can get benefits such as:

Focus on the most relevant features
The model can make more accurate predictions
Simplifies the model
making it easier to interpret and faster to train
Reduces the risk of overfitting
better generalization on new data

## 4. Numerosity Reduction using Sampling

It is a technique to reduce the size of the dataset while maintaining its statistical properties. This can help in improving computational efficiency and manageability of data.

As a first step, we need to identify the size and structure of the dataset and determine the percentage or number of samples to retain.

To get basic information about the dataset, we used info(), describe().

In [28]:
df_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 21 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   tenure                                7032 non-null   float64
 1   MonthlyCharges                        7032 non-null   float64
 2   TotalCharges                          7032 non-null   float64
 3   Cluster                               7032 non-null   float64
 4   PCA2                                  7032 non-null   float64
 5   InternetService_Fiber optic           7032 non-null   float64
 6   InternetService_No                    7032 non-null   float64
 7   OnlineSecurity_No internet service    7032 non-null   float64
 8   OnlineSecurity_Yes                    7032 non-null   float64
 9   OnlineBackup_No internet service      7032 non-null   float64
 10  DeviceProtection_No internet service  7032 non-null   float64
 11  TechSupport_No in

In [29]:
df_selected.describe()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Cluster,PCA2,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No internet service,...,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No internet service,StreamingMovies_No internet service,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Electronic check,tenure_bin_61-72,Churn
count,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,...,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0,7032.0
mean,32.421786,64.798208,2283.300441,1.075796,-6.769961e-17,0.440273,0.216155,0.216155,0.286547,0.216155,...,0.216155,0.290102,0.216155,0.216155,0.209329,0.239619,0.592719,0.33632,0.200085,0.265785
std,24.54526,30.085974,2266.771362,0.825786,1.273071,0.496455,0.41165,0.41165,0.45218,0.41165,...,0.41165,0.453842,0.41165,0.41165,0.406858,0.426881,0.491363,0.472483,0.400092,0.441782
min,1.0,18.25,18.8,0.0,-2.3341,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,9.0,35.5875,401.45,0.0,-0.9724254,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,29.0,70.35,1397.475,1.0,-0.3524832,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,55.0,89.8625,3794.7375,2.0,1.255467,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
max,72.0,118.75,8684.8,2.0,2.673849,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Based on the output of info() and describe() We decided to allocate 20% of the dataset as sample.

In [30]:
sample_fraction = 0.2

After, we started to perform random sampling.

In [31]:
df_sampled = df_selected.sample(frac=sample_fraction, random_state=42)
df_sampled

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Cluster,PCA2,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No internet service,...,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No internet service,StreamingMovies_No internet service,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Electronic check,tenure_bin_61-72,Churn
2476,61.0,25.00,1501.75,0.0,-1.344356,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0
6773,19.0,24.70,465.85,0.0,-1.118123,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0
6116,13.0,102.25,1359.00,2.0,1.634487,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
3047,37.0,55.05,2030.75,0.0,-1.228665,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
4092,6.0,29.45,161.45,0.0,-0.770908,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1423,1.0,50.45,50.45,2.0,1.324805,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1
1728,1.0,19.05,19.05,0.0,-0.793133,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0
5241,12.0,94.55,1173.55,2.0,1.572096,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
5456,26.0,56.05,1553.20,2.0,0.761492,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0


Based on the output you can see the random sample DataFrame that selected from overall dataset.

You can find the size and information of the sample by using shape() and describe().

In [32]:
df_sampled.shape

(1406, 21)

Based on the output, the sample dataset size is 1406 rows and 11 columns.

In [33]:
df_sampled.describe()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Cluster,PCA2,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No internet service,...,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No internet service,StreamingMovies_No internet service,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Electronic check,tenure_bin_61-72,Churn
count,1406.0,1406.0,1406.0,1406.0,1406.0,1406.0,1406.0,1406.0,1406.0,1406.0,...,1406.0,1406.0,1406.0,1406.0,1406.0,1406.0,1406.0,1406.0,1406.0,1406.0
mean,32.739687,65.229623,2315.607006,1.093883,0.016562,0.443812,0.203414,0.203414,0.283073,0.203414,...,0.203414,0.291607,0.203414,0.203414,0.199858,0.239687,0.598862,0.329303,0.200569,0.266003
std,24.569818,29.730212,2274.809905,0.821251,1.270084,0.49701,0.402681,0.402681,0.450652,0.402681,...,0.402681,0.454664,0.402681,0.402681,0.400036,0.427044,0.490303,0.470127,0.400569,0.442023
min,1.0,18.7,18.8,0.0,-2.3341,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,9.25,39.9625,416.75,0.0,-0.971426,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,29.0,70.35,1424.75,1.0,-0.290471,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,56.0,89.85,3882.4875,2.0,1.264018,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
max,72.0,118.6,8670.1,2.0,2.665985,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


This will ensures that you are working with a reduced, yet informative, subset of your data, making further analysis and modeling more efficient.

# Transformation
1.Normalization

Normalization scales the data to a standard range, typically [0, 1].

To work with normalization, first initialize the MinMaxScaler() and apply normalization to the selected features.

In [36]:
scaler = MinMaxScaler()

normalized_features = scaler.fit_transform(df_sampled.drop(columns=['Churn']))

We created a DataFrame with normalized feature and added the target variable back to the DataFrame.

In [37]:
df_normalized = pd.DataFrame(normalized_features, columns=df_sampled.columns[:-1])
df_normalized['Churn'] = df_sampled['Churn'].values
df_normalized

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Cluster,PCA2,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No internet service,...,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No internet service,StreamingMovies_No internet service,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Electronic check,tenure_bin_61-72,Churn
0,0.845070,0.063063,0.171414,0.0,0.197946,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0
1,0.253521,0.060060,0.051674,0.0,0.243191,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0
2,0.169014,0.836336,0.154913,1.0,0.793704,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
3,0.507042,0.363864,0.232560,0.0,0.221083,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
4,0.070423,0.107608,0.016489,0.0,0.312633,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1401,0.000000,0.317818,0.003658,1.0,0.731769,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1
1402,0.000000,0.003504,0.000029,0.0,0.308188,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0
1403,0.154930,0.759259,0.133477,1.0,0.781226,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
1404,0.352113,0.373874,0.177361,1.0,0.619108,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0


Double-click (or enter) to edit

2.Attribute Selection and Derive New Attributes

We derived new attributes that might be useful. Here, we can create a new attribute called MonthlyTenure which is the ratio of TotalCharges to MonthlyCharges.

In [40]:
df_normalized['MonthlyTenure'] = df_normalized['TotalCharges'] / df_normalized['MonthlyCharges']

df_normalized['MonthlyTenure'].replace([np.inf, -np.inf], np.nan, inplace=True)
df_normalized['MonthlyTenure'].fillna(0, inplace=True)

df_normalized

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Cluster,PCA2,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No internet service,...,TechSupport_Yes,StreamingTV_No internet service,StreamingMovies_No internet service,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Electronic check,tenure_bin_61-72,Churn,MonthlyTenure
0,0.845070,0.063063,0.171414,0.0,0.197946,0.0,1.0,1.0,0.0,1.0,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0,2.718129
1,0.253521,0.060060,0.051674,0.0,0.243191,0.0,1.0,1.0,0.0,1.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0.860377
2,0.169014,0.836336,0.154913,1.0,0.793704,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1,0.185228
3,0.507042,0.363864,0.232560,0.0,0.221083,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0.639141
4,0.070423,0.107608,0.016489,0.0,0.312633,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.153231
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1401,0.000000,0.317818,0.003658,1.0,0.731769,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1,0.011511
1402,0.000000,0.003504,0.000029,0.0,0.308188,0.0,1.0,1.0,0.0,1.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0.008248
1403,0.154930,0.759259,0.133477,1.0,0.781226,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0,0.175799
1404,0.352113,0.373874,0.177361,1.0,0.619108,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0,0.474386


You can see we select important features and derives new ones called MonthlyTenure.

3.Discretization

Discretization replaces raw continuous values with interval values. We discretized the MonthlyCharges and TotalCharges columns.

In [41]:
df_normalized['MonthlyChargesBinned'] = pd.cut(df_normalized['MonthlyCharges'], bins=5, labels=False)
df_normalized['TotalChargesBinned'] = pd.cut(df_normalized['TotalCharges'], bins=5, labels=False)

df_normalized[['MonthlyCharges', 'MonthlyChargesBinned', 'TotalCharges', 'TotalChargesBinned']]

Unnamed: 0,MonthlyCharges,MonthlyChargesBinned,TotalCharges,TotalChargesBinned
0,0.063063,0,0.171414,0
1,0.060060,0,0.051674,0
2,0.836336,4,0.154913,0
3,0.363864,1,0.232560,1
4,0.107608,0,0.016489,0
...,...,...,...,...
1401,0.317818,1,0.003658,0
1402,0.003504,0,0.000029,0
1403,0.759259,3,0.133477,0
1404,0.373874,1,0.177361,0


Based on the output you can understand we converted continuous values into intervals.

4.Concept Hierarchy Generation

For concept hierarchy generation, we converted low-level attributes to higher-level attributes. For that we converted MonthlyChargesBinned into a conceptual hierarchy.

In [42]:
concept_hierarchy = {
    0: 'Very Low',
    1: 'Low',
    2: 'Medium',
    3: 'High',
    4: 'Very High'
}

df_normalized['MonthlyChargesCategory'] = df_normalized['MonthlyChargesBinned'].map(concept_hierarchy)
df_normalized['TotalChargesCategory'] = df_normalized['TotalChargesBinned'].map(concept_hierarchy)

df_normalized[['MonthlyChargesBinned', 'MonthlyChargesCategory', 'TotalChargesBinned', 'TotalChargesCategory']]

Unnamed: 0,MonthlyChargesBinned,MonthlyChargesCategory,TotalChargesBinned,TotalChargesCategory
0,0,Very Low,0,Very Low
1,0,Very Low,0,Very Low
2,4,Very High,0,Very Low
3,1,Low,1,Low
4,0,Very Low,0,Very Low
...,...,...,...,...
1401,1,Low,0,Very Low
1402,0,Very Low,0,Very Low
1403,3,High,0,Very Low
1404,1,Low,0,Very Low


Based on the output, we converted binned values into higher-level conceptual categories. However, we do not consider the categorical values of columns MonthlyChargesCategory and TotalChargesCategory. Therefore, we removed those two columns from the DataFrame.

In [43]:
df_normalized = df_normalized.drop(['MonthlyChargesCategory', 'TotalChargesCategory'], axis=1)
df_normalized

Unnamed: 0,tenure,MonthlyCharges,TotalCharges,Cluster,PCA2,InternetService_Fiber optic,InternetService_No,OnlineSecurity_No internet service,OnlineSecurity_Yes,OnlineBackup_No internet service,...,StreamingMovies_No internet service,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Electronic check,tenure_bin_61-72,Churn,MonthlyTenure,MonthlyChargesBinned,TotalChargesBinned
0,0.845070,0.063063,0.171414,0.0,0.197946,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0,2.718129,0,0
1,0.253521,0.060060,0.051674,0.0,0.243191,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0,0.860377,0,0
2,0.169014,0.836336,0.154913,1.0,0.793704,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1,0.185228,4,0
3,0.507042,0.363864,0.232560,0.0,0.221083,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0,0.639141,1,1
4,0.070423,0.107608,0.016489,0.0,0.312633,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0.153231,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1401,0.000000,0.317818,0.003658,1.0,0.731769,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1,0.011511,1,0
1402,0.000000,0.003504,0.000029,0.0,0.308188,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0,0.008248,0,0
1403,0.154930,0.759259,0.133477,1.0,0.781226,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0,0.175799,3,0
1404,0.352113,0.373874,0.177361,1.0,0.619108,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0,0.474386,1,0


Upto this point we completed the data preprocessing part and our finalized DataFrame df_normalized is ready for predictions.

In [44]:
df.to_csv('Section 3 finished dataset.csv', index=False)

# Conclusion

The dataset underwent essential preprocessing steps, including handling missing values, normalizing features, and performing attribute subset selection. These steps ensured the dataset was well prepared for modeling, reducing data dimensionality while retaining important information. By cleaning and transforming the data effectively, the foundation is set for building a robust model that can accurately predict customer churn in the telecom industry.

# Next Steps in Data Mining

With the data preprocessing and feature selection complete, the next steps in the data mining process involve building and training predictive models. This stage focuses on applying machine learning algorithms to the cleaned and prepared dataset to uncover patterns and make predictions.