# Feature Engineering

This notebook focuses on creating and modifying material in the dataset, aiming at the cretaion of the future model.


We import libraries and review basic information.

In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('../data/processed/cleaned_churn_data.csv', header=0)
df.head()
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7032 non-null   object 
 1   gender            7032 non-null   object 
 2   SeniorCitizen     7032 non-null   int64  
 3   Partner           7032 non-null   int64  
 4   Dependents        7032 non-null   int64  
 5   tenure            7032 non-null   int64  
 6   PhoneService      7032 non-null   int64  
 7   MultipleLines     7032 non-null   int64  
 8   InternetService   7032 non-null   object 
 9   OnlineSecurity    7032 non-null   int64  
 10  OnlineBackup      7032 non-null   int64  
 11  DeviceProtection  7032 non-null   int64  
 12  TechSupport       7032 non-null   int64  
 13  StreamingTV       7032 non-null   int64  
 14  StreamingMovies   7032 non-null   int64  
 15  Contract          7032 non-null   object 
 16  PaperlessBilling  7032 non-null   int64  


After reviewing basic information and features, we start the following smart features:

- **Ternure-based**.
- **Total service count**.
- **Charges ratio**.
- **Encoding categorical values**.

tenure_years is the number of years a customer has been with the company, calculated by dividing the 'tenure' 
in months by 12 and rounding to two decimal places. This feature helps in understanding customer loyalty and retention
patterns over time.

long_term_customer is a binary feature indicating whether a customer has been with the company for 2 years or more.
This feature is useful for identifying loyal customers who may have different churn behaviors compared to newer customers.

In [7]:
# Ternure-based features.
df['tenure_years'] = (df['tenure'] / 12).round(2)


df['long_term_customer'] = (df['tenure_years'] >= 2).astype(int)

total_service_count feature highlights the total number of services a customer has subscribed to.
This feature can help in understanding customer engagement and the likelihood of churn based on the number of services.

In [9]:
services = ['PhoneService','MultipleLines', 'OnlineSecurity','OnlineBackup','DeviceProtection',
              'TechSupport','StreamingTV','StreamingMovies']

df['total_service_count'] = df[services].sum(axis=1)


charges_ratio signals the relationship between monthly charges and total charges, providing 
insights into customer spending patterns.
A higher ratio may indicate customers who are paying more monthly relative to their total charges, 
which could suggest different churn behaviors.

average_monthly_charges is the average monthly amount a customer has paid over their tenure,
calculated by dividing the total charges by the tenure in months.

The difference between average_monthly_charges and charges_ratio is that the former provides a
long-term average of monthly payments over the entire tenure, while the latter is a ratio that compares
the monthly charges to the total charges, giving a snapshot of spending behavior at a specific point in time.

In [11]:
df['charges_ratio'] = (df['MonthlyCharges']/df['TotalCharges']).round(2)

df['average_monthly_charges'] = (df['TotalCharges'] / df['tenure']).round(2)
df['average_monthly_charges'] = df['average_monthly_charges'].replace([np.inf, -np.inf], np.nan).fillna(0)


For categorical variables, we can use one-hot encoding to convert them into numerical features.
This allows us to represent categorical data in a format that can be provided to machine learning algorithms.

pd_get_dummies is a function in pandas that converts categorical variable(s) into dummy/indicator variables.
Indicator variables are binary (0 or 1) variables that represent the presence or absence of a category.

In [12]:
df = pd.get_dummies(df, columns=['Contract', 'PaymentMethod', 'InternetService'], drop_first=True)
df['gender'] = df['gender'].map({'Male': 1, 'Female': 0})


Finally, we saved the engineered dataset.

In [13]:
df.to_csv('../data/processed/engineered_churn_data.csv', index=False)