# **Feature Engineering & Preprocessing (Churn Project)**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('/content/projectk.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


# **STEP 1 — Data Cleaning & Type Fixes**

**(1) Convert TotalCharges to numeric (handle blanks) ?**

In [4]:
# 'errors = "coerce"' turns any text (like empty spaces) into NaN(missing values)
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


**(2) Drop duplicates (if any) ?**

In [6]:
# Droping the Duplicates values in the dataset
df.drop_duplicates(inplace=True)

**(3) Handle missing values (strategy: drop few rows or impute) ?**

In [7]:
# We fill missing TotalCharges with o(assuming these are new customers with tenure=0)
df['TotalCharges'].fillna(0, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['TotalCharges'].fillna(0, inplace=True)


**(4) Standardize target: Churn → binary (Yes=1, No=0) ?**

In [8]:
# Machine learning models like the numbers(1/0), not text (Yes/NO)
df['Churn']= df['Churn'].map({'Yes': 1, 'No': 0})

print(df.info())
print(df['Churn'].value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [9]:
df.shape

(7043, 21)

In [10]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,Churn
count,7043.0,7043.0,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692,2279.734304,0.26537
std,0.368612,24.559481,30.090047,2266.79447,0.441561
min,0.0,0.0,18.25,0.0,0.0
25%,0.0,9.0,35.5,398.55,0.0
50%,0.0,29.0,70.35,1394.55,0.0
75%,0.0,55.0,89.85,3786.6,1.0
max,1.0,72.0,118.75,8684.8,1.0


# **STEP 2 — Feature Engineering**

**(1) Tenure buckets (0–12, 13–24, 25–48, 49+) ?**

In [11]:
# Grouping tenure into categories: 0 - 12 months , 13 - 24, etc

def tenure_group(t):
  if t <= 12: return '0-12'
  elif t <= 24: return '13-24'
  elif t <= 48: return '25-48'
  else: return '49+'
df['Tenure_Grouping'] = df['tenure'].apply(tenure_group)

**(2) HighMonthlyChargeFlag (MonthlyCharges > median) ?**

In [12]:
# Calculate the median ( Middle values) first
median_val = df['MonthlyCharges'].median()

# Create a new Column: 1 if charge > median, else 0
df['HighMonthlyChargeFlag'] = (df['MonthlyCharges'] > median_val).astype(int)

**(3) ContractLengthScore (Month-to-month < One year < Two year) ?**

In [13]:
# Ranking contracts: Month-Month is lowest(1), Two Year is highest(3)
contract_mapping = {'Month-to-month': 1, 'One year': 2, 'Two year': 3}
df['ContractLenghtScore'] = df['Contract'].map(contract_mapping)

**(4) Drop leakage/unhelpful columns: customerID ?**

In [14]:
# customersID is unique for everyone and doesn't predict churn
df.drop(columns=['customerID'], inplace=True, errors='igonore')

In [15]:
# verify the new columns
print(df[['Tenure_Grouping', 'HighMonthlyChargeFlag', 'ContractLenghtScore']].head())

  Tenure_Grouping  HighMonthlyChargeFlag  ContractLenghtScore
0            0-12                      0                    1
1           25-48                      0                    2
2            0-12                      0                    1
3           25-48                      0                    2
4            0-12                      1                    1


# **STEP 3 — Encoding (Model-Friendly)**

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

**(1) One-hot encode categorical features: Contract, InternetService, PaymentMethod, etc ?**

In [17]:
# Identify Categorical Columns
categorical_cols = ['gender', 'Partner', 'Dependents', 'PhoneService',
                    'MultipleLines', 'InternetService', 'OnlineSecurity',
                    'OnlineBackup', 'DeviceProtection', 'TechSupport',
                    'StreamingTV', 'StreamingMovies', 'Contract',
                    'PaperlessBilling', 'PaymentMethod']

**(2) Avoid dummy trap (drop one category) ?**

In [18]:
# Drop_first= True 'Dummy Trap' avoid

df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True , dtype=int )

**(3) Check the results ?**

In [19]:
print("Shape after encoding : ", df_encoded.shape)
print(df_encoded.head())

Shape after encoding :  (7043, 34)
   SeniorCitizen  tenure  MonthlyCharges  TotalCharges  Churn Tenure_Grouping  \
0              0       1           29.85         29.85      0            0-12   
1              0      34           56.95       1889.50      0           25-48   
2              0       2           53.85        108.15      1            0-12   
3              0      45           42.30       1840.75      0           25-48   
4              0       2           70.70        151.65      1            0-12   

   HighMonthlyChargeFlag  ContractLenghtScore  gender_Male  Partner_Yes  ...  \
0                      0                    1            0            1  ...   
1                      0                    2            1            0  ...   
2                      0                    1            1            0  ...   
3                      0                    2            1            0  ...   
4                      1                    1            0            0  ...  

# **STEP 4 — Train/Test Split**

In [20]:
from sklearn.model_selection import train_test_split


In [21]:
# Define X (Questions) and y (Answer)
# We drop 'Churn' because it's the answer.
# We drop 'Tenure_Grouping' because it is text (and we already have numeric 'tenure').
# We drop 'customerID' because it is not useful for prediction.
X = df_encoded.drop(columns=['Churn', 'Tenure_Grouping', 'customerID'], errors='ignore')
y = df_encoded['Churn']

**(1)Split data: 80% train / 20% test & Use stratify=y to preserve churn ratio ?**

In [22]:
# 2. Split the data (80% Train, 20% Test)
# stratify=y ensures we have the same percentage of 'Yes' and 'No' in both groups.
# random_state=42 ensures the split is the same every time you run it.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)


In [23]:
# 3. Verify the shapes (Check if split worked)
print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)

Training Data Shape: (5634, 32)
Testing Data Shape: (1409, 32)


# **STEP 5 — Scaling**

**(1) Scale numeric features: tenure, MonthlyCharges, TotalCharges ?**

In [24]:
# We only scale numbers. we do not scale o/1 columns ( like 'Gender_male')
cols_to_scale = ['tenure', 'MonthlyCharges', 'TotalCharges']


**(2) Use StandardScaler (or MinMaxScaler) ?**

In [25]:
# Initialize the Scaler
scaler = StandardScaler()

**(3) Fit scaler on train only (important) ?**

In [26]:
# fit_transform on Train : Calcualtes Mean & SD from Train, then applies it
X_train[cols_to_scale] = scaler.fit_transform(X_train[cols_to_scale])

In [28]:
# 'transform' on Test : user the same Mean & SD from Train to adjust Test
# we do NOt 'fit' on Test data
X_test[cols_to_scale] = scaler.transform(X_test[cols_to_scale])

**(4) Check the result**

In [29]:
# The values should now be small numbers (mostly between -3 and +3)
print("Scaled Training Data (First 5 rows) :")
print(X_train[cols_to_scale].head())

Scaled Training Data (First 5 rows) :
        tenure  MonthlyCharges  TotalCharges
3738  0.102371       -0.521976     -0.262257
3151 -0.711743        0.337478     -0.503635
4860 -0.793155       -0.809013     -0.749883
3867 -0.263980        0.284384     -0.172722
3810 -1.281624       -0.676279     -0.989374


**STEP 6 — Class Imbalance Check**

**(1) Check churn ratio in train set ?**

In [30]:
# Check the balance of churn in the Training set
# normalize = True gives us the percentage (decimal) instead of the count

print("Churn Distribution in Training Set :")
print(y_train.value_counts(normalize=True))

Churn Distribution in Training Set :
Churn
0    0.734647
1    0.265353
Name: proportion, dtype: float64


**(2) Just note imbalance in markdown ?**

**Class Imbalance Observation:** The dataset is imbalanced. Approximately 73% of customers are "No Churn" (Class 0) and only 27% are "Churn" (Class 1).

**Impact:** The model might become biased and predict "No Churn" too often because it sees that answer most of the time.

**Simple Explanation :**

Imagine a classroom where 73 students are boys and 27 are girls. If I close my eyes and guess "Boy" for every student, I will be right 73% of the time, even though I didn't actually look at anyone. This is the problem with imbalanced data—the model might just guess the majority class (No Churn) to get a high score.