##**TASK - 1 Data Cleaning and Preprocessing**

This project follows 5 Steps:

1) Loading the data

2) Handelling missing values

3) Coverting categorical variables

4) Final data checking

5) Saving the cleaned data

**Step -1 Loading the data**

In [5]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('/content/Telco_Customer_Churn_Dataset .csv')

# Display basic info
print(df.info())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


**Step 2 Handle Missing Values**

In [6]:
# Convert empty strings in TotalCharges to NaN
df['TotalCharges'] = df['TotalCharges'].replace(' ', np.nan).astype(float)

# For customers with tenure=0, fill TotalCharges with 0 (new customers)
# For others, fill with median based on tenure and service type
mask = (df['tenure'] == 0) & (df['TotalCharges'].isna())
df.loc[mask, 'TotalCharges'] = 0

# For remaining missing values, fill with median by tenure group
df['TotalCharges'] = df.groupby('tenure')['TotalCharges'].transform(lambda x: x.fillna(x.median()))

**Step 3 Convert Categorical Variables**

In [7]:
# Convert binary categorical variables (Yes/No) to 1/0
binary_cols = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']
for col in binary_cols:
    df[col] = df[col].map({'Yes': 1, 'No': 0})

# SeniorCitizen is already 0/1 but stored as int64, convert to int for consistency
df['SeniorCitizen'] = df['SeniorCitizen'].astype(int)

# For columns with "No phone service" or "No internet service", we can treat them as "No"
service_cols = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

for col in service_cols:
    df[col] = df[col].replace({'No phone service': 0, 'No internet service': 0, 'No': 0, 'Yes': 1})

# One-hot encode remaining categorical variables
categorical_cols = ['gender', 'InternetService', 'Contract', 'PaymentMethod']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Drop customerID as it's not useful for modeling
df = df.drop('customerID', axis=1)

  df[col] = df[col].replace({'No phone service': 0, 'No internet service': 0, 'No': 0, 'Yes': 1})


**Step 4 Final Data Check**

In [8]:
# Check for any remaining missing values
print(df.isnull().sum())

# Check data types
print(df.dtypes)

# Verify binary columns were properly encoded
print(df[['Partner', 'Dependents', 'Churn']].head())

SeniorCitizen                            0
Partner                                  0
Dependents                               0
tenure                                   0
PhoneService                             0
MultipleLines                            0
OnlineSecurity                           0
OnlineBackup                             0
DeviceProtection                         0
TechSupport                              0
StreamingTV                              0
StreamingMovies                          0
PaperlessBilling                         0
MonthlyCharges                           0
TotalCharges                             0
Churn                                    0
gender_Male                              0
InternetService_Fiber optic              0
InternetService_No                       0
Contract_One year                        0
Contract_Two year                        0
PaymentMethod_Credit card (automatic)    0
PaymentMethod_Electronic check           0
PaymentMeth

Step 5 Save Cleaned Data

In [9]:
# Save cleaned data to new CSV file
df.to_csv('Telco_Customer_Churn_Cleaned.csv', index=False)

**Conclusion**

We have successfully cleaned and preprocessed the Telco customer churn dataset to make it ready for analysis. Here is the detailed explaination of what we did:

1.Fixed Missing Data: Handled gaps in TotalCharges—new customers (with zero tenure) got a 0, while others were filled with reasonable median values.

2.Simplified Categories: Turned "Yes/No" answers into 1s and 0s for clarity, and grouped "No internet/phone service" as "No" to avoid redundancy.

3.Encoded Text Data: Converted categorical features (like contract types or payment methods) into numerical formats using one-hot encoding so models can understand them.

4.Removed Irrelevant Data: Dropped the customerID column since it doesn’t help predict churn.

Therefore we took messy, real-world data and made it neat and usable—like organizing a cluttered desk so you can actually find what you need