<div style="background: linear-gradient(to right, #4F46E5, #7C3AED); padding: 30px; border-radius: 15px; margin-bottom: 30px;">
    <h1 style="color: white; font-size: 2.5em; margin-bottom: 15px;">Customer Churn Prediction with Deep Learning</h1>
    <p style="color: rgba(255, 255, 255, 0.9); font-size: 1.1em; line-height: 1.6;">
        Telecom Industry | Data Science | Machine Learning
    </p>
</div>

<div style="background: white; padding: 25px; border-radius: 10px; box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); margin-bottom: 30px;">
    <h2 style="color: #4F46E5; margin-bottom: 20px;">Project Overview</h2>
    <p style="color: #374151; font-size: 1.1em; line-height: 1.6;">
        As Data Scientists at a Digital Services Company specializing in helping telecom operators reduce subscriber loss, 
        you've been assigned to a new client. <span style="color: #4F46E5; font-weight: 500;">TelcoNova</span> wants to anticipate 
        customer departures (<span style="color: #4F46E5; font-weight: 500;">churn</span>) to optimize their retention campaigns.
    </p>
</div>

<div style="display: grid; grid-template-columns: repeat(2, 1fr); gap: 20px; margin-bottom: 30px;">
    <div style="background: #F3F4F6; padding: 20px; border-radius: 10px; border-left: 4px solid #4F46E5;">
        <h3 style="color: #1F2937; margin-bottom: 10px;">📅 Timeline</h3>
        <p style="color: #4B5563;">3 days to deliver a working prototype</p>
    </div>
    <div style="background: #F3F4F6; padding: 20px; border-radius: 10px; border-left: 4px solid #4F46E5;">
        <h3 style="color: #1F2937; margin-bottom: 10px;">🔄 Dataset</h3>
        <p style="color: #4B5563;">Pre-cleaned Telco Customer Churn data</p>
    </div>
</div>

<div style="position: relative; margin-bottom: 30px;">
    <img src="https://images.pexels.com/photos/3861969/pexels-photo-3861969.jpeg?auto=compress&cs=tinysrgb&w=1260&h=750&dpr=2" 
         style="width: 100%; height: 300px; object-fit: cover; border-radius: 10px;">
    <div style="position: absolute; bottom: 0; left: 0; right: 0; background: rgba(0,0,0,0.7); padding: 15px; border-bottom-left-radius: 10px; border-bottom-right-radius: 10px;">
        <p style="color: white; text-align: center; margin: 0;">
            Using neural networks to predict customer behavior in the telecom industry
        </p>
    </div>
</div>

<div style="background: #F8FAFC; padding: 25px; border-radius: 10px; border: 1px solid #E2E8F0;">
    <h2 style="color: #1F2937; margin-bottom: 20px;">Project Goals</h2>
    <ul style="list-style-type: none; padding: 0;">
        <li style="display: flex; align-items: center; margin-bottom: 15px;">
            <span style="background: #4F46E5; color: white; width: 24px; height: 24px; border-radius: 12px; display: flex; align-items: center; justify-content: center; margin-right: 10px;">1</span>
            <span style="color: #4B5563;">Early identification of customers likely to churn</span>
        </li>
        <li style="display: flex; align-items: center; margin-bottom: 15px;">
            <span style="background: #4F46E5; color: white; width: 24px; height: 24px; border-radius: 12px; display: flex; align-items: center; justify-content: center; margin-right: 10px;">2</span>
            <span style="color: #4B5563;">Optimization of retention campaign targeting</span>
        </li>
        <li style="display: flex; align-items: center;">
            <span style="background: #4F46E5; color: white; width: 24px; height: 24px; border-radius: 12px; display: flex; align-items: center; justify-content: center; margin-right: 10px;">3</span>
            <span style="color: #4B5563;">Analysis of key factors influencing customer departures</span>
        </li>
    </ul>
</div>


In [4]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, auc, classification_report
import tensorflow as tf
from tensorflow.keras import layers, models

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

#Reducin tensorFlow log verbosity
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

## 1. Data Loading and Initial Exploration

In [14]:
# Load the dataset
df = pd.read_csv('Dataset.csv')

# Display basic information about the dataset
print(f"The data set has : {df.shape[0]} number of rows and {df.shape[1]} number of columns ")
print("\nDataset Info:")
df.info()

# Display first few rows
print("\nFirst few rows of the dataset:")
df.head(2)

The data set has : 7043 number of rows and 21 number of columns 

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract 

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No


## 2. Data cleaning

In [15]:
# 1. Strip whitespace from column names
df.columns = df.columns.str.strip()

# 2. Trim whitespace in all string cells
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

# 3. Convert empty strings to NaN (optional)
df.replace('', np.nan, inplace=True)

# 4. Show the number of duplicates and number of rows before dropping
print(f"\nNumber of duplicate rows: {df.duplicated().sum()}")
print(f"Number of rows before dropping duplicates: {df.shape[0]}")

# 5. Drop duplicate rows
df.drop_duplicates(inplace=True)

# 6. Show the number of rows after dropping duplicates
print(f"Number of rows after dropping duplicates: {df.shape[0]}")

# 7. Show the number of missing values for the columns which have them
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]
print(f"\nNumber of missing values per column if any:\n{missing_values}")

# 8. Drop rows with missing values
df.dropna(inplace=True)

# 9. (Optional) See the result
print("\nAfter cleaning, the dataset has:")
print(f"Number of rows: {df.shape[0]}")

# 10. Display the first few rows of the cleaned dataset
print("\nFirst few rows of the cleaned dataset:")
print(df.head(2))
# 11. Display the data types of the columns
print("\nData types of the columns:")
print(df.dtypes)
# 12. Display the number of unique values in each column
print("\nNumber of unique values in each column:")
print(df.nunique())



Number of duplicate rows: 0
Number of rows before dropping duplicates: 7043
Number of rows after dropping duplicates: 7043

Number of missing values per column if any:
TotalCharges    11
dtype: int64

After cleaning, the dataset has:
Number of rows: 7032

First few rows of the cleaned dataset:
   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   

  TechSupport StreamingTV StreamingMovies        Contract PaperlessBilling  \
0          No          No              No  Month-to-month              Yes   
1          No          No              No        One year               No   


  df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)


In [20]:
import pandas as pd
import numpy as np
import plotly.express as px

# — 1. Clean categorical columns —
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
df[categorical_columns] = df[categorical_columns].apply(
    lambda col: col.str.strip() if col.dtype == "object" else col
)

# — 2. Plot with Plotly —
max_categories = 50  # adjust as needed

for col in categorical_columns:
    # compute counts and limit to top N
    vc = df[col].value_counts()
    top_vc = vc.iloc[:max_categories]

    # build a horizontal bar chart
    fig = px.bar(
        x=top_vc.values,
        y=top_vc.index,
        orientation='h',
        labels={'x': 'Count', 'y': col},
        title=f"Distribution of {col} (Top {len(top_vc)} of {df[col].nunique()} categories)"
    )

    # invert y-axis so the largest bar is on top
    fig.update_layout(yaxis={'autorange':'reversed'})

    # annotate bars with their counts
    fig.update_traces(text=top_vc.values, textposition='outside')

    fig.show()


ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed