# Data Mining Portfolio

## Project Overview
Brief description of the business problem and objective.

## Stage 1: Problem Definition and Data Scoping
### Business Problem
Customer churn is a significant challenge for telecom service providers, as acquiring new customers is typically more costly than retaining existing ones. UK telecom operators such as Three UK, EE, and Vodafone operate in a highly competitive market where customers can switch providers easily due to short-term contracts and similar pricing structures. As a result, the ability to identify customers who are at risk of leaving is critical for improving customer retention and protecting recurring revenue.

### Data Mining Objective
The objective of this project is to develop a predictive model that can identify customers who are likely to churn based on their historical service usage, contract details, and billing information. This is framed as a binary classification problem, where the output variable indicates whether a customer churns or remains with the service.

### Dataset Selection and Justification
This study uses a publicly available telecom customer churn dataset sourced from Kaggle, originally provided by IBM. The dataset contains detailed customer-level information including tenure, contract type, monthly charges, payment method, subscribed services, and a churn indicator. Although the dataset does not originate from a specific UK provider, it closely reflects the characteristics of customers in the UK telecom sector and is widely used for churn prediction research. The dataset is therefore appropriate for demonstrating data mining techniques and evaluating predictive performance, while acknowledging that real-world deployment would require operator-specific data.

## Stage 2: Exploratory Data Analysis and Pre-processing
- Data overview
- Missing values and outliers
- Visualisations
- Feature engineering

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/telecom_churn.csv")
df.head()
df.shape
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")
df.info()

Dataset contains 7043 rows and 21 columns.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16 

The dataset consists of 7,043 customer records with 21 variables. Most features
are categorical, reflecting customer demographics, service subscriptions, and
contract details. Several numerical variables are present, including tenure,
MonthlyCharges, and TotalCharges. Initial inspection shows that TotalCharges is
stored as an object type, indicating that further preprocessing is required
before modelling.

In [3]:
df["Churn"].value_counts()

Churn
No     5174
Yes    1869
Name: count, dtype: int64

In [4]:
df["Churn"].value_counts(normalize=True) * 100

Churn
No     73.463013
Yes    26.536987
Name: proportion, dtype: float64

In [5]:
# Convert TotalCharges to numeric
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Check missing values
df.isnull().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

In [6]:
# Remove rows with missing TotalCharges
df = df.dropna()

df.shape

(7032, 21)

In [7]:
df.groupby("Churn")["tenure"].mean()

Churn
No     37.650010
Yes    17.979133
Name: tenure, dtype: float64

## Stage 3: Data Mining / Machine Learning
- Technique(s) used
- Justification
- Model training and validation

## Stage 4: Evaluation, Recommendations, and Reflection
- Model evaluation
- Business implications
- Limitations and future work
- Ethical, privacy, and security considerations



## References
(UWE Harvard format)