### 1. Title and Introduction
# Telecom Customer Churn Prediction

The telecom customer churn prediction project aims to develop a predictive model that can accurately identify customers who are likely to churn. By predicting churn in advance, telecom companies can implement strategies to retain at-risk customers and minimize revenue loss.


## 2. Data Import and Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, recall_score, f1_score, roc_auc_score, precision_score
from sklearn.pipeline import Pipeline

# Load dataset
df = pd.read_csv('Call_Details_Data.csv')

# Display first few rows
display(df.head())

# Data overview
df.info()
display(df.describe())

## 3. Data Preprocessing

In [None]:
# Check for missing values
display(df.isnull().sum())

# Handling duplicates
df = df.drop_duplicates()

# Encoding categorical variables
encoder = OneHotEncoder(drop='first')
categorical_features = [col for col in df.columns if df[col].dtype == 'object']
df = pd.get_dummies(df, columns=categorical_features, drop_first=True)

# Splitting features and target variable
y = df['Churn']
X = df.drop(columns=['Churn'])

# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

## 4. Exploratory Data Analysis (EDA)

In [None]:
# Churn distribution
sns.countplot(x='Churn', data=df)
plt.title('Churn Distribution')
plt.show()

# Correlation matrix
plt.figure(figsize=(12, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Feature Correlation Heatmap')
plt.show()

## 5. Model Selection

In [None]:
# Splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Choosing a classification model
model = RandomForestClassifier(n_estimators=100, random_state=42)

## 6. Model Training with Cross-Validation

In [None]:
# Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy')

print('Cross-Validation Accuracy Scores:', cv_scores)
print('Mean CV Accuracy:', np.mean(cv_scores))

# Train model
model.fit(X_train, y_train)

## 7. Model Evaluation

In [None]:
# Predictions
y_pred = model.predict(X_test)

# Performance Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')
print(f'ROC AUC Score: {roc_auc:.4f}')
print(classification_report(y_test, y_pred))

# Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## 8. Results Visualization

In [None]:
# Feature Importance
feature_importance = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
feature_importance.plot(kind='bar', figsize=(10, 5))
plt.title('Feature Importance')
plt.show()

## 9. Conclusion
- The model achieved an accuracy of XX%.
- The most important factors affecting churn are [mention top features].
- Future work could explore additional ML techniques and deeper feature engineering.
- Cross-validation was applied to validate the model's performance.

## 10. References
- Scikit-learn documentation
- Seaborn visualization guide

## 11. Code Comments and Documentation
- Comments are added throughout the code to explain each step.

## 12. Reproducibility
- Ensure dataset is accessible before running.
- All necessary libraries are imported at the start.