**<span style="color:#0F52BA;font-family:serif; font-size:34px;"> TELCO CUSTOMER CHURN PREDICTION </span>**

Churn prediction models are essential tools for businesses seeking to retain customers, optimize resources, and make data-driven decisions to improve overall performance, profitability, and customer experience.

**<span style="font-family:serif; font-size:24px;"> TABLE OF CONTENT </span>**

1. **<span style="font-family:serif; font-size:12px;"> [TABLE OF CONTENT](1) </span>**
    * [What is Customer Churn?](#1)
2. **<span style="font-family:serif; font-size:12px;"> [IMPORT IMPORTANT LIBRARIES](#2) </span>**
3. **<span style="font-family:serif; font-size:12px;"> [EDA](#3) </span>**
4. **<span style="font-family:serif; font-size:12px;"> [DATA ENGINEERING](#4) </span>**
5. **<span style="font-family:serif; font-size:12px;"> [[MACHINE LEARNING MODEL](#4) </span>**
6. **<span style="font-family:serif; font-size:12px;"> [WHY CHOOSE THIS MODEL / MODELS](#5) </span>**

<a id = "1" ></a>

**<span style="color:#0F52BA;font-family:serif; font-size:24px;"> What is Customer Churn ? </span>**

Customer churn, also known as customer attrition or customer turnover, refers to the phenomenon where customers discontinue their relationship with a company or stop using its products or services.

The reasons for customer churn can vary and may include factors such as dissatisfaction with the product or service, better offers from competitors, changes in customer needs, poor customer service, or a negative customer experience.

Customer churn is a critical metric for businesses, as it directly impacts their revenue, profitability, and overall success. A high churn rate can indicate issues with customer satisfaction, loyalty, or product/service quality, while a low churn rate suggests a high level of customer retention and satisfaction.

To mitigate churn, businesses often employ various strategies, such as improving customer service, personalizing offers, offering loyalty programs, and using data-driven approaches like churn prediction models (like this one) to identify and target high-risk customers for proactive retention efforts. By understanding the reasons for churn and implementing effective retention strategies, businesses can improve customer loyalty, reduce customer acquisition costs, and enhance their overall growth and profitability.

<a id = "2" ></a>

**<span style="color:#0F52BA;font-family:serif; font-size:24px;"> Import important libraries </span>**

This code imports the required libraries for the machine learning model and data analysis. 

It includes the libraries to work with Spark, perform feature engineering (StringIndexer, VectorAssembler), build a RandomForestClassifier, create a pipeline, and visualize the data using matplotlib.

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.sql.functions import count, when
import matplotlib.pyplot as plt
import numpy as np


import warnings
warnings.filterwarnings('ignore')

##### **Read the CSV data**

In [None]:
# Read the CSV data

df = spark.sql("SELECT * FROM Telco_churn_LH.Telco_cust_churn_data LIMIT 1000")
#display(df)

<a id = "2" ></a>

**<span style="color:#0F52BA;font-family:serif; font-size:24px;"> Exploratory Data Analysis </span>**

In [None]:
# View column data types
df.printSchema()

The following code performs EDA to analyze the relationship between "Churn" and "gender":

Grouping data by "gender" and "Churn" and calculating the count of each group:

In [None]:
# Clustered Column Chart comparing churn and gender
churn_gender_counts = df.groupBy('gender', 'Churn').agg(count('*').alias('count')).orderBy('gender', 'Churn')
churn_gender_total = df.groupBy('gender').agg(count('*').alias('total')).orderBy('gender')
churn_gender_joined = churn_gender_counts.join(churn_gender_total, on='gender', how='left')
churn_gender_joined = churn_gender_joined.withColumn('percentage', churn_gender_joined['count'] / churn_gender_joined['total'] * 100)

churn_gender_joined_pd = churn_gender_joined.toPandas()

Pivot the data to create the clustered column chart:

In [None]:
# Pivot the data to create the clustered column chart
pivot_data = churn_gender_joined_pd.pivot(index='gender', columns='Churn', values='percentage').reset_index()
pivot_data['Total'] = churn_gender_joined_pd.groupby('gender')['percentage'].sum().values

bar_width = 0.2
index = np.arange(len(pivot_data['gender']))

plt.figure(figsize=(10, 6))
plt.bar(index, pivot_data['No'], width=bar_width, label='No Churn')
plt.bar(index + bar_width, pivot_data['Yes'], width=bar_width, label='Churn')
plt.bar(index + 2 * bar_width, pivot_data['Total'], width=bar_width, label='Total', alpha=0.3)

plt.xlabel('Gender')
plt.ylabel('Percentage')
plt.title('Churn vs Gender')
plt.xticks(index + bar_width, pivot_data['gender'])
plt.legend()


# Add numbers on top of the bars
for i, value in enumerate(pivot_data['No']):
    plt.text(i, value + 2, f"{value:.1f}%", ha='center', va='bottom', fontsize=10)

for i, value in enumerate(pivot_data['Yes']):
    plt.text(i + bar_width, value + 2, f"{value:.1f}%", ha='center', va='bottom', fontsize=10)

for i, value in enumerate(pivot_data['Total']):
    plt.text(i + 2 * bar_width, value + 2, f"{value:.1f}%", ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# Pie Chart for Churn distribution
churn_distribution = df.groupBy('Churn').agg(count('*').alias('count')).orderBy('Churn')
churn_distribution_pd = churn_distribution.toPandas()

plt.figure(figsize=(6, 6))
plt.pie(churn_distribution_pd['count'], labels=churn_distribution_pd['Churn'], autopct='%1.1f%%', startangle=140)
plt.title('Churn Distribution')
plt.show()

please upload to lakehouse

#### Class imbalance



The above chat dipict Class imbalance. It means that one class (the minority class) has significantly fewer instances than the other class or classes (the majority class or classes).

For example, consider a binary classification problem where the task is to predict whether an email is spam or not spam (ham). If the dataset contains 99% non-spam (ham) emails and only 1% spam emails, it is a class-imbalanced dataset because the spam class is the minority class, and the ham class is the majority class.

The main challenge with imbalanced datasets is that standard machine learning algorithms tend to be biased towards the majority class since they are optimized to maximize overall accuracy.


To address class imbalance, several techniques can be employed in this case we used ensemble techniques:

Techniques like Random Forest or Boosting to combine predictions from multiple models, help improve minority class performance.


<a id = "3" ></a>

**<span style="color:#0F52BA;font-family:serif; font-size:24px;"> Data Engineering </span>**

##### **Data Preprocessing:**

In [None]:
# Convert "Churn" column from string to numeric label
indexer = StringIndexer(inputCol="Churn", outputCol="label")
data = indexer.fit(df).transform(df)

Here, the "Churn" column, which contains the target variable, is converted from string labels ("Yes" and "No") to numeric labels (0 for "No" and 1 for "Yes") using the StringIndexer transformer. 

The resulting DataFrame data contains the new "label" column.

In [None]:

# Remove unnecessary columns
data = data.drop('Index', 'customerID', 'Churn')

# Convert categorical variables to numerical using StringIndexer
categorical_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService',
                    'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
                    'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

indexers = [StringIndexer(inputCol=col, outputCol=col+"_index") for col in categorical_cols]
pipeline = Pipeline(stages=indexers)
data_transformed = pipeline.fit(data).transform(data)

This part of the code converts categorical variables in the DataFrame into numerical values using StringIndexer.

The list categorical_cols contains the names of columns that need to be converted.

The code creates a list of StringIndexer transformers for each categorical column and builds a Pipeline to apply these transformers to the data. The transformed DataFrame is stored in data_transformed.

<a id = "4" ></a>

**<span style="color:#0F52BA;font-family:serif; font-size:24px;"> Machine Learning (Random Forest Classifier) </span>**

This step creates a feature vector for the machine learning model. 

The VectorAssembler is used to combine all the numerical columns (previously created by StringIndexer) into a single column named "features". 

The DataFrame data_final now contains the "features" column along with the "label" column, which will be used for model training.

In [None]:
# Create a vector of features using VectorAssembler
feature_cols = [col+"_index" for col in categorical_cols]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
data_final = assembler.transform(data_transformed)


#### Train Test Split

The data is split into training and testing sets using an 80-20 ratio. 80% of the data is used for training, and 20% is used for testing the model's performance.

In [None]:
# Split the data into training and testing sets
train_data, test_data = data_final.randomSplit([0.8, 0.2], seed=42)

The Random Forest Classifier is created with the number of trees set to 100. The model is then trained on the training data.

In [None]:
# Create and train the Random Forest Classifier
rf_classifier = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=100)
model = rf_classifier.fit(train_data)

The trained model is used to make predictions on the test data.

In [None]:
# Make predictions on the test data
predictions = model.transform(test_data)

The predictions (along with the true labels) from the test data are saved to a CSV file named "predictions.csv" for further analysis.

In [None]:
# Save the predictions to a CSV file
predictions.select('label', 'prediction').toPandas().to_csv('predictions.csv', index=False)

In [None]:
# Output the predicted results
#predictions.select('Churn', 'prediction', 'probability').show(10, truncate=False)

The model's performance is evaluated using the BinaryClassificationEvaluator, which calculates the area under the ROC curve (AUC) as a metric. The accuracy is then printed to the console.

In [None]:
#Evaluate the model
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="label")
accuracy = evaluator.evaluate(predictions)

### Model Performance


In [None]:
print("Accuracy:", accuracy)

<a id = "6" ></a>

**<span style="color:#0F52BA;font-family:serif; font-size:24px;"> Why This Model (Random Forest Classifier) </span>**

Some of the reasons why i chose to use the Random Forest Classifier for building this model it because:

- Random Forest is an ensemble method that combines multiple decision trees to make predictions. It mitigates the risk of overfitting and improves the model's generalization ability.

- Random Forest can effectively handle non-linear relationships between features and the target variable, making it suitable for various real-world problems where linear models might not be sufficient.

- Random Forest provides a feature importance score, which helps identify the most influential features in making predictions. This information can be valuable for feature selection and understanding the data.

- Random Forest is less sensitive to outliers compared to individual decision trees, as it aggregates predictions from multiple trees.

- Implementing and training a Random Forest model is relatively easy. It requires minimal hyperparameter tuning compared to other complex models.

- Currently Power Bi uses Spark, which can efficiently handle large datasets and distributed processing. It can scale to big data environments and provide faster model training and prediction times.
- works better for data with class imbalance