# Big Data Churn Prediction with Spark

In this project, we will develop a big data classification model using the Spark library to predict customer churn.

Customer churn refers to the phenomenon where customers stop doing business with a company or cancel a service. It’s a critical metric for businesses as high churn rates indicate customer dissatisfaction or better alternatives in the market, impacting a company's revenue and growth potential. Churn prediction models aim to identify which customers are likely to leave so that companies can take proactive steps to retain them.

<img src= 'https://miro.medium.com/v2/resize:fit:1400/1*TgciopaOk-C8fwtPmmet3w.png' >

### Import Libraries and Dataset

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [2]:
# Step 1: Initialize Spark
spark = SparkSession.builder.appName("ChurnClassification").getOrCreate()

In [3]:
# Step 2: Load the dataset
data = spark.read.csv("churn.csv", inferSchema=True, header=True)

In [4]:
data.show(5)

+---+----------------+----+--------------+---------------+-----+---------+-----+
|_c0|           Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|Churn|
+---+----------------+----+--------------+---------------+-----+---------+-----+
|  0|Cameron Williams|42.0|       11066.8|              0| 7.22|      8.0|    1|
|  1|   Kevin Mueller|41.0|      11916.22|              0|  6.5|     11.0|    1|
|  2|     Eric Lozano|38.0|      12884.75|              0| 6.67|     12.0|    1|
|  3|   Phillip White|42.0|       8010.76|              0| 6.71|     10.0|    1|
|  4|  Cynthia Norton|37.0|       9191.58|              0| 5.56|      9.0|    1|
+---+----------------+----+--------------+---------------+-----+---------+-----+
only showing top 5 rows



In [5]:
for col in data.columns:
     data.describe([col]).show()

+-------+------------------+
|summary|               _c0|
+-------+------------------+
|  count|               900|
|   mean|             449.5|
| stddev|259.95191863111916|
|    min|                 0|
|    max|               899|
+-------+------------------+

+-------+-------------+
|summary|        Names|
+-------+-------------+
|  count|          900|
|   mean|         NULL|
| stddev|         NULL|
|    min|   Aaron King|
|    max|Zachary Walsh|
+-------+-------------+

+-------+-----------------+
|summary|              Age|
+-------+-----------------+
|  count|              900|
|   mean|41.81666666666667|
| stddev|6.127560416916251|
|    min|             22.0|
|    max|             65.0|
+-------+-----------------+

+-------+-----------------+
|summary|   Total_Purchase|
+-------+-----------------+
|  count|              900|
|   mean|10062.82403333334|
| stddev|2408.644531858096|
|    min|            100.0|
|    max|         18026.01|
+-------+-----------------+

+-------+------

### ML- Classification

In [6]:
# Step 3: Prepare the data for training
feature_columns = [col for col in data.columns if col not in ["Names", "Churn"]]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="Features")
data = assembler.transform(data).select("Features", "Churn")


In [7]:
# Step 4: Split the data into training and testing sets
train_data, test_data = data.randomSplit([0.7, 0.3], seed=42)

In [8]:
# Step 5: Train a logistic regression model
gbt = GBTClassifier(labelCol="Churn", featuresCol="Features")
model = gbt.fit(train_data)

In [9]:
# Step 6: Make predictions on the test data
predictions = model.transform(test_data)

In [10]:
# Step 7: Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="Churn")
accuracy = evaluator.evaluate(predictions)
print("Accuracy:", accuracy)

Accuracy: 0.9973262032085561


In [11]:
spark.stop()

### Conclusion

In this project, we developed a customer churn prediction model on big data using the Spark library. Leveraging Spark's powerful data processing capabilities, we analyzed large datasets, utilized VectorAssembler for vector transformations, and applied the GBTClassifier (Gradient Boosted Trees) algorithm as our classification model. We evaluated the model's performance with BinaryClassificationEvaluator for accuracy measurement. Through this project, we successfully built a model capable of efficiently predicting churn on large-scale datasets, harnessing Spark's scalable data processing framework.