<a href="https://colab.research.google.com/github/mdabushad/Customer-Churn-Analysis/blob/main/Customer_Churn_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Customer Churn Analysis Using PySpark**

## Import the import libraries


In [1]:
# install PySpark
!pip install pyspark


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=737ee52e2e6e1b726276324039cd22224fb47334c3245d217919c0a80c66aeb6
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


In [4]:
#import the modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px


In [9]:
#pyspark SQL functions
from pyspark.sql.functions import col, when, count

#pyspark datapreprocessing modules
from pyspark.ml.feature import Imputer, StringIndexer, VectorAssembler, OneHotEncoder, StandardScaler


#pyspark data modeling and model evaluation modules
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

## Build Spark Session


In [14]:
#import the spark session module 
from pyspark.sql import SparkSession

# create spark session object
spark = SparkSession.builder.appName("Customer_Churn_Prediction").getOrCreate()
spark

## Downloading Dataset From Kaggle


In [15]:
#Setting- Up Kagggle API Credentials
import os
os.environ['KAGGLE_USERNAME'] = "mdabushad"
os.environ['KAGGLE_KEY'] = "512195e4116213d6fa87d3986c6708de"

# Downloading the Telco Customer Churn data
!kaggle datasets download -d blastchar/telco-customer-churn


Downloading telco-customer-churn.zip to /content
  0% 0.00/172k [00:00<?, ?B/s]
100% 172k/172k [00:00<00:00, 72.1MB/s]


In [17]:
# Unzip the downloaded dataset
!unzip /content/telco-customer-churn.zip 

Archive:  /content/telco-customer-churn.zip
  inflating: WA_Fn-UseC_-Telco-Customer-Churn.csv  


##Load the Breast Cancer Dataset

In [21]:
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true") \
    .load("/content/WA_Fn-UseC_-Telco-Customer-Churn.csv")

In [25]:
#preview the dataset
df.show()

+----------+------+-------------+-------+----------+------+------------+----------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|customerID|gender|SeniorCitizen|Partner|Dependents|tenure|PhoneService|   MultipleLines|InternetService|     OnlineSecurity|       OnlineBackup|   DeviceProtection|        TechSupport|        StreamingTV|    StreamingMovies|      Contract|PaperlessBilling|       PaymentMethod|MonthlyCharges|TotalCharges|Churn|
+----------+------+-------------+-------+----------+------+------------+----------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|7590-VHVEG|Female|            0|    Yes|        No|     1|  

In [26]:
#print schema of data
df.printSchema()

root
 |-- customerID: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: string (nullable = true)
 |-- OnlineBackup: string (nullable = true)
 |-- DeviceProtection: string (nullable = true)
 |-- TechSupport: string (nullable = true)
 |-- StreamingTV: string (nullable = true)
 |-- StreamingMovies: string (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: string (nullable = true)
 |-- Churn: string (nullable = true)



In [28]:
# print number of rows and columns 
num_rows = df.count()
num_columns = len(df.columns)
print("Number of rows: ", num_rows)
print("Number of columns: ",num_columns)

Number of rows:  7043
Number of columns:  21


## Exploratory Data Analysis


*   Distribution Analysis
*   Correlation Analysis
*   Univariate Analysis
*   Finding Missing Values






In [31]:
from pyspark.sql.functions import col, sum

# Check for null values in each column
null_counts = df.select([sum(col(column).isNull().cast("int")).alias(column) for column in df.columns])

# Display the null counts for each column
null_counts.show()


+----------+------+-------------+-------+----------+------+------------+-------------+---------------+--------------+------------+----------------+-----------+-----------+---------------+--------+----------------+-------------+--------------+------------+-----+
|customerID|gender|SeniorCitizen|Partner|Dependents|tenure|PhoneService|MultipleLines|InternetService|OnlineSecurity|OnlineBackup|DeviceProtection|TechSupport|StreamingTV|StreamingMovies|Contract|PaperlessBilling|PaymentMethod|MonthlyCharges|TotalCharges|Churn|
+----------+------+-------------+-------+----------+------+------------+-------------+---------------+--------------+------------+----------------+-----------+-----------+---------------+--------+----------------+-------------+--------------+------------+-----+
|         0|     0|            0|      0|         0|     0|           0|            0|              0|             0|           0|               0|          0|          0|              0|       0|               0| 