<a href="https://colab.research.google.com/github/prithvikavoori/PySparkMllib/blob/main/Logistic_Regression_Customer_Churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

we are building a machine learning logistic regression model to help the managers of marketing agency company to identify which customers will churn and which customers will continue to be their client. The data used in this model is attached to it. always attach the data when you restart a google colaboratory.Then the company can test this against incoming data for future customers to predict which customers will churn and assign them an account manager.

In [1]:
# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [2]:
# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

In [3]:
# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz


In [4]:
# set your spark folder to your system path environment. 
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

In [5]:
# install findspark using pip
!pip install -q findspark

In [6]:
import findspark
findspark.init()

In [7]:
from pyspark.sql import SparkSession

In [8]:
spark = SparkSession.builder.appName('logisticregresioncustomerchurn').getOrCreate()

In [9]:
data = spark.read.csv('/customer_churn.csv',inferSchema=True,
                     header=True)

In [10]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [11]:
data.describe().show()

+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+--------------------+--------------------+-------------------+
|summary|        Names|              Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|       Onboard_date|            Location|             Company|              Churn|
+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+-------------------+--------------------+--------------------+-------------------+
|  count|          900|              900|              900|               900|              900|               900|                900|                 900|                 900|                900|
|   mean|         null|41.81666666666667|10062.82403333334|0.4811111111111111| 5.27315555555555| 8.587777777777777|               null|                null|                null|0.16666666666666666|
| stddev| 

In [12]:
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

In [13]:
# formating data for machine learning we will be eliminating string values 

from pyspark.ml.feature import VectorAssembler

In [14]:
assembler = VectorAssembler(inputCols=['Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites'],outputCol='features')

In [15]:
output = assembler.transform(data)

In [16]:
final_data = output.select('features','churn')

In [17]:
#spliting the data in to train and test 
train_churn,test_churn = final_data.randomSplit([0.7,0.3])

In [18]:
# fitting the model 

from pyspark.ml.classification import LogisticRegression

In [19]:
lr_churn = LogisticRegression(labelCol='churn')

In [20]:
fitted_churn_model = lr_churn.fit(train_churn)

In [22]:
training_sum = fitted_churn_model.summary

In [23]:
training_sum.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              churn|         prediction|
+-------+-------------------+-------------------+
|  count|                636|                636|
|   mean|0.16981132075471697|0.13364779874213836|
| stddev| 0.3757624843688367| 0.3405413409452437|
|    min|                0.0|                0.0|
|    max|                1.0|                1.0|
+-------+-------------------+-------------------+



In [24]:
#Evaluating the model using the test data 
from pyspark.ml.evaluation import BinaryClassificationEvaluator


In [25]:
pred_and_labels = fitted_churn_model.evaluate(test_churn)

In [26]:
pred_and_labels.predictions.show()

+--------------------+-----+--------------------+--------------------+----------+
|            features|churn|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|[22.0,11254.38,1....|    0|[4.81405794433994...|[0.99195045780226...|       0.0|
|[27.0,8628.8,1.0,...|    0|[5.48979961667903...|[0.99588830443167...|       0.0|
|[29.0,12711.15,0....|    0|[6.03610259880077...|[0.99761484365697...|       0.0|
|[30.0,6744.87,0.0...|    0|[3.62783530934458...|[0.97411423328275...|       0.0|
|[30.0,8403.78,1.0...|    0|[5.94776186146236...|[0.99739512381911...|       0.0|
|[30.0,8677.28,1.0...|    0|[4.13169202112987...|[0.98419802238141...|       0.0|
|[30.0,10960.52,1....|    0|[2.40643044883760...|[0.91731634487096...|       0.0|
|[30.0,12788.37,0....|    0|[2.96341472713700...|[0.95089368991660...|       0.0|
|[31.0,5387.75,0.0...|    0|[2.56001503107025...|[0.92824345892530...|       0.0|
|[31.0,7073.61,0

In [27]:
# using Area under curve (AUC) to evaluate our model 
churn_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                           labelCol='churn')

In [28]:
auc = churn_eval.evaluate(pred_and_labels.predictions)

In [29]:
auc

0.7445302445302446

Now we are testing the model in the data which is outside of the data used to build this model . that is the customers data and our model is going to preict if the customer will churn or not 

In [30]:
final_lr_model = lr_churn.fit(final_data)

In [31]:
new_customers = spark.read.csv('/new_customers.csv',inferSchema=True,
                              header=True)

In [32]:
new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)



In [33]:
test_new_customers = assembler.transform(new_customers)

In [34]:
test_new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)



In [35]:
final_results = final_lr_model.transform(test_new_customers)

In [36]:
final_results.select('Company','prediction').show()

+----------------+----------+
|         Company|prediction|
+----------------+----------+
|        King Ltd|       0.0|
|   Cannon-Benson|       1.0|
|Barron-Robertson|       1.0|
|   Sexton-Golden|       1.0|
|        Wood LLC|       0.0|
|   Parks-Robbins|       1.0|
+----------------+----------+

