#Assignment 6 - Logistic Regression

## Binary Customer Churn

A company has a lot of  customers that use their service to produce ads for the customer websites. They've noticed that they have quite a bit of churn in clients. They basically randomly assign account managers right now, but want you to create a machine learning model that will help predict which customers will churn (stop buying their service) so that they can correctly assign the customers most at risk to churn an account manager. Luckily they have some historical data, can you help them out? Create a classification algorithm that will help classify whether or not a customer churned. 

The data is saved as historical_data.csv. Here are the fields and their definitions:

    Name : Name of the latest contact at Company
    Age: Customer Age
    Total_Purchase: Total Ads Purchased
    Account_Manager: Binary 0=No manager, 1= Account manager assigned
    Years: Totaly Years as a customer
    Num_sites: Number of websites that use the service.
    Onboard_date: Date that the name of the latest contact was onboarded
    Location: Client HQ Address
    Company: Name of Client Company
    
**NB:Create the model and evaluated it.**

In [1]:
#Step 1: Install Dependencies
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz
!pip install -q findspark

#Step 2: Add environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.3.0-bin-hadoop3"

#Step 3: Initialize Pyspark
import findspark
findspark.init()

In [2]:
#creating spark context
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext
sc

In [3]:
from google.colab import files
uploaded = files.upload()

Saving historical_data.csv to historical_data.csv


In [19]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('CustomerChurn').getOrCreate()

In [243]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import (VectorAssembler,OneHotEncoder,StringIndexer)

In [261]:
data = spark.read.csv("historical_data.csv",inferSchema=True,header=True)

In [262]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: integer (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [263]:
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

In [264]:
my_columns=data.select(['Age','Total_Purchase','Account_Manager','Years','Num_Sites','Churn'])
final_data = my_columns.na.drop()

In [265]:
final_data

DataFrame[Age: int, Total_Purchase: double, Account_Manager: int, Years: double, Num_Sites: int, Churn: int]

In [266]:
assembler = VectorAssembler(
    inputCols=["Age", "Total_Purchase", 
               "Account_Manager","Years","Num_Sites"],
    outputCol="features")

In [267]:
log_reg_final_data = LogisticRegression(featuresCol='features',labelCol='Churn')

In [268]:
pipeline = Pipeline(stages=[
    assembler,log_reg_final_data])

In [269]:
train_data, test_data = final_data.randomSplit([0.7,.3])

In [270]:
fit_model = pipeline.fit(train_data)

In [271]:
results = fit_model.transform(test_data)

In [272]:
Evaluation = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                       labelCol='Churn')

In [273]:
results.select('Churn','prediction').show()

+-----+----------+
|Churn|prediction|
+-----+----------+
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    1|       0.0|
|    1|       0.0|
|    1|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    1|       1.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
+-----+----------+
only showing top 20 rows



In [274]:
AUC = Evaluation.evaluate(results)
AUC

0.7894989211220331