## SVM Model

The [ML package](http://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#) is the newer library of machine learning routines. It provides an API for pipelining data transformers, estimators and model selectors.

**Import dependencies** 

In [1]:
from pyspark.sql.functions import *
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler

In [2]:
from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#### Load Data

In [4]:
df = spark.read.csv('../data/Telco-Customer-Churn.csv', header = True, inferSchema = True)

In [5]:
df.printSchema()

root
 |-- customerID: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: string (nullable = true)
 |-- OnlineBackup: string (nullable = true)
 |-- DeviceProtection: string (nullable = true)
 |-- TechSupport: string (nullable = true)
 |-- StreamingTV: string (nullable = true)
 |-- StreamingMovies: string (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: string (nullable = true)
 |-- Churn: string (nullable = true)



**Replace/ Drop Missing Values**

In [6]:
#Replacing spaces with null values in total charges column
dfWithEmptyReplaced = df.withColumn('TotalCharges', when(col('TotalCharges') == ' ', None).otherwise(col('TotalCharges')).cast("float"))
dfWithEmptyReplaced = dfWithEmptyReplaced.na.drop()

In [7]:
#Replacing 'No internet service' to No for the following columns
replace_cols = [ 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
                'TechSupport','StreamingTV', 'StreamingMovies']

In [8]:
#replace values
for col_name in replace_cols:
    dfwithNo = dfWithEmptyReplaced.withColumn(col_name, when(col(col_name)== "No internet service","No").otherwise(col(col_name)))

In [9]:
dfwithNo.createOrReplaceTempView("datawrangling")

In [10]:
# Using Spark SQL to create categories 
df_wrangling = spark.sql("""
select distinct 
         customerID
        ,gender
        ,SeniorCitizen
        ,Partner
        ,Dependents
        ,tenure
        ,case when (tenure<=12) then "Tenure_0-12"
              when (tenure>12 and tenure <=24) then "Tenure_12-24"
              when (tenure>24 and tenure <=48) then "Tenure_24-48"
              when (tenure>48 and tenure <=60) then "Tenure_48-60"
              when (tenure>60) then "Tenure_gt_60"
        end as tenure_group
        ,PhoneService
        ,MultipleLines
        ,InternetService
        ,OnlineSecurity
        ,OnlineBackup
        ,DeviceProtection
        ,TechSupport
        ,StreamingTV
        ,StreamingMovies
        ,Contract
        ,PaperlessBilling
        ,PaymentMethod
        ,MonthlyCharges
        ,TotalCharges
        ,Churn
    from datawrangling
""")


In [11]:
# select on categorical Columns from dataset
categoricalColumns = ['gender','SeniorCitizen','Partner','Dependents','PhoneService','MultipleLines','InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract','PaperlessBilling','PaymentMethod']
stages = [] # stages in our Pipeline

In [12]:
for categoricalCol in categoricalColumns:
    # Category Indexing with StringIndexer
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    # Use OneHotEncoder to convert categorical variables into binary SparseVectors
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    # Add stages.  These are not run here, but will run all at once later on.
    stages += [stringIndexer, encoder]

In [13]:
# Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol="Churn", outputCol="label")
stages += [label_stringIdx]

**Transforming all features into a vector using VectorAssembler**

In [14]:
# Transform all features into a vector using VectorAssembler
numericCols = ['MonthlyCharges', 'TotalCharges']#'TotalRmbRCN1', 
assemblerInputs = numericCols + [c + "classVec" for c in categoricalColumns]
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
IDcols = ['customerID']

**Create a pipeline to transform dataset**

In [15]:
# Create a Pipeline.
pipeline = Pipeline(stages=stages)
# Run the feature transformations.
#  - fit() computes feature statistics as needed.
#  - transform() actually transforms the features.
pipelineModel = pipeline.fit(df_wrangling)
dataset = pipelineModel.transform(df_wrangling)
# Keep relevant columns
selectedcols= ["label", "features"] + IDcols
dataset = dataset.select(selectedcols)

In [16]:
dataset.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = true)
 |-- customerID: string (nullable = true)



In [17]:
dataset.show(5)

+-----+--------------------+----------+
|label|            features|customerID|
+-----+--------------------+----------+
|  0.0|(28,[0,1,3,6,7,10...|6497-TILVL|
|  1.0|(28,[0,1,3,5,6,7,...|0691-JVSYA|
|  0.0|(28,[0,1,3,4,5,6,...|8544-GOQSH|
|  0.0|(28,[0,1,2,3,4,5,...|5172-MIGPM|
|  0.0|(28,[0,1,2,3,5,6,...|4312-KFRXN|
+-----+--------------------+----------+
only showing top 5 rows



### Create Training and Test Set

In [18]:
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=200)
trainingData.createOrReplaceTempView("train")
print(trainingData.count())
testData.createOrReplaceTempView("test")
print(testData.count())

4956
2076


### SVM

In [19]:
lsvc = LinearSVC(maxIter=10, regParam=0.1)

# Fit the model
lsvcModel = lsvc.fit(trainingData)

# Print the coefficients and intercept for linear SVC
print("Coefficients: " + str(lsvcModel.coefficients))
print("Intercept: " + str(lsvcModel.intercept))

Coefficients: [0.00042724081633790673,-8.573555217473973e-05,-0.04471306481844391,-0.416574209785905,0.019049304116676744,-0.0398179141438049,-0.2453784281612095,-0.08096360598197436,-0.00818476483508787,0.24027995845221925,-0.331431377675376,0.10702837207860726,-0.13714519445870801,0.06826010312380384,-0.17506651072989457,0.039217561480580775,-0.10021572538598325,0.09670638629962823,-0.09994433479982602,-0.014495104111961787,0.007117860759516185,-0.22556579617520578,0.13694887080229212,-0.24081489359760241,0.05630268377120266,0.11547533860162605,-0.11734667382467004,-0.10817768800674225]
Intercept: -0.07407941165624629


In [20]:
predictions = lsvcModel.transform(testData)

**Model Evaluation **

In [21]:
evaluator = BinaryClassificationEvaluator()
accuracy = evaluator.evaluate(predictions)

In [22]:
print('Accuracy on test data:', accuracy)

Accuracy on test data: 0.8232954857111511
