```
Prakash Dhimal
Manav Garkel
George Mason University
CS 657 Mining Massive Datasets
Final Project: Sparkify
```

#### Customer Churn prediction
The goal of this project is to build a customer churn prediction model using user interaction logs with an imaginary music streaming service called Sparkify.

#### Data

Sparkify is imaginary digital music service similar to Spotify. The dataset contains 12GB of user interactions with this fictitious music streaming service.

This notebook provides step by step explanation of our work leading to customer churn prediction using four different models (plus a hybrid model). Let's get to it.

In [1]:
from pyspark.sql import SparkSession

Initialize a spark session

In [2]:
spark = SparkSession.builder.master("local[*]").appName("sparkify").getOrCreate()
spark

Read the data

In [3]:
data = spark.read.json("../../../data/sparkify_event_data.json")

Look at the columns

In [4]:
data.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



Look at the data, we will do a vertical show

In [5]:
data.show(vertical=True, n=2)

-RECORD 0-----------------------------
 artist        | Popol Vuh            
 auth          | Logged In            
 firstName     | Shlok                
 gender        | M                    
 itemInSession | 278                  
 lastName      | Johnson              
 length        | 524.32934            
 level         | paid                 
 location      | Dallas-Fort Worth... 
 method        | PUT                  
 page          | NextSong             
 registration  | 1533734541000        
 sessionId     | 22683                
 song          | Ich mache einen S... 
 status        | 200                  
 ts            | 1538352001000        
 userAgent     | "Mozilla/5.0 (Win... 
 userId        | 1749042              
-RECORD 1-----------------------------
 artist        | Los Bunkers          
 auth          | Logged In            
 firstName     | Vianney              
 gender        | F                    
 itemInSession | 9                    
 lastName      | Miller  

The most important column in the data set seems to be the `Page` column. It holds all Sparkify pages that the customers have visited. 

In [6]:
data.select("page").dropDuplicates().show(20, False)

+-------------------------+
|page                     |
+-------------------------+
|Cancel                   |
|Submit Downgrade         |
|Thumbs Down              |
|Home                     |
|Downgrade                |
|Roll Advert              |
|Logout                   |
|Save Settings            |
|Cancellation Confirmation|
|About                    |
|Submit Registration      |
|Settings                 |
|Login                    |
|Register                 |
|Add to Playlist          |
|Add Friend               |
|NextSong                 |
|Thumbs Up                |
|Help                     |
|Upgrade                  |
+-------------------------+
only showing top 20 rows



From the list above we can use the `Cancellation Confirmation` to indicate if a particular user churned or not. We are building a model that predicts if a user is going to churn given the history of interactions with the service.

Therfore, it is important to understand the customers that reached the `Cancellation Confirmation` page. These are the customers that churned, and the task is to build a prediction model that can recognize them.

## Data cleaning

Now that we indentified our "goal", we can let go of some of the columns that are not needed for further analysis.

In [7]:
data = data.drop(*['firstName', 'lastName', 'id_copy'])
data.show(vertical=True, n=2)

-RECORD 0-----------------------------
 artist        | Popol Vuh            
 auth          | Logged In            
 gender        | M                    
 itemInSession | 278                  
 length        | 524.32934            
 level         | paid                 
 location      | Dallas-Fort Worth... 
 method        | PUT                  
 page          | NextSong             
 registration  | 1533734541000        
 sessionId     | 22683                
 song          | Ich mache einen S... 
 status        | 200                  
 ts            | 1538352001000        
 userAgent     | "Mozilla/5.0 (Win... 
 userId        | 1749042              
-RECORD 1-----------------------------
 artist        | Los Bunkers          
 auth          | Logged In            
 gender        | F                    
 itemInSession | 9                    
 length        | 238.39302            
 level         | paid                 
 location      | San Francisco-Oak... 
 method        | PUT     

Since we are building a model that focuses on a user, remove any null/na values in the userID

In [8]:
from pyspark.sql.functions import isnan, isnull

In [9]:
data.filter((isnan(data['userId'])) | (data['userId'].isNull()) | (data['userId'] == "")).count()

0

In [10]:
data.filter((data['userId'] == "")).count()

0

In [11]:
data = data.filter(data["userId"] != "") 
data.filter((isnan(data['userId'])) | (data['userId'].isNull()) | (data['userId'] == "")).count()

0

The times in `registration` and `ts` column are given in milliseconds, we will convert those to seconds. Before that, we will remove any with null values.

In [12]:
data = data.filter(data['registration'].isNotNull())

In [13]:
from pyspark.sql.types import DoubleType
import pyspark.sql.functions as F

In [14]:
time_unit_udf = F.udf(lambda x: float(x/1000), DoubleType())
data = data.withColumn("registration", time_unit_udf("registration")). \
    withColumn("ts", time_unit_udf("ts"))

data.select('registration', 'ts').show()

+-------------+-------------+
| registration|           ts|
+-------------+-------------+
|1.533734541E9|1.538352001E9|
|1.537500318E9|1.538352002E9|
|1.536414505E9|1.538352002E9|
| 1.53438666E9|1.538352003E9|
|1.537381415E9|1.538352003E9|
| 1.53760256E9|1.538352004E9|
|1.536563853E9|1.538352005E9|
|1.538069376E9|1.538352006E9|
|1.536455539E9|1.538352006E9|
|1.533220062E9|1.538352006E9|
|1.534393835E9|1.538352006E9|
|1.537618545E9|1.538352006E9|
|1.537868758E9|1.538352007E9|
|1.534635513E9|1.538352007E9|
|1.531817572E9|1.538352008E9|
|1.528964849E9|1.538352008E9|
| 1.53676124E9|1.538352008E9|
|1.537790336E9|1.538352008E9|
| 1.53421749E9| 1.53835201E9|
|1.536693084E9| 1.53835201E9|
+-------------+-------------+
only showing top 20 rows



#### Label

As mentioned above we are deriving the label `churn` as a function of the `Page` column. There are only two values to churn, churned or not churned. Therefore this is going to be a boolean column represented by 1 for churned and 0 for not churned.

In [15]:
from pyspark.sql.types import IntegerType

In [16]:
cancelation_event_udf = F.udf(lambda x: 1 if x == "Cancellation Confirmation" else 0, IntegerType())
data = data.withColumn("churn", cancelation_event_udf("page"))
data.filter(data['userId'] == 1749042).describe('churn').show()

+-------+--------------------+
|summary|               churn|
+-------+--------------------+
|  count|                1223|
|   mean|8.176614881439084E-4|
| stddev| 0.02859478078502978|
|    min|                   0|
|    max|                   1|
+-------+--------------------+



The problem here is that a particular user will have multiple entries, since this is a database of user activities log. A user may have visitied number of other pages before reaching the "Cancellation" page. For a user that has churned, we want to make sure that all of his/her logs record 1 for the churn column. For this we will use the Window function from `sql.window`.

For example, user `1749042` above has entries labeled as both churned (1) and not churned (0). We need to make sure that all of his/her entries are labeled as churned. It only takes that one "Cancellation Confirmation" to make a user churn.

In [17]:
from pyspark.sql.window import Window
from pyspark.sql.functions import sum as Fsum

In [18]:
window = Window.partitionBy("userId") \
        .rangeBetween(Window.unboundedPreceding,
                      Window.unboundedFollowing)
data = data.withColumn("churn", Fsum("churn").over(window))


In [19]:
data.filter(data['userId'] == 1749042).describe('churn').show()

+-------+-----+
|summary|churn|
+-------+-----+
|  count| 1223|
|   mean|  1.0|
| stddev|  0.0|
|    min|    1|
|    max|    1|
+-------+-----+



Let's see how many unique users we have

In [20]:
data.select("userID").dropDuplicates().count()

22277

The total churn instances, again a user has multiple entries here

In [21]:
data.filter(data['churn'] == 1).count()

5382467

Now let's create a label dataframe that has one entry per each user

In [22]:
from pyspark.sql.functions import col

In [23]:
label = data \
    .select('userId', col('churn').alias('label')) \
    .dropDuplicates()
label.show()

+-------+-----+
| userId|label|
+-------+-----+
|1000280|    1|
|1002185|    0|
|1017805|    0|
|1030587|    0|
|1033297|    0|
|1057724|    0|
|1059049|    0|
|1069552|    0|
|1071308|    1|
|1076191|    1|
|1083324|    0|
|1102913|    0|
|1114507|    0|
|1133196|    0|
|1142513|    0|
|1151194|    0|
|1156065|    1|
|1178731|    0|
|1180406|    0|
|1190352|    0|
+-------+-----+
only showing top 20 rows



## Feature engineering

Users that have been using the service tend to stay with the service, even as a paying customer, than those that recently signed up.

Every company selling products and services to customers have an idea of a "lifetime value" of a customer. With that in mind we create our first feature.

We converted the times to seconds above. After gaining the lifetime of a user, we convert that to days.

In [24]:
from pyspark.sql.functions import sum as Fsum, col

In [25]:
time_since_registration = data \
    .select('userId', 'registration', 'ts') \
    .withColumn('lifetime', (data.ts - data.registration)) \
    .groupBy('userId') \
    .agg({'lifetime': 'max'}) \
    .withColumnRenamed('max(lifetime)', 'lifetime') \
    .select('userId', (col('lifetime') / 3600 / 24).alias('lifetime'))

In [26]:
time_since_registration.describe('lifetime').show()

+-------+-------------------+
|summary|           lifetime|
+-------+-------------------+
|  count|              22277|
|   mean|  83.38872055383476|
| stddev|  40.88235193199312|
|    min|-19.427465277777777|
|    max| 410.25917824074077|
+-------+-------------------+



Bussiness big and small not only rely on their customers coming back for more good and services, but also referring the bussiness to their friends and family. It is in our nature to refer things that we like. With this idea in mind we will create another feature.

Again we will user the "Page" column specifically looking for the `Add Friend` page.

In [27]:
referring_friends = data \
    .select('userID', 'page') \
    .where(data.page == 'Add Friend') \
    .groupBy('userID') \
    .count() \
    .withColumnRenamed('count', 'add_friend')

In [28]:
referring_friends.describe('add_friend').show()

+-------+-----------------+
|summary|       add_friend|
+-------+-----------------+
|  count|            20305|
|   mean|18.79655257325782|
| stddev|20.74770411629507|
|    min|                1|
|    max|              222|
+-------+-----------------+



The more songs a user listens to, the more likely they are to enjoy the streaming service and keep their subscription. Thus we add a `total_songs_listened` feature.

In [29]:
# total songs listened
total_songs_listened = data \
    .select('userID', 'song') \
    .groupBy('userID') \
    .count() \
    .withColumnRenamed('count', 'total_songs')

In [30]:
total_songs_listened.describe('total_songs').show()

+-------+------------------+
|summary|       total_songs|
+-------+------------------+
|  count|             22277|
|   mean|1143.8129011985457|
| stddev|1321.2139656987083|
|    min|                 1|
|    max|             13591|
+-------+------------------+



The more songs a user liked on the streaming service, the more it potentially implies that they enjoy their subscription
and the value they get from it. They are more likely to keep their subscription if they are liking more songs. Thus we create a feature to count the number of songs a user gives a thumbs_up to.

In [31]:
# thumbs up
thumbs_up = data \
    .select('userID', 'page') \
    .where(data.page == 'Thumbs Up') \
    .groupBy('userID') \
    .count() \
    .withColumnRenamed('count', 'num_thumb_up')

In [32]:
thumbs_up.describe('num_thumb_up').show()

+-------+------------------+
|summary|      num_thumb_up|
+-------+------------------+
|  count|             21732|
|   mean|52.984769004233385|
| stddev| 64.86699983998629|
|    min|                 1|
|    max|               836|
+-------+------------------+



Similarly, the more songs a user dislikes, the less likely they are to enjoy their subscription and cancel it. 

In [33]:
# thumbs down
thumbs_down = data \
    .select('userID', 'page') \
    .where(data.page == 'Thumbs Down') \
    .groupBy('userID') \
    .count() \
    .withColumnRenamed('count', 'num_thumb_down')

In [34]:
thumbs_down.describe('num_thumb_down').show()

+-------+------------------+
|summary|    num_thumb_down|
+-------+------------------+
|  count|             20031|
|   mean|11.942089760870651|
| stddev| 12.75272884784001|
|    min|                 1|
|    max|               154|
+-------+------------------+



Playlist length counts the number of times a user visited the add to playlist page indicating that they added a song to their playlist. A long playlist implies a user is enjoying several songs and wants frequent access to them. This would likely lead to them keeping their subscription. Thus, we create a feature to compute the lenght of the playlist. 

In [35]:
playlist_length = data.select('userID', 'page') \
    .where(data.page == 'Add to Playlist') \
    .groupby('userID').count() \
    .withColumnRenamed('count', 'playlist_length')

In [36]:
playlist_length.describe('playlist_length').show()

+-------+-----------------+
|summary|  playlist_length|
+-------+-----------------+
|  count|            21260|
|   mean|28.12422389463782|
| stddev|32.27499039023109|
|    min|                1|
|    max|              340|
+-------+-----------------+



Avg songs per session allows us to measure how long each user session lasts when they open the application and start a session. The longer the session, the more songs the user is listening too. This could imply that the user is enjoying the application and is less likely to cancel.

In [37]:
avg_songs_played = data.where('page == "NextSong"') \
    .groupby(['userId', 'sessionId']) \
    .count() \
    .groupby(['userId']) \
    .agg({'count': 'avg'}) \
    .withColumnRenamed('avg(count)', 'avg_songs_played')

In [38]:
avg_songs_played.describe('avg_songs_played').show()

+-------+-----------------+
|summary| avg_songs_played|
+-------+-----------------+
|  count|            22261|
|   mean|67.28930119633605|
| stddev|42.00146132153543|
|    min|              1.0|
|    max|            579.0|
+-------+-----------------+



Artist count measures the amount of different artists the user listens to. A user listening to a wide variety of artists could potentially be enjoying the music application more and is likely to keep their subscription. On the other hand, a user listening to only a handful of artists may not be happy with the current subscription and would be likely to churn. 

In [39]:
artist_count = data \
    .filter(data.page == "NextSong") \
    .select("userId", "artist") \
    .dropDuplicates() \
    .groupby("userId") \
    .count() \
    .withColumnRenamed("count", "artist_count")

In [40]:
artist_count.describe('artist_count').show()

+-------+-----------------+
|summary|     artist_count|
+-------+-----------------+
|  count|            22261|
|   mean|645.0307263824626|
| stddev|602.2479741901458|
|    min|                1|
|    max|             4368|
+-------+-----------------+



Total number of sessions computes how many times the user has started a session on the app. The more sessions implies more visits to the application meaning they are less likely to churn. 

In [41]:
num_sessions = data \
    .select("userId", "sessionId") \
    .dropDuplicates() \
    .groupby("userId") \
    .count() \
    .withColumnRenamed('count', 'num_sessions')

In [42]:
num_sessions.describe('num_sessions').show()

+-------+------------------+
|summary|      num_sessions|
+-------+------------------+
|  count|             22277|
|   mean|13.334964312968532|
| stddev|13.027024992429286|
|    min|                 1|
|    max|               147|
+-------+------------------+



#### Feature table

Now that we have our features, we will join them to create a table of features.

In [43]:
cols = ["lifetime",
        "total_songs",
        "num_thumb_up",
        'num_thumb_down',
        'add_friend',
        'playlist_length',
        'avg_songs_played',
        'artist_count',
        'num_sessions']
features = [time_since_registration,
            total_songs_listened,
            thumbs_up,
            thumbs_down,
            referring_friends,
            playlist_length,
            avg_songs_played,
            artist_count,
            num_sessions]

In [44]:
data = features.pop()
while len(features) > 0:
    data = data.join(features.pop(), 'userID', 'outer')

data = data.join(label, 'userID', 'outer').fillna(0)
data.show(vertical=True, n=2)

-RECORD 0------------------------------
 userId           | 1000280            
 num_sessions     | 22                 
 artist_count     | 767                
 avg_songs_played | 48.666666666666664 
 playlist_length  | 25                 
 add_friend       | 14                 
 num_thumb_down   | 33                 
 num_thumb_up     | 53                 
 total_songs      | 1317               
 lifetime         | 77.30377314814815  
 label            | 1                  
-RECORD 1------------------------------
 userId           | 1002185            
 num_sessions     | 17                 
 artist_count     | 1205               
 avg_songs_played | 104.58823529411765 
 playlist_length  | 49                 
 add_friend       | 25                 
 num_thumb_down   | 14                 
 num_thumb_up     | 92                 
 total_songs      | 2080               
 lifetime         | 65.75105324074075  
 label            | 0                  
only showing top 2 rows



In [45]:
data.limit(10).toPandas()

Unnamed: 0,userId,num_sessions,artist_count,avg_songs_played,playlist_length,add_friend,num_thumb_down,num_thumb_up,total_songs,lifetime,label
0,1000280,22,767,48.666667,25,14,33,53,1317,77.303773,1
1,1002185,17,1205,104.588235,49,25,14,92,2080,65.751053,0
2,1017805,3,223,83.333333,5,13,4,7,320,54.266389,0
3,1030587,11,1071,163.555556,46,23,16,66,1752,131.597523,0
4,1033297,5,215,47.2,7,4,3,10,299,116.144549,0
5,1057724,40,2157,98.641026,135,76,29,200,4669,96.044549,0
6,1059049,5,454,186.333333,16,10,6,29,662,133.714074,0
7,1069552,12,389,37.916667,11,7,6,26,582,126.268843,0
8,1071308,18,1007,82.882353,26,31,12,74,1693,63.393692,1
9,1076191,3,47,15.666667,1,0,1,4,64,28.905035,1


In [46]:
data.printSchema()

root
 |-- userId: string (nullable = true)
 |-- num_sessions: long (nullable = true)
 |-- artist_count: long (nullable = true)
 |-- avg_songs_played: double (nullable = false)
 |-- playlist_length: long (nullable = true)
 |-- add_friend: long (nullable = true)
 |-- num_thumb_down: long (nullable = true)
 |-- num_thumb_up: long (nullable = true)
 |-- total_songs: long (nullable = true)
 |-- lifetime: double (nullable = false)
 |-- label: long (nullable = true)



In [47]:
data.groupby('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    0|17274|
|    1| 5003|
+-----+-----+



### Vector Assembler and Scaling

There are few more steps before we can feed the data into a model. First we will use a vector assempler to create a feature column.

In [48]:
# Vector assembler
from pyspark.ml.feature import StandardScaler, VectorAssembler
assembler = VectorAssembler(inputCols=cols, outputCol="unScaled_features")
data = assembler.transform(data)

We don't need all of the feature columns before, we will select a few that we need

In [49]:
data = data.select('userID', 'unScaled_features', 'label')
data.show(vertical=True, n=2)

-RECORD 0---------------------------------
 userID            | 1000280              
 unScaled_features | [77.3037731481481... 
 label             | 1                    
-RECORD 1---------------------------------
 userID            | 1002185              
 unScaled_features | [65.7510532407407... 
 label             | 0                    
only showing top 2 rows



If you have been paying attention to the min and max values output after our features above, they are everywhere. We need to use a scaler to scale them.

In [50]:
# scale the features
scaler = StandardScaler(inputCol="unScaled_features", outputCol="features", withStd=True)
scalerModel = scaler.fit(data)
data = scalerModel.transform(data)
data.show(vertical=True, n=2)

-RECORD 0---------------------------------
 userID            | 1000280              
 unScaled_features | [77.3037731481481... 
 label             | 1                    
 features          | [1.89088370641545... 
-RECORD 1---------------------------------
 userID            | 1002185              
 unScaled_features | [65.7510532407407... 
 label             | 0                    
 features          | [1.60829918371908... 
only showing top 2 rows



### The split

Now that we have our data ready to be transformed and fitted by the models, we split them into training and test set. We will do a 80-20 split

In [51]:
# train test split
trainTest = data.randomSplit([0.8, 0.2])
trainingDF = trainTest[0]
testDF = trainTest[1]

## Modelling
Four different models were picked: 
  * Gradient Boosting Trees, 
  * Logistic Regression,
  * Linear SVC,
  * RandomForestClassifier
  
We used Accuracy and f1-score as for evaluating our models. We are really looking for the highest f1 score. Results are shown below.

In [52]:
from pyspark.ml.classification import GBTClassifier, LogisticRegression, LinearSVC, RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

Gradient Boosted Tree.

(Note: we run all of our models through cross validation for our actual predictions, we will defer that to the source code at `../src/sparkify.py`)

In [53]:
# initialize classifier
GradBoostTree = GBTClassifier()

# Fit the model
cvModel_GradBoostTree = GradBoostTree.fit(trainingDF)

# Make Predictions
results_GradBoostTree = cvModel_GradBoostTree.transform(testDF)
results_GradBoostTree = results_GradBoostTree.select('userID', 'label', 'prediction')
results_GradBoostTree.show(10)

# Get Results
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
accuracy = evaluator.evaluate(results_GradBoostTree, {evaluator.metricName: "accuracy"})
f1Score = evaluator.evaluate(results_GradBoostTree, {evaluator.metricName: "f1"})
print('Gradient Boosted Trees Metrics:')
print('Accuracy: {:.2f}'.format(accuracy))
print('F1 Score: {:.2f}'.format(f1Score))

+-------+-----+----------+
| userID|label|prediction|
+-------+-----+----------+
|1057724|    0|       0.0|
|1059049|    0|       0.0|
|1133196|    0|       0.0|
|1151194|    0|       0.0|
|1156065|    1|       0.0|
|1311711|    1|       1.0|
|1396378|    1|       0.0|
|1507765|    0|       0.0|
|1519090|    0|       0.0|
|1528396|    1|       1.0|
+-------+-----+----------+
only showing top 10 rows

Gradient Boosted Trees Metrics:
Accuracy: 0.83
F1 Score: 0.81


Logistic Regression Classifier

In [54]:
# initialize classifier
lgr = LogisticRegression()

# Fit the model
cvModel_lgr = lgr.fit(trainingDF)

# Make Predictions
results_lgr = cvModel_lgr.transform(testDF).select('userID', 'label', 'prediction')
results_lgr.show(10)

# Get Results
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
accuracy = evaluator.evaluate(results_lgr, {evaluator.metricName: "accuracy"})
f1Score = evaluator.evaluate(results_lgr, {evaluator.metricName: "f1"})
print('Logistic Regression Metrics:')
print('Accuracy: {:.2f}'.format(accuracy))
print('F1 Score: {:.2f}'.format(f1Score))

+-------+-----+----------+
| userID|label|prediction|
+-------+-----+----------+
|1057724|    0|       0.0|
|1059049|    0|       0.0|
|1133196|    0|       0.0|
|1151194|    0|       0.0|
|1156065|    1|       0.0|
|1311711|    1|       0.0|
|1396378|    1|       0.0|
|1507765|    0|       0.0|
|1519090|    0|       0.0|
|1528396|    1|       1.0|
+-------+-----+----------+
only showing top 10 rows

Logistic Regression Metrics:
Accuracy: 0.78
F1 Score: 0.72


Support Vector Machine

In [55]:
# initialize classifier
svc = LinearSVC()

# Fit the model
cvModel_svc = svc.fit(trainingDF)

# Make Predictions
results_svc = cvModel_svc.transform(testDF).select('userID', 'label', 'prediction')
results_svc.show(10)

# Get Results
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
accuracy = evaluator.evaluate(results_svc, {evaluator.metricName: "accuracy"})
f1Score = evaluator.evaluate(results_svc, {evaluator.metricName: "f1"})
print('Support Vector Machine Metrics:')
print('Accuracy: {:.2f}'.format(accuracy))
print('F1 Score: {:.2f}'.format(f1Score))

+-------+-----+----------+
| userID|label|prediction|
+-------+-----+----------+
|1057724|    0|       0.0|
|1059049|    0|       0.0|
|1133196|    0|       0.0|
|1151194|    0|       0.0|
|1156065|    1|       0.0|
|1311711|    1|       0.0|
|1396378|    1|       0.0|
|1507765|    0|       0.0|
|1519090|    0|       0.0|
|1528396|    1|       0.0|
+-------+-----+----------+
only showing top 10 rows

Support Vector Machine Metrics:
Accuracy: 0.77
F1 Score: 0.66


Finally, Random Forest Classifier

In [56]:
rf_classifier = RandomForestClassifier()
cvModel_rf= rf_classifier.fit(trainingDF)

# Make Predictions
results_rf = cvModel_rf.transform(testDF).select('userID', 'label', 'prediction')
results_rf.show(10)

# Get Results
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
accuracy = evaluator.evaluate(results_rf, {evaluator.metricName: "accuracy"})
f1Score = evaluator.evaluate(results_rf, {evaluator.metricName: "f1"})
print('Random Forest Metrics:')
print('Accuracy: {:.2f}'.format(accuracy))
print('F1 Score: {:.2f}'.format(f1Score))

+-------+-----+----------+
| userID|label|prediction|
+-------+-----+----------+
|1057724|    0|       0.0|
|1059049|    0|       0.0|
|1133196|    0|       0.0|
|1151194|    0|       0.0|
|1156065|    1|       0.0|
|1311711|    1|       0.0|
|1396378|    1|       0.0|
|1507765|    0|       0.0|
|1519090|    0|       0.0|
|1528396|    1|       1.0|
+-------+-----+----------+
only showing top 10 rows

Random Forest Metrics:
Accuracy: 0.83
F1 Score: 0.80


From the four models above, Gradient Boosted Trees and Random Forest Classifier came out on top with:
  * Accuracy: 0.83
  * F1 Score: 0.81 and 0.80

## Hybrid model

Ensemble models sometimes produce great results. We shall create a hybrid model to combine our four models above.

Lets combine the results first.

In [57]:
results_GradBoostTree = results_GradBoostTree.withColumnRenamed("prediction", "prediction_GBT")
results_lgr = results_lgr.select('userId', 'prediction').withColumnRenamed("prediction", "prediction_LGR")
results_svc = results_svc.select('userId', 'prediction').withColumnRenamed("prediction", "prediction_SVC")
results_rf = results_rf.select('userId', 'prediction').withColumnRenamed("prediction", "prediction_RF")
results = results_GradBoostTree.join(results_lgr, 'userID', 'outer')
results = results.join(results_svc, 'userID', 'outer').join(results_rf, 'userID', 'outer')
results.show()

+-------+-----+--------------+--------------+--------------+-------------+
| userID|label|prediction_GBT|prediction_LGR|prediction_SVC|prediction_RF|
+-------+-----+--------------+--------------+--------------+-------------+
|1057724|    0|           0.0|           0.0|           0.0|          0.0|
|1059049|    0|           0.0|           0.0|           0.0|          0.0|
|1133196|    0|           0.0|           0.0|           0.0|          0.0|
|1151194|    0|           0.0|           0.0|           0.0|          0.0|
|1156065|    1|           0.0|           0.0|           0.0|          0.0|
|1311711|    1|           1.0|           0.0|           0.0|          0.0|
|1396378|    1|           0.0|           0.0|           0.0|          0.0|
|1507765|    0|           0.0|           0.0|           0.0|          0.0|
|1519090|    0|           0.0|           0.0|           0.0|          0.0|
|1528396|    1|           1.0|           1.0|           0.0|          1.0|
|1537210|    0|          

Our hybrid model will simply be an OR function. This means that if a model has labelled if a user will churn, the hybrid model will label that user as churned.

In [58]:
def hybrid_function(prediction_GBT, prediction_LGR, prediction_SVM, prediction_rf):
    sum_predictions = prediction_GBT + prediction_LGR + prediction_SVM + prediction_rf
    if sum_predictions >= 1:
        return 1.0
    else:
        return 0.0

udf_hybrid_calc_function = F.udf(hybrid_function, DoubleType())
results = results.withColumn("prediction",
                                 udf_hybrid_calc_function(
                                     "prediction_GBT",
                                     "prediction_SVC",
                                     "prediction_LGR",
                                     "prediction_RF"))

evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
accuracy = evaluator.evaluate(results, {evaluator.metricName: "accuracy"})
f1Score = evaluator.evaluate(results, {evaluator.metricName: "f1"})
results.show()
print('Hybrid Metrics:')
print('Accuracy: {:.2f}'.format(accuracy))
print('F1 Score: {:.2f}'.format(f1Score))

+-------+-----+--------------+--------------+--------------+-------------+----------+
| userID|label|prediction_GBT|prediction_LGR|prediction_SVC|prediction_RF|prediction|
+-------+-----+--------------+--------------+--------------+-------------+----------+
|1057724|    0|           0.0|           0.0|           0.0|          0.0|       0.0|
|1059049|    0|           0.0|           0.0|           0.0|          0.0|       0.0|
|1133196|    0|           0.0|           0.0|           0.0|          0.0|       0.0|
|1151194|    0|           0.0|           0.0|           0.0|          0.0|       0.0|
|1156065|    1|           0.0|           0.0|           0.0|          0.0|       0.0|
|1311711|    1|           1.0|           0.0|           0.0|          0.0|       1.0|
|1396378|    1|           0.0|           0.0|           0.0|          0.0|       0.0|
|1507765|    0|           0.0|           0.0|           0.0|          0.0|       0.0|
|1519090|    0|           0.0|           0.0|         

The f1 score and the accuracy stayed the same as the best model.