# Predicting User Churn in Digital Music Services

Notebook to document data exploration and development of ML algorithm to identify at risk customers in digital music services.

### Data Definition

From Exploratory Data Analysis (EDA): 
#### Useful:
- *location*: location of user, seems to append each new state (location, state)
- *gender*: user gender (M/F/None)

- *page*: what page the user is on during event (pages)
- *level*: subscription level check uniqueness (free or paid)
- *auth*: authenication (logged in/out)
- *length*: time spent on page, max 50 mins on NextSong (if song paused??)

- *registration*: unknown (registration unixtime)
- *ts*: timestamp of event in ms (event unixtime)

- *userId*: unique (userId val)
- *sessionId*: unique sessionId per user?
- *itemInSession*: lcounter for the number of items in a single session (item listened to in session)


#### Not Useful:
- *firstName*: users first name (not important, remove)
- *lastName*: users lastname
- *artist*: song artist
- *song*: songname
- *userAgent*: device/browser (not important for us, remove)
- *method*: API PUT/GET http request (not important for us, remove)
- *status*: http status

### Imports

In [20]:
# imports
#import ibmos2spark

# pyspark sql
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import from_unixtime, udf, col, when, isnan, desc
from pyspark.sql.functions import sum as Fsum
from pyspark.sql.types import IntegerType, StringType
from pyspark.sql import functions as F

# pyspark ml
from pyspark.ml.feature import VectorAssembler, Normalizer, StringIndexer, OneHotEncoder
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier, GBTClassifier
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline

# python
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

### setup

In [21]:
# # config
# # @hidden_cell
# credentials = {
#     'endpoint': 'https://s3.eu-geo.objectstorage.service.networklayer.com',
#     'service_id': 'iam-ServiceId-147e1161-7da9-41fe-ac00-c144730def00',
#     'iam_service_endpoint': 'https://iam.cloud.ibm.com/oidc/token',
#     'api_key': 'kAtvjdC8VIYYUmU3gDaOYIK2fCvP3nkjYYlDiNuu4gw6'
# }

# configuration_name = 'os_76774389dfa04fb5acbb1640b3e11704_configs'
# cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

In [22]:
# Build Spark session
spark = SparkSession.builder.appName("user_churn").getOrCreate()

# Read in data from IBM Cloud
# data_df = spark.read.json(cos.url('medium-sparkify-event-data.json', 'sparkify-donotdelete-pr-fnqu5byx41gcai'))
data_df = spark.read.parquet("../data/04_primary/medium-sparkify-event-data-features.parquet")

# Data Setup for ML Algorithm

In [23]:
data_df.printSchema()

root
 |-- label: long (nullable = true)
 |-- userId: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- usStateAbbr: string (nullable = true)
 |-- avg_item_in_session: double (nullable = true)
 |-- avg_session_length: double (nullable = true)
 |-- num_good_recc: long (nullable = true)
 |-- num_bad_recc: long (nullable = true)
 |-- num_bad_sys: long (nullable = true)



In [24]:
data_df.head()

Row(label=1, userId='100012', gender='M', usStateAbbr='WI', avg_item_in_session=25.15568862275449, avg_session_length=242.23231660714296, num_good_recc=9, num_bad_recc=2, num_bad_sys=3)

## Create Features

### Onehot Encode Categorical Variables

In [25]:
## https://stackoverflow.com/questions/32277576/how-to-handle-categorical-features-with-spark-ml

from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
cols = ['gender', 'usStateAbbr']

indexers = [
    StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
    for c in cols
]

encoders = [
    OneHotEncoder(
        inputCol=indexer.getOutputCol(),
        outputCol="{0}_encoded".format(indexer.getOutputCol())) 
    for indexer in indexers
]

assembler = VectorAssembler(
    inputCols=[encoder.getOutputCol() for encoder in encoders],
    outputCol="features2"
)

pipeline = Pipeline(stages=indexers + encoders + [assembler])
data_df = pipeline.fit(data_df).transform(data_df)


### Create Features Vector

In [26]:
# this vector is created in prep for ml
assembler = VectorAssembler(inputCols=["avg_item_in_session",
                                       "avg_session_length",
                                       "num_good_recc",
                                       "num_bad_recc",
                                       "num_bad_sys",
                                       'gender_indexed',
                                       'usStateAbbr_indexed'],
                            outputCol="raw_features",
                           handleInvalid="skip")
data_df = assembler.transform(data_df)

In [39]:
# apply scaler
scaler = Normalizer(inputCol="raw_features", outputCol="features")
ml_df_prepped = scaler.transform(data_df)

### Create custom evaluator

In [49]:
def evaluate_model(results):
    """Customer function to evaluate result.
    """
    # Generic evaluation
    total_results = results.count()
    correct_pred = results.filter(results.label == results.prediction).count()
    incorrect_pred = results.filter(results.label != results.prediction).count()
    print('Total user events predicted correctly: {}'.format(correct_pred))
    print('Total user events predicted wrongly: {}'.format(incorrect_pred))
    print("Percentage predicted correct (%): {} \n".format((correct_pred/total_results)))
    
    # Correct churn predictions
    churn_correct = results.filter((results.label == 1) & (results.prediction == 1)).count()
    actual_churned_users = results.filter(results.label == 1).count()
    print('User churned and predicted to churn: {}'.format(churn_correct))
    print('User churned : {}'.format(actual_churned_users))
    print('Percent churned user events predicted correctly(%): {}\n'.format((churn_correct/actual_churned_users)))
    
    # Incorrect churn predictions
    print('Number of events predicted to churn but didnt: {}'.format(results.filter((results.label == 0) & (results.label == 1)).count()))
    churn_incorrect = results.filter((results.label == 0) & (results.prediction == 1)).count()
    print('User did not churn and predicted to: {}'.format(churn_incorrect))
    print('Percent churned user events predicted correctly(%): {}\n'.format(churn_incorrect/total_results))
    
    


# Train ML Model

In [41]:
# train test split for ML validation
train, test =  ml_df_prepped.randomSplit([0.6, 0.4], seed=42)  # more equal fit to combat overfitting
train.head(1)

[Row(label=0, userId='100002', gender='F', usStateAbbr='CA', avg_item_in_session=31.25423728813559, avg_session_length=268.38730715328467, num_good_recc=11, num_bad_recc=2, num_bad_sys=3, gender_indexed=1.0, usStateAbbr_indexed=0.0, gender_indexed_encoded=SparseVector(1, {}), usStateAbbr_indexed_encoded=SparseVector(44, {0: 1.0}), features2=SparseVector(45, {1: 1.0}), raw_features=DenseVector([31.2542, 268.3873, 11.0, 2.0, 3.0, 1.0, 0.0]), features=DenseVector([0.1156, 0.9924, 0.0407, 0.0074, 0.0111, 0.0037, 0.0]))]

In [42]:
print("{} churned user events".format(train.filter(train['label']==1).count()))

print("{} non-churned user events".format(train.filter(train['label']==0).count()))

print("{} ratio of churned/non-churned user events".format(train.filter(train['label']==1).count()/train.filter(train['label']==0).count()))

62399 churned user events
254906 non-churned user events
0.24479219790824852 ratio of churned/non-churned user events


## Baseline ML Model

baseline binary Logisitc Regression Model

In [50]:
results = LogisticRegression().fit(train).transform(test)
evaluate_model(results)

Total user events predicted correctly: 170373
Total user events predicted wrongly: 40327
Percentage predicted correct (%): 0.8086046511627907 

User churned and predicted to churn: 3941
User churned : 41247
Percent churned user events predicted correctly(%): 0.09554634276432225

Number of events predicted to churn but didnt: 0
User did not churn and predicted to: 3021
Percent churned user events predicted correctly(%): 0.014337921214997627



In [51]:
# Fit and calculate predictions
results = RandomForestClassifier().fit(train).transform(test)
evaluate_model(results)

Total user events predicted correctly: 184519
Total user events predicted wrongly: 26181
Percentage predicted correct (%): 0.8757427622211675 

User churned and predicted to churn: 15066
User churned : 41247
Percent churned user events predicted correctly(%): 0.36526292821296097

Number of events predicted to churn but didnt: 0
User did not churn and predicted to: 0
Percent churned user events predicted correctly(%): 0.0



In [52]:
# Fit and calculate predictions
results = GBTClassifier().fit(train).transform(test)
evaluate_model(results)

Total user events predicted correctly: 204272
Total user events predicted wrongly: 6428
Percentage predicted correct (%): 0.9694921689606075 

User churned and predicted to churn: 34819
User churned : 41247
Percent churned user events predicted correctly(%): 0.8441583630324629

Number of events predicted to churn but didnt: 0
User did not churn and predicted to: 0
Percent churned user events predicted correctly(%): 0.0



## Optimised ML Model

In [53]:
# pipeline, just running it on classifier no transformations

gbt_model = GBTClassifier()

pipeline = Pipeline(stages=[gbt_model])

# set up param grid to iterate over
paramGrid = ParamGridBuilder() \
.addGrid(gbt_model.maxDepth, [2, 4, 7]) \
.addGrid(gbt_model.maxBins, [15, 40, 50]) \
.addGrid(gbt_model.stepSize, [0.02, 0.2]) \
.build()

# set up crossvalidator to tune parameters and optimize
crossval = CrossValidator(estimator=pipeline,
                         estimatorParamMaps=paramGrid,
                         evaluator=MulticlassClassificationEvaluator(metricName='f1'),
                         numFolds=3)

In [54]:
cvModel = crossval.fit(train)  # train model
results = cvModel.transform(test)  # apply model on test data

In [55]:
evaluate_model(results)

Total user events predicted correctly: 210351
Total user events predicted wrongly: 349
Percentage predicted correct (%): 0.998343616516374 

User churned and predicted to churn: 40898
User churned : 41247
Percent churned user events predicted correctly(%): 0.991538778577836

Number of events predicted to churn but didnt: 0
User did not churn and predicted to: 0
Percent churned user events predicted correctly(%): 0.0



In [56]:
cvModel.avgMetrics  # look at model scoring metrics

[0.7735991524077506,
 0.8307600265233073,
 0.7902243411999164,
 0.852009308215375,
 0.8040890373611671,
 0.8615093841220003,
 0.8525021171831484,
 0.9609473971428757,
 0.8689646251652048,
 0.9673013776843362,
 0.8469729699292978,
 0.9677221883693633,
 0.9286818765501423,
 0.9945904015326863,
 0.9505490507083667,
 0.9987713149098558,
 0.946510784643793,
 0.9989807612709432]

In [57]:
cvModel.bestModel

PipelineModel_610c3d5c90b4