# Predicting User Churn in Digital Music Services

Notebook to document data exploration and development of ML algorithm to identify at risk customers in digital music services.

### Data Definition

From Exploratory Data Analysis (EDA): 
#### Useful:
- *location*: location of user, seems to append each new state (location, state)
- *gender*: user gender (M/F/None)

- *page*: what page the user is on during event (pages)
- *level*: subscription level check uniqueness (free or paid)
- *auth*: authenication (logged in/out)
- *length*: time spent on page, max 50 mins on NextSong (if song paused??)

- *registration*: unknown (registration unixtime)
- *ts*: timestamp of event in ms (event unixtime)

- *userId*: unique (userId val)
- *sessionId*: unique sessionId per user?
- *itemInSession*: lcounter for the number of items in a single session (item listened to in session)


#### Not Useful:
- *firstName*: users first name (not important, remove)
- *lastName*: users lastname
- *artist*: song artist
- *song*: songname
- *userAgent*: device/browser (not important for us, remove)
- *method*: API PUT/GET http request (not important for us, remove)
- *status*: http status

# Apache Spark on IBM Watson Setup

### Imports

In [1]:
# imports
import ibmos2spark

# pyspark sql
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import from_unixtime, udf, col, when, isnan, desc
from pyspark.sql.functions import sum as Fsum
from pyspark.sql.types import IntegerType, StringType
from pyspark.sql import functions as F

# pyspark ml
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# python
import datetime
import matplotlib.pyplot as plt
import seaborn as sns

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20200626194806-0003
KERNEL_ID = 71c997e1-caf0-4c6b-8802-8bb377aaf082


### setup

In [2]:
# The code was removed by Watson Studio for sharing.

In [None]:
# Build Spark session
spark = SparkSession.builder.appName("User Churn") .getOrCreate()

# Read in data from IBM Cloud
data_df = spark.read.json(cos.url('medium-sparkify-event-data.json', 'sparkify-donotdelete-pr-fnqu5byx41gcai'))

# Exploratory Data Analysis

In [None]:
data_df.printSchema()

In [None]:
data_df.head(1)

In [None]:
data_df.toPandas().describe(include='all')

# ...

## Exploratory Data Analysis (EDA) -  using pysparksql

In [None]:
# create temp sql table to explore data
data_df.createOrReplaceTempView("user_log_table")

### Metadata: No. of Users in data

In [None]:
# how many users in the dataset, unique userId
spark.sql("SELECT COUNT(DISTINCT(userId)) FROM user_log_table LIMIT 10").show()

### Feature: Types of Pages

In [None]:
# look at unique pages
spark.sql("SELECT DISTINCT(page) FROM user_log_table LIMIT 100").collect()

From here we can see we want to identifying at risk customers by prediciting:
- Cancel
- Submit Downgrade
- Downgrade
- Cancellation Confirmation


### Feature: Types of level

In [None]:
# unique levels
spark.sql("SELECT DISTINCT(level) FROM user_log_table LIMIT 100").collect()

### Feature: authentication levels 

In [None]:
spark.sql("SELECT DISTINCT(auth) FROM user_log_table LIMIT 100").collect()

### Feature: User Locations

In [None]:
spark.sql("SELECT DISTINCT(location) FROM user_log_table LIMIT 1000").collect()

#                               ...

# Data Wrangling

### Remove non-useful columns and drop missing values

In [None]:
def clean_df(user_log_valid):
    """Remove non useful data.
    """
    # lets remove some of the columns we don't think will be useful from data exploration
    cols_to_drop = ['firstName', 'lastName','artist', 'song', 'method', 'status', 'userAgent']
    user_log_df = data_df.drop(*cols_to_drop)
    # drop rows with missing info
    user_log_valid = user_log_df.dropna(how = "any", subset = ["userId", "sessionId"])
    
user_log_valid = clean_df(user_log_valid)

### Convert UNIX timestamps to Datatime

In [None]:
def unix_to_datetime(user_log_valid):
    """ Convert unix timestamps to datetime
    """
    # event unix to datetime
    user_log_valid = user_log_valid.withColumn("timestamp_datetime",
                                         from_unixtime(user_log_valid.ts/1000,
                                                       format='yyyy-MM-dd HH:mm:ss'))
    # registration unix to datetime
    user_log_valid = user_log_valid.withColumn("registration_datetime",
                                         from_unixtime(user_log_valid.registration/1000,
                                                       format='yyyy-MM-dd HH:mm:ss'))
    return user_log_valid


user_log_valid = unix_to_datetime(user_log_valid)

### Creating US State Feature for Visualisation

In [None]:
# missing values cause issue with split
user_log_valid.filter((user_log_df["location"].isNull())).count()

In [None]:
def create_us_states(user_log_valid):
    """Create US states column from location
    """
    # we don't really want to drop these rows as the col isn't vital 
    # so replace missing values to allow split
    user_log_valid = user_log_valid.fillna({'location':''})
    # create state column
    loc_split = udf(lambda x: x.split(', ')[-1], StringType())
    # Sates seem to be appended, so take latest
    state_split = udf(lambda x: x.split('-')[-1], StringType())

    # apply udfs
    user_log_valid = user_log_valid.withColumn("usstate_abbr",
                                         when(user_log_valid.location.isNotNull(),
                                              loc_split(user_log_valid.location)).otherwise(''))
    user_log_valid = user_log_valid.withColumn("usstate_abbr",
                                         when(user_log_valid.usstate_abbr.isNotNull(),
                                              state_split(user_log_valid.usstate_abbr)).otherwise(''))
    
user_log_valid = create_us_states(user_log_valid)

In [None]:
# take a look
user_log_valid.head(1)

# ML Feature Engineering

### Flag user Cancellations and Create Phase

In [None]:
def create_phase_feature(user_log_valid):
    """Use the cancellation to identify churned users.
    """
    flag_cancellation_event = udf(lambda x: 1 if x == "Cancellation Confirmation" else 0, IntegerType())
    user_log_valid = user_log_valid.withColumn("churn", flag_cancellation_event("page"))
    windowval = Window.partitionBy("userId").orderBy(desc("ts")).rangeBetween(Window.unboundedPreceding, 0)
    user_log_valid = user_log_valid.withColumn("label", Fsum("churn").over(windowval))
    
user_log_valid = create_phase_feature(user_log_valid)

In [None]:
user_log_valid.head()

In [None]:
user_log_valid.filter(user_log_valid['userId']==100010).head(50000)

### Calculate Hours Since Registration

In [None]:
def hours_since_reg(user_log_valid):

    # hours since registration
    user_log_valid = user_log_valid.withColumn('hours_since_registration',
                                         (user_log_valid['ts'] - user_log_valid['registration']) / (1000 *3600))
    return user_log_valid.withColumn("hours_since_registration", user_log_valid["hours_since_registration"].cast(IntegerType()))

user_log_valid = hours_since_reg(user_log_valid)

### Calculate Hour in the Day of Event

In [None]:
def hour_in_day(user_log_valid):

    # hour in the day of event
    get_hour = udf(lambda x:  int(datetime.datetime.fromtimestamp(x / 1000.0).hour)) 
    user_log_valid = user_log_valid.withColumn("hour", get_hour(user_log_valid.ts))
    return user_log_valid

user_log_valid = hour_in_day(user_log_valid)

In [None]:
def avg_user_items_in_sesh(user_log_valid):
    # calculate average listening time
    windowval = Window.partitionBy("userId").orderBy("ts").rangeBetween(Window.unboundedPreceding, 0)
    return user_log_valid.withColumn('itemInSession_rolling_average', F.avg("itemInSession").over(windowval))
    
user_log_valid = avg_user_items_in_sesh(user_log_valid)

In [None]:
def avg_user_listening_time(user_log_valid)
    # calculate average listening time
    windowval = Window.partitionBy("userId").orderBy("ts").rangeBetween(Window.unboundedPreceding, 0)
    return user_log_valid.withColumn('length_rolling_average', F.avg("length").over(windowval))

user_log_valid = avg_user_listening_time(user_log_valid)

In [None]:
user_log_valid.filter(user_log_valid['userId']==293).select("sessionId","length","length_rolling_average").head(5)

In [None]:
def num_neg_user_events(user_log_valid):
    # Number of Positive Events
    return user_log_valid.withColumn("positive_event",
                                         when((user_log_valid["page"] == 'Add to Playlist') |\
                                              (user_log_valid["page"] == 'Add Friend') |\
                                              (user_log_valid["page"] == 'Thumbs Up'),
                                              1).otherwise(0))

user_log_valid = num_neg_user_events(user_log_valid)

In [None]:
def num_pos_user_events(user_log_valid):
    # Number of Negative Events
    return user_log_valid.withColumn("negative_event",
                                         when((user_log_valid["page"] == 'Thumbs Down') |\
                                              (user_log_valid["page"] == 'Help') |\
                                              (user_log_valid["page"] == 'Error'),
                                              1).otherwise(0))

user_log_valid = num_pos_user_events(user_log_valid)

In [None]:
user_log_valid.head(1)

In [None]:
pd_features = features_df.toPandas()

In [None]:
fig = plt.figure(figsize=(30,25))
ax = fig.gca()
h = pd_features.hist(ax=ax)

# Data Setup for ML Algorithm

In [None]:
ml_df_prep=user_log_valid

## Create Features

### Onehot Encode Categorical Variables

In [None]:
## https://stackoverflow.com/questions/32277576/how-to-handle-categorical-features-with-spark-ml
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator

indexer = StringIndexer(inputCol="level", outputCol="levelIndex")
inputs = [indexer.getOutputCol()]
encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["levelVec"])

pipeline = Pipeline(stages=[indexer, encoder])
ml_df_prep = pipeline.fit(ml_df_prep).transform(ml_df_prep)

### Create Features Vector

In [None]:
# this vector is created in prep for ml
assembler = VectorAssembler(inputCols=["sessionId",
                                       "itemInSession",
                                       "hours_since_registration",
                                       "levelVec"],
                            outputCol="features",
                           handleInvalid="skip")
ml_df_prep = assembler.transform(ml_df_prep)

In [None]:
# apply scaler
scaler = Normalizer(inputCol="features", outputCol="ScaledFeatures")
ml_df_prep = scaler.transform(ml_df_prep)

In [None]:
ml_df_prep.head(1)

In [None]:
ml_df = ml_df_prep.select("label","features")
ml_df.head()

# Train ML Model

In [None]:
# train test split for ML validation
train, test =  ml_df.randomSplit([0.8, 0.2], seed=42)  # more equal fit to combat overfitting
train.head(50)

In [None]:
train.filter(train['label']==1).count()

In [None]:
train.filter(train['label']==0).count()

In [None]:
train.filter(train['label']==1).count()/train.filter(train['label']==0).count()

In [None]:
# estimators
lr = LogisticRegression(maxIter=10, regParam=0.0, elasticNetParam=0)

### Baseline

baseline binary Logisitc Regression Model

In [None]:
lrmodel = lr.fit(train)

In [None]:
lr_results = lrmodel.transform(test) 

In [None]:
lr_results.filter(lr_results["prediction"]==1).head(10)

In [None]:
lrmodel.summary.accuracy

In [None]:
lrmodel.summary.fMeasureByLabel()

In [None]:
lrmodel.summary.precisionByLabel

### Optimised

In [None]:
# pipeline, just running it on classifier no transformations
pipeline = Pipeline(stages=[lr])

In [None]:
# set up param grid to iterate over
paramGrid = ParamGridBuilder() \
.addGrid(lr.regParam, [0.0, 0.1]) \
.build()

In [None]:
# set up crossvalidator to tune parameters and optimize
crossval = CrossValidator(estimator=pipeline,
                         estimatorParamMaps=paramGrid,
                         evaluator=MulticlassClassificationEvaluator(metricName='f1'),
                         numFolds=2)

In [None]:
cvModel = crossval.fit(train)  # train model

In [None]:
results = cvModel.transform(test)  # apply model on test data

In [None]:
cvModel.avgMetrics  # look at model scoring metrics

In [None]:
results.count()  # how many events in total labels

In [None]:
print(results.filter(results.label == results.prediction).count())  # check how many were predicted correctly

In [None]:
results.filter(results.label == results.prediction).count()/results.count()   # hwow many correct

In [None]:
results.filter(results["prediction"]==1).head(5)

In [None]:
results.filter(results["label"]==1).head(50)