# Predicting User Churn in Digital Music Services

Notebook to document data exploration and development of ML algorithm to identify at risk customers in digital music services.

### Data Definition

From Exploratory Data Analysis (EDA): 
#### Useful:
- *location*: location of user, seems to append each new state (location, state)
- *gender*: user gender (M/F/None)

- *page*: what page the user is on during event (pages)
- *level*: subscription level check uniqueness (free or paid)
- *auth*: authenication (logged in/out)
- *length*: time spent on page, max 50 mins on NextSong (if song paused??)

- *registration*: unknown (registration unixtime)
- *ts*: timestamp of event in ms (event unixtime)

- *userId*: unique (userId val)
- *sessionId*: unique sessionId per user?
- *itemInSession*: lcounter for the number of items in a single session (item listened to in session)


#### Not Useful:
- *firstName*: users first name (not important, remove)
- *lastName*: users lastname
- *artist*: song artist
- *song*: songname
- *userAgent*: device/browser (not important for us, remove)
- *method*: API PUT/GET http request (not important for us, remove)
- *status*: http status

# Apache Spark on IBM Watson Setup

In [1]:
# imports
import ibmos2spark

# pyspark sql
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import from_unixtime, udf, col, when, isnan, desc
from pyspark.sql.functions import sum as Fsum
from pyspark.sql.types import IntegerType, StringType

# pyspark ml
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.pipeline import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# python
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20200621184655-0005
KERNEL_ID = 4e3e8129-b3d1-4f06-94f0-3ea706b93dd4


In [2]:
# The code was removed by Watson Studio for sharing.

In [3]:
# Build Spark session
spark = SparkSession.builder.appName("User Churn") .getOrCreate()

# Read in data from IBM Cloud
data_df = spark.read.json(cos.url('medium-sparkify-event-data.json', 'sparkify-donotdelete-pr-fnqu5byx41gcai'))

# Exploratory Data Analysis

In [4]:
data_df.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [5]:
data_df.head(1)

[Row(artist='Martin Orford', auth='Logged In', firstName='Joseph', gender='M', itemInSession=20, lastName='Morales', length=597.55057, level='free', location='Corpus Christi, TX', method='PUT', page='NextSong', registration=1532063507000, sessionId=292, song='Grand Designs', status=200, ts=1538352011000, userAgent='"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='293')]

In [6]:
data_df.toPandas().describe(include='all')

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
count,432877,543705,528005,528005,543705.0,528005,432877.0,543705,528005,543705,543705,528005.0,543705.0,432877,543705.0,543705.0,528005,543705.0
unique,21247,4,345,2,,275,,2,192,2,22,,,80292,,,71,449.0
top,Kings Of Leon,Logged In,Joseph,M,,Reed,,paid,"New York-Newark-Jersey City, NY-NJ-PA",PUT,NextSong,,,You're The One,,,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4...",
freq,3497,527906,13108,302612,,12767,,428597,40156,495143,432877,,,2219,,,46082,15700.0
mean,,,,,107.306291,,248.664593,,,,,1535523000000.0,2040.814353,,210.018291,1540965000000.0,,
std,,,,,116.723508,,98.41267,,,,,3078725000.0,1434.338931,,31.471919,1482057000.0,,
min,,,,,0.0,,0.78322,,,,,1509854000000.0,1.0,,200.0,1538352000000.0,,
25%,,,,,26.0,,199.3922,,,,,1534368000000.0,630.0,,200.0,1539720000000.0,,
50%,,,,,68.0,,234.00444,,,,,1536556000000.0,1968.0,,200.0,1541005000000.0,,
75%,,,,,147.0,,276.79302,,,,,1537612000000.0,3307.0,,200.0,1542177000000.0,,


# ...

## Exploratory Data Analysis (EDA) -  using pysparksql

In [7]:
# create temp sql table to explore data
data_df.createOrReplaceTempView("user_log_table")

### Metadata: No. of Users in data

In [8]:
# how many users in the dataset, unique userId
spark.sql("SELECT COUNT(DISTINCT(userId)) FROM user_log_table LIMIT 10").show()

+----------------------+
|count(DISTINCT userId)|
+----------------------+
|                   449|
+----------------------+



### Feature: Types of Pages

In [9]:
# look at unique pages
spark.sql("SELECT DISTINCT(page) FROM user_log_table LIMIT 100").collect()

[Row(page='Cancel'),
 Row(page='Submit Downgrade'),
 Row(page='Thumbs Down'),
 Row(page='Home'),
 Row(page='Downgrade'),
 Row(page='Roll Advert'),
 Row(page='Logout'),
 Row(page='Save Settings'),
 Row(page='Cancellation Confirmation'),
 Row(page='About'),
 Row(page='Submit Registration'),
 Row(page='Settings'),
 Row(page='Login'),
 Row(page='Register'),
 Row(page='Add to Playlist'),
 Row(page='Add Friend'),
 Row(page='NextSong'),
 Row(page='Thumbs Up'),
 Row(page='Help'),
 Row(page='Upgrade'),
 Row(page='Error'),
 Row(page='Submit Upgrade')]

From here we can see we want to identifying at risk customers by prediciting:
- Cancel
- Submit Downgrade
- Downgrade
- Cancellation Confirmation


### Feature: Types of level

In [10]:
# unique levels
spark.sql("SELECT DISTINCT(level) FROM user_log_table LIMIT 100").collect()

[Row(level='free'), Row(level='paid')]

### Feature: authentication levels 

In [11]:
spark.sql("SELECT DISTINCT(auth) FROM user_log_table LIMIT 100").collect()

[Row(auth='Logged Out'),
 Row(auth='Cancelled'),
 Row(auth='Guest'),
 Row(auth='Logged In')]

### Feature: User Locations

In [12]:
spark.sql("SELECT DISTINCT(location) FROM user_log_table LIMIT 1000").collect()

[Row(location='Atlantic City-Hammonton, NJ'),
 Row(location='Gainesville, FL'),
 Row(location='Richmond, VA'),
 Row(location='Oskaloosa, IA'),
 Row(location='Tucson, AZ'),
 Row(location='Deltona-Daytona Beach-Ormond Beach, FL'),
 Row(location='San Diego-Carlsbad, CA'),
 Row(location='Cleveland-Elyria, OH'),
 Row(location='Medford, OR'),
 Row(location='Kingsport-Bristol-Bristol, TN-VA'),
 Row(location='New Haven-Milford, CT'),
 Row(location='Birmingham-Hoover, AL'),
 Row(location='Corpus Christi, TX'),
 Row(location='Mobile, AL'),
 Row(location='Dubuque, IA'),
 Row(location='Las Vegas-Henderson-Paradise, NV'),
 Row(location='Killeen-Temple, TX'),
 Row(location='Ottawa-Peru, IL'),
 Row(location='Boise City, ID'),
 Row(location='Bremerton-Silverdale, WA'),
 Row(location='Urban Honolulu, HI'),
 Row(location='Cedar City, UT'),
 Row(location='Indianapolis-Carmel-Anderson, IN'),
 Row(location='Durham-Chapel Hill, NC'),
 Row(location='Seattle-Tacoma-Bellevue, WA'),
 Row(location='Fort Smith, A

#                               ...

# Data Wrangling

### Remove non-useful columns and drop missing values

In [14]:
# lets remove some of the columns we don't think will be useful from data exploration
cols_to_drop = ['firstName', 'lastName','artist', 'song', 'method', 'status', 'userAgent']
user_log_df = data_df.drop(*cols_to_drop)

In [15]:
# drop rows with missing info
user_log_valid = user_log_df.dropna(how = "any", subset = ["userId", "sessionId"])

### Convert UNIX timestamps to Datatime

In [16]:
# event unix to datetime
user_log_valid = user_log_valid.withColumn("timestamp_datetime",
                                     from_unixtime(user_log_valid.ts/1000,
                                                   format='yyyy-MM-dd HH:mm:ss'))

In [17]:
# registration unix to datetime
user_log_valid = user_log_valid.withColumn("registration_datetime",
                                     from_unixtime(user_log_valid.registration/1000,
                                                   format='yyyy-MM-dd HH:mm:ss'))

### Creating US State Feature for Visualisation

In [18]:
# missing values cause issue with split
user_log_valid.filter((user_log_df["location"].isNull())).count()

15700

In [19]:
# we don't really want to drop these rows as the col isn't vital 
# so replace missing values to allow split
user_log_valid = user_log_valid.fillna({'location':''})
# create state column
loc_split = udf(lambda x: x.split(', ')[-1], StringType())
# Sates seem to be appended, so take latest
state_split = udf(lambda x: x.split('-')[-1], StringType())

# apply udfs
user_log_valid = user_log_valid.withColumn("usstate_abbr",
                                     when(user_log_valid.location.isNotNull(),
                                          loc_split(user_log_valid.location)).otherwise(''))
user_log_valid = user_log_valid.withColumn("usstate_abbr",
                                     when(user_log_valid.usstate_abbr.isNotNull(),
                                          state_split(user_log_valid.usstate_abbr)).otherwise(''))

In [20]:
# take a look
user_log_valid.head(1)

[Row(auth='Logged In', gender='M', itemInSession=20, length=597.55057, level='free', location='Corpus Christi, TX', page='NextSong', registration=1532063507000, sessionId=292, ts=1538352011000, userId='293', timestamp_datetime='2018-10-01 00:00:11', registration_datetime='2018-07-20 05:11:47', usstate_abbr='TX')]

# Feature Engineering

### Flag user Downgrades and Create Phase

In [21]:
flag_downgrade_event = udf(lambda x: 1 if x == "Submit Downgrade" else 0, IntegerType())
user_log_valid = user_log_valid.withColumn("downgraded", flag_downgrade_event("page"))

windowval = Window.partitionBy("userId").orderBy(desc("ts")).rangeBetween(Window.unboundedPreceding, 0)

user_log_valid = user_log_valid.withColumn("label", Fsum("downgraded").over(windowval))

### Calculate Hours Since Registration

In [22]:
# hours since registration
user_log_valid = user_log_valid.withColumn('hours_since_registration',
                                     (user_log_valid['ts'] - user_log_valid['registration']) / (1000 *3600))
user_log_valid = user_log_valid.withColumn("hours_since_registration", user_log_valid["hours_since_registration"].cast(IntegerType()))

### Calculate Hour in the Day of Event

In [23]:
# hour in the day of event
get_hour = udf(lambda x:  int(datetime.datetime.fromtimestamp(x / 1000.0).hour)) 
user_log_valid = user_log_valid.withColumn("hour", get_hour(user_log_valid.ts))

In [24]:
from pyspark.sql import functions as F
# calculate average listening time
windowval = Window.partitionBy("userId").orderBy("ts").rangeBetween(Window.unboundedPreceding, 0)
user_log_valid = user_log_valid.withColumn('itemInSession_rolling_average', F.avg("itemInSession").over(windowval))
user_log_valid.filter(user_log_valid['userId']==293).select("sessionId","itemInSession","itemInSession_rolling_average").head(5)

[Row(sessionId=292, itemInSession=20, itemInSession_rolling_average=20.0),
 Row(sessionId=292, itemInSession=21, itemInSession_rolling_average=20.5),
 Row(sessionId=292, itemInSession=22, itemInSession_rolling_average=21.0),
 Row(sessionId=292, itemInSession=23, itemInSession_rolling_average=21.5),
 Row(sessionId=292, itemInSession=24, itemInSession_rolling_average=22.0)]

In [25]:
# calculate average listening time
windowval = Window.partitionBy("userId").orderBy("ts").rangeBetween(Window.unboundedPreceding, 0)
user_log_valid = user_log_valid.withColumn('length_rolling_average', F.avg("length").over(windowval))
user_log_valid.filter(user_log_valid['userId']==293).select("sessionId","length","length_rolling_average").head(5)

[Row(sessionId=292, length=597.55057, length_rolling_average=597.55057),
 Row(sessionId=292, length=180.50567, length_rolling_average=389.02812),
 Row(sessionId=292, length=268.59057, length_rolling_average=348.88227),
 Row(sessionId=292, length=None, length_rolling_average=348.88227),
 Row(sessionId=292, length=232.88118, length_rolling_average=319.8819975)]

In [26]:
# Number of Positive Events
user_log_valid = user_log_valid.withColumn("positive_event",
                                     when((user_log_valid["page"] == 'Add to Playlist') |\
                                          (user_log_valid["page"] == 'Add Friend') |\
                                          (user_log_valid["page"] == 'Thumbs Up'),
                                          1).otherwise(0))

In [27]:
# Number of Negative Events
user_log_valid = user_log_valid.withColumn("negative_event",
                                     when((user_log_valid["page"] == 'Thumbs Down') |\
                                          (user_log_valid["page"] == 'Help') |\
                                          (user_log_valid["page"] == 'Error'),
                                          1).otherwise(0))

# Data Setup for ML Algorithm

In [28]:
ml_df_prep =user_log_valid

In [29]:
# how many churn events in dataset
ml_df_prep.filter(ml_df_prep["label"]==1).count()

107526

## Create Features

### Onehot Encode Categorical Variables

In [32]:
## https://stackoverflow.com/questions/32277576/how-to-handle-categorical-features-with-spark-ml
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator

indexer = StringIndexer(inputCol="level", outputCol="levelIndex")
inputs = [indexer.getOutputCol()]
encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["levelVec"])

pipeline = Pipeline(stages=[indexer, encoder])
ml_df_prep = pipeline.fit(ml_df_prep).transform(ml_df_prep)

### Create Features Vector

In [33]:
# this vector is created in prep for ml
assembler = VectorAssembler(inputCols=["sessionId",
                                       "itemInSession",
                                       "hours_since_registration",
                                      "levelVec"],
                            outputCol="features",
                           handleInvalid="skip")
ml_df_prep = assembler.transform(ml_df_prep)

In [34]:
# apply scaler
scaler = Normalizer(inputCol="features", outputCol="ScaledFeatures")
ml_df_prep = scaler.transform(ml_df_prep)

In [45]:
ml_df_prep.head(1)

[Row(auth='Logged In', gender='F', itemInSession=0, length=226.08934, level='free', location='Bridgeport-Stamford-Norwalk, CT', page='NextSong', registration=1538016340000, sessionId=62, ts=1538991392000, userId='100010', timestamp_datetime='2018-10-08 09:36:32', registration_datetime='2018-09-27 02:45:40', usstate_abbr='CT', downgraded=0, label=0, hours_since_registration=270, hour='9', itemInSession_rolling_average=0.0, length_rolling_average=226.08934, positive_event=0, negative_event=0, levelIndex=1.0, levelVec=SparseVector(1, {}), features=DenseVector([62.0, 0.0, 270.0, 0.0]), ScaledFeatures=DenseVector([0.2238, 0.0, 0.9746, 0.0]))]

In [44]:
ml_df = ml_df_prep.select("label","features")
ml_df.head()

Row(label=0, features=DenseVector([166.0, 67.0, 343.0, 0.0]))

## Train ML Model

In [37]:
# train test split for ML validation
train, test =  ml_df.randomSplit([0.6, 0.4], seed=42)  # more equal fit to combat overfitting
train.head(1)

[Row(label=0, features=DenseVector([2.0, 4.0, 590.0, 0.0]))]

In [38]:
# estimators
lr = LogisticRegression(maxIter=10, regParam=0.0, elasticNetParam=0)

### Baseline

baseline binary Logisitc Regression Model

In [39]:
lrmodel = lr.fit(train)

In [40]:
lr_results = lrmodel.transform(test) 

In [43]:
lr_results.filter(lr_results["label"]==1).head(50)

[Row(label=1, features=DenseVector([477.0, 0.0, 1121.0, 0.0]), rawPrediction=DenseVector([2.7007, 1.4516, -0.8783, -3.274]), probability=DenseVector([0.7592, 0.2177, 0.0212, 0.0019]), prediction=0.0),
 Row(label=1, features=DenseVector([477.0, 6.0, 1121.0, 0.0]), rawPrediction=DenseVector([2.705, 1.4476, -0.8782, -3.2744]), probability=DenseVector([0.7606, 0.2163, 0.0211, 0.0019]), prediction=0.0),
 Row(label=1, features=DenseVector([477.0, 8.0, 1121.0, 0.0]), rawPrediction=DenseVector([2.7065, 1.4462, -0.8781, -3.2746]), probability=DenseVector([0.7611, 0.2158, 0.0211, 0.0019]), prediction=0.0),
 Row(label=1, features=DenseVector([477.0, 9.0, 1121.0, 0.0]), rawPrediction=DenseVector([2.7072, 1.4456, -0.8781, -3.2747]), probability=DenseVector([0.7614, 0.2156, 0.0211, 0.0019]), prediction=0.0),
 Row(label=1, features=DenseVector([477.0, 10.0, 1121.0, 0.0]), rawPrediction=DenseVector([2.7079, 1.4449, -0.8781, -3.2747]), probability=DenseVector([0.7616, 0.2154, 0.0211, 0.0019]), predicti

In [None]:
lrmodel = lr.fit(train)

In [None]:
lrmodel.summary.accuracy

In [None]:
lrmodel.summary.precisionByLabel

### Optimised

In [None]:
# pipeline, just running it on classifier no transformations
pipeline = Pipeline(stages=[lr])

In [None]:
# set up param grid to iterate over
paramGrid = ParamGridBuilder() \
.addGrid(lr.regParam, [0.0, 0.1]) \
.build()

In [None]:
# set up crossvalidator to tune parameters and optimize
crossval = CrossValidator(estimator=pipeline,
                         estimatorParamMaps=paramGrid,
                         evaluator=MulticlassClassificationEvaluator(metricName='f1'),
                         numFolds=2)

In [None]:
cvModel = crossval.fit(train)  # train model

In [None]:
results = cvModel.transform(test)  # apply model on test data

In [None]:
cvModel.avgMetrics  # look at model scoring metrics

In [None]:
results.count()  # how many events in total labels

In [None]:
print(results.filter(results.label == results.prediction).count())  # check how many were predicted correctly

In [None]:
results.filter(results.label == results.prediction).count()/results.count()   # hwow many correct

In [None]:
results.filter(results["prediction"]==1).head(5)

In [None]:
results.filter(results["label"]==1).head(50)