# Predicting User Churn in Digital Music Service

Since JSON data can be semi-structured and contain additional metadata, 
it is possible that you might face issues with the DataFrame layout.
Please read the documentation of 'SparkSession.read()' to learn more about 
the possibilities to adjust the data loading.
PySpark documentation: 
http://spark.apache.org/docs/2.0.2/api/python/pyspark.sql.html 
pyspark.sql.DataFrameReader.json

Notes 
time in unix msecs

### Data Definition
#### Useful:
- *location*: location of user, seems to append each new state (location, state)
- *gender*: user gender (M/F/None)

- *page*: what page the user is on during event (pages)
- *level*: subscription level check uniqueness (free or paid)
- *auth*: authenication (logged in/out)
- *length*: time spent on page, max 50 mins on NextSong (if song paused??)

- *registration*: unknown (registration unixtime)
- *ts*: timestamp of event in ms (event unixtime)

- *userId*: unique (userId val)
- *sessionId*: unique sessionId per user?
- *itemInSession*: lcounter for the number of items in a single session (item listened to in session)


#### Not Useful:
- *firstName*: users first name (not important, remove)
- *lastName*: users lastname
- *artist*: song artist
- *song*: songname
- *userAgent*: device/browser (not important for us, remove)
- *method*: API PUT/GET http request (not important for us, remove)
- *status*: http status

# Setup

In [1]:
# imports
import ibmos2spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_unixtime
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType
from pyspark.sql.functions import col, when

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20200618175721-0000
KERNEL_ID = bff3e845-648a-42ea-be84-091938adfce3


In [2]:
# config
# @hidden_cell
credentials = {
    'endpoint': 'https://s3.eu-geo.objectstorage.service.networklayer.com',
    'service_id': 'iam-ServiceId-147e1161-7da9-41fe-ac00-c144730def00',
    'iam_service_endpoint': 'https://iam.cloud.ibm.com/oidc/token',
    'api_key': 'kAtvjdC8VIYYUmU3gDaOYIK2fCvP3nkjYYlDiNuu4gw6'
}

configuration_name = 'os_76774389dfa04fb5acbb1640b3e11704_configs'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

In [3]:
# Build Spark session
spark = SparkSession.builder.getOrCreate()

In [4]:
# Read in data from IBM Cloud
data_df = spark.read.json(cos.url('medium-sparkify-event-data.json', 'sparkify-donotdelete-pr-fnqu5byx41gcai'))

# Data Exploration

In [5]:
data_df.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [6]:
data_df.head(1)

[Row(artist='Martin Orford', auth='Logged In', firstName='Joseph', gender='M', itemInSession=20, lastName='Morales', length=597.55057, level='free', location='Corpus Christi, TX', method='PUT', page='NextSong', registration=1532063507000, sessionId=292, song='Grand Designs', status=200, ts=1538352011000, userAgent='"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='293')]

In [7]:
# get dataset metadata
# create temp sql table
data_df.createOrReplaceTempView("user_log_table")

### Metadata: No. of Users

In [8]:
# how many users in the dataset, unique userId
spark.sql("SELECT COUNT(DISTINCT(userId)) FROM user_log_table LIMIT 10").show()

+----------------------+
|count(DISTINCT userId)|
+----------------------+
|                   449|
+----------------------+



### Feature: Pages

In [9]:
# look at unique pages
spark.sql("SELECT DISTINCT(page) FROM user_log_table LIMIT 100").collect()

[Row(page='Cancel'),
 Row(page='Submit Downgrade'),
 Row(page='Thumbs Down'),
 Row(page='Home'),
 Row(page='Downgrade'),
 Row(page='Roll Advert'),
 Row(page='Logout'),
 Row(page='Save Settings'),
 Row(page='Cancellation Confirmation'),
 Row(page='About'),
 Row(page='Submit Registration'),
 Row(page='Settings'),
 Row(page='Login'),
 Row(page='Register'),
 Row(page='Add to Playlist'),
 Row(page='Add Friend'),
 Row(page='NextSong'),
 Row(page='Thumbs Up'),
 Row(page='Help'),
 Row(page='Upgrade'),
 Row(page='Error'),
 Row(page='Submit Upgrade')]

### Feature: Pages

In [10]:
# look at unique pages
spark.sql("SELECT MAX(length)/60,  MIN(length)/60 FROM user_log_table LIMIT 100").collect()

[Row((max(length) / CAST(60 AS DOUBLE))=50.4110945, (min(length) / CAST(60 AS DOUBLE))=0.013053666666666667)]

In [11]:
# look at unique pages
spark.sql("SELECT page, length/60 FROM user_log_table ORDER BY length DESC  LIMIT 20").collect()

[Row(page='NextSong', (length / CAST(60 AS DOUBLE))=50.4110945),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=50.4110945),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=50.4110945),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=50.4110945),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=50.4110945),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=50.4110945),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=50.4110945),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=50.4110945),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=50.4110945),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=50.4110945),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=49.34312166666667),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=45.7181965),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=45.537951666666665),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=45.153951666666664),
 Row(page='NextSong', (length / CAST(60 AS DOUBLE))=43.774251),
 Ro

### Feature: level

In [12]:
# unique levels
spark.sql("SELECT DISTINCT(level) FROM user_log_table LIMIT 100").collect()


[Row(level='free'), Row(level='paid')]

### Feature: gender

In [13]:
# unique levels
spark.sql("SELECT DISTINCT(gender) FROM user_log_table LIMIT 100").collect()

[Row(gender='F'), Row(gender=None), Row(gender='M')]

### Feature: authentication

In [14]:
spark.sql("SELECT DISTINCT(auth) FROM user_log_table LIMIT 100").collect()

[Row(auth='Logged Out'),
 Row(auth='Cancelled'),
 Row(auth='Guest'),
 Row(auth='Logged In')]

### feature: location

In [15]:
spark.sql("SELECT DISTINCT(location) FROM user_log_table LIMIT 1000").collect()

[Row(location='Atlantic City-Hammonton, NJ'),
 Row(location='Gainesville, FL'),
 Row(location='Richmond, VA'),
 Row(location='Oskaloosa, IA'),
 Row(location='Tucson, AZ'),
 Row(location='Deltona-Daytona Beach-Ormond Beach, FL'),
 Row(location='San Diego-Carlsbad, CA'),
 Row(location='Cleveland-Elyria, OH'),
 Row(location='Medford, OR'),
 Row(location='Kingsport-Bristol-Bristol, TN-VA'),
 Row(location='New Haven-Milford, CT'),
 Row(location='Birmingham-Hoover, AL'),
 Row(location='Corpus Christi, TX'),
 Row(location='Mobile, AL'),
 Row(location='Dubuque, IA'),
 Row(location='Las Vegas-Henderson-Paradise, NV'),
 Row(location='Killeen-Temple, TX'),
 Row(location='Ottawa-Peru, IL'),
 Row(location='Boise City, ID'),
 Row(location='Bremerton-Silverdale, WA'),
 Row(location='Urban Honolulu, HI'),
 Row(location='Cedar City, UT'),
 Row(location='Indianapolis-Carmel-Anderson, IN'),
 Row(location='Durham-Chapel Hill, NC'),
 Row(location='Seattle-Tacoma-Bellevue, WA'),
 Row(location='Fort Smith, A

### Feature: useragent

In [16]:
spark.sql("SELECT DISTINCT(userAgent) FROM user_log_table LIMIT 100").collect()

[Row(userAgent='"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"'),
 Row(userAgent='"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"'),
 Row(userAgent='Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:31.0) Gecko/20100101 Firefox/31.0'),
 Row(userAgent='"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"'),
 Row(userAgent='"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"'),
 Row(userAgent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:31.0) Gecko/20100101 Firefox/31.0'),
 Row(userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0'),
 Row(userAgent='Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0'),
 Row(userAgent='Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW6

# Data Cleaning

In [49]:
# lets remove some of the columns we don't think will be useful from data exploration
cols_to_drop = ['firstName', 'lastName','artist', 'song', 'method', 'status', 'userAgent']
user_log_df = data_df.drop(*cols_to_drop)

In [50]:
# event unix to datetime
user_log_df = user_log_df.withColumn("timestamp_datetime",
                                     from_unixtime(data_df.ts/1000,
                                                   format='yyyy-MM-dd HH:mm:ss'))

In [51]:
# registration unix to datetime
user_log_df = user_log_df.withColumn("registration_datetime",
                                     from_unixtime(data_df.registration/1000,
                                                   format='yyyy-MM-dd HH:mm:ss'))

In [52]:
# take a look
user_log_df.head()

Row(auth='Logged In', gender='M', itemInSession=20, length=597.55057, level='free', location='Corpus Christi, TX', page='NextSong', registration=1532063507000, sessionId=292, ts=1538352011000, userId='293', timestamp_datetime='2018-10-01 00:00:11', registration_datetime='2018-07-20 05:11:47')

# Feature Engineering

In [21]:
# time since registration
user_log_df = user_log_df.withColumn('seconds_since_registration',
                                     (user_log_df['ts'] - user_log_df['registration']) / 1000 )

In [None]:
 | user_log_df["location"].isNull() | isnan(user_log_df["location"])

In [61]:
# missing values cause issue with split
from pyspark.sql.functions import isnan
user_log_df.filter((user_log_df["location"].isNull())).count()

15700

In [68]:
# replace missing values to allow split
user_log_df = user_log_df.fillna({'location':''})

In [69]:
# create state column
state_abbr = udf(lambda x: x.split(', ')[-1], StringType()) # x.split(', ')[-1]
user_log_df = user_log_df.withColumn("usstate_abbr",
                                     when(user_log_df.location.isNotNull(),
                                          state_abbr(user_log_df.location)).otherwise(''))

In [71]:
# Sates seem to be appended, so take latest
state_abbr = udf(lambda x: x.split('-')[-1], StringType()) # x.split(', ')[-1]
user_log_df = user_log_df.withColumn("usstate_abbr",
                                     when(user_log_df.location.isNotNull(),
                                          state_abbr(user_log_df.location)).otherwise(''))

In [72]:
# take a look
user_log_df.head(1)

[Row(auth='Logged In', gender='M', itemInSession=20, length=597.55057, level='free', location='Corpus Christi, TX', page='NextSong', registration=1532063507000, sessionId=292, ts=1538352011000, userId='293', timestamp_datetime='2018-10-01 00:00:11', registration_datetime='2018-07-20 05:11:47', usstate_abbr='Corpus Christi, TX')]

In [None]:
# calcualte average listening time