# Sparkify Project Workspace
This workspace contains a tiny subset (128MB) of the full dataset available (12GB). Feel free to use this workspace to build your project, or to explore a smaller subset with Spark before deploying your cluster on the cloud. Instructions for setting up your Spark cluster is included in the last lesson of the Extracurricular Spark Course content.

You can follow the steps below to guide your data analysis and model building portion of this project.

In [80]:
# import libraries
from pyspark.sql import SparkSession
from pathlib import Path # better file paths
from pyspark.sql.functions import countDistinct, col, when, lit

In [2]:
# create a Spark session
spark = SparkSession \
    .builder \
    .appName('Sparkify') \
    .getOrCreate()
spark

In [3]:
# what is the project folder?
import os; os.getcwd()

'/Users/jas/github/udacity-data-scientist-nanodegree-capstone-project'

# Load and Clean Dataset
In this workspace, the mini-dataset file is `mini_sparkify_event_data.json`. Load and clean the dataset, checking for invalid or missing data - for example, records without userids or sessionids. 

In [4]:
# reading in data from local data folder which is gitignored due to large file size
event_data = Path.cwd() / "data" / "mini_sparkify_event_data.json"
df = spark.read.json(str(event_data))
df.head()

Row(artist='Martha Tilston', auth='Logged In', firstName='Colin', gender='M', itemInSession=50, lastName='Freeman', length=277.89016, level='paid', location='Bakersfield, CA', method='PUT', page='NextSong', registration=1538173362000, sessionId=29, song='Rockpools', status=200, ts=1538352117000, userAgent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0', userId='30')

In [26]:
print(df.columns) # this way prints horizontally

['artist', 'auth', 'firstName', 'gender', 'itemInSession', 'lastName', 'length', 'level', 'location', 'method', 'page', 'registration', 'sessionId', 'song', 'status', 'ts', 'userAgent', 'userId']


In [5]:
df.limit(5).toPandas()

Unnamed: 0,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
0,Martha Tilston,Logged In,Colin,M,50,Freeman,277.89016,paid,"Bakersfield, CA",PUT,NextSong,1538173362000,29,Rockpools,200,1538352117000,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) G...,30
1,Five Iron Frenzy,Logged In,Micah,M,79,Long,236.09424,free,"Boston-Cambridge-Newton, MA-NH",PUT,NextSong,1538331630000,8,Canada,200,1538352180000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",9
2,Adam Lambert,Logged In,Colin,M,51,Freeman,282.8273,paid,"Bakersfield, CA",PUT,NextSong,1538173362000,29,Time For Miracles,200,1538352394000,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) G...,30
3,Enigma,Logged In,Micah,M,80,Long,262.71302,free,"Boston-Cambridge-Newton, MA-NH",PUT,NextSong,1538331630000,8,Knocking On Forbidden Doors,200,1538352416000,"""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebK...",9
4,Daft Punk,Logged In,Colin,M,52,Freeman,223.60771,paid,"Bakersfield, CA",PUT,NextSong,1538173362000,29,Harder Better Faster Stronger,200,1538352676000,Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) G...,30


# Exploratory Data Analysis
When you're working with the full dataset, perform EDA by loading a small subset of the data and doing basic manipulations within Spark. In this workspace, you are already provided a small subset of data you can explore.

### Define Churn

Once you've done some preliminary analysis, create a column `Churn` to use as the label for your model. I suggest using the `Cancellation Confirmation` events to define your churn, which happen for both paid and free users. As a bonus task, you can also look into the `Downgrade` events.

### Explore Data
Once you've defined churn, perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned. You can start by exploring aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or number of songs played.

## Define Churn

* What column has Cancellation Confirmation? page?

In [28]:
df.groupby('page').count().sort('count', ascending=False).limit(50).toPandas()

Unnamed: 0,page,count
0,NextSong,228108
1,Home,14457
2,Thumbs Up,12551
3,Add to Playlist,6526
4,Add Friend,4277
5,Roll Advert,3933
6,Login,3241
7,Logout,3226
8,Thumbs Down,2546
9,Downgrade,2055


* How many total rows (in this sample dataset)?

In [33]:
num_rows = df.count()
num_rows

286500

* How many total users?

In [53]:
#num_users = df.select(countDistinct("userId")).toPandas().values[0]
num_users = df.select(countDistinct("userId")).toPandas().iloc[0][0]
num_users

226

In [55]:
num_users_cancelled = df.filter('page == "Cancellation Confirmation"').select(countDistinct("userId")).toPandas().iloc[0][0]
num_users_cancelled

52

In [59]:
round(num_users_cancelled/num_users * 100, 1) # percentage of users that cancelled

23.0

In [None]:
# actually update the data with churn value

* Just need to find the ids that churned and of course the rest are not churned

In [94]:
cancelled_ids = df.filter('page == "Cancellation Confirmation"').select("userId").distinct()
# Convert to list to be used to filter later
cancelled_ids = cancelled_ids.toPandas()['userId'].tolist()
cancelled_ids[:5]

['125', '51', '54', '100014', '101']

In [68]:
cancelled_ids.count()

52

* When the userId matches a cancelled user than we provide a value of **1**, else it is **0**.

In [98]:
df = df.withColumn("Churn", when((col("userId").isin(cancelled_ids)),lit('1')).otherwise(lit('0')))
df.show()

+--------------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+---------------+-------------+---------+--------------------+------+-------------+--------------------+------+-----+
|              artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|           page| registration|sessionId|                song|status|           ts|           userAgent|userId|Churn|
+--------------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+---------------+-------------+---------+--------------------+------+-------------+--------------------+------+-----+
|      Martha Tilston|Logged In|    Colin|     M|           50| Freeman|277.89016| paid|     Bakersfield, CA|   PUT|       NextSong|1538173362000|       29|           Rockpools|   200|1538352117000|Mozilla/5.0 (Wind...|    30|    0|
|    Five Iron Frenzy|Logged In|    Micah|     M|           79|    L

In [101]:
df.groupby('Churn').agg(countDistinct("userId")).toPandas()

Unnamed: 0,Churn,count(DISTINCT userId)
0,0,174
1,1,52


## Explore Data

### How many columns and what type of data?

In [6]:
df.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [7]:
## Descriptive stats
df.describe().toPandas()

Unnamed: 0,summary,artist,auth,firstName,gender,itemInSession,lastName,length,level,location,method,page,registration,sessionId,song,status,ts,userAgent,userId
0,count,228108,286500,278154,278154,286500.0,278154,228108.0,286500,278154,286500,286500,278154.0,286500.0,228108,286500.0,286500.0,278154,286500.0
1,mean,551.0852017937219,,,,114.41421291448516,,249.1171819778372,,,,,1535358834085.557,1041.526554973822,Infinity,210.05459685863875,1540956889810.4714,,59682.02278593872
2,stddev,1217.7693079161374,,,,129.76726201141085,,99.23517921058324,,,,,3291321616.328068,726.7762634630834,,31.50507848842202,1507543960.8187113,,109091.9499991052
3,min,!!!,Cancelled,Adelaida,F,0.0,Adams,0.78322,free,"Albany, OR",GET,About,1521380675000.0,1.0,ÃÂg ÃÂtti GrÃÂ¡a ÃÂsku,200.0,1538352117000.0,"""Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10)...",
4,max,ÃÂlafur Arnalds,Logged Out,Zyonna,M,1321.0,Wright,3024.66567,paid,"Winston-Salem, NC",PUT,Upgrade,1543247354000.0,2474.0,ÃÂau hafa sloppiÃÂ° undan ÃÂ¾unga myrkursins,404.0,1543799476000.0,Mozilla/5.0 (compatible; MSIE 9.0; Windows NT ...,99.0


## artist

* Who are the most popular artists?

In [8]:
df.groupby('artist').count().sort('count', ascending=False).limit(5).toPandas() 

Unnamed: 0,artist,count
0,,58392
1,Kings Of Leon,1841
2,Coldplay,1813
3,Florence + The Machine,1236
4,Dwight Yoakam,1135


## auth

In [9]:
df.groupby('auth').count().sort('count', ascending=False).limit(5).toPandas() 

Unnamed: 0,auth,count
0,Logged In,278102
1,Logged Out,8249
2,Guest,97
3,Cancelled,52


## Create a function to do the same thing for each column

In [10]:
def top5_values(column_name):
    """
    Take a column name and find the most frequent values
    """
    return df.groupby(column_name).count().sort('count', ascending=False).limit(5).toPandas()

In [11]:
for column in df.columns:
    top_values = top5_values(column)
    print(top_values, "\n")

artist  count
0                    None  58392
1           Kings Of Leon   1841
2                Coldplay   1813
3  Florence + The Machine   1236
4           Dwight Yoakam   1135 

         auth   count
0   Logged In  278102
1  Logged Out    8249
2       Guest      97
3   Cancelled      52 

  firstName  count
0    Payton   9632
1      None   8346
2     Riley   7970
3    Lucero   6880
4    Emilia   5732 

  gender   count
0      F  154578
1      M  123576
2   None    8346 

   itemInSession  count
0              0   3278
1              1   3125
2              2   3067
3              3   3013
4              4   2977 

   lastName  count
0  Campbell  14060
1      Reed   9284
2  Williams   8410
3      None   8346
4    Taylor   7230 

      length  count
0        NaN  58392
1  239.30730   1205
2  348.57751   1037
3  201.79546    908
4  655.77751    730 

  level   count
0  paid  228162
1  free   58338 

                                location  count
0     Los Angeles-Long Beach-Anaheim, C

### Findings
* Need to ignore users without a **userId**, since we can't aggregate their data across sessions, or maybe there are only a few hits within a session that are missing and in that case we can fill the data

* Need to exclude information that is not relevant for modeling such as first and last name, userId is all we need for now 

* How many user agents per user? Can we seperate out browser and OS?

* Can we extract the user's state from the location to create a new variable with less cardinality?

## Gender

* gender - we expect each user to have 1 value

In [24]:
df.groupBy('userId')\
.agg(countDistinct("gender"))\
.withColumnRenamed("count(DISTINCT gender)", 'gender_count')\
.filter('gender_count != 1').show()

+------+------------+
|userId|gender_count|
+------+------------+
|      |           0|
+------+------------+



# Feature Engineering
Once you've familiarized yourself with the data, build out the features you find promising to train your model on. To work with the full dataset, you can follow the following steps.
- Write a script to extract the necessary features from the smaller subset of data
- Ensure that your script is scalable, using the best practices discussed in Lesson 3
- Try your script on the full data set, debugging your script if necessary

If you are working in the classroom workspace, you can just extract features based on the small subset of data contained here. Be sure to transfer over this work to the larger dataset when you work on your Spark cluster.

In [None]:
### 

In [None]:
### How many unique sessions per user?

In [None]:
### Favorite singer

In [None]:
### Favorite song

# Modeling
Split the full dataset into train, test, and validation sets. Test out several of the machine learning methods you learned. Evaluate the accuracy of the various models, tuning parameters as necessary. Determine your winning model based on test accuracy and report results on the validation set. Since the churned users are a fairly small subset, I suggest using F1 score as the metric to optimize.

# Final Steps
Clean up your code, adding comments and renaming variables to make the code easier to read and maintain. Refer to the Spark Project Overview page and Data Scientist Capstone Project Rubric to make sure you are including all components of the capstone project and meet all expectations. Remember, this includes thorough documentation in a README file in a Github repository, as well as a web app or blog post.