# SPARKIFY PROJECT - SPECIAL EMR NOTEBOOK
This notebook explores the full 12GB dataset: [s3n://udacity-dsnd/sparkify/sparkify_event_data.json](s3n://udacity-dsnd/sparkify/sparkify_event_data.json)

The goal is to use eveything that has been made on a smaller machine with a smaller dataset and:
* see if it scales well with more data
* collect more data (we have only 225 users in the small dataset

As per [AWS documentation](https://aws.amazon.com/fr/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/) run the cell below to install pandas package (and do not forget to select the `PySpark` kernel!).

In [1]:
sc.install_pypi_package("pandas==0.25.1") #Install pandas version 0.25.1 

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
3,application_1571908068699_0004,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting pandas==0.25.1
  Using cached https://files.pythonhosted.org/packages/73/9b/52e228545d14f14bb2a1622e225f38463c8726645165e1cb7dde95bfe6d4/pandas-0.25.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting python-dateutil>=2.6.1
  Using cached https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python_dateutil-2.8.0-py2.py3-none-any.whl
Installing collected packages: python-dateutil, pandas
Successfully installed pandas-0.25.1 python-dateutil-2.8.0

In [20]:
# Needed to save CSV files into S3
sc.install_pypi_package("s3fs")

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Collecting s3fs
  Downloading https://files.pythonhosted.org/packages/3e/0d/d5f7ef4a3c2237b71e7e406ed6f512815874bfc5c042ebd680bd29d7d06e/s3fs-0.3.5.tar.gz (46kB)
Collecting boto3>=1.9.91
  Downloading https://files.pythonhosted.org/packages/b9/25/0d28ef2ec3459f8399fb45f0578f21a937b90bd3b00aea07d35d6e39c041/boto3-1.10.1-py2.py3-none-any.whl (128kB)
Collecting botocore>=1.12.91
  Downloading https://files.pythonhosted.org/packages/62/32/d7ff8ad00fcdfa235dba4b2d0f1605beac5a8287e7e8f040d6bfffb8a1a8/botocore-1.13.1-py2.py3-none-any.whl (5.3MB)
Collecting fsspec>=0.2.2
  Downloading https://files.pythonhosted.org/packages/54/18/59f1850336568168144a746afa7199a0311f1a89bdb7b0d8e8b50d2c3d93/fsspec-0.5.2.tar.gz (64kB)
Collecting s3transfer<0.3.0,>=0.2.0
  Downloading https://files.pythonhosted.org/packages/16/8a/1fc3dba0c4923c2a76e1ff0d52b305c44606da63f718d14d3231e21c51b0/s3transfer-0.2.1-py2.py3-none-any.whl (70kB)
Collecting docutils<0.16,>=0.10
  Downloading https://files.pythonhosted.org/pac

All intermediate CSV files will be stored under this Amazon S3 folder:

In [22]:
S3_SAVE_PATH = "s3://aws-emr-resources-604676609121-eu-west-1/notebooks/"

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Import libraries, init Spark and load dataset

In [2]:
import pyspark
from pyspark.sql import SparkSession, Window

from pyspark.sql.functions import udf, desc, isnan, when, count, col, lit
from pyspark.sql.functions import max as Fmax
from pyspark.sql.types import IntegerType, FloatType

import numpy as np
import pandas as pd

from datetime import datetime
from datetime import timedelta
import re

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [3]:
# Create or retrieve a Spark session
spark = SparkSession.builder.appName("dsnd-p7-sparkify").getOrCreate()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
event_data = "s3n://udacity-dsnd/sparkify/sparkify_event_data.json"
df = spark.read.json(event_data)
df.show(3)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+-------+
|     artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent| userId|
+-----------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+-------+
|  Popol Vuh|Logged In|    Shlok|     M|          278| Johnson|524.32934| paid|Dallas-Fort Worth...|   PUT|NextSong|1533734541000|    22683|Ich mache einen S...|   200|1538352001000|"Mozilla/5.0 (Win...|1749042|
|Los Bunkers|Logged In|  Vianney|     F|            9|  Miller|238.39302| paid|San Francisco-Oak...|   PUT|NextSong|1537500318000|    20836|         MiÃ

In [5]:
print("Loaded pyspark dataframe has shape ({}, {})".format(df.count(), len(df.columns)))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Loaded pyspark dataframe has shape (26259199, 18)

We have more than 26 billion rows to deal with now!

# CLEAN: remove empty users and build the `churn` target feature

In [6]:
df_clean_users = df.filter((~(isnan(df['userId']))) & (df['userId'].isNotNull()) & (df['userId'] != ""))
print("Cleaned pyspark dataframe has now shape ({}, {})".format(df_clean_users.count(), len(df_clean_users.columns)))

# Define the UDF (do not forget to precise type of column otherwise String is taken by default)
user_has_churned = udf(lambda x: 1 if x == "Cancellation Confirmation" else 0, IntegerType())

# Apply this function on a specific column of the whole dataset
# (made with the help of: https://gist.github.com/zoltanctoth/2deccd69e3d1cde1dd78
# and https://docs.databricks.com/spark/latest/spark-sql/udf-python.html)
df_users_with_churn = df_clean_users.withColumn("churn", user_has_churned("page"))

print("Pyspark dataframe has now shape ({}, {})".format(df_users_with_churn.count(), len(df_users_with_churn.columns)))

df_users_with_churn_full = df_users_with_churn.withColumn("churn", Fmax('churn').over(Window.partitionBy("userId")))

# Check how many users we are talking about in the whole dataset
df_users_with_churn_full.filter(df_users_with_churn_full['churn'] == 1).select('userId').dropDuplicates().count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Cleaned pyspark dataframe has now shape (26259199, 18)
Pyspark dataframe has now shape (26259199, 19)
5003

***Note:*** we have 5003 users that are churn ones, this will help us more in our classification problem.

---
# 1. FEATURE ENGINEERING
In the previous notebook, the conclusion after all observations and plots was to build some new features based on those observations as they could help the model to detect churn. As a reminder it was:
* Transform as binary 0/1 the `level` of subscription (paid or not)
* we can dummy the `gender` (binary 0/1 as well)
* `registration` time for the user
* `engagement` of the user with the number of artists, songs or even the total length of music listened, add to playlist number
* `social interactions` with likes/dislikes, friends, etc
* `upgrade/downgrade` the subscription level
* `user operating system` which could help us to identify users of a version that does not give entire satisfaction
* `errors_encountered` which could help us to identify users who had several issues and then maybe quit

## 1.1. Keep only useful columns

In [7]:
df_filtered = df_users_with_churn_full.select(['artist', 'gender', 'length', 'level', 'page', 'registration', 'sessionId', 'song', 'status', 'ts', 'userAgent', 'userId', 'churn'])
print("Filtered dataframe has shape ({}, {})".format(df_filtered.count(), len(df_filtered.columns)))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Filtered dataframe has shape (26259199, 13)

In [8]:
# Define the UDF (do not forget to precise type of column otherwise String is taken by default)
to_dummy_level = udf(lambda x: 1 if x == "paid" else 0, IntegerType())
to_dummy_gender = udf(lambda x: 1 if x == "M" else 0, IntegerType())

df_filtered = df_filtered.withColumn("level", to_dummy_level("level"))
df_filtered = df_filtered.withColumn("gender", to_dummy_level("gender"))
df_filtered.show(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+------+---------+-----+-----------+-------------+---------+---------+------+-------------+--------------------+-------+-----+
|     artist|gender|   length|level|       page| registration|sessionId|     song|status|           ts|           userAgent| userId|churn|
+-----------+------+---------+-----+-----------+-------------+---------+---------+------+-------------+--------------------+-------+-----+
|Les Nubians|     0|296.35873|    0|   NextSong|1535470939000|     9784|  Saravah|   200|1539526990000|Mozilla/5.0 (Wind...|1000280|    1|
|       Stoa|     0|353.48853|    0|   NextSong|1535470939000|     9784|     Stoa|   200|1539527286000|Mozilla/5.0 (Wind...|1000280|    1|
|       Fate|     0|329.87383|    0|   NextSong|1535470939000|     9784|    Toxic|   200|1539527639000|Mozilla/5.0 (Wind...|1000280|    1|
|       null|     0|     null|    0|Roll Advert|1535470939000|     9784|     null|   200|1539527726000|Mozilla/5.0 (Wind...|1000280|    1|
|  Anathallo|     0|307.904

In [23]:
df_users = df_filtered.select('userId', 'churn', 'gender').groupby('userId').agg({'churn': 'max', 'gender': 'max'}).withColumnRenamed('max(churn)', 'churn').withColumnRenamed('max(gender)', 'gender')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [24]:
df_pd_users = df_users.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [25]:
df_pd_users.to_csv(S3_SAVE_PATH + 'df_final_set12G_users.csv', header=1, index=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 1.2. Transform `level` and `gender` into binary 0/1

In [9]:
df_lvl = df_filtered.select('userId', 'level').groupby('userId').agg({'level': 'max'}).withColumnRenamed('max(level)', 'level')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
df_pd_lvl = df_lvl.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [21]:
df_pd_lvl.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_level.csv', header=1, index=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 1.3. Count number of days the service has been used by users

In [26]:
# Define the UDF (do not forget to precise type of column otherwise String is taken by default)
to_delta_in_days = udf(lambda x: timedelta(seconds=x).days if x else 0, IntegerType())

# Build a time diff column
df_time_delta = df_filtered.select('userId', 'registration', 'ts').withColumn('timedelta', (df_filtered.ts - df_filtered.registration)/1000)
df_time_delta = df_time_delta.withColumn("timedelta", to_delta_in_days("timedelta"))

# Keep only the max per user
df_time_delta = df_time_delta.select('userId', 'timedelta').groupBy('userId').agg({'timedelta': 'max'}).withColumnRenamed('max(timedelta)', 'timedelta')
df_time_delta.show(5)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+---------+
| userId|timedelta|
+-------+---------+
|1390009|       35|
|1519090|       64|
|1394508|       92|
|1178731|       93|
|1351489|       74|
+-------+---------+
only showing top 5 rows

In [27]:
df_pd_time_delta = df_time_delta.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [28]:
df_pd_time_delta.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_timedelta.csv', header=1, index=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 1.4. Measure engagement of the user with the number of artists/songs, total length, etc.

In [29]:
def count_nb_artist_songs(df, col_name, count_unique):
    """
    Count the number of artists or songs listened per user (the count is distinct if count_unique is True)
    """
    innerdf = df.filter(df_filtered[col_name] != 'null').select('userId', col_name)
    if count_unique:
        innerdf = innerdf.dropDuplicates()
    return innerdf.groupBy('userId').count().withColumnRenamed('count', 'nb_{}_{}s'.format('unique' if count_unique else 'total', col_name))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [30]:
def count_nb_page(df, page_value, total_col_name):
    """
    Build a new dataframe filtered on a certain page value and count the number of times each user has seen this page
    """
    return df.filter(df['page'] == page_value).select(['page', 'userId']).groupBy('userId').agg({'page': 'count'}).withColumnRenamed('count(page)', total_col_name)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [31]:
df_users_unique_songs = count_nb_artist_songs(df_filtered, 'song', True)
df_users_total_songs = count_nb_artist_songs(df_filtered, 'song', False)
df_users_unique_artists = count_nb_artist_songs(df_filtered, 'artist', True)
df_users_total_artists = count_nb_artist_songs(df_filtered, 'artist', False)

df_total_length = df_filtered.select('userId', 'length').groupBy('userId').agg({'length': 'sum'}).withColumnRenamed('sum(length)', 'total_length')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [32]:
df_add_playlist = count_nb_page(df_filtered, 'Add to Playlist', 'total_add_playlist')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Transform into pandas DataFrames and save them:

In [33]:
df_pd_users_unique_songs = df_users_unique_songs.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [34]:
df_pd_users_total_songs = df_users_total_songs.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [35]:
df_pd_users_unique_artists = df_users_unique_artists.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [36]:
df_pd_total_length = df_total_length.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [37]:
df_pd_add_playlist = df_add_playlist.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [38]:
df_pd_users_unique_songs.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_unique_songs.csv', header=1, index=False)
df_pd_users_total_songs.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_total_songs.csv', header=1, index=False)
df_pd_users_unique_artists.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_unique_artists.csv', header=1, index=False)
df_pd_total_length.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_total_length.csv', header=1, index=False)
df_pd_add_playlist.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_add_to_playlist.csv', header=1, index=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 1.5. Measure social interactions with likes/dislikes, add friend, etc.

In [40]:
df_add_friend = count_nb_page(df_filtered, 'Add Friend', 'total_add_friend')
df_thumbs_up = count_nb_page(df_filtered, 'Thumbs Up', 'total_thumbs_up')
df_thumbs_down = count_nb_page(df_filtered, 'Thumbs Down', 'total_thumbs_down')
df_rolling_ads = count_nb_page(df_filtered, 'Roll Advert', 'total_ads')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [41]:
df_pd_add_friend = df_add_friend.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [42]:
df_pd_thumbs_up = df_thumbs_up.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [43]:
df_pd_thumbs_down = df_thumbs_down.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [44]:
df_pd_rolling_ads = df_rolling_ads.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [45]:
df_pd_add_friend.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_add_friend.csv', header=1, index=False)
df_pd_thumbs_up.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_thumbs_up.csv', header=1, index=False)
df_pd_thumbs_down.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_thumbs_down.csv', header=1, index=False)
df_pd_rolling_ads.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_rolling_ads.csv', header=1, index=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 1.6. `Upgrade/Downgrade`

In [46]:
df_view_upgrade = count_nb_page(df_filtered, 'Upgrade', 'think_upgrade')
df_count_upgrade = count_nb_page(df_filtered, 'Submit Upgrade', 'has_upgraded')
df_view_downgrade = count_nb_page(df_filtered, 'Downgrade', 'think_downgrade')
df_count_downgrade = count_nb_page(df_filtered, 'Submit Downgrade', 'has_downgraded')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [47]:
df_pd_view_upgrade = df_view_upgrade.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [48]:
df_pd_count_upgrade = df_count_upgrade.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [49]:
df_pd_view_downgrade = df_view_downgrade.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [50]:
df_pd_count_downgrade = df_count_downgrade.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [51]:
df_pd_view_upgrade.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_view_upgrade.csv', header=1, index=False)
df_pd_count_upgrade.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_real_upgrade.csv', header=1, index=False)
df_pd_view_downgrade.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_view_downgrade.csv', header=1, index=False)
df_pd_count_downgrade.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_real_downgrade.csv', header=1, index=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 1.7. Extract Operating System information

In [52]:
# Define the regex (regex101 is your friend to validate it works!)
def extract_systeminfo(txt):
    if txt:
        matches = re.match(".*Mozilla/[0-9.]+\s\(([a-zA-Z0-9\s.]+)(;|\))", txt)
        if matches:
            return matches.group(1)
        else:
            return "Unknown"
    else:
        return "Unknown"

to_os = udf(lambda x: extract_systeminfo(x))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [53]:
df_os = df_filtered.select('userId', 'userAgent').withColumn("os", to_os("userAgent"))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [54]:
def rename_os_col(val_to_replace, replaced_value):
    """
    Rename a value within the 'os' column by another one
    """
    return df_os.withColumn("os", when(df_os.os == val_to_replace, lit(replaced_value)).otherwise(df_os.os))

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [55]:
df_os = rename_os_col('compatible', 'Windows NT 6.1')
df_os = rename_os_col('X11', 'Linux')
df_os = rename_os_col('Windows NT 5.1', 'Windows XP')
df_os = rename_os_col('Windows NT 6.0', 'Windows Vista')
df_os = rename_os_col('Windows NT 6.1', 'Windows Seven')
df_os = rename_os_col('Windows NT 6.2', 'Windows 8')
df_os = rename_os_col('Windows NT 6.3', 'Windows 81')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [57]:
df_os_tmp = df_os.groupby('userId').agg({'os': 'max'}).withColumnRenamed('max(os)', 'os')
os_list = df_os_tmp.select('os').distinct().rdd.flatMap(lambda x:x).collect()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [59]:
# How to build dummies? Found something on SO: https://stackoverflow.com/questions/46528207/dummy-encoding-using-pyspark
exprs = [when(col('os') == os, 1).otherwise(0).alias(str(os)) for os in os_list]
df_os_tmp = df_os_tmp.select(exprs + df_os_tmp.columns)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [60]:
df_pd_os = df_os_tmp.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [61]:
df_pd_os.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_os.csv', header=1, index=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 1.8. Count number of errors per user

In [62]:
df_errors = count_nb_page(df_filtered, 'Error', 'nb_404')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [63]:
df_pd_errors = df_errors.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [64]:
df_pd_errors.to_csv(S3_SAVE_PATH + 'df_final_set12G_user_view_errors.csv', header=1, index=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 1.9. Merge everything into a single final dataframe

In [None]:
# Reload them from the S3 storage

In [65]:
df_list = [df_pd_users, df_pd_lvl, df_pd_time_delta, df_pd_users_unique_songs, df_pd_users_total_songs, 
           df_pd_users_unique_artists, df_pd_total_length, df_pd_add_playlist, 
           df_pd_add_friend, df_pd_thumbs_up, df_pd_thumbs_down, df_pd_rolling_ads, df_pd_view_upgrade, 
           df_pd_count_upgrade, df_pd_view_downgrade, df_pd_count_downgrade, df_pd_errors, df_pd_os]
for a_df in df_list:
    print(a_df.shape)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(22278, 3)
(22278, 2)
(22278, 2)
(22261, 2)
(22261, 2)
(22261, 2)
(22278, 2)
(21260, 2)
(20305, 2)
(21732, 2)
(20031, 2)
(20068, 2)
(16151, 2)
(12082, 2)
(15209, 2)
(5103, 2)
(11273, 2)
(22278, 12)

In [66]:
df_pd_final = df_pd_users
for a_df in df_list[1:]:
    df_pd_final = df_pd_final.merge(a_df, on='userId', how='left')
# In the end remove the userId that is now useless and fill potential NaN with 0's
df_pd_final = df_pd_final.drop(['userId', 'os'], axis=1)
df_pd_final = df_pd_final.fillna(0)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [67]:
df_pd_final.shape

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

(22278, 28)

In [68]:
df_pd_final.head(10)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

  gender  churn  level  ...  Windows 81  Windows XP  Windows Seven
0      M      1      1  ...           0           1              0
1      F      0      1  ...           0           0              1
2      F      0      1  ...           0           0              0
3      F      0      1  ...           1           0              0
4      M      0      1  ...           0           1              0
5      M      0      1  ...           0           0              0
6      M      0      1  ...           0           0              1
7      F      0      1  ...           0           0              1
8      F      1      1  ...           0           0              1
9      M      1      0  ...           0           0              1

[10 rows x 28 columns]

In [69]:
# Save locally so that it can be reused later
df_pd_final.to_csv(S3_SAVE_PATH + 'df_final_total_12GB.csv', header=1, index=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…