# SPARKIFY PROJECT - DATA UNDERSTANDING
This notebook explores a tiny subset (128MB) of the full dataset available (12GB).  
Both can be retrieved here:
* 128MB subset: [s3n://udacity-dsnd/sparkify/mini_sparkify_event_data.json](s3n://udacity-dsnd/sparkify/mini_sparkify_event_data.json)
* full 12GB dataset: [s3n://udacity-dsnd/sparkify/sparkify_event_data.json](s3n://udacity-dsnd/sparkify/sparkify_event_data.json)

Goal of this notebook is to explore the dataset with few statistics in order to have a better understanding of the data we are dealing with.

## Import libraries, init Spark and load dataset

In [2]:
import pyspark
from pyspark.sql import SparkSession

from pyspark.sql.functions import desc, isnan, when, count, col

from datetime import datetime

In [3]:
# It is useful to know the version we are using when reading the pyspark documentations
pyspark.__version__

'2.4.3'

In [4]:
# Create or retrieve a Spark session
spark = SparkSession.builder.appName("dsnd-p7-sparkify").getOrCreate()

In [5]:
df = spark.read.json("mini_sparkify_event_data.json")
df.show(3)

+----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+-----------------+------+-------------+--------------------+------+
|          artist|     auth|firstName|gender|itemInSession|lastName|   length|level|            location|method|    page| registration|sessionId|             song|status|           ts|           userAgent|userId|
+----------------+---------+---------+------+-------------+--------+---------+-----+--------------------+------+--------+-------------+---------+-----------------+------+-------------+--------------------+------+
|  Martha Tilston|Logged In|    Colin|     M|           50| Freeman|277.89016| paid|     Bakersfield, CA|   PUT|NextSong|1538173362000|       29|        Rockpools|   200|1538352117000|Mozilla/5.0 (Wind...|    30|
|Five Iron Frenzy|Logged In|    Micah|     M|           79|    Long|236.09424| free|Boston-Cambridge-...|   PUT|NextSong|1538331630000|        8|   

In [6]:
print("Loaded pyspark dataframe has shape ({}, {})".format(df.count(), len(df.columns)))

Loaded pyspark dataframe has shape (286500, 18)


---
# DATA UNDERSTANDING
To get a better understanding of data we are dealing with and their type, let's first proceed with an analysis, feature per feature.  

# 1. Dataset schema

In [7]:
df.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



We have 18 features to analyze. Lets go!

# 2. `Artist` feature
It's pretty obvious that it relates to the artist the user is currently listening to. But are there empty rows? Depending on the event type that has been collected perhaps sometimes this value is empty and we are OK with that. Let's see.

## 2.1. Feature statistics

In [8]:
df.describe('artist').show()

+-------+------------------+
|summary|            artist|
+-------+------------------+
|  count|            228108|
|   mean| 551.0852017937219|
| stddev|1217.7693079161374|
|    min|               !!!|
|    max| ÃÂlafur Arnalds|
+-------+------------------+



***Observations:*** we have 228108 non null values, other statistics are not relevant as this is a categorical feature. `min` and `max` are related to alphabetical order.

## 2.2. Number of unique values

In [9]:
df.select(['artist']).distinct().count()

17656

***Observations:*** we have only around 18K different artists. Let's see who is the favourite one!

## 2.3. Which artist is the most represented in this subset?

In [10]:
df.groupby(['artist']).agg({'artist': 'count'}).withColumnRenamed("count(artist)", "count").sort(desc("count")).show()

+--------------------+-----+
|              artist|count|
+--------------------+-----+
|       Kings Of Leon| 1841|
|            Coldplay| 1813|
|Florence + The Ma...| 1236|
|       Dwight Yoakam| 1135|
|            BjÃÂ¶rk| 1133|
|      The Black Keys| 1125|
|                Muse| 1090|
|       Justin Bieber| 1044|
|        Jack Johnson| 1007|
|              Eminem|  953|
|           Radiohead|  884|
|     Alliance Ethnik|  876|
|               Train|  854|
|        Taylor Swift|  840|
|         OneRepublic|  828|
|         The Killers|  822|
|         Linkin Park|  787|
|         Evanescence|  781|
|            Harmonia|  729|
|       Guns N' Roses|  713|
+--------------------+-----+
only showing top 20 rows



## 2.4. Missing value analysis

#### How many?

In [11]:
# Made with https://stackoverflow.com/questions/44627386/how-to-find-count-of-null-and-nan-values-for-each-column-in-a-pyspark-dataframe?rq=1
df.select([count(when(isnan('artist') | col('artist').isNull(), 'artist')).alias('artist')]).show()

+------+
|artist|
+------+
| 58392|
+------+



I expected to find ```286500 - 228108 = 58392```. That's good!

#### When it is empty, what are the values for the feature `page`?

In [12]:
df.filter(isnan('artist') | col('artist').isNull()).groupby('page').agg({'page': 'count'}).withColumnRenamed("count(page)", "count").sort(desc("count")).show()

+--------------------+-----+
|                page|count|
+--------------------+-----+
|                Home|14457|
|           Thumbs Up|12551|
|     Add to Playlist| 6526|
|          Add Friend| 4277|
|         Roll Advert| 3933|
|               Login| 3241|
|              Logout| 3226|
|         Thumbs Down| 2546|
|           Downgrade| 2055|
|                Help| 1726|
|            Settings| 1514|
|               About|  924|
|             Upgrade|  499|
|       Save Settings|  310|
|               Error|  258|
|      Submit Upgrade|  159|
|    Submit Downgrade|   63|
|              Cancel|   52|
|Cancellation Conf...|   52|
|            Register|   18|
+--------------------+-----+
only showing top 20 rows



***Observations:*** we have a lot of different values for the visited page when `artist` is empty but most of the time it happens when user is on homepage, interacting with its account (`Login`, `Logout`, `Settings`) or the "Social part" of the app (`Add Friend`, `Thumbs Up` or `Down`).

# 3. `Auth` feature
First, we build some functions that will help to analyze each feature:

In [69]:
def show_feature_stats(df, col_name):
    """
    Show statistic for the given column name in the given dataset. If feature is numerical you can pay attention to mean and 
    standard deviation values but otherwise those are just fancy things that you should not pay attention to.
    :param df: (pyspark DataFrame) the data to analyze
    :param col_name: (string) the colum name
    :return: feature statistics
    """
    return df.describe(col_name).show()

def value_counts(df, col_name):
    """
    Display all possible values for a given column and the count for each value (similar to pandas value_counts() in the end)
    :param df: (pyspark DataFrame) the data to analyze
    :param col_name: (string) the colum name
    :return: pyspark DataFrame with all possible values for a given column and the count for each value
    """
    df.groupby([col_name]).agg({col_name: 'count'}).withColumnRenamed("count({})".format(col_name), "count").sort(desc("count")).show()

def count_missing_values(df, col_name):
    """
    Count how many missing values for the given column
    :param df: (pyspark DataFrame) the data to analyze
    :param col_name: (string) the colum name
    :return: pyspark DataFrame with the count of missing values for the given column
    """
    return df.select([count(when(isnan(col_name) | col(col_name).isNull(), col_name)).alias(col_name)]).show()

def get_other_col_value_counts_when_col_is_null(df, null_col_name, other_col):
    """
    Display the different values and their counts for the feature 'page' when there are many missing values for the given column
    :param df: (pyspark DataFrame) the data to analyze
    :param null_col_name: (string) the colum name with missing values
    :param other_col: (string) the other colum name for which we will display all possible values
    :return: pyspark DataFrame with the count of value counts for 'page' when there are missing values in the given column
    """
    return value_counts(df.filter(isnan(null_col_name) | col(null_col_name).isNull()), other_col)

***Observations:*** we have a lot of different values for the visited page when `artist` is empty but most of the time it happens when user is on homepage, interacting with its account (`Login`, `Logout`, `Settings`) or the "Social part" of the app (`Add Friend`, `Thumbs Up` or `Down`).

# 3. `Auth` feature
## 3.1. Feature statistics

In [70]:
show_feature_stats(df, 'auth')

+-------+----------+
|summary|      auth|
+-------+----------+
|  count|    286500|
|   mean|      null|
| stddev|      null|
|    min| Cancelled|
|    max|Logged Out|
+-------+----------+



***Observations:*** we have no null values, other statistics are not relevant as this is a categorical feature. `min` and `max` are related to alphabetical order.

## 3.2. Number of unique values

In [54]:
value_counts(df, 'auth')

+----------+------+
|      auth| count|
+----------+------+
| Logged In|278102|
|Logged Out|  8249|
|     Guest|    97|
| Cancelled|    52|
+----------+------+



***Observations:*** it seems that this feature is related to user authentication: is the user authenticated or not. 97% of this feature has the value `Logged In` so I am not sure that it is relevant to keep it. We'll see.

# 4. `firstName` and `lastName` features
It's pretty obvious that it relates to the user itself. Same questions: are there empty rows? Let's see.

## 4.1. Features statistics

In [71]:
show_feature_stats(df, 'firstName')

+-------+---------+
|summary|firstName|
+-------+---------+
|  count|   278154|
|   mean|     null|
| stddev|     null|
|    min| Adelaida|
|    max|   Zyonna|
+-------+---------+



In [72]:
show_feature_stats(df, 'firstName')

+-------+---------+
|summary|firstName|
+-------+---------+
|  count|   278154|
|   mean|     null|
| stddev|     null|
|    min| Adelaida|
|    max|   Zyonna|
+-------+---------+



***Observations:*** we have few missing values, other statistics are not relevant as this is a categorical feature. `min` and `max` are related to alphabetical order.  
We will not count the number of distinct values as it will not be meaningful, several users can have the same name.

## 4.2. Missing values
First let's build some functions that will be used to analyze missing values:

#### How many?

In [55]:
count_missing_values(df, 'firstName')

+---------+
|firstName|
+---------+
|     8346|
+---------+



In [56]:
count_missing_values(df, 'lastName')

+--------+
|lastName|
+--------+
|    8346|
+--------+



#### When it is empty, what are the values for the feature `page`?

In [63]:
get_other_col_value_counts_when_col_is_null(df, 'firstName', 'page')

+-------------------+-----+
|               page|count|
+-------------------+-----+
|               Home| 4375|
|              Login| 3241|
|              About|  429|
|               Help|  272|
|           Register|   18|
|              Error|    6|
|Submit Registration|    5|
+-------------------+-----+



In [64]:
get_other_col_value_counts_when_col_is_null(df, 'lastName', 'page')

+-------------------+-----+
|               page|count|
+-------------------+-----+
|               Home| 4375|
|              Login| 3241|
|              About|  429|
|               Help|  272|
|           Register|   18|
|              Error|    6|
|Submit Registration|    5|
+-------------------+-----+



***Observations:*** 
* We have exactly the same figures for both `firstName` and `lastName` features, that's a good point!
* `firstName` or `lastName` are empty where the page is `Home` or `Logged In`, it is obvious as at this time we do not know who is the visitor. Anyway, as it does not bring any information for our churn prediction problem, it is not relevant to keep those 2 features.

# 5. `Gender` feature
It should be related to the user's gender. Number of missing value should then be the same, more or less, than what we have seen earlier with `firstName`. Let's see.

## 5.1. Feature statistics

In [73]:
show_feature_stats(df, 'gender')

+-------+------+
|summary|gender|
+-------+------+
|  count|278154|
|   mean|  null|
| stddev|  null|
|    min|     F|
|    max|     M|
+-------+------+



***Observations:*** as expected, we have the same number of missing values we have found when analyzing `firstName`.

## 5.2. Most represented gender?
We have to take care about user duplicates and must take distinct values. For that, instead of `firstName` I will use the `userId` feature.

In [59]:
value_counts(df.select('userId', 'gender').dropDuplicates(), 'gender')

+------+-----+
|gender|count|
+------+-----+
|     M|  121|
|     F|  104|
|  null|    0|
+------+-----+



In [30]:
# Check this information by counting the number of different userId
df.select('userId').dropDuplicates().count()

226

Humm, there is small a difference: 226 distinct user id whereas we found 225 when summing 'M' and 'F' genders.  
Perhaps it is linked to an empty row, that would explain the "null" value we got in the gender count above.

# 6. `itemInSession` feature
So far I am not sure what it means so let's analyze it in details.

## 6.1. Feature statistics

In [74]:
show_feature_stats(df, 'itemInSession')

+-------+------------------+
|summary|     itemInSession|
+-------+------------------+
|  count|            286500|
|   mean|114.41421291448516|
| stddev|129.76726201140994|
|    min|                 0|
|    max|              1321|
+-------+------------------+



Based on the feature name, I would say that it contains the number of elements played within a single session.  
***Observations:***
* We have no null values and the column is numerical so we can also have a look at other statistics such as the `mean` or the `standard deviation`.  
* Note that we have a `min` at 0 and I my assumption is right so, we should have the same amount of '0' than missing values for `firstName`.  
* There is not enough information so far to say whether the `max` value is an outlier or not. We'll see that later during the EDA phase (perhaps plotting a boxplot of that if it helps.

## 6.2. Number of 0's

In [40]:
df.filter(df['itemInSession'] == 0).count()

3278

In [35]:
df.filter(df['itemInSession'] == 0).select([count(when(isnan('firstName') | col('firstName').isNull(), 'firstname')).alias('firstname')]).show()

+---------+
|firstname|
+---------+
|      589|
+---------+



***Note:*** no, it is not exact! Perhaps this would become clearer when doing some plots.

# 7. `length` feature
As for `itemInSession`, I am not sure what it means so let's analyze it in details.

## 7.1. Feature statistics

In [75]:
show_feature_stats(df, 'length')

+-------+-----------------+
|summary|           length|
+-------+-----------------+
|  count|           228108|
|   mean|249.1171819778458|
| stddev|99.23517921058361|
|    min|          0.78322|
|    max|       3024.66567|
+-------+-----------------+



***Observations:***
* 228108 is the same number we have found earlier and corresponds to the number of not null `Artist` feature so those 2 are obviously connected.
* The column is numerical so it is interesting to have a look at statistics figures. Based on the values for the mean or even the max, I guess that this is the length of the song that has been played.

## 7.2. Confirm it is the song's length
For that I will take randomly few songs that appears more than once in the dataset. The length related to the song should always be the same. If it is not the case then it would mean something else.

In [60]:
value_counts(df, 'song')

+--------------------+-----+
|                song|count|
+--------------------+-----+
|      You're The One| 1153|
|                Undo| 1026|
|             Revelry|  854|
|       Sehr kosmisch|  728|
|Horn Concerto No....|  641|
|Dog Days Are Over...|  574|
|             Secrets|  466|
|        Use Somebody|  459|
|              Canada|  435|
|             Invalid|  424|
|    Ain't Misbehavin|  409|
|       ReprÃÂ©sente|  393|
|SinceritÃÂ© Et J...|  384|
|Catch You Baby (S...|  373|
|              Yellow|  343|
|    Somebody To Love|  343|
|    Hey_ Soul Sister|  334|
|            The Gift|  327|
|           Fireflies|  312|
|          Love Story|  309|
+--------------------+-----+
only showing top 20 rows



In [48]:
songs_to_check = ['Use Somebody', 'Canada', 'Yellow', 'Love Story', 'Fireflies', 'Somebody To Love']
df.filter(df['song'].isin(songs_to_check)).select('song', 'length').dropDuplicates().show()

+----------------+---------+
|            song|   length|
+----------------+---------+
|          Yellow|174.75873|
|      Love Story|179.46077|
|          Canada|413.28281|
|       Fireflies|208.97914|
|      Love Story|199.52281|
|          Yellow|218.14812|
|Somebody To Love|211.53914|
|       Fireflies|195.02975|
|          Canada|236.09424|
|Somebody To Love|473.23383|
|          Yellow|268.30322|
|    Use Somebody|231.26159|
|Somebody To Love|179.53914|
|    Use Somebody|231.81016|
|      Love Story|233.89995|
|      Love Story|312.73751|
|      Love Story|236.01587|
|Somebody To Love|220.89098|
|       Fireflies|225.17506|
|      Love Story|245.36771|
+----------------+---------+
only showing top 20 rows



***Note:*** we can see different values for the same song so this is not the length of the song that has been played but it could instead be the **listening duration for this song**.

# 8. `level` feature
Still no real clue about what it is so let's analyze it in details.

## 8.1. Feature statistics

In [76]:
show_feature_stats(df, 'level')

+-------+------+
|summary| level|
+-------+------+
|  count|286500|
|   mean|  null|
| stddev|  null|
|    min|  free|
|    max|  paid|
+-------+------+



***Observations:***
* This is a categorical column and corresponds to the different levels of service the user has suscribed to.
* There are not missing values (even for user without firstName? So then they are considered...what? Free?)

## 8.2. How many different levels available?

In [61]:
value_counts(df, 'level')

+-----+------+
|level| count|
+-----+------+
| paid|228162|
| free| 58338|
+-----+------+



In [66]:
# When first name is null, what are the possible values for 'level'?
get_other_col_value_counts_when_col_is_null(df, 'firstName', 'level')

+-----+-----+
|level|count|
+-----+-----+
| paid| 5729|
| free| 2617|
+-----+-----+



In [67]:
# Strange, let's now try with 'userId':
get_other_col_value_counts_when_col_is_null(df, 'userId', 'level')

+-----+-----+
|level|count|
+-----+-----+
+-----+-----+



In [68]:
# Okayyyy, when firstName is null, userId is not necessarily null, we have an empty string! That's why we found 8346
# values for levels above
get_other_col_value_counts_when_col_is_null(df, 'firstName', 'userId')

+------+-----+
|userId|count|
+------+-----+
|      | 8346|
+------+-----+



# 9. `location` feature
With this name I guess that it is related to user's location, let's see if I am right.

## 9.1. Feature statistics

In [77]:
show_feature_stats(df, 'location')

+-------+-----------------+
|summary|         location|
+-------+-----------------+
|  count|           278154|
|   mean|             null|
| stddev|             null|
|    min|       Albany, OR|
|    max|Winston-Salem, NC|
+-------+-----------------+



***Observations:***
* This is a categorical column and corresponds to the different locations of users.
* It seems that all users are from United States.
* There are same amount of missing values as missing `firstName` and so on.
* We could try to plot a map where users tend to churn more but for our prediction problem I thing I will not keep this feature.

## 9.2. How many different locations?

In [78]:
value_counts(df, 'location')

+--------------------+-----+
|            location|count|
+--------------------+-----+
|Los Angeles-Long ...|30131|
|New York-Newark-J...|23684|
|Boston-Cambridge-...|13873|
|Houston-The Woodl...| 9499|
|Charlotte-Concord...| 7780|
|Dallas-Fort Worth...| 7605|
|Louisville/Jeffer...| 6880|
|Philadelphia-Camd...| 5890|
|Chicago-Napervill...| 5114|
|    St. Louis, MO-IL| 4858|
|Phoenix-Mesa-Scot...| 4846|
|Vineland-Bridgeto...| 4825|
|          Wilson, NC| 4659|
|Denver-Aurora-Lak...| 4453|
|           Ionia, MI| 4428|
|San Antonio-New B...| 4373|
|        Danville, VA| 4257|
|Atlanta-Sandy Spr...| 4236|
|New Haven-Milford...| 4007|
|         Jackson, MS| 3839|
+--------------------+-----+
only showing top 20 rows



***Note:*** we have the confirmation that all the locations are actually different cities from United States.

# 10. `method` feature
No real clue about what it is so let's analyze it in details.

## 10.1. Feature statistics

In [79]:
show_feature_stats(df, 'method')

+-------+------+
|summary|method|
+-------+------+
|  count|286500|
|   mean|  null|
| stddev|  null|
|    min|   GET|
|    max|   PUT|
+-------+------+



***Observations:***
* This is a categorical column
* "GET" and "PUT" are HTTP methods, could it be possible that this feature relates to the HTTP method used for the user's action?
* There are no missing values (which seems logic if my assumption is correct).

## 10.2. How many different methods?

In [80]:
value_counts(df, 'method')

+------+------+
|method| count|
+------+------+
|   PUT|261064|
|   GET| 25436|
+------+------+



***Note:*** we have only 2 possible values so let's say I am right. Would have been better to see values such as "POST" but it does not matter because there are high chances that this feature will not be kept in the end.

# 11. `page` feature
As explained in project overview video, this is related to the user's action.

## 11.1. Feature statistics

In [81]:
show_feature_stats(df, 'page')

+-------+-------+
|summary|   page|
+-------+-------+
|  count| 286500|
|   mean|   null|
| stddev|   null|
|    min|  About|
|    max|Upgrade|
+-------+-------+



***Observations:***
* This is a categorical column
* There are no missing values

## 11.2. How many different pages?

In [82]:
value_counts(df, 'page')

+--------------------+------+
|                page| count|
+--------------------+------+
|            NextSong|228108|
|                Home| 14457|
|           Thumbs Up| 12551|
|     Add to Playlist|  6526|
|          Add Friend|  4277|
|         Roll Advert|  3933|
|               Login|  3241|
|              Logout|  3226|
|         Thumbs Down|  2546|
|           Downgrade|  2055|
|                Help|  1726|
|            Settings|  1514|
|               About|   924|
|             Upgrade|   499|
|       Save Settings|   310|
|               Error|   258|
|      Submit Upgrade|   159|
|    Submit Downgrade|    63|
|              Cancel|    52|
|Cancellation Conf...|    52|
+--------------------+------+
only showing top 20 rows



In [84]:
df.select("page").dropDuplicates().count()

22

***Observations:***
* We have 228108 values for 'NextSong', the same number than no missing values for `Artist`. This is far the most represented page in the dataset.
* There are 22 different values possible but not all of them are displayed in the first table. Anyway, this is OK as the 20th has only 52 rows.

In our churn prediction problem, we will classify as "churned" a user with `Cancellation Confirmation` page. It means that **in our dataset we have 52 users over the 225 who are churned ones (around 23%)**.

# 12. `registration` feature
The schema showed that this is numeric, let's see what it is exactly.

## 12.1. Feature statistics

In [85]:
show_feature_stats(df, 'registration')

+-------+--------------------+
|summary|        registration|
+-------+--------------------+
|  count|              278154|
|   mean|1.535358834084427...|
| stddev| 3.291321616327586E9|
|    min|       1521380675000|
|    max|       1543247354000|
+-------+--------------------+



***Observations:***
* We have the same number of missing values than for `firstName` and so on so this feature is related to user.
* Values seems to be timestamp so I guess with the column name that it relates to the date the user registered.

## 12.2. Date range for this dataset?

In [104]:
begin = datetime.fromtimestamp(df.agg({"registration": "min"}).collect()[0][0]/1000)
end = datetime.fromtimestamp(df.agg({"registration": "max"}).collect()[0][0]/1000)
print("Date range is from {} to {}".format(begin, end))

Date range is from 2018-03-18 13:44:35 to 2018-11-26 15:49:14


***Note:*** we have users who registered within a 8 months period

# 13. `sessionId` feature
This should be a technical id for the user's session. Let's confirm that thought.

## 13.1. Feature statistics

In [105]:
show_feature_stats(df, 'sessionId')

+-------+-----------------+
|summary|        sessionId|
+-------+-----------------+
|  count|           286500|
|   mean|1041.526554973822|
| stddev|726.7762634630741|
|    min|                1|
|    max|             2474|
+-------+-----------------+



***Observations:***
* We have no missing values.
* `min` is 1 and `max` is another value, let's see if 2474 is the number of unique session, that would mean that each new session in the dataset is a counter incrementing by 1.

## 13.2. How many unique values?

In [107]:
df.select('sessionId').dropDuplicates().count()

2354

***Observations:*** we have more or less the same number of sessions as the "max" value. This feature is probably not relevant for our churn prediction problem.

# 14. `song` feature
Self-explanatory, this is the song currently played by the user. And we should have the same number of missing values than with `Artist`.

## 14.1. Feature statistics

In [108]:
show_feature_stats(df, 'song')

+-------+--------------------+
|summary|                song|
+-------+--------------------+
|  count|              228108|
|   mean|            Infinity|
| stddev|                 NaN|
|    min|ÃÂg ÃÂtti Gr...|
|    max|ÃÂau hafa slopp...|
+-------+--------------------+



***Note:*** we have same number of missing values, as expected.

## 14.2. What is the most played song?

In [109]:
value_counts(df, 'song')

+--------------------+-----+
|                song|count|
+--------------------+-----+
|      You're The One| 1153|
|                Undo| 1026|
|             Revelry|  854|
|       Sehr kosmisch|  728|
|Horn Concerto No....|  641|
|Dog Days Are Over...|  574|
|             Secrets|  466|
|        Use Somebody|  459|
|              Canada|  435|
|             Invalid|  424|
|    Ain't Misbehavin|  409|
|       ReprÃÂ©sente|  393|
|SinceritÃÂ© Et J...|  384|
|Catch You Baby (S...|  373|
|              Yellow|  343|
|    Somebody To Love|  343|
|    Hey_ Soul Sister|  334|
|            The Gift|  327|
|           Fireflies|  312|
|          Love Story|  309|
+--------------------+-----+
only showing top 20 rows



# 15. `status` feature
This is ambiguous so let's analyze it.

## 15.1. Feature statistics

In [110]:
show_feature_stats(df, 'status')

+-------+------------------+
|summary|            status|
+-------+------------------+
|  count|            286500|
|   mean|210.05459685863875|
| stddev| 31.50507848842214|
|    min|               200|
|    max|               404|
+-------+------------------+



***Observations:***
* We have no missing values.
* `min` and `max` does not give so much information so far.

## 15.2. How many unique values?

In [111]:
value_counts(df, 'status')

+------+------+
|status| count|
+------+------+
|   200|259812|
|   307| 26430|
|   404|   258|
+------+------+



Okay, this is related to HTTP code (remember that we also have HTTP method in the dataset so why not HTTP code...?). Here are the meanings:
* 200: OK (request worked without error)
* 404: page not found
* 307: Temporary Redirect

For more details, please refer to this [page](https://developer.mozilla.org/fr/docs/Web/HTTP/Status).

# 16. `ts` feature
"ts" is oftenly used as timestamp abbreviation

## 16.1. Feature statistics

In [112]:
show_feature_stats(df, 'ts')

+-------+--------------------+
|summary|                  ts|
+-------+--------------------+
|  count|              286500|
|   mean|1.540956889810483...|
| stddev|1.5075439608226302E9|
|    min|       1538352117000|
|    max|       1543799476000|
+-------+--------------------+



## 16.2. Date range for this dataset?

In [113]:
begin = datetime.fromtimestamp(df.agg({"ts": "min"}).collect()[0][0]/1000)
end = datetime.fromtimestamp(df.agg({"ts": "max"}).collect()[0][0]/1000)
print("Date range is from {} to {}".format(begin, end))

Date range is from 2018-10-01 00:01:57 to 2018-12-03 01:11:16


***Note:*** we have 2 months data (remember that we have users who registered within a 8 months period: from March to November)

# 17. `userAgent` feature
I guess this is the user-agent value for user's browser. More details about what is user agent [here].(https://en.wikipedia.org/wiki/User_agent).

## 17.1. Feature statistics

In [114]:
show_feature_stats(df, 'userAgent')

+-------+--------------------+
|summary|           userAgent|
+-------+--------------------+
|  count|              278154|
|   mean|                null|
| stddev|                null|
|    min|"Mozilla/5.0 (Mac...|
|    max|Mozilla/5.0 (comp...|
+-------+--------------------+



***Observations:*** we have as many missing values as for features related to users (`firstName`, `lastName`, etc)

## 17.2. How many unique values?

In [115]:
df.select('userAgent').dropDuplicates().count()

57

In [119]:
value_counts(df, 'userAgent')

+--------------------+-----+
|           userAgent|count|
+--------------------+-----+
|"Mozilla/5.0 (Win...|22751|
|"Mozilla/5.0 (Mac...|19611|
|"Mozilla/5.0 (Mac...|18448|
|"Mozilla/5.0 (Mac...|17348|
|Mozilla/5.0 (Wind...|16700|
|"Mozilla/5.0 (Win...|15395|
|"Mozilla/5.0 (Win...|14598|
|Mozilla/5.0 (Maci...|10300|
|"Mozilla/5.0 (iPa...| 8912|
|Mozilla/5.0 (comp...| 8624|
|"Mozilla/5.0 (Mac...| 8094|
|"Mozilla/5.0 (Win...| 7923|
|"Mozilla/5.0 (Mac...| 7906|
|"Mozilla/5.0 (Win...| 7624|
|"Mozilla/5.0 (iPh...| 6417|
|Mozilla/5.0 (Wind...| 5989|
|"Mozilla/5.0 (Mac...| 5716|
|"Mozilla/5.0 (Win...| 5238|
|"Mozilla/5.0 (Win...| 4917|
|Mozilla/5.0 (Wind...| 4663|
+--------------------+-----+
only showing top 20 rows



***Note:*** it seems that we could be able to extract the user's OS (Windows, Mac) or device (iPad, iPhone). It might be used to analyze whether users from one or another OS tend to churn more than others (thus meaning perhaps that the app is not giving satisfaction or is buggy).

# 18. `userId` feature
Must be a technical id for the user. We have already seen many times that there are 8346 missing values for features related to users.

## 18.1. Feature statistics

In [120]:
show_feature_stats(df, 'userId')

+-------+-----------------+
|summary|           userId|
+-------+-----------------+
|  count|           286500|
|   mean|59682.02278593872|
| stddev|109091.9499991047|
|    min|                 |
|    max|               99|
+-------+-----------------+



***Observations:*** we can see here the empty string for the user id. At first glance, it seems there is no missing value but actually there are.

## 17.2. How many unique values?

In [121]:
value_counts(df, 'userId')

+------+-----+
|userId|count|
+------+-----+
|    39| 9632|
|      | 8346|
|    92| 7230|
|   140| 6880|
|300011| 5732|
|   124| 4825|
|300021| 4659|
|300017| 4428|
|    85| 4370|
|    42| 4257|
|200023| 3769|
|     6| 3761|
|    29| 3603|
|    54| 3437|
|   100| 3214|
|     9| 3191|
|   126| 3102|
|300015| 3051|
|    91| 3014|
|    98| 2891|
+------+-----+
only showing top 20 rows



# 19. CONCLUSION
Here is a summary of this first phase of Data Exploration:

| Column name | Type | Data Definition | Decision |
|-------------|-----------------|----------|----------|
| `artist`    | Categorical     | Name of the artist the user is currently listening to | **Drop**, it will not help us to identify when user will churn |
| `auth`    | Categorical     | Login status of the user | **Drop**, it will not help us to identify when user will churn |
| `firstName`    | Categorical     | First name of the user | **Drop**, it will not help us to identify when user will churn |
| `lastName`    | Categorical     | Last name of the user | **Drop**, it will not help us to identify when user will churn |
| `gender`    | Categorical     | Gender of the user | **Drop**, it will not help us to identify when user will churn |
| `itemInSession`    | Numerical     | Number of elements played in the same session | **Keep**, indicates user engagement with the service |
| `length`    | Numerical     | Number of seconds of the song listened by user | **Keep**, indicates user engagement with the service |
| `level`    | Categorical     | Free or paid user? | **Keep**, indicates user engagement with the service. Will be transformed as dummy |
| `location`    | Categorical     | User's location (in United States) | **Drop**, it will not help us to identify when user will churn |
| `method`    | Categorical     | HTTP method used for the action | **Drop**, it will not help us to identify when user will churn |
| `page`    | Categorical     | User's action | **Keep**, needs to be investigated more as it indicates user engagement with the service. Target will be generated from one specific value of this feature |
| `registration`    | Numerical     | User's registration timestamp | **Keep**, needs to be investigated more as it indicates user engagement with the service. Target will be generated from one specific value of this feature |
| `sessionId`    | Numerical     | Session id | **Keep** to be transformed into something else such as number of times user came back |
| `song`    | Categorical     | Song currently played by user | **Drop**, it will not help us to identify when user will churn |
| `status`    | Numerical     | HTTP code for user's action | **Drop**, it will not help us to identify when user will churn |
| `ts`    | Numerical     | User's action timestamp | **Keep** to be transformed into something else such as number of actions within a time window |
| `userAgent`    | Categorical     | User's browser user-agent | **Keep**, maybe we can extract OS and device and dummy that |
| `userId`    | Numerical     | User technical id | **Drop**, it will not help us to identify when user will churn |

The basic exploration of the dataset is now over and we will now explore in details a clean dataset with the target defined. This will be done in this [notebook about EDA](2_Sparkify_Data_Exploration.ipynb)