# Sparkify Project 

This workspace contains a tiny subset (128MB) of the full dataset available (12GB). Feel free to use this workspace to build your project, or to explore a smaller subset with Spark before deploying your cluster on the cloud. Instructions for setting up your Spark cluster is included in the last lesson of the Extracurricular Spark Course content.

You can follow the steps below to guide your data analysis and model building portion of this project.

## Table of Content

**Target**: Predict users at risk to churn either downgrade from premium to free tier or cancelling their services altogether.

`1.` Load Dataset <br>
`2.` Explore and Clean Data. <br>
`3.` Feature Engineering. <br>
`4.` Build Model. <br>

In [233]:
# import libraries
from pyspark.sql import SparkSession, Window
from pyspark.sql.functions import udf, sum, desc, lit, date_format, from_unixtime, isnan, when, count, col, substring
from pyspark.sql.types import IntegerType, DateType, StringType
from datetime import datetime

In [2]:
# create a Spark session
spark = SparkSession \
    .builder \
    .appName("Sparkify") \
    .getOrCreate()

# Load Dataset
In this workspace, the mini-dataset file is `mini_sparkify_event_data.json`. Load and clean the dataset, checking for invalid or missing data - for example, records without userids or sessionids. 

In [3]:
# set data path and read in
data_path = "mini_sparkify_event_data.json"
mini_data = spark.read.json(data_path)

In [None]:
# display the table content
mini_data.show(1)

In [None]:
mini_data.take(1)

In [None]:
# count rows in this dataset
mini_data.count()

In [12]:
# register the mini data as a SQL temporary view
mini_data.createOrReplaceTempView("df_mini")

In [13]:
# Print the schema in a tree format
mini_data.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [14]:
# count distinct user_ids
spark.sql("select count(distinct(userId)) from df_mini").show()

+----------------------+
|count(DISTINCT userId)|
+----------------------+
|                   226|
+----------------------+



In [35]:
# print out null sums in each column 
[(c, spark.sql("select * from df_mini where {} is Null".format(c)).count()) for c in mini_data.columns]

[('artist', 58392),
 ('auth', 0),
 ('firstName', 8346),
 ('gender', 8346),
 ('itemInSession', 0),
 ('lastName', 8346),
 ('length', 58392),
 ('level', 0),
 ('location', 8346),
 ('method', 0),
 ('page', 0),
 ('registration', 8346),
 ('sessionId', 0),
 ('song', 58392),
 ('status', 0),
 ('ts', 0),
 ('userAgent', 8346),
 ('userId', 0)]

In [40]:
# print out '' sums in each column 
[(c, spark.sql("select * from df_mini where {} = ''".format(c)).count()) for c in mini_data.columns]

[('artist', 0),
 ('auth', 0),
 ('firstName', 0),
 ('gender', 0),
 ('itemInSession', 0),
 ('lastName', 0),
 ('length', 0),
 ('level', 0),
 ('location', 0),
 ('method', 0),
 ('page', 0),
 ('registration', 0),
 ('sessionId', 0),
 ('song', 0),
 ('status', 0),
 ('ts', 0),
 ('userAgent', 0),
 ('userId', 8346)]

In [43]:
# take a look which pages are user in if 
spark.sql("select distinct page from df_mini where artist is not Null").show()

+--------+
|    page|
+--------+
|NextSong|
+--------+



In [50]:
# take a look which pages are user in if 
spark.sql("select song from df_mini where userId = ''").show(2)

+----+
|song|
+----+
|null|
|null|
+----+
only showing top 2 rows



**Findings:** <br>

    There are two types missing values in this dataset.
    * Nulls: artist/song -- 58,392. Due to the log page is not 'NextSong'.
    * ''s: userId -- 8,346. Due to user not log in. At this situation, user related information would be nulls such        as first name,last name,location etc.
    
    Next step is to keep logged in users only -- remove userId is '' records.

In [48]:
# Filter out userId is null records
mini_data_non0 = mini_data.filter(mini_data.userId != '')

In [55]:
mini_data_non0.count()

278154

In [52]:
mini_data_non0.createOrReplaceTempView("df_mini_non0")

In [53]:
# print out null sums in each column 
[(c, spark.sql("select * from df_mini_non0 where {} is Null".format(c)).count()) for c in mini_data_non0.columns]

[('artist', 50046),
 ('auth', 0),
 ('firstName', 0),
 ('gender', 0),
 ('itemInSession', 0),
 ('lastName', 0),
 ('length', 50046),
 ('level', 0),
 ('location', 0),
 ('method', 0),
 ('page', 0),
 ('registration', 0),
 ('sessionId', 0),
 ('song', 50046),
 ('status', 0),
 ('ts', 0),
 ('userAgent', 0),
 ('userId', 0)]

In [85]:
def dispaly_vara_cnt(vara):
    """ Show variable's unique count, if unique values less than 10 print them out.
    INPUT:
    vara: -- (string), variable need to be counted.
    OUTPUT:
    total counts or groupby counts
    """
    ttl_uniq_cnt = mini_data_non0.select([vara]).dropDuplicates().count()
    if ttl_uniq_cnt >10:
        print("In this dataset, there are total {} unique {}".format(ttl_uniq_cnt, vara))
        
    elif vara in ['gender','firstName','lastnAME','location']:
        spark.sql("select {}, count(*) as unique_cnt from \
          (select distinct userId, {} from df_mini_non0) \
          group by {}".format(vara, vara, vara)).show()
        
    else:
        spark.sql("select {}, count(level) as unique_cnt from df_mini_non0\
           group by {} order by unique_cnt desc".format(vara, vara)).show()

In [86]:
dispaly_vara_cnt('userId')

In this dataset, there are total 225 unique userId


In [87]:
dispaly_vara_cnt('gender')

+------+----------+
|gender|unique_cnt|
+------+----------+
|     F|       104|
|     M|       121|
+------+----------+



In [88]:
dispaly_vara_cnt('artist')

In this dataset, there are total 17656 unique artist


In [89]:
dispaly_vara_cnt('itemInSession')

In this dataset, there are total 1311 unique itemInSession


In [90]:
dispaly_vara_cnt('length')

In this dataset, there are total 14866 unique length


In [91]:
dispaly_vara_cnt('level')

+-----+----------+
|level|unique_cnt|
+-----+----------+
| paid|    222433|
| free|     55721|
+-----+----------+



In [92]:
dispaly_vara_cnt('location')

In this dataset, there are total 114 unique location


In [93]:
dispaly_vara_cnt('method')

+------+----------+
|method|unique_cnt|
+------+----------+
|   PUT|    257818|
|   GET|     20336|
+------+----------+



In [94]:
dispaly_vara_cnt('page')

In this dataset, there are total 19 unique page


In [95]:
dispaly_vara_cnt('registration')

In this dataset, there are total 225 unique registration


In [96]:
dispaly_vara_cnt('song')

In this dataset, there are total 58481 unique song


In [97]:
dispaly_vara_cnt('status')

+------+----------+
|status|unique_cnt|
+------+----------+
|   200|    254718|
|   307|     23184|
|   404|       252|
+------+----------+



In [98]:
dispaly_vara_cnt('userAgent')

In this dataset, there are total 56 unique userAgent


**Temporarily to keep gender, level, location, page, method, registration, status, userAgent**

# Exploratory Data Analysis
When you're working with the full dataset, perform EDA by loading a small subset of the data and doing basic manipulations within Spark. In this workspace, you are already provided a small subset of data you can explore.

### Define Churn

Once you've done some preliminary analysis, create a column `Churn` to use as the label for your model. I suggest using the `Cancellation Confirmation` events to define your churn, which happen for both paid and free users. As a bonus task, you can also look into the `Downgrade` events.

### Explore Data
Once you've defined churn, perform some exploratory data analysis to observe the behavior for users who stayed vs users who churned. You can start by exploring aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or number of songs played.

### Target Churn Define

In [132]:
# Define Churn
# To look how many unique users submit cancellation
mini_data_non0.select(['userId']).where(mini_data_non0.page.isin(['Cancellation Confirmation']))\
            .dropDuplicates().count()


52

In [133]:
# # To look how many unique users submit cancellation or Downgrade
mini_data_non0.select(['userId']).where(mini_data_non0.page.isin(['Cancellation Confirmation','Downgrade']))\
                .dropDuplicates().count()


171

In [101]:
# Total users
mini_data_non0.select(['userId']).dropDuplicates().count()

225

**Findings**

1. There are total 225 unique users in this dataset.
2. Out of these 225 unique users, 52 submitted cancellations and 171 submitted either cancellations of downgrades.

In [103]:
# Create new column to label cancellation or not
# define churn function
cancellation_event = udf(lambda x: 1 if x == 'Cancellation Confirmation' else 0, IntegerType())
# apply churn function
mini_data_non0 = mini_data_non0.withColumn("churn", cancellation_event("page"))

In [None]:
# fill churn user na as 0s
mini_data_non0 = mini_data_non0.na.fill({'churn_user': 0})

In [104]:
# get hour from ts
ts_to_hour = udf(lambda x: datetime.fromtimestamp(x / 1000.0).hour)
mini_data_non0 = mini_data_non0.withColumn("hour",ts_to_hour("ts"))

In [105]:
ts_to_day = udf(lambda x: datetime.fromtimestamp(x / 1000.0).day)
mini_data_non0 = mini_data_non0.withColumn("day",ts_to_day("ts"))

In [108]:
mini_data_non0 = mini_data_non0.withColumn("date",from_unixtime(mini_data_non0.ts/1000).cast(DateType()))

**To explore which columns kept between churn/not churn groups**

**Temporarily to keep gender, level, location, page, method, registration, status, userAgent**

In [111]:
mini_data_non0.createOrReplaceTempView("df_mini_non0")

In [120]:
# churn users' churn date distirbution
spark.sql("select date, count(date) as churn_user \
           from df_mini_non0 where churn = 1 \
           group by date order by date").show()

+----------+----------+
|      date|churn_user|
+----------+----------+
|2018-10-01|         1|
|2018-10-02|         1|
|2018-10-04|         2|
|2018-10-05|         1|
|2018-10-07|         2|
|2018-10-08|         1|
|2018-10-11|         1|
|2018-10-12|         2|
|2018-10-13|         2|
|2018-10-15|         2|
|2018-10-16|         2|
|2018-10-17|         2|
|2018-10-19|         1|
|2018-10-20|         3|
|2018-10-22|         2|
|2018-10-23|         2|
|2018-10-24|         1|
|2018-10-26|         1|
|2018-10-30|         1|
|2018-11-01|         2|
+----------+----------+
only showing top 20 rows



In [129]:
"""
# Count how many rows of unique user with dates -- Considering put churn as the rolling weekly base
spark.sql("select date, userId from df_mini_non0 \
           where date <= '2018-11-25' \
           group by date, userId").count()

# Will generate 2,845 records for predicting next week churn
"""

2845

**Thoughts:** <br>
If we observe monthly churn, then I will select 2018-11-01 as the benchmark date, 2018-10-01 to 2018-10-31 as the observational period and customers active at 2018-10-31 as base customer counts. Customers who churned between 2018-11-01 and 2018-11-30 as target churned customers. All customers who churned before 2018-11-01 will be removed from the analysis. In other words, only keep customers who were active at 2018-10-31. 
* Baseline: (214 - (52-22)) = 184.
* Churned in the folowing month: 22.
* Monthly Churn Rate: 22/184 = 12%

Also, checked if using rolling weekly base for predicting next week churn, that will generate 2,845 rows insteading of 195 rows.

**Next Step**:<br>
1) Filtered user logs who churned before 2018-11-01. <br>
2) Aggregate user activities during observational period (2018-10-01 to 2018-10-31) <br>
3) Compare the difference between churn/stay groups. <br>
4) If time allowed, try rolling weekly method.

`1.` Get obervational period data

In [210]:
mini_data_obv = spark.sql("select * from df_mini_non0 \
                          where date >= '2018-10-01' and date < '2018-11-01'")

`2.` Remove users churned between 2018-10-01 and 2018-11-01

In [211]:
# get unique churn users
churn_user_obv = mini_data_obv.select(['userId']).where(mini_data_obv.churn == 1).dropDuplicates()
# assign new churn user label to churn users
churn_user_obv = churn_user_obv.withColumn("churn_user_obv", lit(1))
# join churn user back to original table and got an label
mini_data_obv = mini_data_obv.join(churn_user_obv, "userId", how = 'outer')
# remove users records who churn between 2028-10-01 and 2018-11-01
mini_data_obv = mini_data_obv.where(col('churn_user_obv').isNull())

`3.`Get unique userIds who churned during target period 2018-11-01 and 2018-12-01, then join back with the observational period.

In [212]:
# Churn period data
mini_data_target = spark.sql("select * from df_mini_non0 \
                          where date >= '2018-11-01' and date < '2018-12-01'")

# unique users churned during the target period
churn_user_target = mini_data_target.select(['userId']).where(mini_data_target.churn == 1).dropDuplicates()

# assign new churn user label to churn users
churn_user_target = churn_user_target.withColumn("churn_user_target", lit(1))

# join churn user back to observal table and got an label
mini_data_obv = mini_data_obv.join(churn_user_target, "userId", how = 'outer')

# fill na with 0s in the observational dataset
mini_data_obv = mini_data_obv.fillna(0, subset = ['churn_user_target'])

# Drop some columns
mini_data_obv = mini_data_obv.drop('churn').drop('churn_user_obv').drop('month')

In [213]:
mini_data_obv.createOrReplaceTempView("df_mini_non0_obv")

In [214]:
# churn/not churn counts
spark.sql("select churn_user_target, count(*) as cnt from df_mini_non0_obv\
           group by churn_user_target order by churn_user_target").show()

+-----------------+------+
|churn_user_target|   cnt|
+-----------------+------+
|                0|108206|
|                1| 20168|
+-----------------+------+



In [215]:
# unique churn/nonchurn user counts
spark.sql("select churn_user_target, count(*) as cnt from \
          (select distinct userId,  churn_user_target from df_mini_non0_obv)\
          group by churn_user_target order by churn_user_target").show()

+-----------------+---+
|churn_user_target|cnt|
+-----------------+---+
|                0|162|
|                1| 22|
+-----------------+---+



`4.` Transform dataframe into user - feature format. Total would be 184 rows and many columns.

In [221]:
# gender
df_gender = spark.sql("select distinct userId, gender, churn_user_target from df_mini_non0_obv")
df_gender.count()

184

In [278]:
# location
df_location = spark.sql("select distinct userId, location, churn_user_target from df_mini_non0_obv")
# extract State -- 37
substr_state = udf(lambda x: x[-2:], StringType())
df_location = df_location.withColumn("state", substr_state("location"))

In [280]:
# userAgent
df_useAgen = spark.sql("select distinct userId, userAgent, churn_user_target from df_mini_non0_obv")
remove_qot = udf(lambda x: x.replace(u'"',''))
substr_agent = udf(lambda x: x[13:21], StringType())
df_useAgen = df_useAgen.withColumn("opr_system", substr_agent(remove_qot("userAgent")))
df_useAgen.select(['opr_system']).dropDuplicates().show()

+----------+
|opr_system|
+----------+
|  iPad; CP|
|  Windows |
|  compatib|
|  Macintos|
|  iPhone; |
|  X11; Ubu|
|  X11; Lin|
+----------+



In [284]:
# Total songs played during the month
df_song = spark.sql("select userId, count(song) as ttl_song\
                     from df_mini_non0_obv \
                     where song is not Null \
                     group by userId")

df_song.describe('ttl_song').show()

+-------+-----------------+
|summary|         ttl_song|
+-------+-----------------+
|  count|              183|
|   mean|572.1584699453551|
| stddev| 649.292211448894|
|    min|                1|
|    max|             5127|
+-------+-----------------+



In [297]:
# Pages
# reshap page count -- userId as row and page as column, count as value
df_page = mini_data_obv.groupby('userId').pivot('page').count()
df_page.show(2)

+------+-----+----------+---------------+---------+-----+----+----+------+--------+-----------+-------------+--------+----------------+--------------+-----------+---------+-------+
|userId|About|Add Friend|Add to Playlist|Downgrade|Error|Help|Home|Logout|NextSong|Roll Advert|Save Settings|Settings|Submit Downgrade|Submit Upgrade|Thumbs Down|Thumbs Up|Upgrade|
+------+-----+----------+---------------+---------+-----+----+----+------+--------+-----------+-------------+--------+----------------+--------------+-----------+---------+-------+
|200002|    2|         4|              6|        3| null|   1|  14|     3|     267|          7|         null|       3|            null|             1|          6|       15|      2|
|100010| null|         3|              2|     null| null|   1|   6|     2|     120|         22|         null|    null|            null|          null|          1|        6|      1|
+------+-----+----------+---------------+---------+-----+----+----+------+--------+-----------+

In [208]:
mini_data_obv.printSchema()

root
 |-- userId: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- day: string (nullable = true)
 |-- date: date (nullable = true)
 |-- churn_user_target: integer (nullable = true)



In [308]:
# length
df_song_length = spark.sql("select userId, round(avg(length)) as avg_song_length\
                        from df_mini_non0_obv \
                        where song is not Null \
                        group by userId")

df_song_length.describe('avg_song_length').show()

+-------+------------------+
|summary|   avg_song_length|
+-------+------------------+
|  count|               183|
|   mean|248.80327868852459|
| stddev| 7.295534549021886|
|    min|             222.0|
|    max|             280.0|
+-------+------------------+



In [312]:
# level -- window function to mark the changes free as 0, paid as 1 -- how many days free, how many days paid
df_level = spark.sql("select distinct userId, level, churn_user_target from df_mini_non0_obv sort by userId")
df_level.show()

+------+-----+-----------------+
|userId|level|churn_user_target|
+------+-----+-----------------+
|   146| free|                0|
|    38| paid|                0|
|    88| free|                0|
|   126| paid|                0|
|   132| paid|                0|
|   104| paid|                0|
|300021| paid|                0|
|    82| free|                0|
|300018| free|                0|
|    10| paid|                0|
|   120| paid|                0|
|300011| paid|                0|
|    81| paid|                0|
|300019| free|                0|
|    40| paid|                0|
|    70| free|                1|
|   127| paid|                0|
|     8| free|                0|
|200017| paid|                1|
|    28| free|                1|
+------+-----+-----------------+
only showing top 20 rows



In [320]:
spark.sql("select userId, count(*) as cnt from \
         (select distinct userId, level, churn_user_target from df_mini_non0_obv)\
           group by userId sort by cnt desc").show()

+------+---+
|userId|cnt|
+------+---+
|200002|  2|
|100010|  1|
|     7|  1|
|   124|  1|
|    54|  2|
|    15|  1|
|   132|  2|
|100014|  1|
|    11|  2|
|   138|  2|
|300017|  1|
|    29|  2|
|    69|  2|
|100021|  1|
|    42|  2|
|   112|  1|
|200010|  1|
|    64|  1|
|    30|  2|
|   113|  2|
+------+---+
only showing top 20 rows



In [326]:
# method -- TBD
df_method = spark.sql("select distinct userId, method, churn_user_target from df_mini_non0_obv sort by userId")
df_method.count()

366

In [328]:
spark.sql("select distinct method from df_mini_non0_obv").show()

+------+
|method|
+------+
|   PUT|
|   GET|
+------+



In [330]:
# registration -- drop since each user has an distinct regi number
df_regi = spark.sql("select distinct userId, registration, churn_user_target from df_mini_non0_obv sort by userId")
df_regi.show()

+------+-------------+-----------------+
|userId| registration|churn_user_target|
+------+-------------+-----------------+
|    45|1536398117000|                0|
|   118|1537893493000|                0|
|300016|1534622171000|                0|
|    11|1532554781000|                0|
|    13|1533192032000|                0|
|300012|1530306321000|                0|
|300019|1536158069000|                0|
|    36|1533908361000|                0|
|    90|1533995214000|                0|
|    57|1535062159000|                0|
|300011|1538336771000|                0|
|    76|1538065863000|                0|
|    92|1536403972000|                0|
|300022|1534461078000|                0|
|    60|1537014411000|                0|
|     8|1533650280000|                0|
|    88|1536663902000|                0|
|    62|1531804365000|                0|
|200008|1533670697000|                0|
|200006|1536963671000|                0|
+------+-------------+-----------------+
only showing top

In [333]:
# sessionId -- drop since too much level
df_session = spark.sql("select distinct userId, sessionId, churn_user_target from df_mini_non0_obv sort by userId")
df_session.count()

1619

In [335]:
# status -- need to figure out how the status changes for each user
df_status = spark.sql("select distinct userId, status, churn_user_target from df_mini_non0_obv sort by userId")
df_status.count()

435

In [337]:
spark.sql("select distinct status from df_mini_non0_obv").show()

+------+
|status|
+------+
|   307|
|   404|
|   200|
+------+



**Next Step** <br>

* compile all separated functions together. sql + substr + pivot table
* Distribution on two classes
* Fit ML model
* Move to AWS cluster

In [290]:
spark.sql("select userId, page, count(*) as page_cnt from df_mini_non0_obv \
           group by userId, page").count()

2247

In [294]:
mini_data_obv.groupby('userId').pivot('page').count().count()

184

In [None]:
# gentder counts
spark.sql("select gender, churn_user, count(*) as cnt from \
          (select distinct userId, gender, churn_user from df_mini_non0)\
          group by gender, churn_user order by churn_user, gender").show()

In [None]:
# level counts
spark.sql("select level, churn_user, count(*) as cnt \
           from df_mini_non0 group by level, churn_user order by churn_user, level").show()

In [None]:
# page counts
spark.sql("select page, churn_user, count(*) as cnt \
           from df_mini_non0 group by page, churn_user order by churn_user, page").show()

In [None]:
# song counts
spark.sql("select churn_user, count(song) as song_cnt \
           from df_mini_non0 group by churn_user order by churn_user").show()

In [None]:
# method
spark.sql("select method, churn_user, count(*) as cnt \
           from df_mini_non0 group by method, churn_user order by churn_user, method").show()

In [None]:
mini_data_non0.select("hour").show()

You can start by exploring aggregates on these two groups of users, observing how much of a specific action they experienced per a certain time unit or number of songs played.

rows: unique user
Columns:
1. number of songs played
2. actions taken and # of these actions
3. gender
4. level
5. extract user agent
6. udf hour, day
7. location
8. method


Qs:
1. How to define time unit
2. How to agg during each time unit
3. 

# Feature Engineering
Once you've familiarized yourself with the data, build out the features you find promising to train your model on. To work with the full dataset, you can follow the following steps.
- Write a script to extract the necessary features from the smaller subset of data
- Ensure that your script is scalable, using the best practices discussed in Lesson 3
- Try your script on the full data set, debugging your script if necessary

If you are working in the classroom workspace, you can just extract features based on the small subset of data contained here. Be sure to transfer over this work to the larger dataset when you work on your Spark cluster.

In [None]:
# Gender table
user_gender = spark.sql("SELECT distinct userId, gender from mini_FE")

In [None]:
# Location
user_location = spark.sql("SELECT distinct userId, location from mini_FE").count()

# Modeling
Split the full dataset into train, test, and validation sets. Test out several of the machine learning methods you learned. Evaluate the accuracy of the various models, tuning parameters as necessary. Determine your winning model based on test accuracy and report results on the validation set. Since the churned users are a fairly small subset, I suggest using F1 score as the metric to optimize.

# Final Steps
Clean up your code, adding comments and renaming variables to make the code easier to read and maintain. Refer to the Spark Project Overview page and Data Scientist Capstone Project Rubric to make sure you are including all components of the capstone project and meet all expectations. Remember, this includes thorough documentation in a README file in a Github repository, as well as a web app or blog post.