Churn Prediction Sparkify

Overview

Predicting music streaming service user churn on local machine and AWS EMR.

User churn (cancellation) prediction is an imperative predictive tool. This project sets to solve this problem for a music streaming service: Sparkify. By exploring Sparkify usage data, the project identifies features for model learning. For computation efficiency reason, a tiny dataset (240Mb), a sample of the full dataset (12Gb) is used for initial data exploration, feature engineering and modelling experimentation on a local machine (Workflow in Sparkify_local.ipynb)

The initial work on the tiny dataset shall identify the most suitable model and hyper-parameters for the full dataset to train the final model . Once features and model are identified, they will be used for modelling the full dataset on AWS EMR.(Workflow in Sparkify_AWS_EMR.ipynb)

The actionable insight gained from churn prediction would be to identify users who are likely to churn and send them offers that hopefully will keep them from clicking cancellation confirmation.

My Medium post provides a more detailed explanation of this project.

Requirements:

Python 3
Pyspark
Pandas
Matplotlib
Seaborn
Jupyter notebook

Instructions:

Data:
Tiny 240Mb
Big 12Gb: s3a://udacity-dsnd/sparkify/sparkify_event_data.json)

To run Sparkify_local.ipynb, simply run it in Jupyter notebook.
To run Sparkify_AWS_EMR.ipynb: Spin up an AWS EMR cluster, create the Sparkify_AWS_EMR.ipynb notebook.

Results

Exploratory data analysis

Training results

All features are adopted for modelling the large dataset.

Test data evaluation results

On tiny data set on local machine:
Evaluation result:
+---------+------+------+--------+
|precision|recall| f1|accuracy|
+---------+------+------+--------+
| 0.8611|0.8662|0.8622| 0.8662|
+---------+------+------+--------+

On large dataset on AWS:
Evaluation result:
+------+------+---------+--------+
| f1|recall|precision|accuracy|
+------+------+---------+--------+
|0.7254|0.7868| 0.7444| 0.7908|
+------+------+---------+--------+

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
plots		plots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Sparkify_AWS_EMR.html		Sparkify_AWS_EMR.html
Sparkify_AWS_EMR.ipynb		Sparkify_AWS_EMR.ipynb
Sparkify_local.html		Sparkify_local.html
Sparkify_local.ipynb		Sparkify_local.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Churn Prediction Sparkify

Overview

Requirements:

Instructions:

Results

Exploratory data analysis

Training results

Test data evaluation results

About

Releases

Packages

Languages

License

jiewwantan/Churn_prediction_Sparkify

Folders and files

Latest commit

History

Repository files navigation

Churn Prediction Sparkify

Overview

Requirements:

Instructions:

Results

Exploratory data analysis

Training results

Test data evaluation results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages