This repository covers the Capstone Project (Spark Project 'Sparkify') of the Udacity Nanodegree Course Data Science.
The Capstone Project is documented in this Medium Blog Post: https://medium.com/@t.mw/predicting-churn-with-apache-spark-and-pyspark-ml-429c3ad79670
Nothing to install, just run the Jupyter Notebook Sparkify.ipynb.
The notebook imports and uses following libriaries:
- numpy
- pandas
- pyspark.sql
- pyspark.ml
- scipy.stats
- matplotlib.pyplot
The Notebook uses the mini data set 'mini_sparkify_event_data.json' (size 128 MB) that comes with the Udacity workspace. The data is not contained in this repository, it is expected to be in the same folder as the notebook.
The Spark Project was chosen as Capstone Project, because it's an opportunity to get to know Apache Spark and Big Data Machine Learning Methods that have not been coverd by the course so far.
- Jupiyter Notebook: Sparkify.ipynb
- Readme
- License
- Provide dataset 'mini_sparkify_event_data.json'.
- Run the Notebook and click through.
This Repository and the Notebook are left under GNU General Public License v3.0