Udacity_Sparkify

Udacity Data Science Nanodegree capstone project

Problem Statement

The aim of the project is to build a machine learning model to predict whether customer's of 'Sparkify' will 'churn' - i.e. they will cancel their account. Sparkify is an imagined music streaming platform, similar to Spotify. The raw data consists of web hits to the sparkify platform, including things like page name, timestamp and userId.

The dataset has been provided by Udacity for use in the captstone project of their Data Science Nanodegree. The full datasetis very large and would require powerful server hardware to run the EDA and model building. For the purposes of this project a small sample data set has been used to make it possible to run it on lower spec hardware. However, the aim is to use a methodology that could scale to a large dataset, so a key principle has been to rely on PySpark rather than pandas and scikit learn.

There is an accompanying blog post which discusses the project on medium here: https://medium.com/@nealedenton_87598/predicting-customer-churn-ee3b3a0bb370

Files Required

Sparkify.ipynb : The notebook with the analysis
README.md: This file
mini_sparkify_event_data.json : (NOT INCLUDED) this can be obtained from Udacity if you enroll on the program

Dependencies

Apache Spark 2.4.3
Jupyter Notebook environment

Python Packages

pyspark.sql
pyspark.ml
sklearn.metrics
pandas
numpy
matplotlib.pyplot
seaborn
datetime

Analysis

Import the necessary libraries
Import sample data set
Perform some initial Exploratory Data Analysis including:
1. Identifying the 'label' we want our model to predict ('Cancellation Confirmation')
2. Cleaning the data - removing rows with missing userId
Feature engineering: aggregate by userId and create relevant features
Modelling
1. Split data into training and test set
2. Logistic Regression with grid search to fine tune parameters with Cross Folds Validation
3. Decision Tree Classification with grid search to fine tune parameters with Cross Folds Validation
Evaluation and Conclusion

References and Acknowledgements

Many of the techniques I used in this project were new to me, particularly the use of PySpark. I was introduced to these techniques in the extra curricular content of the Udacity Nanodegree program.

Udaciy Nanodegree

The PySpark documentation was also very helpful

PySpark 2.4.3 Docs

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Sparkify.ipynb		Sparkify.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Udacity_Sparkify

Problem Statement

Files Required

Dependencies

Analysis

References and Acknowledgements

About

Releases

Packages

Languages

License

nealedenton/Udacity_Sparkify

Folders and files

Latest commit

History

Repository files navigation

Udacity_Sparkify

Problem Statement

Files Required

Dependencies

Analysis

References and Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages