Udacity - Big Data - Apache Spark

Installation

The following was used for the environment setup:

Python 3.5
PySpark SQL
PySpark ML

Project Motivation

For a fictional music streaming service (Sparkify), we are interested in seeing if we can predict which users are at risk of cancelling their service. The goal is identify these users before they leave so they can hypothetically receive discounts and incentives to stay.

File Descriptions

Sparkify.ipynb - Notebook used for initial analysis / training. Used a small subset (128MB) for this analysis. When performed on the IBM platform, the entire dataset (12GB) will be used.
README.md - The file you are reading now

Feature Engineering

As a result of using Logistc Regression, I was able to get a training accuracy of around 99.99% and a testing accuracy of around 99.98%. This was due to the following transformations to the data:

String Indexer:

Gender_indexer (inputCol="gender")
User_indexer (inputCol="userAgent")
Page_indexer (inputCol="page")
indexer (inputCol="Churn")

One Hot Encoding:

Gender_encoder (inputCol='Gender_Index')
User_encoder (inputCol='User_Index')
Page_encoder (inputCol='Page_Index')

Vector Assembler:

assembler (inputCols=["Gender_Vec", "User_Vec", "Page_Vec", "status"])

Acknowledgements

The data was taken from the following (https://s3.amazonaws.com/video.udacity-data.com/topher/2018/December/5c1d6681_medium-sparkify-event-data/medium-sparkify-event-data.json)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitattributes		.gitattributes
Capstone.docx		Capstone.docx
IBM-Cloud_Workspace.ipynb		IBM-Cloud_Workspace.ipynb
README.md		README.md
Udacity-Spark-Workspace.ipynb		Udacity-Spark-Workspace.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Udacity - Big Data - Apache Spark

Installation

Project Motivation

File Descriptions

Feature Engineering

Acknowledgements

About

Releases

Packages

Languages

greulist137/Udacity---Big-Data---Apache-Spark

Folders and files

Latest commit

History

Repository files navigation

Udacity - Big Data - Apache Spark

Installation

Project Motivation

File Descriptions

Feature Engineering

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages