Skip to content

marinelin/sf_bike_share_spark

Repository files navigation

Predicting bike share availability with H2O and Spark on AWS EMR

Tools Used:

  • Spark SQL
  • Spark Machine Learning
  • H2O ML, Deep Learning and AutoML
  • AWS EMR Clusters
  • Plotly

Data Source: https://www.kaggle.com/benhamner/sf-bay-area-bike-share Data Size: 4GB

  • station.csv - Contains data that represents a station where users can pick up or return bikes.
  • status.csv - Data about the number of bikes and docks available for a given station and minute.
  • trips.csv - Data about individual bike trips
  • weather.csv - Data about the weather on a specific day for certain zip codes

Predict number of bikes available at a given station with:

  • station information
  • weather condition
  • type of day (weekday/weekend)
  • hour of the day
  • population of the neighborhood

Data Pipeline:

Prediction Result:

  • For Spark ML Models, our best model is Random Forest, with an RMSE of 2.7
  • For H2O Models, AutoML's XGBoost model has an RMSE of 2.38

Run Time Comparison on different AWS EMR Clusters:

Group member: Esther Liu, Marine Lin, Akankasha, Lexie

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published