# Your friend just called, she is having people over her house in Berkeley and wants you to come.
#### How do you get there? Take bart? Take Uber? Drive yourself?
- This project attempted to answer the first question by predicting daily capacity of each bart station.
- **Final output here http://bart-capacity-predictions.com.s3-website-us-east-1.amazonaws.com/**
- Sample below <img src ="images/sample_predictions.png">

###### Data Sources
- 1) Pull historic bart ridership information per station from http://www.bart.gov/about/reports/ridership
- 2) Pull historic weather information from https://www.wunderground.com/history/?MR=1
- 3) Ingest every 10 minutes currently arriving bart trains from http://api.bart.gov/docs/overview/index.aspx
- 4) Ingest every 10 minutes the current weather in San Francisco http://openweathermap.org/api

###### Analysis Procedure
- 1) Pull live bart data into S3 (which is acting as a data lake)
- 2) Upload the historic data to s3
- 3) Use Spark on EMR to normalize the data to third normal form and push back to s3
- 4) Output the normalized data to mongoDB
- 5) Create a gradient boost model from the daily normalized data to predict station ridership
- 6) Output the prediction to a s3 bucket acting as a website


##### Overview of scalability and robustness of the system
Real-Time Bart Arrival, Capacity, and forecast



> *Robustness and fault tolerance*
- Spark is very fault tolerant thanks to the directed aclyic graph structure that it employes to manage nodes going down during computation. This graph keeps an outline of each transformation made to the underlying dataset. Therefore, these transformations can be re-created if need be. 
- For version 2.0, additional ec2 instance can be spun up for the webserver in case our server can't handle excess web traffic.
- Separation of concerns, by having different ec2 instances, ensures that there is not a single point of failure.

> *Low latency reads and updates*
- Spark has very low latency
- Amazon kinesis can have have low as a recommended one sec streams written to s3. http://docs.aws.amazon.com/streams/latest/dev/kinesis-low-latency.html
- For the next version, Spark Streaming can be implemented which can have latency as low as 


> *Scalability*
- Amazon kinesis is great for data scalability as it can handle alarge inflow of data that varies over time. The way to set up kinesis is to specify two parameters 
    - 1) How much time do you want to allow inbetween writes to s3
    - 2) How large do you want to file to be that you are writting to s3
- After these parameters are set, whenever one of the thresholds is hit, then a new file will be written to s3. 
- Kinesis is also very useful for maintaining a file directory in s3 as it automatically creates year, month, day, and hour folder systems. 
- Checks such as
autoscaling (http://docs.aws.amazon.com/autoscaling/latest/userguide/WhatIsAutoScaling.html)
can ensure that the website hosting the EC2 instance does not go down. Or, when
additional traffic comes to the website, additional servers can spin up
to handle the load. I currently have alarms set up and autoscaling will be implemented in version 2.0.

> *Generalization*
- Right now, this system has been optimized to predict ridership per station per day.
- However, with the data that we have stored, the following questions can be asked and implemented fairly easily.
    - Predict the weather given the ridership yesterday per station along with the day of the month and the month number
    - Which stations on average are the most crowded?
    - How many train stops are made per day through bart?


> *Extensibility*
- This system can be used across any Ec2 instance. In addition, a vagrant file
can be used to keep the configuration ready to spin up another server.

> *Ad hoc queries*
- With all of the data being stored in Mongodb, a user can write custom queries
to ask different questions of historical data than is being asked currently. In version 2.0, this database will be hosted using a flask app to allow queries from the web versus logging into a ubuntu server. (http://blog.dwyer.co.za/2013/10/a-basic-web-app-using-flask-and-mongodb.html)

> *Minimal Maintenance*
- With airflow, emails can be sent in case of errors in completing tasks.
- In addition, Amazon allows you to set up alerts in case your server experiences any errors.
<img src ="images/mongo_db_error.png">


> *Debuggability*
- Data structures will are immutable in this system. Therefore, it will be easier
to debug what things go wrong.
- Airflow keeps all log files from the tasks run in a centralized location which makes it easier to parse through.


### Below is an overview of the architecture, as well as some technologies used in this project

# Architecture of the system
<img src="images/data_architecture.png">

# Overview of airflow
<img src="images/airflow_overview.png">
##### Airflow comes with an awesome UI to understand what tasks are running on your server
- The main page here shows what tasks have succeeded, are running, and failed. For tasks that have failed, airflow will send you an email outlining the error encountered.

<img src ="images/airflow_dependencies.png">
#### The dependencies show you the directed acyclic graph of your tasks.

# Schema in Third Normalized Form
<img src ="images/final_schema.png">
