Skip to content

Analyzing Chicago taxi trips dataset using Spark Streaming, and a real-time dashboard for reporting using Flask.

Notifications You must be signed in to change notification settings

jacobceles/ChicagoTaxiTrips-SparkStreaming-RealTimeDashboard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chicago Taxi Trips Streaming Analysis with Real Time Dashboard

In this project, we have analyzed the infamous Chicago City Taxi Trips dataset using Spark Structured Streaming, and also built a dashboard for reporting the metrics using Flask and HTML. The dataset itself is open for public use and holds information since 2013, amounting to about 200 Million rows with each record being around 1024 bytes in size.

Features

Here are some features of this implementation:

  • Retrieves data from a folder and process them in a streaming fashion.
  • A full-fledged dashboard built using Flask and Chart JS to visualize the metrics.
  • Uses Spark to support big data analysis, tested on more than 100 million records.
  • Uses Structured Streaming and therefore can easily be ported for any similar use case.

Problem Statements

  1. Rate of tipping over years?
  2. Popular taxi trips days?
  3. Mode of payment over years?
  4. Total miles travelled across years?
  5. Total time travelled across years?
  6. Which company makes the most trips per year?

Implementation

High Level Design

Folder Structure

├── Documents                                   # Holds info about the project
├── dashboard                                   # Code for the Flask application
│   ├── static                                      # Holds static files associated with the dashboard
│   │   ├── css                                         # Holds styling files for the dashboard
│   │   └── js                                          # Holds javascript files used for the dashboard
│   ├── templates                                   # Holds the HTML template used for the dashboard UI
│   └── app.py                                      # Flask application which defines and triggers the endpoints; startpoint for the dashboard
├── source                                      # Source folder for the streaming application
│   └── 1.csv                                       # A sample source file
├── README.md                                   # Read this first
└── streaming.py                                # Streaming application which reads the files and executes operations in PySpark using Structured Streaming

Output

A full demo including code walkthrough and live demo can be found here (redirects to YouTube).

Dashboard Screenshot

Dashboard.Video.mp4

Contributors