Analyzing millions of NYC Parking Violations

STA9760 - Big Data Technologies

The purpose of this project is to conduct data visualization analysis to understand Parking and Camera Violation in NYC. We will be going through the process of containerization, terminal navigation, python scripting, artifact deployment and AWS EC2 provisioning.

Dataset

The Open Parking and Camera Violations dataset was initially loaded with all violations contained in the mainframe database as of May 2016 (Initial dataset loaded 05/14/2016.). New or open violations will be updated weekly (Sunday). NYC Open Data

Install & Quickstart

You will need to install Docker, Docker Compose, register a Docker Hub account, create a Amazon EC2 server account and set up a AWS EC2 instance. create a NYC App Token from NYC Open Data.

Overview

The project is broken into three parts.

Part1: Python Scripting (Docker & AWS EC2)
Part2: Loading into ElasticSearch
Part3: Visualizing and Analysis on Kibana

Part1: Python Scripting

Develop a python command line interface that will connect to the OPCV API and demonstrate that the data is accessible via python.

Build image from Dockerfile and push the image to Docker Hub. Lastly, run the image in AWS EC2 server.

Docker Architecture

Using Dockerfile & Running AWS EC2

Build an image from our dockerfile on your computer:
```
 $ docker build -t image_name . 
```

Create a container and run Python script locally. The command line is:

 $ docker run -v $(pwd):/app -e APP_KEY=API_Token -t image_name python main.py --page_size=1000 --num_page=4 --output=results.json

Key Arguments

--page_size: This command line argument is required. It will ask for how many records to request from the API per call.
--num_pages: This command line argument is optional. If not provided, your script should continue requesting data until the entirety of the content has been exhausted. If this argument is provided, continue querying for data num_pages times.
--output: This command line argument is optional. If not provided, your script should simply print results to stdout. If provided, your script should write the data to the file (in this case, results.json).

Deploying via Docker Hub:

 $ docker push your_username/image_name

Pulling the image from dockerhub and running it in AWS EC2

 ~$ sudo docker pull image_name
 ~$ sudo docker run -it your_username/image_name /bin/bash
 ~$ sudo docker run -e APP_KEY=API_token -it your_username/image_name python main.py --page_size=1000 --num_page=4 --output=results.json

Part2: Loading into ElasticSearch

In this part, you will leverage docker-compose to bring up a service that encapsulates your bigdata1 container and an elasticsearch container and ensures that they are able to interact.

You must update your original script (from Part_1) to now not only download the data but also load it into the elasticsearch instance.

Updating the Script:

Under Part_2/bigdata1 folder

create docker-compose.yml - include three images/servcies (python, elasticsearch, kibana) to interact with your container
requirements.txt - add elasticsearch
main.py - add 'push_elastic' argument and pass it to the call_api function

Under Part2/bigdata1/src/bigdata1 folder

callapi.py add:
- import Elasticsearch and datetime, date
- add create_update_index() function - create an elasticsearch index to store data
- add format_push_data() function - format python data type and push to elasticsearch
- update call_api function - takes argument from the interface (main.py)

Use Docker-Compose to Run ElasticSearch and Kibana Locally

Build ElasticSearch and Kibana (Create Images):
```
 $ docker-compose build pyth
```
Run ElasticSearch and Kibana (Run Services):
```
 $ docker-compose up -d
```

Interact at Interface

 $ docker-compose run -e APP_KEY=API_token -v ${PWD}:/app pyth python -m main --page_size=100 --num_pages=10 --output=results.json --push_elastic=True

Query ElasticSearch

 $ curl -o output.txt http://localhost:9200/opcv/_search?q=state:NY&size=10

Return 10 records where state is NY
output.txt will be generated under Part_2/bigdata1

You have successfully pushed to ElasticSearch

Part3: Visualizing and Analysis on Kibana

You now can visualize and analyze data in Kibana by accessing http://localhost:5601/app/kibana

Then you will go to the Management tab, define your Index pattern using indices, and use Time Filter using issue_data

`Discover` Tab to See Your Data

`Visualize` Tab to Create Visualization

Here are some data visualization examples:

Horizontal Bar Chart - Average Reduction Amount By County

Pie Chart - Top 10 Violation Type

Bar Chart - Number of Violation By County

Line Chart - Number of Violation Per Monthly

`Dashboard` Tab to Combine Your Data Visualization

Dashboard

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
Part_1/bigdata1		Part_1/bigdata1
Part_2/bigdata1		Part_2/bigdata1
Part_3		Part_3
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analyzing millions of NYC Parking Violations

STA9760 - Big Data Technologies

Dataset

Install & Quickstart

Overview

Part1: Python Scripting

Using Dockerfile & Running AWS EC2

Part2: Loading into ElasticSearch

Updating the Script:

Use Docker-Compose to Run ElasticSearch and Kibana Locally

Part3: Visualizing and Analysis on Kibana

`Discover` Tab to See Your Data

`Visualize` Tab to Create Visualization

`Dashboard` Tab to Combine Your Data Visualization

About

Releases

Packages

Languages

hailindu/Big-Data-Analyzing-Millions-of-NYC-Parking-Violations

Folders and files

Latest commit

History

Repository files navigation

Analyzing millions of NYC Parking Violations

STA9760 - Big Data Technologies

Dataset

Install & Quickstart

Overview

Part1: Python Scripting

Using Dockerfile & Running AWS EC2

Part2: Loading into ElasticSearch

Updating the Script:

Use Docker-Compose to Run ElasticSearch and Kibana Locally

Part3: Visualizing and Analysis on Kibana

Discover Tab to See Your Data

Visualize Tab to Create Visualization

Dashboard Tab to Combine Your Data Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`Discover` Tab to See Your Data

`Visualize` Tab to Create Visualization

`Dashboard` Tab to Combine Your Data Visualization

Packages