Scrapes and stores traffic cam images on AWS with metadata.
Jupyter Notebook Other
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
resources
src
.gitignore
README.md Update README.md Aug 18, 2016
package.json

README.md

Scraping Images from traffic cam URLs to do ML.

A Node.js script to scrape traffic cam images from Exelis’ Helios Weather Platform's open data API and stream it to an AWS S3 bucket.

Weather forecast are linked to each image using datapoint-js and pipe to a DynamoDB table.

To process the images and to do some machine learning, I use TensorFlow and the jupyter notebook.

Have a look at the CADL project! I found it very helpful to begin with ML in python.

Keys

Disclaimer: Note that the following environmental variables need to be set up:

AWS API keys

HELIOS API keys

DATAPOINT API keys

So you need account on all of those platforms.

Personally, I put all those keys in a credentials.sh file with other recurring data:

$ #!/usr/bin/env bash

export HELIOS_API_ID =
export HELIOS_API_SECRET =

export DATAPOINT_KEY =

export REGION =
export AWS_ACCESS_KEY =
export AWS_SECRET_ACCESS_KEY =
export BUCKET =
export TABLE =

export CSV_FILE =

Obviously you have to fill this file with your own keys.

Don't forget to run it before running the main code (src/index.js):

$ . ./credentials.sh

Installation

You need to install some modules (package.json).

Just type the following command line in your terminal:

$ npm install

For more information, it's here!

Set up before running index.js

credentials.sh

To do my way, create and complete the entire credentials.sh file:

  • The region of your present AWS server
  • The name of the S3 bucket you want the images to be store
  • The name of the DynamoDB table you want all the weather data to be put
  • The name of your present working csv file

There are two csv file: webcams.csv and webcams1.csv.

The first one contain the URLs, the latitudes and the longitudes for every traffic cameras.

The second one contain only one camera.

Both of them have the same header.

Again, don't forget to set up your environment variables.

ATTENTION:

Scraping on your own machine can take a very long time with the whole cams file. You might want to split it. That way, be aware to save the new file as a csv in the resources folder.

AWS

You have to create your own bucket and your own table on AWS.

For the DynamoDB table, I chose to use the camera Id as the primary key. I also add a sort key: the time the image as been scrapped. That way, every item is unique.

The name of the images in S3 follows this pattern: cameraId_scrapeTime.jpg.

index.js

If you want to set up a cron job, uncomment the lines 270 to 273 (and comment the line 268). See documentation for time settings:

//        go());

        new CronJob('* * * * * *', function() {
            go()
        }, null, true, 'Europe/London')
    );

How to run index.js

It's using Node.js. So to run it from the main directory, write this command:

node src/index.js

ML thanks to CADL

The session-1-traffic-cams.ipynb file is my own version (not entirely completed!) of the session-1.ipynb file of the CADL project.

For this ML tutorial, you will need a set of 100 traffic cam images. A dataset of images is already prepared in the resources/Images folder.

The other sessions of the CADL project are the next step of this ML project.

To go further ...

Lambda function

You can avoid to use your own machine for the CronJob by using AWS Lambda service. It would fix the problem of size of the csv file and allowed you to scrape anytime, anywhere.

You can use a command line tool called kappa. Deploy, update, and test your Lambda functions should be easier.

Get data from AWS

The get_data.ipynb file let you check the data you've stored on AWS.

It's using boto3.

Choose an existing cameraId and an existing scrapeTime of your DB. The image and the metadata should be display in your notebook.

Scrape the URLs

As you've probably noticed, to scrape the traffic cam images, I had to scrape the traffic cam URLs to create the webcams.csv file.

To do so, have look at the src/scrapingURL folder.

Keys need to be set up in the .js file. You can run the .html file in your browser (I use Firefox). You will see an empty cell aking you " Go? ". Wait the scrape to be over and type " YES " to download the csv file. It would be download with a random name.