In [1]:
import requests
from bs4 import BeautifulSoup, Comment

import json
from geojson import Feature, FeatureCollection, Point, dump
import pandas as pd

# SELFIE_NYC --- Scraping the NYC DOT Website

The purpose of this project was to insert myself into the NYC DOT live traffic camera "database" by creating an app that would allow you to take a traffic cam "selfie" and save it to your phone.

All data in this file was scraped from the NYC DOT website located here https://webcams.nyctmc.org/

Since different aspects of the cameras are split up in different parts of the website, I scraped a few urls in order to piece together the final dataset. The urls I used for scraping are saved as variables below:

In [2]:
# url variables

# webpage with the list of all cameras
alllist_url = 'https://webcams.nyctmc.org/multiview2.php'
# the webpage containing json data where the map data is stored for all cameras
map_url = 'https://webcams.nyctmc.org/new-data.php?query='
# the partial link from the above page to the individual camera page containing the live feed
camlist_url = 'https://webcams.nyctmc.org/multiview2.php?listcam='

### 1. Getting All Active Camera Names and Numbers

In this section, I am scraping the DOT website to get the locations or "names" of each camera and their associated number. This number will later be used to access the individual url that hosts the feed for that camera.

The webpage I'm scraping in this section shows a table where users may view the feed for one or multiple cameras by checking the box next to the camera name and clicking the "View" button.

In [3]:
# scraping the 'list camera' page
response = requests.get(alllist_url)
doc_all = BeautifulSoup(response.text, 'html.parser')

Scrolling through the page, there appear to be a few camera names without a checkbox next to them that instead say "Inactive". A comment found in the html mentions the number of total cameras, number of active cameras, and number of inactive cameras. This stats comment appears to get updated as along with the actual number of available feeds.

In [4]:
# gathering all the comments from the 'list camera' page
comments = doc_all.find_all(text=lambda text:isinstance(text, Comment))

# finding and printing the 'stats' comment
for comment in comments:
    text_strip = comment.strip()
    text_html = BeautifulSoup(comment, 'html.parser')
    total_comment = text_html.find_all('td')
    for string in total_comment:
        print(string.text)

Stats

Total Cameras: 756Active Cameras: 699Inactive Cameras: 57


The total number of camera attributes collected from the same page below should match the number of cameras noted in the comment above. It appears that the actual number of cameras (in addition to the number of "active" feeds changes day to day. Unfortunately this means that regular scraping will be necessary in order to maintain an accurately updated list of functioning camera feeds.

In [5]:
# getting all attributes (names and numbers) for all  cameras listed
cam_attr = doc_all.find_all('tr', id = 'repCam__ctl0_trCam')

len(cam_attr)

756

The below output shows the raw html data associated with a 2 different cameras. In the first, the camera name as it appears on the original webpage is "1 Ave @ 110 St", and the camera number is under "value = '368'". The second camera "10 Ave @ 42 St" does not have a checkbox and is therefore missing the "input" element. Instead, the box just shows text saying "Inactive".

In [6]:
cam_attr[0]

<tr bgcolor="#E6E6E7" id="repCam__ctl0_trCam">
<td align="right" class="OTopTitle" height="25" id="repCam__ctl0_lbNum" width="15">
<input id="id[]" name="id[]" onclick="check_count(this)" type="checkbox" value="368"/>
</td>
<td width="235">
<span class="OTopTitle">1 Ave @ 110 St</span>
</td>
</tr>

In [7]:
cam_attr[11]

<tr bgcolor="#E6E6E7" id="repCam__ctl0_trCam">
<td align="right" class="OTopTitle" height="25" id="repCam__ctl0_lbNum" width="15">
			Inactive																	</td>
<td width="235">
<span class="OTopTitle">10 Ave @ 42 St</span>
</td>
</tr>

Since I'm not sure when these cameras may become listed/active and therefore scrapable, I don't want to drop them entirely from the final list. Below I collect all listed camera names and id numbers in a json format. Cameras that do not have a displayed number will show up as "inactive" and will be filtered out later before the feeds are scraped.

In [8]:
# collecting all camera names and numbers into a json format
cam_list = []

for cam in cam_attr:
    item = {}
    # getting just camera name
    name = cam.find('span', class_ ='OTopTitle')
    item['name'] = name.text
    # checking for an input element to get camera number
    num = cam.find('input')
    # getting camera number
    if num != None:
        item['num'] = num.attrs['value']
    # for cameras without feeds, put "inactive"
    else:
        item['num'] = 'inactive'
    
    cam_list.append(item)

# total number of listed cameras (including active and inactive)
len(cam_list)

756

In [9]:
cam_list[0]

{'name': '1 Ave @ 110 St', 'num': '368'}

In [10]:
cam_list[11]

{'name': '10 Ave @ 42 St', 'num': 'inactive'}

Above are two examples of the final dataset to be used for scraping the camera feeds. At this point, this is all the data required in order to continue on to scraping the actual images for every camera feed. However, since the NYC DOT website has also provided a map showing the locations of all cameras, I decided to scrape this as well for lat/long coordinates so that users can locate cameras on a map.

### 2. Getting the Map Coordinates for all Camera Locations

Interestingly, the map that DOT provides does not necessarily show the locations of all cameras listed on the "camera list" page. The coordinates, names, and id numbers for the cameras that are shown can be scraped from the url that is queried when the page is refreshed.

In [11]:
# scraping the json page that is requested when the map loads
json_data = requests.get(map_url).json()
markers_list = json_data['markers']

# total number of listed markers
len(markers_list)

750

The marker data in the marker json file seems to match up with the the camera names and ids that were scraped above, even though some cameras are missing. This will matching the cameras in the larger dataset with the smaller one way easier.

In [12]:
markers_list[0]

{'id': '368',
 'latitude': '40.79142677512476',
 'longitude': '-73.93807411193848',
 'title': 'images/camera1.png',
 'icon': 'images/camera1.png',
 'content': '1 Ave @ 110 St'}

In [13]:
cam_list[0]

{'name': '1 Ave @ 110 St', 'num': '368'}

In [14]:
# using pandas to merge the two datasets
cam_df = pd.DataFrame(cam_list, columns=['name', 'num'])
markers_df = pd.DataFrame(markers_list, columns=['id', 'latitude', 'longitude', 'title', 'icon', 'content'])

# creating a 'key' column in both databases
cam_df['key'] = cam_df['name'] + cam_df['num']
markers_df['key'] = markers_df['content'] + markers_df['id']

# merging the two datasets and filtering out unmatched cams
all_df = pd.merge(markers_df, cam_df, how = 'left', on='key')
filter_df = all_df.loc[all_df['id'] == all_df['num']]
filter_df = filter_df.drop(['title', 'key', 'content', 'icon', 'num'], axis = 1)

filter_df.head()

Unnamed: 0,id,latitude,longitude,name
0,368,40.79142677512476,-73.93807411193848,1 Ave @ 110 St
1,360,40.80042614416932,-73.93155097961426,1 Ave @ 124 St
2,1189,40.731361,-73.982486,1 Ave @ 14 St
3,361,40.7359741672444,-73.97828578948975,1 Ave @ 23 St
4,550,40.74803725830298,-73.9694881439209,1 Ave @ 42 St


### 3. Grabbing the Traffic Camera Still

In [15]:
# getting a list from the final pandas dataframe
id_list = filter_df['id'].tolist()

In [16]:
# this list will hold all the scraped webpages with active camera feeds
url_list = []

# scraping the individual camera webpages
for num in id_list:
    item = {}
    # going to the individual camera page
    response = requests.get(camlist_url + num)
    # grabbing the html from each page
    doc = BeautifulSoup(response.text, 'html.parser')
    item['id'] = num
    item['doc'] = doc
    
    url_list.append(item)

# final list of all html text for each camera page
len(url_list)

694

The final step 

In [17]:
# looping through the list to pull the url for each camera feed
for cam in url_list:
    # searching for the image on each page where "alt" tag is the same as the camera id number
    img_tags = cam['doc'].find('img', alt = cam['id'])
    # isolating the url where the feed is stored
    img_link = img_tags.get('src')
    # adding url to the dataset
    cam['url'] = img_link
    cam['icon'] = 'assets/camera.png'

# taking a peek at the first url in the list
url_list[0]['url']

'https://jpg.nyctmc.org/cctv261.jpg'

Going to the URL above shows an isolated frame from the camera feed for the first camera in the dataset. Every time the page is refreshed the image is updated, meaning that the feed urls will only need to be scraped when there is a change with the number of camera that are being serviced.

In [18]:
# getting url data into a pandas dataframe
url_df = pd.DataFrame(url_list, columns=['id', 'doc', 'url', 'icon'])

# merging into the final dataset
final_df = pd.merge(filter_df, url_df, how = 'left', on='id')
final_df = final_df.drop(['doc'], axis = 1)

final_df.head()

Unnamed: 0,id,latitude,longitude,name,url,icon
0,368,40.79142677512476,-73.93807411193848,1 Ave @ 110 St,https://jpg.nyctmc.org/cctv261.jpg,assets/camera.png
1,360,40.80042614416932,-73.93155097961426,1 Ave @ 124 St,https://jpg.nyctmc.org/cctv254.jpg,assets/camera.png
2,1189,40.731361,-73.982486,1 Ave @ 14 St,https://jpg.nyctmc.org/cctv1083.jpg,assets/camera.png
3,361,40.7359741672444,-73.97828578948975,1 Ave @ 23 St,https://jpg.nyctmc.org/cctv253.jpg,assets/camera.png
4,550,40.74803725830298,-73.9694881439209,1 Ave @ 42 St,https://jpg.nyctmc.org/cctv490.jpg,assets/camera.png


In [19]:
# saving df as a dictionary
final_dict = final_df.to_dict('records')

In [20]:
# building the geojson file
features = []
# looping through the dictionary holding all camera data
for cam in final_dict:
    geometry = {}
    geometry['type'] = 'Point'
    geometry['coordinates'] = [float(cam['longitude']), float(cam['latitude'])]
    
    properties = {}
    properties['id'] = cam['id']
    properties['name'] = cam['name']
    properties['url'] = cam['url']
    properties['icon'] = cam['icon']
    
    # building the geojson structure and adding cam data to it
    item = {}
    item['geometry'] = geometry
    item['type'] = 'Feature'
    item['properties'] = properties
    features.append(item)

len(features)

694

The final count for all available camera feeds is shown above. Below is an example of one camera feature from the geojson data. After the data is exported as a geojson file, users will be able to click on the icons displayed on the map to access each of the camera feeds.

In [21]:
features[0]

{'geometry': {'type': 'Point',
  'coordinates': [-73.93807411193848, 40.79142677512476]},
 'type': 'Feature',
 'properties': {'id': '368',
  'name': '1 Ave @ 110 St',
  'url': 'https://jpg.nyctmc.org/cctv261.jpg',
  'icon': 'assets/camera.png'}}

In [22]:
# saving feature list as a feature collection for the geojson file
feature_collection = FeatureCollection(features)

# writing out file
with open('assets/data.geojson', 'w') as f:
   dump(feature_collection, f)