# Data Analysis Using Python: A Beginner’s Guide Featuring NYC Open Data
## Part 1: Reading and Writing Files in Python

Mark Bauer

Table of Contents
=================

   * [Getting Started](#-Getting-Started:-Accessing-the-Building-Footprints-Dataset)
       * [1. Search NYC Open Data in Google](##1.-Search-NYC-Open-Data-in-Google)
       * [2. Search "Building Footprints" in NYC Open Data search bar](##-2.-Search-"Building-Footprints"-in-NYC-Open-Data-search-bar)
       * [3. Select "Building Footprints" Dataset](##3.-Select-"Building-Footprints"-Dataset)
       * [4. The Building Footprints Dataset Page](##4.-The-Building-Footprints-Dataset-Page)
       
       
   * [1. Reading In Data](##-1.-Reading-In-Data)
       * [1.1 Reading in data as csv in static form](##-1.1-Reading-in-data-as-csv-in-static-form)
       * [1.2 Reading in data as json in static form](##-1.2-Reading-in-data-as-json-in-static-form)
       * [1.3 Reading in shapefile data](##-1.3-Reading-in-shapefile-data)
       * [1.4 Unzipping and reading in data as csv in memory](##-1.4-Unzipping-and-reading-in-data-as-csv-in-memory)
       * [1.5 Unzipping and reading in data as csv to local folder](##-1.4-Unzipping-and-reading-in-data-as-csv-to-local-folder)
       * [1.6 Unzipping and reading in data as csv from local folder](##-1.4-Unzipping-and-reading-in-data-as-csv-from-local-folder)
       * [1.7 Reading in data from Socrata Open Data API (SODA)](##-1.5-Reading-in-data-from-Socrata-Open-Data-API-%28SODA%29)
       
       
   * [2. Writing Out Data](#-2.-Writing-Out-Data)
       * [2.1 Writing to a CSV file](##-2.1-Writing-to-a-CSV-file)
       * [2.2 Writing to a Excel (xlsx) file](##-2.2-Writing-to-a-Excel-%28xlsx%29-file)
       * [2.3 Writing to a JSON file](##-2.3-Writing-to-a-JSON-file)
       
       
   * [3. Reading In Data from Local Folder](#-3.-Reading-In-Data-from-Local-Folder)
       * [3.1 Reading in a CSV file](##-3.1-Reading-in-a-CSV-file)
       * [3.2 Reading in an Excel file](##-3.2-Reading-in-an-Excel-file)
       * [3.3 Reading in a JSON file](##-3.3-Reading-in-a-JSON-file)
       
       
   * [4. Conclusion](#-4.-Conclusion)

**Goal:** In this notebook, we will review various ways to read (load) and write (save) data from NYC Open Data. Specifically, we will focus on reading our data into a pandas dataframe.

**Main Library:** [pandas](https://pandas.pydata.org/) is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

In [196]:
# importing libraries
import pandas as pd
import numpy as np
import geopandas as gpd
from fiona.crs import from_epsg
import matplotlib.pyplot as plt
import os
import urllib
import json
import requests
from io import BytesIO
from sodapy import Socrata
import zipfile
from zipfile import ZipFile
from os.path import basename
from openpyxl import Workbook

Printing verions of Python modules and packages with **watermark** - the IPython magic extension.

In [197]:
%load_ext watermark

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark


In [198]:
%watermark -v -p numpy,pandas,geopandas,matplotlib.pyplot,json,requests,sodapy

CPython 3.7.1
IPython 7.20.0

numpy 1.19.2
pandas 1.2.1
geopandas 0.8.1
matplotlib.pyplot 3.3.2
json 2.0.9
requests 2.25.1
sodapy 2.0.0


Documention for installing watermark: https://github.com/rasbt/watermark

# Getting Started: Accessing the Building Footprints Dataset

## 1. Search NYC Open Data in Google

![building_footprints](images/1.nyc-open-data-google.png)

## 2. Search "Building Footprints" in NYC Open Data search bar

![building_footprints](images/2.building-footprints-opendata-search.png)

## 3. Select "Building Footprints" Dataset

![building_footprints](images/3.building-footprints-dataset-link.png)

## 4. The Building Footprints Dataset Page

![building_footprints](images/4.data-homepage.png)

Dataset Link: 
https://data.cityofnewyork.us/Housing-Development/Building-Footprints/nqwf-w8eh

Documentation/Metadata: 
https://github.com/CityOfNewYork/nyc-geo-metadata/blob/master/Metadata/Metadata_BuildingFootprints.md

**Building Footprints Dataset Identification**

> **Here are a few things to note about the data:**
>
> - **Purpose:** This feature class is used by the NYC DOITT GIS group to maintain and distribute an accurate 'basemap' for NYC. The basemap provides the foundation upon virtually all other geospatial data with New York.
> - **Description:** Building footprints represent the full perimeter outline of each building as viewed from directly above. Additional attribute information maintained for each feature includes: Building Identification Number (BIN); Borough, Block, and Lot information(BBL); ground elevation at building base; roof height above ground elevation; construction year, and feature type.
> - **Source(s):** Annually captured aerial imagery, NYC Research of DOB records, or other image resources.
> - **Publication Dates:** **Data**: 05/03/16<br>
> - **Last Update:** Weekly<br>
> - **Metadata:** 12/22/2016<br>
> - **Update Frequency:** Features are updated daily by DoITT staff and a public release is available weekly on NYC Open Data. Every four years a citywide review is made of the building footprints and features are updated photogrammetrically.
> - **Available Formats:** File Geodatabase Feature Class as part of the Planimetrics geodatabase and individual shapefile on the [NYC Open Data Portal](https://data.cityofnewyork.us/Housing-Development/Building-Footprints/nqwf-w8eh)
> - **Use Limitations:** Open Data policies and restrictions apply. See [Terms of Use](http://www.nyc.gov/html/data/terms.html)
> - **Access Rights:** Public
> - **Links:** https://data.cityofnewyork.us/Housing-Development/Building-Footprints/nqwf-w8eh
> - **Tags:** Buildings, Building footprint, BIN, Structure

**Source:** 
https://github.com/CityOfNewYork/nyc-geo-metadata/blob/master/Metadata/Metadata_BuildingFootprints.md

# 1. Reading In Data

## 1.1 Reading in data as csv in static form

![building_footprints_csv](images/building-footprints-csv.png)

# NOTE: 
The buildings footprints `dataset identifier` changes weekly, and so does the data api path. Click on the API Docs page and verify the correct dataset identifier. If you're not working with the correct id, you will receive a `HTTP Error`. Screenshots below:

**Click on API Docs**
![building_footprints_csv](images/api_docs.png)

**Grab the updated dataset identifier**
![building_footprints_csv](images/api_docs_dataset_id.png)

The `dataset identifier` is inserted into the api path below:  
url = https://data.cityofnewyork.us/api/views/DATASET_IDENTIFIER/rows.csv?accessType=DOWNLOAD

In [199]:
# reading in data as a url from NYC Open Data
url = 'https://data.cityofnewyork.us/api/views/uic8-njst/rows.csv?accessType=DOWNLOAD'

# saving data as a pandas dataframe named 'building_footprints_csv'
building_footprints_csv = pd.read_csv(url)

In [200]:
# previewing the first five rows 
building_footprints_csv.head()

Unnamed: 0,NAME,CNSTRCT_YR,BIN,the_geom,LSTMODDATE,LSTSTATYPE,DOITT_ID,HEIGHTROOF,FEAT_CODE,GROUNDELEV,SHAPE_AREA,SHAPE_LEN,BASE_BBL,MPLUTO_BBL,GEOMSOURCE
0,,2009.0,3394646,MULTIPOLYGON (((-73.87129515296562 40.65717370...,08/22/2017 12:00:00 AM +0000,Constructed,1212853,21.61,2100.0,18.0,854.66,125.08,3044520815.0,3044520815.0,Photogramm
1,,1930.0,4548330,MULTIPOLYGON (((-73.87670970144625 40.71425234...,08/17/2017 12:00:00 AM +0000,Constructed,1226227,10.36,5110.0,122.0,217.59,60.23,4030640041.0,4030640041.0,Photogramm
2,,1960.0,4460479,MULTIPOLYGON (((-73.85195485799383 40.66235471...,08/22/2017 12:00:00 AM +0000,Constructed,581946,29.81,2100.0,10.0,946.43,123.14,4139430001.0,4139430001.0,Photogramm
3,,1920.0,3355684,MULTIPOLYGON (((-73.94029215265738 40.64108287...,08/17/2017 12:00:00 AM +0000,Constructed,858061,11.2,5110.0,32.0,248.68,63.94,3049720006.0,3049720006.0,Photogramm
4,,1915.0,3131737,MULTIPOLYGON (((-73.98998983552244 40.62383804...,08/22/2017 12:00:00 AM +0000,Constructed,568078,24.98,2100.0,44.0,1163.23,165.61,3055100055.0,3055100055.0,Photogramm


In [201]:
# printing the dimentions (i.e. rows, columns) of the data
building_footprints_csv.shape

(1084824, 15)

In [202]:
rows = f'{building_footprints_csv.shape[0]:,}'
columns = building_footprints_csv.shape[1]

print('This dataset has {} rows and {} columns.'.format(rows, columns))

This dataset has 1,084,824 rows and 15 columns.


**Sanity check**

We use pandas `.head()` method to preview the first five rows of the dataframe.

We use pandas `.shape` method to print the dimensions of the dataframe (i.e. number of rows, number of columns).

We will use these two methods throughout the examples.

## 1.2 Reading in data as json in static form

![building_footprints_csv](images/building-footprints-json.png)

In [203]:
url = 'https://data.cityofnewyork.us/api/views/uic8-njst/rows.json?accessType=DOWNLOAD'

# loads a json object as a python object
with urllib.request.urlopen(url) as url:
    data = json.loads(url.read().decode())

In [204]:
data.keys()

dict_keys(['meta', 'data'])

In [205]:
len(data['data'])

1084824

In [206]:
# previewing the first row in our data section of our python object
data['data'][0]

['row-cs6h.qik8.dv3m',
 '00000000-0000-0000-8A9B-77885D496B59',
 0,
 1613504955,
 None,
 1613504955,
 None,
 '{ }',
 None,
 '2009',
 '3394646',
 'MULTIPOLYGON (((-73.87129515296562 40.65717370043455, -73.87135858020156 40.65714663518705, -73.87143322008981 40.6572480836196, -73.87136979278591 40.6572751498085, -73.87129515296562 40.65717370043455)))',
 1503360000,
 'Constructed',
 '1212853',
 '21.60850812',
 '2100',
 '18',
 '854.66243317866',
 '125.0797955584',
 '3044520815',
 '3044520815',
 'Photogramm']

In [207]:
data['meta'].keys()

dict_keys(['view'])

In [208]:
keys = data['meta']['view'].keys()

for key in keys:
    print(key)

id
name
assetType
averageRating
createdAt
displayType
downloadCount
hideFromCatalog
hideFromDataJson
newBackend
numberOfComments
oid
provenance
publicationAppendEnabled
publicationDate
publicationGroup
publicationStage
rowsUpdatedAt
rowsUpdatedBy
tableId
totalTimesRated
viewCount
viewLastModified
viewType
approvals
columns
grants
owner
query
rights
tableAuthor
flags


In [209]:
# locating our columns (i.e. field names) and saving as a new variable called 'lst'
cols = data['meta']['view']['columns']

# previewing first five
cols[:5]

[{'id': -1,
  'name': 'sid',
  'dataTypeName': 'meta_data',
  'fieldName': ':sid',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'id',
  'dataTypeName': 'meta_data',
  'fieldName': ':id',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'position',
  'dataTypeName': 'meta_data',
  'fieldName': ':position',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'created_at',
  'dataTypeName': 'meta_data',
  'fieldName': ':created_at',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']},
 {'id': -1,
  'name': 'created_meta',
  'dataTypeName': 'meta_data',
  'fieldName': ':created_meta',
  'position': 0,
  'renderTypeName': 'meta_data',
  'format': {},
  'flags': ['hidden']}]

In [210]:
for item in cols:
    print(item['fieldName'])

:sid
:id
:position
:created_at
:created_meta
:updated_at
:updated_meta
:meta
name
cnstrct_yr
bin
the_geom
lstmoddate
lststatype
doitt_id
heightroof
feat_code
groundelev
shape_area
shape_len
base_bbl
mpluto_bbl
geomsource


In [211]:
# saving our field names as a variable
fieldName = {x['fieldName']: x for x in lst}

# printing the field names in our data
for key in fieldName.keys():
    print(key)

:sid
:id
:position
:created_at
:created_meta
:updated_at
:updated_meta
:meta
name
cnstrct_yr
bin
the_geom
lstmoddate
lststatype
doitt_id
heightroof
feat_code
groundelev
shape_area
shape_len
base_bbl
mpluto_bbl
geomsource


In [212]:
# saving our field names in a list
columns = [*fieldName]
building_footprints_json = pd.DataFrame(data['data'], columns=columns)

# identifying columns not required for analysis
drop_columns = [':sid', ':id', ':position', ':created_at', ':created_meta', ':updated_at', 
                ':updated_meta', ':meta']

# dropping columns not required for analysis
building_footprints_json.drop(drop_columns, axis=1, inplace=True)

In [213]:
# previewing the first five rows
building_footprints_json.head()

Unnamed: 0,name,cnstrct_yr,bin,the_geom,lstmoddate,lststatype,doitt_id,heightroof,feat_code,groundelev,shape_area,shape_len,base_bbl,mpluto_bbl,geomsource
0,,2009,3394646,MULTIPOLYGON (((-73.87129515296562 40.65717370...,1503360000,Constructed,1212853,21.60850812,2100,18,854.66243317866,125.0797955584,3044520815,3044520815,Photogramm
1,,1930,4548330,MULTIPOLYGON (((-73.87670970144625 40.71425234...,1502928000,Constructed,1226227,10.36,5110,122,217.59424346169,60.22585821856,4030640041,4030640041,Photogramm
2,,1960,4460479,MULTIPOLYGON (((-73.85195485799383 40.66235471...,1503360000,Constructed,581946,29.81157033,2100,10,946.42747637737,123.14194057237,4139430001,4139430001,Photogramm
3,,1920,3355684,MULTIPOLYGON (((-73.94029215265738 40.64108287...,1502928000,Constructed,858061,11.2,5110,32,248.67816852809,63.94081721089,3049720006,3049720006,Photogramm
4,,1915,3131737,MULTIPOLYGON (((-73.98998983552244 40.62383804...,1503360000,Constructed,568078,24.98,2100,44,1163.227668698,165.60876340496,3055100055,3055100055,Photogramm


In [214]:
# printing dimensions of data
building_footprints_json.shape

(1084824, 15)

## 1.3 Reading in shapefile data

![building_footprints_csv](images/building-footprints-shp.png)

In [None]:
url = 'https://data.cityofnewyork.us/api/geospatial/nqwf-w8eh?method=export&format=Shapefile'

# reading in data as a geodataframe
building_footprints_shp = gpd.read_file(url)

# printing the firt five rows
building_footprints_shp.head()

In [None]:
# printing dimensions of data
building_footprints_shp.shape

Another popular dataset is NYC's PLUTO dataset. We will use this one because it comes in a zip file.

Description: Extensive land use and geographic data at the tax lot level in comma–separated values (CSV) file format. The PLUTO files contain more than seventy fields derived from data maintained by city agencies.

Dataset Link: https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page

Data Dictionary: https://www1.nyc.gov/assets/planning/download/pdf/data-maps/open-data/pluto_datadictionary.pdf?v=20v1

![building_footprints_csv](images/pluto-csv.png)

## 1.4 Unzipping and reading in data as a csv in memory

In [None]:
url = 'https://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nyc_pluto_20v1_csv.zip'

# reading in our zipfile data in-memory
content = requests.get(url)
zf = ZipFile(BytesIO(content.content))

# printing files in our zipfile
for item in zf.namelist():
    print("File in zip: "+ item)

In [None]:
# read our csv data into a dataframe from our zipfile
pluto_data = pd.read_csv(zf.open('pluto_20v1.csv'))

# previewing the first five rows of data
pluto_data.head()

In [None]:
# printing dimensions of our data
pluto_data.shape

## 1.5 Unzipping and reading in data as csv to local folder

We will retrieve, unzip and read in data in our downloads folder.

In [None]:
url = 'https://www1.nyc.gov/assets/planning/download/zip/data-maps/open-data/nyc_pluto_20v1_csv.zip'

# a path to our downloads folder 
downloads_path = '../../Downloads/'

# a path to our file from our downloads path
fullfilename = os.path.join(downloads_path, 'PLUTO.gz')

# retrieving data 
urllib.request.urlretrieve(url, fullfilename)

In [None]:
# a path to our file from our downloads folder
file_path = '../../Downloads/PLUTO.gz'

# open zipfile and saving items in our zipfolder
items = zipfile.ZipFile(file_path)

# available files in the container
print(items.namelist())

In [None]:
# opening zipfile using 'with' keyword in read mode
with zipfile.ZipFile(file_path, 'r') as file:
    file.extractall(downloads_path)

In [None]:
# read our data into a dataframe from our downloads path
pluto_data = pd.read_csv(downloads_path + 'pluto_20v1.csv')

In [None]:
# previewing the first five rows in data
pluto_data.head()

In [None]:
# printing dimensions of data 
pluto_data.shape

## 1.6 Unzipping and reading in data as csv from local folder


**Manually zipping a csv file of the first 20 rows of the data in a zipped file.**

In [None]:
# saving first twenty rows of our data as a new csv
building_footprints_csv.head(20).to_csv('data/sample_buildings.csv', index=False)

In [None]:
file_path = 'data/sample-buidlings.zip'

# create a zipfile
with zipfile.ZipFile(file_path, 'w') as file:
        # write mode overrides all the existing files in the 'Zip.'
        # you have to create the file which you have to write to the 'Zip.'
        file.write('data/sample_buildings.csv', basename('data/sample_buildings.csv'))

In [None]:
# seeing if a file is a zipfile
print(zipfile.is_zipfile(file_path))

In [None]:
# list items in this file path
%ls data/

In [None]:
# save items in our zipfile
items = zipfile.ZipFile(file_path)

# available files in the container
print(items.namelist())

**Extracting the csv file of the data from the zipped file.**

In [None]:
file_name = 'data/sample-buidlings.zip'

# opening zip using 'with' keyword in read mode
with zipfile.ZipFile(file_name, 'r') as file:
    # extracing all items in our zipfile
    file.extractall('data/unzipped-data')

In [None]:
# list files in this file path
%ls data/unzipped-data/

In [None]:
# read data as a dataframe
sample_buidlings = pd.read_csv('data/unzipped-data/sample_buildings.csv')

# previewing first five rows of data
sample_buidlings.head()

In [None]:
# printing dimensions of data
sample_buidlings.shape

## 1.7 Reading in data from Socrata Open Data API (SODA)

**Note: If you haven't signed up for an app token, there is a 1,000 rows limit.**

![building_footprints_csv](images/building-footprints-soda-api.png)

In [None]:
# Enter the information from those sections here
socrata_domain = 'data.cityofnewyork.us' # nyc open data domain
socrata_dataset_identifier = 'uic8-njst' # building footprints dataset identifier

# App Tokens can be generated by creating an account at https://opendata.socrata.com/signup
# Tokens are optional (`None` can be used instead), though requests will be rate limited.
#
# If you choose to use a token, run the following command on the terminal (or add it to your .bashrc)
# $ export SODAPY_APPTOKEN=<token>
socrata_token = os.environ.get("SODAPY_APPTOKEN")

Source: https://github.com/xmunoz/sodapy/blob/master/examples/basic_queries.ipynb

In [None]:
# The main class that interacts with the SODA API. Sample usage:
    # from sodapy import Socrata
    # client = Socrata("opendata.socrata.com", None)
client = Socrata(socrata_domain, socrata_token)

print("Domain: {domain:}\nSession: {session:}\nURI Prefix: {uri_prefix:}".format(**client.__dict__))

We are setting the **limit** at **2,000,000 rows** (i.e. the full data set).

In [None]:
# retrieving data as a dictionary 
results = client.get(socrata_dataset_identifier, limit=2000000)

# creating a dataframe from our dictionary
building_footprints_soda_api = pd.DataFrame.from_dict(results)

# printing first five rows of data
building_footprints_soda_api.head()

In [None]:
# printing dimensions of our data
building_footprints_soda_api.shape

In [None]:
# ending our API request
client.close()

**Useful resources:**
    
API Docs: https://dev.socrata.com/foundry/data.cityofnewyork.us/i62d-kjv8

Sign up for app token: https://data.cityofnewyork.us/profile/edit/developer_settings

Python client for the Socrata Open Data API: https://github.com/xmunoz/sodapy

Examples: https://github.com/xmunoz/sodapy/tree/master/examples

# 2. Writing Out Data

### For simplicity, we're only exporting buildings built from 2010 to 2020

In [None]:
# saving only buildings built between 2010 and 2020 as a new dataframe
building_footprints_after_2010 = building_footprints_csv[building_footprints_csv['CNSTRCT_YR'].between(2010, 2020)]

# reset our index
building_footprints_after_2010.reset_index(drop=True, inplace=True)

In [None]:
# previewing first five rows of data
building_footprints_after_2010.head()

In [None]:
# printing dimensions of our data
building_footprints_after_2010.shape

In [None]:
# display float types as two decimals
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# sorting our construction year values and printing the unique values
building_footprints_after_2010.sort_values(by='CNSTRCT_YR').CNSTRCT_YR.unique()

In [None]:
# list items in data folder
%ls data/

## 2.1 Writing to a CSV file

In [None]:
# writing files as a csv
building_footprints_after_2010.to_csv('data/building_after_2010.csv', index=False)

# listing items in data folder
%ls data/

## 2.2 Writing to an Excel (xlsx) file

In [None]:
# writing files as an excel file
building_footprints_after_2010.to_excel('data/building_after_2010.xlsx', index=False)

# listing items in data folder
%ls data/

## 2.3 Writing to a JSON file

In [None]:
# writing files as json
building_footprints_after_2010.to_json('data/building_after_2010.json')

# listing items in data folder
%ls data/

# 3. Reading In Data from Local Folder

In [None]:
# listing items in data folder
%ls data/

## 3.1 Reading in a CSV file

In [None]:
# read data as a dataframe
building_footprints_after_2010 = pd.read_csv('data/building_after_2010.csv')

# previewing first five rows in data
building_footprints_after_2010.head()

In [None]:
# printing dimensions of data
building_footprints_after_2010.shape

## 3.2 Reading in an Excel file

In [None]:
# read data as a dataframe
building_footprints_after_2010 = pd.read_excel('data/building_after_2010.xlsx')

# previewing first five rows in data
building_footprints_after_2010.head()

In [None]:
# printing dimensions of data
building_footprints_after_2010.shape

## 3.3 Reading in a JSON file

In [None]:
# read data as a dataframe
building_footprints_after_2010 = pd.read_json('data/building_after_2010.json')

# previewing first five rows in data
building_footprints_after_2010.head()

In [None]:
# printing dimensions of data
building_footprints_after_2010.shape

# 4. Conclusion

In this notebook, we reviewed various ways to read (load) and write (save) data from NYC Open Data. Specifically, we focused on reading our data into a pandas dataframe. We also went over common file formats that you might encounter - csv, json, shapefiles, and zip files. In Part II, we will focus on basic data inspection and wrangling techniques in the data analysis workflow.