# Downloading data

This data comes from the City of New York, and can be read about here: 

https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9

The actual data download link is: 

https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD

The following code downloads the data to your computer; the file is about 18 gigabytes, so it may take an hour or several depending on your internet speed.

In [None]:
#import urllib.request
#urllib.request.urlretrieve("https://data.cityofnewyork.us/api/views/erm2-nwe9/rows.csv?accessType=DOWNLOAD", "./data/cityofnewyork.us/311-Service-Requests-from-2010-to-Present.csv")

# Displaying raw data
First check out what the header of the column-separated value (csv) file looks like:

In [None]:
#!head -n 4 ./data/cityofnewyork.us/311-Service-Requests-from-2010-to-Present.csv

# Loading the data into a database

We will use the `duckdb` database to load the data into a database. This will allow us to query the data using the structured query language (SQL).

In [1]:
# Load duckdb, which lets us efficiently load large files
import duckdb

# Load pandas, which lets us manipulate dataframes
import pandas as pd

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql

# Set configrations on jupysql to directly output data to Pandas and to simplify the output that is printed to the notebook.
%config SqlMagic.autopandas = True

%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

# Connect jupysql to DuckDB using a SQLAlchemy-style connection string. Either connect to an in memory DuckDB, or a file backed db.
%sql duckdb:///:memory:

In [2]:
import pandas as pd

df = pd.read_csv('/Users/brenstockdale/Downloads/Data Downloads/adi-download/PA_2021_ADI_9 Digit Zip Code_v4.csv')
df.info()
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2421529 entries, 0 to 2421528
Data columns (total 7 columns):
 #   Column        Dtype  
---  ------        -----  
 0   INDEX         int64  
 1   GISJOIN       object 
 2   TYPE          object 
 3   ADI_NATRANK   object 
 4   ADI_STATERNK  object 
 5   BENE_ZIP_CD   int64  
 6   FIPS          float64
dtypes: float64(1), int64(2), object(4)
memory usage: 129.3+ MB


Unnamed: 0,INDEX,GISJOIN,TYPE,ADI_NATRANK,ADI_STATERNK,BENE_ZIP_CD,FIPS
0,1,G42000706047001,,96,10,150010001,420076000000.0
1,2,G42000706047001,,96,10,150010002,420076000000.0
2,3,G42000706047001,,96,10,150010003,420076000000.0
3,4,G42000706047001,,96,10,150010004,420076000000.0
4,5,G42000706047001,,96,10,150010005,420076000000.0


In [4]:
%%sql
SELECT * 
FROM read_csv('/Users/brenstockdale/Downloads/Data Downloads/adi-download/PA_2021_ADI_9 Digit Zip Code_v4.csv',
    header=True,
    delim=',',
    quote='"',
    columns={
        'INDEX': 'INT',
        'GISJOIN': 'VARCHAR',
        'TYPE': 'VARCHAR',
        'ADI_NATRANK': 'VARCHAR',
        'ADI_STATERNK': 'VARCHAR',
        'BENE_ZIP_CD': 'INT',
        'FIPS': 'FLOAT'
    }
)

Unnamed: 0,INDEX,GISJOIN,TYPE,ADI_NATRANK,ADI_STATERNK,BENE_ZIP_CD,FIPS
0,1,G42000706047001,,96,10,150010001,4.200761e+11
1,2,G42000706047001,,96,10,150010002,4.200761e+11
2,3,G42000706047001,,96,10,150010003,4.200761e+11
3,4,G42000706047001,,96,10,150010004,4.200761e+11
4,5,G42000706047001,,96,10,150010005,4.200761e+11
...,...,...,...,...,...,...,...
2421524,2421525,G42001100006002,P,P,P,196129622,4.201100e+11
2421525,2421526,G42001100006002,P,P,P,196129660,4.201100e+11
2421526,2421527,G42001100006002,P,P,P,196129991,4.201100e+11
2421527,2421528,G42001100006002,P,P,P,196129992,4.201100e+11


# Saving the database to a parquet file

A parquet file is a columnar data format that is optimized for reading and writing data. `duckdb` can save the data to a parquet file.

In [5]:
%%sql
COPY (SELECT * 
FROM read_csv('/Users/brenstockdale/Downloads/Data Downloads/adi-download/PA_2021_ADI_9 Digit Zip Code_v4.csv',
    header=True,
    delim=',',
    quote='"',
    columns={
        'INDEX': 'INT',
        'GISJOIN': 'VARCHAR',
        'TYPE': 'VARCHAR',
        'ADI_NATRANK': 'VARCHAR',
        'ADI_STATERNK': 'VARCHAR',
        'BENE_ZIP_CD': 'INT',
        'FIPS': 'FLOAT'
    }
)) TO '/Users/brenstockdale/Downloads/Data Downloads/adi-download/PA_2021_ADI_9 Digit Zip Code_v4.parquet' (COMPRESSION ZSTD);

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Unnamed: 0,Count
0,2421529


# Visualizing the data

We will use the `altair` library to visualize the data. This library is based on the `vega-lite` visualization grammar, which is a high-level visualization grammar that is based on the `vega` visualization grammar. `vega` is a low-level visualization grammar that is based on the `d3` visualization library.

Vega fusion is a library that allows us to embed `vega-lite` visualizations in Jupyter notebooks, using data from a `duckdb` database.

In [1]:
import vegafusion as vf
import polars as pl
import altair as alt
import altair as alt
alt.data_transformers.disable_max_rows()
alt.renderers.enable('html')

# Configure DuckDB connection
vf.runtime.set_connection("duckdb")

# Enable Mime Renderer
vf.enable(row_limit=100000000)

vegafusion.enable(mimetype='html', row_limit=100000000, embed_options=None)

In [2]:
# Load the phone calls data into a Polars datafram
phone_calls = pl.read_parquet("/Users/brenstockdale/Downloads/Data Downloads/adi-download/PA_2021_ADI_9 Digit Zip Code_v4.parquet")

In [None]:
# Load the data from the public datathinking.org Amazon S3 bucket
# The file is about 2 gigabytes, so this may take a few minutes (5-10 minutes is normal depending on internet speed!)
# You can also download this file on the command line using wget (`wget https://public.datathinking.org/cityofnewyork.us%2F311-Service-Requests-from-2010-to-Present.parquet`)
# You can also download this file using a web browser by visiting https://public.datathinking.org/cityofnewyork.us%2F311-Service-Requests-from-2010-to-Present.parquet
#phone_calls = pl.read_parquet("https://public.datathinking.org/cityofnewyork.us%2F311-Service-Requests-from-2010-to-Present.parquet")

In [3]:
print(phone_calls.schema)

{'INDEX': Int32, 'GISJOIN': Utf8, 'TYPE': Utf8, 'ADI_NATRANK': Utf8, 'ADI_STATERNK': Utf8, 'BENE_ZIP_CD': Int32, 'FIPS': Float32}


In [4]:
# Create a bar chart (NEEDS TO BE CHANGED FOR DIFFERENT DATASET)
alt.Chart(phone_calls).mark_bar().encode(
    x='Agency:N',
    y='count()',
)

thread '<unnamed>' panicked at 'Failed to get node value: DataFusionError(SchemaError(FieldNotFound { field: Column { relation: None, name: "Agency" }, valid_fields: [Column { relation: None, name: "_vf_order" }, Column { relation: None, name: "INDEX" }, Column { relation: None, name: "GISJOIN" }, Column { relation: None, name: "TYPE" }, Column { relation: None, name: "ADI_NATRANK" }, Column { relation: None, name: "ADI_STATERNK" }, Column { relation: None, name: "BENE_ZIP_CD" }, Column { relation: None, name: "FIPS" }] }), ErrorContext { contexts: [] })', /Users/runner/work/vegafusion/vegafusion/vegafusion-runtime/src/task_graph/runtime.rs:775:18
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


PanicException: Failed to get node value: DataFusionError(SchemaError(FieldNotFound { field: Column { relation: None, name: "Agency" }, valid_fields: [Column { relation: None, name: "_vf_order" }, Column { relation: None, name: "INDEX" }, Column { relation: None, name: "GISJOIN" }, Column { relation: None, name: "TYPE" }, Column { relation: None, name: "ADI_NATRANK" }, Column { relation: None, name: "ADI_STATERNK" }, Column { relation: None, name: "BENE_ZIP_CD" }, Column { relation: None, name: "FIPS" }] }), ErrorContext { contexts: [] })

In [None]:
# Filter out phone calls that don't have a location
phone_calls = phone_calls.filter(
    pl.col("Latitude").is_not_null() & pl.col("Longitude").is_not_null()
)

In [None]:
# Plot the phone calls on a map
alt.Chart(phone_calls[:10000]).mark_circle().encode(
    longitude='Longitude:Q',
    latitude='Latitude:Q',
    size='count()',
    color='count()',
    tooltip=['Agency:N', 'Complaint Type:N', 'Descriptor:N', 'Location Type:N', 'Incident Zip:N', 'City:N', 'Borough:N']
).project(
    type='albersUsa'
)