# Learning goals
After today's lesson you should be able to:
- Get Census data from the IPUMS/NHGIS data portal
- Get Census data with cenpy
- Use the Socrata API


Some of today's lessons borrow from: 
- [The cenpy documentation](https://github.com/cenpy-devs/cenpy)
- [The Socrata SODA API documentation](https://dev.socrata.com/consumers/getting-started.html)

In [None]:
# We are going to start importing the libraries we need
# all in one cell. 
# It is a good practice to keep all the imports in one cell so that
# we can easily see what libraries we are using in the notebook.
import pandas as pd
import numpy as np
import geopandas as gpd
import os

import matplotlib.pyplot as plt
import seaborn as sns

## The set_context() function is really useful!
## It allows us to set the size of the fonts in our plots based on whether 
## we are making a poster, a talk, a notebook, etc.

## If you are only presenting these figures in your jupyter notebook, 
## there is no need to set the context to be "talk" or "poster"
## But, I sometimes set my context to be "talk" or "poster" even for articles
## because I like the fonts to be bigger.
sns.set_context(context='paper')

# we use the inline backend to generate the plots within the browser
%matplotlib inline

os.getcwd()


# 0. Census Data: Census survey and statistical boundaries

## 0.1 Census Surveys
The United States Census Bureau has been collecting information on its residents in the country since 1780 through surveys sent by mail (since 2020, you can submit your survey by phone, mail, or online). Census data is used for a variety of governmental purposes including: provision of housing, infrastructure, and public amenities; making districting decisions for schools, precints, and elections; and more generally, to understand the population, socio-economic, and demographic characteristics of residents in the country. [Did you know that the punch card machine (a prototype for the computer) was created for the 1890 Census?](https://en.wikipedia.org/wiki/Tabulating_machine)

The US Census has historically been taken every 10 years. Every household in the U.S. is sent a Census survey (and you are legally required to respond.) In 2005, the Census Bureau created the American Community Survey (ACS), which is collected every month on a sample of households.

Since 2020, the Census only contains 10 questions (historically called the "short form census") such as age, sex, race, Hispanic origin, and owner/renter status. The ACS contains a larger set of questions such as employment, education, transportation.

Because the ACS is more frequent, it is often used for more current census needs; however, because it is also a sample, we generally need a longer time span to get a robust sample. This is why we will often use the **5-yr ACS** (for ex: 2012 - 2016 ACS) to represent the year (here, 2014).

Census data is often the baseline survey dataset in the area of urban planning because it provides racial, socio-economic, housing, etc. information that is often the highlight or backdrop of a study.

## 0.2 Census Geographies
There are different, often nested Census geographic regions used for  different administrative scales. The most commonly used regions are statistical areas, typically nested within each other, whose boundaries are defined by certain physical, administrative, and population constraints. For instance, a **Census block** is bounded by physical features such as streets and administrative boundaries such as city limits and school districts. **Block groups**, the smallest unit of analysis that is still mostly statistically robust, are collections of Census blocks (hence the name) that generally have between 800 to 5000 people. **Census tracts** generally have between 1000 and 8000 people. [Here's more information](https://pitt.libguides.com/uscensus/understandinggeography) about Census geographies if you're curious.

See the image below for how these regions nest within one another.

</figure>
<img src="https://www.dropbox.com/s/8w69pibhwffgoc0/qgis_censusgeography.png?dl=1" alt="drawing" width="500" style="display: block; margin: 0 auto"/>
</figure>


## 0.3 [Social Explorer](https://www-socialexplorer-com.proxy.library.cornell.edu/ezproxy)
This is a great tool for looking at Census and ACS data visually. They also have datasets beyond just Census Bureau data. You can also output images and shareable links to the map. I encourage you to sign up (through Cornell it's free) and explore this tool on your own time.



# 1. IPUMS

I have found that the easiest way to query and download a large Census dataset is to use a service provided by [IPUMS](https://www.ipums.org/)

You will see that IPUMS provides data from various sources, including the Census Bureau, the Bureau of Labor Statistics, the National Science Foundation, the National Center for Health Statistics, the Centers for Disease Control, and the National Aeronautics and Space Administration. According to IPUMS, there is also census or survey data available for over 100 countries. For US data, there is Census data going back to 1790.

Here, we are using Census data from IPUM's [National Historical Geographic Information System](https://www.nhgis.org/) (NHGIS).

## 1.1 Getting started with IPUMS

First you'll need to register an account here:
https://uma.pop.umn.edu/nhgis/user/new

IPUMS will also send you an email verification.

From the [NHGIS website](https://www.nhgis.org/), click on **Get Data**. This should take you to a page like this:

</figure>
<img src="https://www.dropbox.com/s/jv9aciqkfemgkzt/Screen%20Shot%202023-02-18%20at%2011.51.26%20AM.png?dl=1" alt="drawing" width="1000" style="display: block; margin: 0 auto"/>
</figure>

## 1.2 Getting data with NHGIS
We are going to use the **2017 - 2021** 5-year ACS at the **tract level** in the U.S. to understand the **educational attainment's relationship to income** (Remember: we use the 5-year to represent the median year.) 

**The aim: get a geospatial dataset of tracts, education, and income**

There are two main sections on the page:
- **APPLY FILTERS** allows you to choose which datasets and what levels of granularity. (The default year of 2019, but we can change this.)
- **SELECT DATA** lets you choose specific tables and columns in your dataset once you've chosen your dataset.

Now, let's make the following selections:
1. In geographic levels, select `TRACT` amongst the different levels and hit **SUBMIT**
</figure>
<img src="https://www.dropbox.com/s/xoz9tfsff7ahezv/Screen%20Shot%202023-02-18%20at%2011.48.16%20AM.png?dl=1" alt="drawing" width="1000" style="display: block; margin: 0 auto"/>
</figure>

2. In years, select `2017-2021` and hit **SUBMIT**
3. In topics, select **POPULATION** on the left hand panel, and select `Educational Attainment` and hit **SUBMIT**


Now, we can see that our **SELECT DATA** table has been populated by our filter with the relevant scale, topics, and years. A good first step is to sort by Popularity. You should see something like this.
</figure>
<img src="https://www.dropbox.com/s/f7yobykeef23n8v/Screen%20Shot%202023-02-18%20at%2011.51.54%20AM.png?dl=1" alt="drawing" width="1000" style="display: block; margin: 0 auto"/>
</figure>

Next to the table name `Educational Attainment for the Population 25 Years and Over`, select the **plus**.

4. In topics again, now **de-select** `Educational Attainment` and select `Per Capita Income` and hit **SUBMIT**. 
And select the `Per Capita Income in the Past 12 Months (in 2021 Inflation-Adjusted Dollars)`. 

</figure>
<img src="https://www.dropbox.com/s/a76xdvajo488gf2/Screen%20Shot%202023-02-18%20at%2012.06.18%20PM.png?dl=1" alt="drawing" width="1000" style="display: block; margin: 0 auto"/>
</figure>


After selecting our data, we need to select our Census tract geometries that go along with it.
</figure>
<img src="https://www.dropbox.com/s/fr5i9rqeay139tp/Screen%20Shot%202023-02-18%20at%2012.07.19%20PM.png?dl=1" alt="drawing" width="1000" style="display: block; margin: 0 auto"/>
</figure>

Under **SELECT DATA**, go to the third tab and select the 2021 Census Tract.

Your data cart in the upper right should show:
- `2 SOURCE TABLE`
- `0 TIMES SERIES TABLES`
- `1 GIS FILE`

Confirm this is what you have, then click **Continue**. You'll be taken to a Data Option page, click the **Continue** button again.

In the description box, you can write anything. I recommend including a text that has the tables you selected, ACS vintage, and scale. (For ex: `2017-2021 ACS educational attainment, per capita income, tract level`) Then hit **Submit**. (If you haven't logged in already, you might have to do that first.)

You'll be taken to your Extracts History. It might take a minute, but soon the **Download Data** column should be populated with two buttons that allow you download the tables and GIS files.
</figure>
<img src="https://www.dropbox.com/s/5bf3uv5d1tmgbgs/Screen%20Shot%202023-02-18%20at%2012.11.05%20PM.png?dl=1" alt="drawing" width="1000" style="display: block; margin: 0 auto"/>
</figure>

You should have two zipped files called: 
- `nhgisXXXX_csv.zip`, where `XXXX` is the job number. 
- `nhgisXXXX_shape.zip`, where `XXXX` is the job number. 

Save your `TABLES` and `GIS FILES` to your folder for this class and unzip them. 

## 1.3 Read in the data
Finally, let's take a look at the data. 

In [None]:
# tracts = gpd.read_file('nhgis0124_shape/nhgis0124_shapefile_tl2021_us_tract_2021.zip')
tracts = gpd.read_file('nhgis0001_shape/nhgis0001_shapefile_tl2021_us_tract_2021.zip')

In [None]:
## I got an UnicodeDecodeError, when I tried to read the csv without specifying 
## that the encoding should be latin-1.
## Latin-1 encoding is different from UTF-8, which is the default encoding in Python.
## Encoding is typically when characters are converted into a binary format (bytes)
## to be stored or transmitted.

## Specifying the encoding allows the file to be read correctly when we decode it back to characters.
# acs_data = pd.read_csv('nhgis0124_csv/nhgis0124_ds254_20215_tract.csv',encoding='latin-1')

acs_data = pd.read_csv('nhgis0001_csv/nhgis0001_ds254_20215_tract.csv',encoding='latin-1')

In [None]:
# tracts[tracts['STATEFP'] == '36'].to_file('ny_tracts.geojson', driver='GeoJSON')
# acs_data[acs_data['STATEA'] == 36].to_csv('ny_acs.csv')

If you had trouble opening the shapefile for the entire country, I have created two files for just NY state, which are smaller and will be easier too handle: 
- `ny_tracts.geojson` (download [here]('https://www.dropbox.com/s/e7u3fphsjp7fqd9/ny_tracts.geojson?dl=0'))
- `ny_acs.csv` (download [here]('https://www.dropbox.com/s/vvht4wpr954nvj9/ny_acs.csv?dl=0'))

You can still use your `xxxx_codebook.txt` file. 

#### Tracts
Let's take a look at the tracts dataset. 

In [None]:
tracts.shape

In [None]:
tracts.head()

In [None]:
tracts[tracts['STATEFP']=='01'].plot(figsize=(10,10))

Notice that we only have geographical characteristics in our shapefile. 

In [None]:
tracts.columns

#### Census data
Now, let's take a look at our Census data. 

In [None]:
acs_data.shape

In [None]:
acs_data.head()

Notice that 
- Most of the column names are coded in a way that we'll need the `xxxx_codebook.txt` file to parse. 
- The `GISJOIN` column is the geometry ID that allows us to join the shapefiles to the data. 

Let's first resolve the column name issue. Open up your codebook, which should be in the same folder as your CSV file. 

</figure>
<img src="https://www.dropbox.com/s/0tseofmb7c2sl79/Screen%20Shot%202023-02-18%20at%2012.49.23%20PM.png?dl=1" alt="drawing" width=900" style="display: block; margin: 0 auto"/>
</figure>

Notice that you have many different "context" columns that allows your to select by different geographies. If you scroll down we can see what each column name translates into: 

</figure>
<img src="https://www.dropbox.com/s/gqos37pt7goayt7/Screen%20Shot%202023-02-18%20at%2012.55.24%20PM.png?dl=1" alt="drawing" width=900" style="display: block; margin: 0 auto"/>
</figure>


#### Column selection 
Notice that our original data selections from the NHGIS portal has been broken down into "tables". 
- Table 1 contains columns related to educational attainment, where each column represents the **number of people in a tract who's highest educational attainment is X**. 
- Table 2 is our per capita income for each tract. 

Let's use **Bachelors degree and above** as our higher ed proxy. Since these are numbers and we probably also want percentages, we'll need the **Total** population 25 years and over. And we'll need the **Per capita income**: 
- AOP8E001:    Total
- AOP8E022:    Bachelor's degree
- AOP8E023:    Master's degree
- AOP8E024:    Professional school degree
- AOP8E025:    Doctorate degree
- AORME001:    Per capita income in the past 12 months (in 2021 inflation-adjusted dollars)
 

 As we saw above, there are 95 columns in our `acs_data` df! I don't really need all of those to proceed as I have a clear, targeted question. Therefore, I will filter my columns in `acs_data` to only contain those I need for my analysis. 

In [None]:
## I also need the GISJOIN so I can join this to my shapefiles
cols_need = ['GISJOIN','AOP8E001','AOP8E022','AOP8E023','AOP8E024','AOP8E025','AORME001']
acs_data_new = acs_data[cols_need]
acs_data_new.head()

Let's rename our column names. 

In [None]:
acs_data_new = acs_data_new.rename(columns={'AOP8E001':'population_over_25',
                                            'AOP8E022':'ba',
                                            'AOP8E023':'ma',
                                            'AOP8E024':'prof',
                                            'AOP8E025':'doctorate',
                                            'AORME001':'per_capita_income'})

And create a new column that is the % of people who's highest level of education is or above the bachelor's degree. 

In [None]:
acs_data_new['perc_highered'] = (acs_data_new['ba']+acs_data_new['ma']+acs_data_new['prof']+acs_data_new['doctorate'])/acs_data_new['population_over_25']

In [None]:
acs_data_new.head()

Finally, let's join this DF back to the tracts data. 

In [None]:
## I'm going to use the default join method, which is an inner join
## So no need specify how='inner'
tracts_acs = tracts.merge(acs_data_new, on='GISJOIN')

In [None]:
tracts_acs.head()

## 1.4 Descriptive statistics
Let's take a look some descriptive statistics. I'm only going to select a subset of columns I'm interested in for my descriptive analysis. For instance, the `ALAND` (area of land in the tract) is not really relevant in this analysis. 

In [None]:
tracts_acs.columns

In [None]:
cols_using = ['population_over_25', 'ba', 'ma', 'prof', 'doctorate',
       'per_capita_income', 'perc_highered']

In [None]:
tracts_acs[cols_using].describe()

In [None]:
tracts_acs[cols_using].corr()

In [None]:
## Even plotting 7 columns took about 11 seconds for me!
## I've set the transparency to be really low at 0.2, so that we can see the density of the points
sns.pairplot(tracts_acs[cols_using],
            markers='+',diag_kind='kde',
            plot_kws={'alpha':0.2,'s':30,'color':'red'},
            diag_kws={'color':'red'})

Oh, pretty!

## Q.1 (2 pts)
List at least three reasons why the above is not a clear figure in a markdown. Can you only focus on New York State & plotting percent of higher education to improve it?

In [None]:
## Just plotting NY state so that the map doesn't take so long to render
tracts_acs[tracts_acs['STATEFP']=='36'].plot(column='perc_highered',figsize=(10,10),legend=True)

# 2. `cenpy`

`cenpy` is python library that allows us to easily use the [US Census Bureau's API](http://www.census.gov/data/developers/data-sets.html) to programmatically read the publicly available data sets into a dataframe or geodataframe. This does not include the Decennial Census and American Community Survey, but other data products from the Bureau such as the Longitudinal Employer-Household Dynamics dataset, the Commodity Flow Survey, Survey of Business Owners, etc. 



#### APIs
What is an API? API stands for Application Programming Interface and essentially is a tool that allow our computers to communicate and use an API host's (in this case, the Census) "servers" (computers) in a regulated manner. There are different kinds of APIs, though most APIs that you will use are called REST APIs. This is when we (the "client") send data to the OSM (the "server") so that its API reads the data and returns the outputs we asked for - the Census data in this case. `cenpy` is acting as facilitator in this case, since the Census's API is a little clunky. 

In [None]:
import cenpy 

## 2.1 Getting ACS data from Cenpy
`cenpy.products` is the tool that we will mostly be interacting with. Here let's explore `cenpy.products.ACS()`

In [None]:
## This creates the "connection" to the latest 5-year ACS data in the API, which is 2019
## cenpy.products.ACS(2018) would give us the 2014-2018 ACS data
acs_cp= cenpy.products.ACS()

We then need to tell cenpy how to extract the data based on geography. We can
- `from_place()`
- `from_county()`
- `from_state()`
- `from_csa()`
- `from_msa()`

We will also need to tell cenpy which [ACS Table Lists and Shells](https://www.census.gov/programs-surveys/acs/technical-documentation/table-shells.2019.html#list-tab-LO1F1MU1CQP3YOHD2T) we want. Luckily, we already have this from the NHGIS codebook. 

Notice in your code that **Source code: B15003**. The "source" here is the original Census table. 

Lastly, we need to indicate the geography level.

In [None]:
## Note, because this a free API, it is limited to
acs_cp_data = acs_cp.from_place('New York, NY',
                                level='tract',
                                variables=['B15003'])

In [None]:
acs_cp_data.head()

In [None]:
## They are all in the same order as the NHGIS data, so I can just use the same columns
acs_cp_data['perc_highered'] = (acs_cp_data['B15003_022E']+acs_cp_data['B15003_023E']+acs_cp_data['B15003_024E']+acs_cp_data['B15003_025E'])/acs_cp_data['B15003_001E']

In [None]:
acs_cp_data.plot(column='perc_highered',figsize=(10,10),legend=True)

Pretty neat! There are certain limitations here (we can't get the whole of NY state or the country, given the API's limitations, for instance), but cenpy is a pretty easy to use out of the box tool for some fast case studies uses. 

## Q.2 (5 pts)
Using the [ACS table](https://www.census.gov/programs-surveys/acs/technical-documentation/table-shells.2019.html#list-tab-LO1F1MU1CQP3YOHD2T) lookup page to download "2019 ACS Detailed Table Shells"
- Find the table ID (in the format of "B00000") for the **HISPANIC OR LATINO ORIGIN BY RACE** table 
- For the state of Massachusetts, use the `.from_state()` function with your `acs_cp` engine. 

Call this geodataframe `ma`.


(This query took about 1 minute for me.)

In [None]:
ma = ### INSERT YOUR CODE HERE

Now calculate the percentage of "Non Hispanic or Latino, White Alone" in the state.

In [None]:
ma['perc_white'] = ### INSERT YOUR CODE HERE

And create a plot of the non-hispanic White percentage across the state. 

In [None]:
fig, ax = plt.subplots(figsize=(15,15))

ma.plot(### INSERT YOUR CODE HERE)
ax.set_axis_off()

## Use tight_layout to remove the white space around the plot
plt.tight_layout()

## I forgot to show you all how to save down your plots!
fig.savefig('MA_perc_white.png')   # save the figure to file

## 3. Socrata and Socrata APIs
Many government open data portals were built by the same company, Socrata (acquired a few years back by Tyler Technologies), which created the infrastructure and front-end interface to access open government data. 

Here, we are going to re-visit our NYCHA developments dataset [here](https://data.cityofnewyork.us/Housing-Development/Map-of-NYCHA-Developments/i9rv-hdr5).


You may have noticed that, when we go to export data, that there is a **SODA API** section: 
</figure>
<img src="https://www.dropbox.com/s/7pvi2f0jnbrlwdt/Screen%20Shot%202023-02-18%20at%205.57.09%20PM.png?dl=1" alt="drawing" width="1000" style="display: block; margin: 0 auto"/>
</figure>

SODA is Socrata's API for allowing users from researchers to (more often) people building tools and applications to access open-portal data. This is most useful when you have to programmatically connect your data export to something else. For instance, if you're running a website that needs to update data in real-time or if you don't want to download an updated dataset each time, you can connect your notebook or app to this API. Click to expand the **SODA API** section: 
</figure>
<img src="https://www.dropbox.com/s/ure1ep5y7ussvxs/Screen%20Shot%202023-02-18%20at%205.57.44%20PM.png?dl=1" alt="drawing" width="1000" style="display: block; margin: 0 auto"/>
</figure>


**Copy the API endpoint URL**. 

## 3.1 API endpoint to GeoDataFrame

We can pretty easily this JSON file into a geodataframe. FYI, a JSON stands for "JavaScript Object Notation" and is a file format that was desisgned for the JavaScript language, but is easily translated to other formats that we know well. 

The good thing is that pandas has a `pd.read_json()` function that will allow us read this JSON as a DF and eventually turn it into a geodataframe. 

In [None]:
nycha = pd.read_json('https://data.cityofnewyork.us/resource/5j2e-zhmb.json')

In [None]:
nycha.head()

In [None]:
nycha.shape

Notice that there is a **the_geom** column that looks like it might have geometry information. 

In [None]:
## Ignore the warnings 
nycha['the_geom'].head()

We are going to turn these strings, into Shapely geometries, which is the only piece of our data that is missing so we can turn this into a geometry. 

In [None]:
from shapely.geometry import shape

## the apply method applies the function to each row of the dataframe
nycha['the_geom'] = nycha['the_geom'].apply(shape)

## I'm going to use the GeoDataFrame method to create a GeoDataFrame
nycha_geo = gpd.GeoDataFrame(nycha,geometry='the_geom')

In [None]:
nycha_geo.head()

In [None]:
## Faint, but these are our buildings

nycha_geo.plot(figsize=(10,10))

## 3.2 Filtering
The SODA API allows us to filter data from the endpoint url. Why might we want to do this? For one, there are very large datasets such as the [311 Service Requests dataset](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) (with 32 million rows) or the [Open Parking and Camera Violations](https://data.cityofnewyork.us/City-Government/Open-Parking-and-Camera-Violations/nc67-uf89) (with 93 million rows!) that are difficult to work with due to their size. 

There are two ways to filter data using the SODA API: 
- [Simple Filters](https://dev.socrata.com/docs/filtering.html)
- [SoQL Queries](https://dev.socrata.com/docs/queries/)


**Both of these filters are text we append to the original endpoint URL.**

### 3.2.1 Simple Filters
Any column in the dataset can be used as a filter for specific values within that column and is in the format :

`http://yourendpointurl.json?col_name=element_name`

In [None]:
nycha_url_orig = "https://data.cityofnewyork.us/resource/5j2e-zhmb.json"

## Note, this query is CASE-SENSITIVE! 
## If the column name is in all caps, it must be in all caps here
## If the value of interest is in all caps, it must be in all caps here
nycha_url_mh = "https://data.cityofnewyork.us/resource/5j2e-zhmb.json?borough=MANHATTAN"

In [None]:
nycha_mh = pd.read_json(nycha_url_mh)
nycha_mh['the_geom'] = nycha_mh['the_geom'].apply(shape)
nycha_mh_geo = gpd.GeoDataFrame(nycha_mh,geometry='the_geom')

In [None]:
nycha_mh_geo.plot()

You can join multiple queries with an `&`. 

In [None]:
nycha_url_mh_jeff = "https://data.cityofnewyork.us/resource/5j2e-zhmb.json?borough=MANHATTAN&developmen=JEFFERSON"
nycha_mh_jeff = pd.read_json(nycha_url_mh_jeff)
nycha_mh_jeff['the_geom'] = nycha_mh_jeff['the_geom'].apply(shape)
nycha_mh_jeff_geo = gpd.GeoDataFrame(nycha_mh_jeff,geometry='the_geom')

In [None]:
nycha_mh_jeff_geo.plot()

### 3.2.2 SoQL Queries
The “Socrata Query Language” (SoQL) is a simple, SQL-like query language specifically designed for making it easy to work with data on the web. If you're familiar with SQL, the following may be familiar. And even if you're not, this will seem pretty intuitive. 

Here are all the different parameters that you can use in this query: 
</figure>
<img src="https://www.dropbox.com/s/r4edgdtyzm2vrxn/Screen%20Shot%202023-02-19%20at%2010.09.27%20AM.png?dl=1" alt="drawing" width="800" style="display: block; margin: 0 auto"/>
</figure>


**One key formatting difference here is the use of white space, which allowed in the query but must be translated into `%20` for URL purposes, since no white spaces are allowed in the URL.** I am using the `.replace("to_be_replace_str","new_str")` function to replace empty spaces with `%20`.

The same filtering for Manhattan and the Jefferson Development we did above would look like this: 


In [None]:
## Note the use of single vs double quotes here, since I need to include a single quote in the query
nycha_url_mh_soql = "https://data.cityofnewyork.us/resource/5j2e-zhmb.json?$where=borough='MANHATTAN' and developmen='JEFFERSON'".replace(" ", "%20")

nycha_mh_jeff2 = pd.read_json(nycha_url_mh_soql)
nycha_mh_jeff2['the_geom'] = nycha_mh_jeff2['the_geom'].apply(shape)
nycha_mh_jeff2_geo = gpd.GeoDataFrame(nycha_mh_jeff2,geometry='the_geom')

In [None]:
nycha_mh_jeff2_geo.plot()

### 3.2.3 A more complex SoQL query

Let's say we wanted to look at the [311 Service Requests data](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9). Here are the ways I want to filter the dataset based on the columns available: 
- **Created Date** is since Feb 2023
- **Complaint Type**  is `Noise - Residential`
- **Descriptor** is `Loud Music/Party` 

Looking at the [311 API docs](https://dev.socrata.com/foundry/data.cityofnewyork.us/erm2-nwe9) will give you some example queries and will also show you the correct column names for the API. You can also find the column names when you click on each column in the "Columns in the Dataset" section of the data homepage. 

</figure>
<img src="https://www.dropbox.com/s/wlrh8jzes9dcsvv/Screen%20Shot%202023-02-19%20at%2011.55.08%20AM.png?dl=1" alt="drawing" width="1000" style="display: block; margin: 0 auto"/>
</figure>


In [None]:
servicereq_url = "https://data.cityofnewyork.us/resource/erm2-nwe9.json?$where=created_date between '2023-02-01T0:00:00.000' and '2023-02-19T0:00:00.000' and complaint_type='Noise - Residential' and descriptor='Loud Music/Party'".replace(" ", "%20")
servicereq = pd.read_json(servicereq_url)

Let's turn this into a GeoDataFrame

In [None]:
servicereq_geo = gpd.GeoDataFrame(servicereq, 
                                  geometry=gpd.points_from_xy(servicereq['longitude'], 
                                                              servicereq['latitude']))

In [None]:
servicereq_geo.plot(markersize=2)

## 3.3 `offset` and `limit`
The issue with using this endpoint is that we are limited to 1000 rows per query. You will see the documentation refer to this as "pages" sometimes.


In [None]:
servicereq.shape

What to do? 

One way to get around this is to use the `limit` and `offset` parameters. From the SODA documentation: 

>The $offset parameter is most often used in conjunction with $limit to page through a dataset. The $offset is the number of records into a dataset that you want to start, indexed at 0. For example, to retrieve the “4th page” of records (records 151 - 200) where you are using $limit to page 50 records at a time, you’d ask for an $offset of 150.

In [None]:
servicereq_url_offset = "https://data.cityofnewyork.us/resource/erm2-nwe9.json?$limit=50&$offset=150&$where=created_date between '2023-02-01T0:00:00.000' and '2023-02-19T0:00:00.000' and complaint_type='Noise - Residential' and descriptor='Loud Music/Party'".replace(" ", "%20")
servicereq_offset = pd.read_json(servicereq_url_offset)

This is now 50 entries of the "4th page".

In [None]:
servicereq_offset

So, to get all the data, what we can do is run a loop to change that offset amount iteratively. 

OR

If we are getting the data just once, we can use the filter function, accessible through the  "View Data" button on the dataset's home page. 

</figure>
<img src="https://www.dropbox.com/s/oz26ti7y164pm8r/Screen%20Shot%202023-02-19%20at%2012.35.21%20PM.png?dl=1" alt="drawing" width="1000" style="display: block; margin: 0 auto"/>
</figure>



### 3.3.1 A short review of loops

In [None]:
my_counter = np.arange(0,1000,50)
print(my_counter)

In [None]:
# The for loop will iterate through each value in the list
# The {} is a placeholder for the value in the list within a string

for i in my_counter: 
    print("Current Counter is now at {}".format(i))

In [None]:
## reset i to 0
i = 0
## The while loop will continue to run until the condition is no longer true
while i < 1000:
    print("Current Counter is now at {}".format(i))
    
    ## This is an example of an incrementer
    ## An incrementer is a variable that is used to increment a value
    ## After each iteration, the value of i will increase by 50
    i = i + 50

In [None]:
for i in np.arange(0,100000,50):
    print("Current Counter is now at {}".format(i))
    i = i + 50

    if i >1000 :
        print("We are done")
        break

To programmatically run different queries, I just going to 

This might take a while to run and might not work at all given our 1000 an hour limit. :/

In [None]:
## I actually don't know what the upper range is for my dataset, but I will just use 100,000
# offset_list = np.arange(0,100000,50)

# I'm actually going to use a smaller list for demo and not overloading the API
offset_list_smaller = np.arange(0,200,50)

list_of_dfs = []

for offset in offset_list_smaller:
    servicereq_url_offset = "https://data.cityofnewyork.us/resource/erm2-nwe9.json?$limit=50&$offset={}&$where=created_date between '2023-02-01T0:00:00.000' and '2023-02-19T0:00:00.000' and complaint_type='Noise - Residential' and descriptor='Loud Music/Party'".replace(" ", "%20").format(offset)
    servicereq_offset = pd.read_json(servicereq_url_offset)

    ## Here I am creating a list of dataframes by appending each dataframe to the list
    list_of_dfs.append(servicereq_offset)

I now have a list of dataframes.

In [None]:
list_of_dfs

In [None]:
## pd.concat will concatenate the dataframes in the list
## to create a single dataframe
servicereq_final = pd.concat(list_of_dfs)

If I were to really try and get all this data, I'd put a `sleep()` call from the library `time` to pause my code from running the next line for a certain amount of time. 

In [None]:
import time

list_of_dfs = []

for offset in offset_list_smaller:
    servicereq_url_offset = "https://data.cityofnewyork.us/resource/erm2-nwe9.json?$limit=50&$offset={}&$where=created_date between '2023-02-01T0:00:00.000' and '2023-02-19T0:00:00.000' and complaint_type='Noise - Residential' and descriptor='Loud Music/Party'".replace(" ", "%20").format(offset)
    servicereq_offset = pd.read_json(servicereq_url_offset)

    ## Here I am creating a list of dataframes by appending each dataframe to the list
    list_of_dfs.append(servicereq_offset)
    
    ## I am adding a sleep timer to avoid overloading the API
    ## The sleep timer will pause the code for 3 minutes
    ## This gives me 3 min/run for each 50 records = 20 queries per hour = 1000 records per hour
    time.sleep(180)
    if servicereq_offset.shape[0] == 50:
        print("We are done")
        break

servicereq_final = pd.concat(list_of_dfs)

Lastly! Don't think this means we can just get all the data at once. Each query we make "costs" the API provider resources. To ensure that everyone is able to use the API, the provider will limit your capacity to query. 

>## Throttling and Application Tokens
>Hold on a second! Before you go storming off to make the next great open data app, you should understand how SODA handles throttling. You can make a certain number of requests without an application token, but they come from a shared pool and you’re eventually going to get cut off.
>
>If you want more requests, sign up for a Socrata account, then register for an application token and your application will be granted up to 1000 requests per rolling hour period. If you need even more than that, special exceptions are made by request. You can contact our support team here.

In [None]:
servicereq_url_offset = "https://data.cityofnewyork.us/resource/erm2-nwe9.json?$limit=50&$offset=5000000&$where=created_date between '2023-02-01T0:00:00.000' and '2023-02-19T0:00:00.000' and complaint_type='Noise - Residential' and descriptor='Loud Music/Party'".replace(" ", "%20").format(offset)
servicereq_offset = pd.read_json(servicereq_url_offset)

In [None]:
servicereq_offset.shape

## Q.3 Querying and Concatenating (5 pts)
- Using the [Film Permits](https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p) dataset to retrieve two dataframes: 
    1. The **StartDateTime** should be after July 1, 2022
    2. The **StartDateTime** should be after July 1, 2022 & The **Category** should be `Television`. 
- Create a list of two dataframes with 50 rows per "page"
- Concatenate these two dataframes together into one dataframe
- Show the first 5 rows of the new dataframe.



Using the [Film Permits](https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p) dataset to retrieve two dataframes: 
1. The **StartDateTime** should be after July 1, 2022
2. The **StartDateTime** should be after July 1, 2022 & The **Category** should be `Television`. 

In [None]:
film_url1 = ## INSERT YOUR CODE HERE
film1 = pd.read_json(film_url1)

film_url2 = ## INSERT YOUR CODE HERE
film2 = pd.read_json(film_url2)

Concatenate these two dataframes together into one dataframe

In [None]:
## INSERT YOUR CODE HERE

Show the first 5 rows of the new dataframe.


In [None]:
## INSERT YOUR CODE HERE