<a href="https://colab.research.google.com/github/odu-cs625-datavis/public/blob/main/Va_Open_Data_Portal_API_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Using the SODA API to access data from the [Virginia Open Data Portal](https://data.virginia.gov)

For more information on using the API, visit the dataset page you're interested in and click the **API** button.  It should provide links to the API docs for that dataset and the [Developer Portal (Socrata.com)](https://dev.socrata.com/).

In this example, we'll access the VDH COVID-19 Public Use Dataset - Cases, available at <https://data.virginia.gov/Government/VDH-COVID-19-PublicUseDataset-Cases/bre9-aqqr>.  The dataset name, which we'll need for the API, is the last part of the URI, `bre9-aqqr`.

As of Sep 29, 2021, this dataset had 74.7k rows and 7 columns, with each row representing the overall count of COVID-19 cases, hospitalizations, deaths for each locality in Virginia by report date since reporting began for this dataset.

The following example is based the API docs for this dataset, available at https://dev.socrata.com/foundry/data.virginia.gov/bre9-aqqr

First, install the `sodapy` package to use the API in Python

In [1]:
!pip install sodapy

Collecting sodapy
  Downloading sodapy-2.1.0-py2.py3-none-any.whl (14 kB)
Installing collected packages: sodapy
Successfully installed sodapy-2.1.0


Then import `pandas` and the `Socrata` package from the `sodapy` library.

Next, we need to specify that we'll be using the data.virginia.gov client.  We are accessing public datasets, so we don't need to provide an authentication token.

In [2]:
import pandas as pd
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.virginia.gov", None)



Then we specify which dataset we want to access.  This corresponds to the set of characters at end of the dataset URI.

Note that the `get()` function only returns 1000 results by default.  To access more, use the `limit=` option.

And then we'll convert that to a Pandas data frame for ease of use.

In [9]:
# data = client.get("bre9-aqqr")         # Gets 1000 records
data = client.get("bre9-aqqr", limit=75000)  # We know there are 74.7k records, set max at 75,000
df = pd.DataFrame.from_records(data)   # Convert to pandas DataFrame

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74746 entries, 0 to 74745
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   report_date          74746 non-null  object
 1   fips                 74746 non-null  object
 2   locality             74746 non-null  object
 3   vdh_health_district  74746 non-null  object
 4   total_cases          74746 non-null  object
 5   hospitalizations     74746 non-null  object
 6   deaths               74746 non-null  object
dtypes: object(7)
memory usage: 4.0+ MB


In [11]:
# convert to proper datatypes
df['report_date'] = pd.to_datetime(df['report_date'])
df['total_cases'] = pd.to_numeric(df['total_cases'])
df['hospitalizations'] = pd.to_numeric(df['hospitalizations'])
df['deaths'] = pd.to_numeric(df['hospitalizations'])

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74746 entries, 0 to 74745
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   report_date          74746 non-null  datetime64[ns]
 1   fips                 74746 non-null  object        
 2   locality             74746 non-null  object        
 3   vdh_health_district  74746 non-null  object        
 4   total_cases          74746 non-null  int64         
 5   hospitalizations     74746 non-null  int64         
 6   deaths               74746 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(3)
memory usage: 4.0+ MB


In [13]:
df.head(10)

Unnamed: 0,report_date,fips,locality,vdh_health_district,total_cases,hospitalizations,deaths
0,2021-09-29,51001,Accomack,Eastern Shore,3768,286,286
1,2021-09-29,51003,Albemarle,Blue Ridge,7504,307,307
2,2021-09-29,51005,Alleghany,Alleghany,1938,79,79
3,2021-09-29,51007,Amelia,Piedmont,1299,59,59
4,2021-09-29,51009,Amherst,Central Virginia,4071,198,198
5,2021-09-29,51011,Appomattox,Central Virginia,2217,104,104
6,2021-09-29,51013,Arlington,Arlington,17750,906,906
7,2021-09-29,51015,Augusta,Central Shenandoah,9396,265,265
8,2021-09-29,51017,Bath,Central Shenandoah,416,14,14
9,2021-09-29,51019,Bedford,Central Virginia,9143,353,353
