<a href="https://colab.research.google.com/github/odu-cs625-datavis/public/blob/main/R_SODA_API_for_VDH_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using the SODA API to access data from the Virginia Open Data Portal

For more information on using the API, visit the dataset page you're interested in and click the API button. It should provide links to the API docs for that dataset and the Developer Portal.

In this example, we'll access the VDH COVID-19 Public Use Dataset - Cases, available at https://data.virginia.gov/Government/VDH-COVID-19-PublicUseDataset-Cases/bre9-aqqr. The dataset name, which we'll need for the API, is the last part of the URI, `bre9-aqqr`.

As of Sep 29, 2021, this dataset had 74.7k rows and 7 columns, with each row representing the overall count of COVID-19 cases, hospitalizations, deaths for each locality in Virginia by report date since reporting began for this dataset.

The following example is based the API docs for this dataset, available at https://dev.socrata.com/foundry/data.virginia.gov/bre9-aqqr, and the RSocrata repo at https://github.com/Chicago/RSocrata

First, we have to install and load the `RSocrata` library.

In [38]:
install.packages("RSocrata")
library(RSocrata)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



Then, we specify the dataset we want to access. This corresponds to the set of characters at end of the dataset URI.

In [27]:
vdhDataset <- "bre9-aqqr"

By default, the API only returns 1000 results. To access more, we have to set the $limit parameter.

In [None]:
fetchLimit <- 80000   # since there were 74.7k on Sep 30, request more than that

Then, we construct the URI to request the data. There's nothing you need to change here.

In [None]:
uri <- paste("https://data.virginia.gov/resource/", vdhDataset, ".json?$limit=", fetchLimit, sep="")

Finally, we read in the data. Again, there's nothing for you to change.

In [28]:
df <- read.socrata(uri)

In [37]:
str(df)

'data.frame':	74879 obs. of  7 variables:
 $ report_date        : Date, format: "2020-03-17" "2020-03-17" ...
 $ fips               : chr  "51001" "51003" "51005" "51007" ...
 $ locality           : chr  "Accomack" "Albemarle" "Alleghany" "Amelia" ...
 $ vdh_health_district: chr  "Eastern Shore" "Thomas Jefferson" "Alleghany" "Piedmont" ...
 $ total_cases        : num  0 0 0 0 0 0 13 0 0 0 ...
 $ hospitalizations   : num  0 0 0 0 0 0 1 0 0 0 ...
 $ deaths             : num  0 0 0 0 0 0 0 0 0 0 ...


To be able to use the data, we'll want to convert columns to the proper datatypes (datetime, int, etc.).

In [36]:
df$total_cases = as.numeric(as.character(df$total_cases))
df$hospitalizations = as.numeric(as.character(df$hospitalizations))
df$deaths = as.numeric(as.character(df$deaths))
df$report_date = as.Date(df$report_date)

In [39]:
str(df)

'data.frame':	74879 obs. of  7 variables:
 $ report_date        : Date, format: "2020-03-17" "2020-03-17" ...
 $ fips               : chr  "51001" "51003" "51005" "51007" ...
 $ locality           : chr  "Accomack" "Albemarle" "Alleghany" "Amelia" ...
 $ vdh_health_district: chr  "Eastern Shore" "Thomas Jefferson" "Alleghany" "Piedmont" ...
 $ total_cases        : num  0 0 0 0 0 0 13 0 0 0 ...
 $ hospitalizations   : num  0 0 0 0 0 0 1 0 0 0 ...
 $ deaths             : num  0 0 0 0 0 0 0 0 0 0 ...
