# Creating Covid-19 Maps from Online Data

As the Covid-19 pandemic evolves, you no doubt have seen lot of maps showing the impact of Covid-19. You might have wondered how difficult it is to create these maps. In this section you will learn the entire process from downloading data to creating a map.


In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci

# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')


## Reminder

<a href="#/slide-2-0" class="navigate-right" style="background-color:blue;color:white;padding:8px;margin:2px;font-weight:bold;">Continue with the lesson</a>

<font size="+1">

By continuing with this lesson you are granting your permission to take part in this research study for the Hour of Cyberinfrastructure: Developing Cyber Literacy for GIScience project. In this study, you will be learning about cyberinfrastructure and related concepts using a web-based platform that will take approximately one hour per lesson. Participation in this study is voluntary.

Participants in this research must be 18 years or older. If you are under the age of 18 then please exit this webpage or navigate to another website such as the Hour of Code at https://hourofcode.com, which is designed for K-12 students.

If you are not interested in participating please exit the browser or navigate to this website: http://www.umn.edu. Your participation is voluntary and you are free to stop the lesson at any time.

For the full description please navigate to this website: <a href="gateway-1.ipynb">Gateway Lesson Research Study Permission</a>.

</font>



## Setup

To start you have to import the specific Python modules you will need. You will learn more about these packages in other Hour of CI lessons, so let's just import everything we need right now.

Click the Run button ( <img src="supplementary/play-button.png" alt="Run button picture" style="display: inline-block;">) below to import our packages (wait until you see the printed message to continue).


In [None]:
import pandas
import geopandas
from matplotlib import pyplot

print("Python modules imported!")

## Download Covid-19 Data
First, we have to find the data. There are lots of sources of Covid-19 data online. For this segment we will use US county level data released by the New York Times. It's found here: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv.

The code below uses a utility called **wget** to download the data from the URL and save it to a local file called "us-counties.csv" (side note: that is what the -O does). Click the Run button ( <img src="supplementary/play-button.png" alt="Run button picture" style="display: inline-block;">) below.

In [None]:
!wget -O us-counties.csv https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv

## Read and view the data
Once you have downloaded the data file, you have to read it using Python. To do that, we'll convert the downloaded file into a data format that our Python program can use. Here we're going to use Dataframes provided by the **pandas** module we just imported.

**Dataframes** can be thought of as spreadsheets for tabular data organized in rows and columns. See an example below.

| Column 1 | Column 2 | Column 3 |
|:---------|:---------|:---------|
|First     |Record    |1         |
|Second    |Record    |2         |

If you want to learn more about Dataframes you can look at the Pandas documentation <a href="https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/01_table_oriented.html#min-tut-01-tableoriented">here</a>.

## Read and view the data

The function we will use to read the data is **pandas.read_csv**. To view the data we will use the **head** function that displays the top 5 data records. Run the code. Here we can see a record contains a date, a county, a state, a numerical representations of a county and state called a FIPS code. We are getting a record for each county for each day there is at least one case or death. Click the Run button ( <img src="supplementary/play-button.png" alt="Run button picture" style="display: inline-block;">) below.

In [None]:
#Read the data that we downloaded from the NYT into a dataframe
covid_counties = pandas.read_csv('./us-counties.csv')

#View the first 5 records
covid_counties.head(5)

## Count the number of records

How many records do we have? Let's take a look using Panda's **count** function.


In [None]:
covid_counties.count()

Whoa! That is a lot of records. Too many records actually... We are getting a record for each day there is at least one case.



## Aggregate records

We need to combine or **aggregate** the Covid-19 records to map them. Let's map the total number of cases for each US county.

To do this we will use the **groupby** function. We will group _daily cases_ by _county_. Since some county names are found in more than one state, we have to group by _county_ and _state_ (as well as a special code called the _FIPS_ code). We will add them all up using the **sum** function.

Go to the next slide to see the code.

## Aggregate records

In [None]:
# First group cases by county and state using groupby
covid_grouped = covid_counties.groupby(['fips','county','state'])['cases']

# Second, add up all the Covid-19 cases using sum
covid_total = covid_grouped.sum()

#View the result
covid_total

## Get the geography
Though this Covid-19 data includes columns for county and state, which is **geospatial information,** it does not have the geometry data that will allow you to plot a map. So we need to get additional data that has the geometry data that defines the outline of each county. 

Good news! We have already obtained that data for you, "counties_geometry.geojson", and stored it on disk, so now we'll load that into a geodataframe - that's a dataframe that also contains a geospatial column for the geometry.

In [None]:
counties_geojson = geopandas.read_file("./supplementary/counties_geometry.geojson")
counties_geojson.head(5)

## Merging Data

Now we have two files: 
1. New York Times Covid-19 cases file for every county but no geometry data
2. County geometry file that contains the geometry and population data but no Covid-19 case data

We need to **merge** these data by matching the county and state names. However, if you look at the two dataframes above, the Covid-19 data has columns "county" and "state", while the geometry data has columns "NAME" and "state_name", so we have to specify which columns to match up. 

In [None]:
# Merge geography (counties_geojson) and covid cases (covid_total)
merged = pandas.merge(counties_geojson, covid_total, how='left',
                left_on=['NAME','state_name'], right_on = ['county','state'])

## Success!

Now we have a merged dataframe with both geometry, population and Covid-19 data. Let's view the data and the columns in it.

In [None]:
merged.columns

In [None]:
merged

## Mapping the data

Now that we have a combined dataset, making a map is easy! There are a lot of options, we'll just use one of them.

In [None]:
merged.plot(figsize=(15, 15), column='cases', cmap='OrRd', scheme='fisher_jenks', legend="true", 
                       legend_kwds={'loc': 'lower left', 'title':'Number of Confirmed Cases'})
pyplot.title("Number of Confirmed Cases")

## Congratulations, you made a map!

Of course, it's not a pretty map, but it's a map! Congratulations, you've made a map using cyberinfrastructure and geospatial technologies!

## How is this cyberinfrastructure?

I wrote a few lines of Python code. How is this cyberinfrastructure? 

To answer that question, let's take a step back.

<a href="gateway-5.ipynb">Click here to move to the next section to learn more!</a>