In [None]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci


# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')

## Big Data

### Part 2 of 5
# Show Me the Data!!

## The Pandas module
Pandas is a python package that is very useful to read and process data. This notebook will show you how to: 
- Import the Pandas module
- Pandas data structures (e.g. series, dictionary and dataframe)
- Import data
- Calculate data

## Reminder
<a href="#/slide-2-0" class="navigate-right" style="background-color:blue;color:white;padding:8px;margin:2px;font-weight:bold;">Continue with the lesson</a>

<br>
</br>
<font size="+1">

By continuing with this lesson you are granting your permission to take part in this research study for the Hour of Cyberinfrastructure: Developing Cyber Literacy for GIScience project. In this study, you will be learning about cyberinfrastructure and related concepts using a web-based platform that will take approximately one hour per lesson. Participation in this study is voluntary.

Participants in this research must be 18 years or older. If you are under the age of 18 then please exit this webpage or navigate to another website such as the Hour of Code at https://hourofcode.com, which is designed for K-12 students.

If you are not interested in participating please exit the browser or navigate to this website: http://www.umn.edu. Your participation is voluntary and you are free to stop the lesson at any time.

For the full description please navigate to this website: <a href="../../gateway-lesson/gateway/gateway-1.ipynb">Gateway Lesson Research Study Permission</a>.

</font>

In [None]:
# Import the Pandas Module
import pandas as pd
pd.__version__

### Series
A Series is a one-dimensional array of indexed data (think a series of data if you will). A Series object contains a sequence of values and associated index (i.e. the order of those values starts with 0). We can use the Series object to store some data. 

In [None]:
# Create Series from an array
data1 = [1,2,3,6,7]
s1 = pd.Series(data1)
print('Values in series s1:', s1.values)
print('The 4th value of s1:', s1[3])

### Dictionary
Instead of a regular index, it is also possible to use a dictionary that uses a label-value structure for each item in the series. In the example below, we will create a population series with a label to store the state name and a value to store its population. 

In [None]:
# Create Population series from a dictionary
pop_data = {'CA': 39.5,'TX': 29,'NY': 8.39,'FL': 21.48,
            'IL': 12.67}
population = pd.Series(pop_data)
population

Similarly, we can create an area series.

In [None]:
# Create Area series
area_data = {'CA': 155779.22, 'TX': 261231.71, 'NY': 47126.40,
             'FL': 53624.76, 'IL': 55518.93}
area = pd.Series(area_data)
area

### DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection of columns. You can think of it as a spreadsheet or SQL table where each column has a column name for reference and each row can be accessed by using row numbers. Column names and row numbers are known as column and row index.

DataFrame is a fundamental Pandas data structure in which each column can be of a different value type (numeric, string, boolean, etc.). A data set can be first read into a DataFrame and then various operations (i.e. indexing, grouping, aggregation etc.) can be easily applied to it.

In [None]:
# Create DataFrame from dictionary of Series
df2 = pd.DataFrame({'population':population,
                   'land area':area})
df2['drought'] = ['Yes','Yes','No','Yes','No']
df2

## Import Data
### Read from a csv
The read_csv() function can be used to read CSV (Comma-Separated Value) files. The function includes a number of different parameters and you can read more about them in the pandas documentation here.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [None]:
# Read the data from csv
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports_us/05-24-2022.csv'
df_csv = pd.read_csv(url)
df = pd.DataFrame(df_csv)
df

### Explore the JHU COVID-19 csv data
Let's explore some big data out there. One data source is the JHU COVID-19 data portal: 
https://github.com/CSSEGISandData/COVID-19.
In the example above, we read the csv data from the COVID-19 daily report in the US. 

**Can you find the URL to the COVID-19 daily report of countries around the world?** 
- You can pick any date in that folder (e.g. 01-01-2022.csv)
- On the selected date of your choice, click the "Raw" tab in the csv (look at the right side)
- You are on the **right** track if the csv looks messy on your browser
- Click the address bar of your web browser, copy the URL (it should start with "https://raw.githubusercontent.com...")
- In the cell below, paste the URL into the empty space between ''
- Run the code

In [None]:
# Read the data from csv
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-01-2022.csv'
df_csv = pd.read_csv(url)
df2 = pd.DataFrame(df_csv)
df2

# example: https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-01-2022.csv

## Calculate Data
Examine the US daily report of COVID-19 data, note the two fields documenting the incident rate and testing rate.
What are the field names?
Hypothetically, one can explore the effectiveness of COVID-19 testing by dividing the incident rate by testing rate, so that:
** Effectiveness of Testing = Incident Rate / Testing Rate **
Run the code below to calculate the above equation into a new field called "Eff_Testing".

In [None]:
# Add column and calculate
df['Eff_Testing'] = df['Incident_Rate']/df['Testing_Rate']
df

## Further Exploration ##
### Congratulations!! This is IT!! ###
You have done the following: 
- Learned about what big data is
- The 'V's of big data and its relevance to their applications
- Explore some big data on the Internet
- Load COVID-19 data into a table using Pandas
- Parse the data and calculate new columns 

Here are some pointers for further exploration: 
- Noticed that there are some calculation returns a value "NaN". What does that mean?
- Explore more county level COVID-19 data from NY Times at: https://github.com/nytimes/covid-19-data
- Load the mask use data: https://github.com/nytimes/covid-19-data/tree/master/mask-use

If you are interested, feel free to check out the intermediate lesson. We will introduce more techniques to process, analyze and visualize the big data! 



<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" 
href="bigdata-4.ipynb">Click here to go to the next notebook.</a></font>