# Lab 04 Functions and Visualization

<i>Elements of Data Science</i><br><br>
Welcome to lab 4!
This week, we will focus on functions and visualization. <br>Functions are described in [Chapter 8](https://inferentialthinking.com/chapters/08/Functions_and_Tables.html) of the Inferential Thinking text. <br>Visualizations is covered in [Chapter 7](https://inferentialthinking.com/chapters/07/Visualization.html).
<br>**<center>Learning Goals**
|Area|Concept|
|---|---|
|Tables|Load and analyze data sets. |
|Time Trends|Using EDS module to examine and plot time trends in datascience Tables|
|Visualization|Line plot and scatter plots using matplotlib and `EDS.ptrend`
|Functions|Learn to define your own functions and apply them to arrays and Table columns|

First, set up the tests and imports by running the cell below.

In [None]:
# Enter your name as a string
name = ...

In [None]:
import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import matplotlib.dates as mdates
from matplotlib import ticker
from gofer.ok import check # This line loads the tests.
import os
user = os.getenv('JUPYTERHUB_USER')
import EDS

### Let's explore the most recent COVID data from the New York Times
This data is updated and stored at GitHub: https://github.com/nytimes/covid-19-data <br>
US rolling average: https://raw.githubusercontent.com/nytimes/covid-19-data/master/rolling-averages/us.csv <br>
US States rolling average: https://raw.githubusercontent.com/nytimes/covid-19-data/master/rolling-averages/us-states.csv <br>
US Wastewater Surveillance: [https://covid.cdc.gov/covid-data-tracker/#wastewater-surveillance](https://covid.cdc.gov/covid-data-tracker/#wastewater-surveillance) 

In [None]:
# Set a variable to contain the path to the data file on the internet.
# us.csv is a CSV file, which stands for "comma separated variables," which
# is a common human-readable for format for data files.
COVID_data = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/rolling-averages/us.csv'

In [None]:
# Read the data file into a data table assigned the name "COVID"
COVID = Table.read_table(COVID_data)

In [None]:
# Show data table show() method to display the first three rows of the data table
COVID.show(3)

Now we can sort the data by date. Since the data starts at the beginning of the pandemic we see very few cases.

In [None]:
# Sort the data on the date column
# Display oldest data first using the decending=False option
COVID.sort("date", descending=False)

Try sorting the data by cases to find the date with the highest number of cases. Hint: use the above code as a model and replace the "date" with "cases" and adjust the descending=? to get the largest number at the top.

In [None]:
COVID.sort(...)

### <font color=blue> **What do you observe?** </font>
Replace the ... lines with as many lines of text as you need for your answer. Leave the lines with """ unchanged.

In [None]:
q1_answer = """
...
"""

In [None]:
check('tests/q1_open_ended.py')

### Use where to select data from November - December 2021
Here are the possible arguments for the <i>where</i> Table method:<br>

|Predicate|Example|Result|
|-|-|-|
|`are.equal_to`|`are.equal_to(50)`|Find rows with values equal to 50|
|`are.not_equal_to`|`are.not_equal_to(50)`|Find rows with values not equal to 50|
|`are.above`|`are.above(50)`|Find rows with values above (and not equal to) 50|
|`are.above_or_equal_to`|`are.above_or_equal_to(50)`|Find rows with values above 50 or equal to 50|
|`are.below`|`are.below(50)`|Find rows with values below 50|
|`are.between`|`are.between(2, 10)`|Find rows with values above or equal to 2 and below 10|

In [None]:
# Apply the where() method to filter the data table. Only rows where 
# the condition is true are returned.
COVID.where("deaths",are.between(0,1))

One way to select a particular date we can use where with the appropriate strings to select certain dates, for example for January of 2021 we can use the below code. Let's think about how this works:

**We want only the year 2021,** so we can use the are.containing() method on the "date" column to only return rows where the data string contains "2021".

In [None]:
COVID.where('date', are.containing("2021"))

But this returns all of the dates in the year 2021. **We only want January.** So what if we try this?

In [None]:
COVID.where('date', are.containing("-01-"))

This returned all of the rows in the table containing "-01-" so we got the month of January, but all the years!
We need both filters. **We can "chain" the filters together.** They will be executed from left to right. 
* First, we filter the date column for the year 2021.
* Second, we filter the result for the month of January.

In [None]:
# Return only the rows with dates in January, 2021.
COVID.where('date', are.containing("2021")).where('date', are.containing("-01-"))

#### Time Trends and Dates with Data Science Tables

We will use the EDS module to handle dates in Tables. The EDS module is just a collection of predefined functions
to save you writing the same code over and over. Later on, you'll learn how to write your own functions.

There are two tasks.
1. Filter the Table between two dates, use `EDS.FilterTdate(tbl_variable,'01/01/2020','02/01/2021')`
2. Plot a time trend using EDS, use `ptrend(tbl_variable,"date","deaths_avg_per_100k",fmtdate="%b-%Y")`

In [None]:
# import the module so we can use its functions:
import EDS

##### Examples
Using the 5 year Google Trend search volume for Chemistry, Biology, and Nobel Prize

In [None]:
# Like before, we read a CSV file.
Nobel = Table().read_table("ChemBioNobelTrend.csv") # 5 years data
Nobel

Filter for Semptember-December 2020

In [None]:
# Use the FilterTDate function from the EDS module.
# The function expects you to provide a table name along with the start and end dates.
# It will return a new table with only the rows in the date range you specified.
Nobel_October = EDS.FilterTdate(Nobel, '10/01/2020', '10/31/2021')

In [None]:
# Use ptrend (short for plot trend) to plot this data.
# This function expects you to provide a table name, and the x and y column names
# Optionally, you can format the dates on the x-axis
# fmtdate="%b-%Y" tell it to use short-version month names (%b), hyphen, then year (%Y).
EDS.ptrend(Nobel_October,"Week","Nobel Prize: (United States)",fmtdate="%b-%Y")

### <font color=blue> **Question 1. Tools for examining time trend data** </font>
In preparing to look at COVID data we will first plot Chemistry and Biology [Google Trend](https://trends.google.com/trends/) search volumes for the five year period included in the Nobel data above. The Google Trend data gives the relative search volume as a function of day or week over a time period. An example of Google Trend data is searching for the trend of Turkey, Thanksgiving and Football as shown below.

![Turkey Google Trend](turkey_trend.png "Turkey Google Trend")

Examine this data for the Nobel prize and Biology. Nobel prizes are announced early October annually and awarded December 10 at 7:00 AM in honor of Alfred Nobel's death.

**Chemistry**

In [None]:
# Replace the ... with column name.
# Use the example above as a model.
EDS.ptrend(Nobel,"Week",...,fmtdate="%b-%Y")

**Biology**

In [None]:
# Now try filling in all of the missing fuction arguments.
# This time plotting Biology Nobel prize winners.
EDS.ptrend(...)

Next we will create a new table which is subset of the original Table containing only data for 2023.

In [None]:
# Note the use of the FilterTdate() method to create a new table with
# just the specified date range.
Nobel_2023 = EDS.FilterTdate(Nobel,'01/01/2023','12/31/2023')
Nobel_2023

**Biology in 2023**
Now plot the data just for 2023 using the new table `Nobel_2023`

In [None]:
# Fill in all three arguments: table, x-column name and y-column name
testcheck = EDS.ptrend(...,...,...,fmtdate="%b-%Y")

In [None]:
check('tests/q1new.py')

### <font color=blue> **Question 2. Back to COVID Data Analysis** </font><br />
Now let's create and look at late 2023 COVID data. Now use the Nobel example above to define a subset of the data to examine trends during all of November and December of 2021.

In [None]:
COVID # Complete COVID data from above

In [None]:
# Follow the example above but filter on the date range
# from the start of November to the end of December.
Late2021 = EDS.FilterTdate(...)
Late2021

In [None]:
check('tests/q2new.py')

### Plot
If we attempt to plot using the 'date' column the bottom axis has starnge numbers which are multiplied by 1e9 ($1\cdot 10^9$) as shown in the lower right corner of the plot. These are the number of seconds from the epoch (January 1, 1970).  Another name for this time unit is UNIX time which has an [interesting history](https://en.wikipedia.org/wiki/Unix_time#:~:text=History,-Learn%20more&text=The%20earliest%20versions%20of%20Unix,two%20and%20a%20quarter%20years.). This unit of time is a way that is convenient for computers to store time as an integer but not at all convenient for us as data scientists! Using the `EDS.ptrend` function will alleviate this problem.

In [None]:
# This example uses the built in plot() method of data tables,
# the date is in seconds -- not very human-friendly.
# The datatables module wasn't created with time series data in mind.
Late2021.plot("date", "cases_avg_per_100k")

#### Now use EDS.ptrend

In [None]:
# The ptrend() method using dates -- much nicer!
EDS.ptrend(Late2021,"date","cases_avg_per_100k",fmtdate="%b-%Y")

#### Now change format to get days
This is in the last part, `fmtdate="%b-%Y"` This type of format string is common in coding and this particular one relates to a variety of codes for parts of a date.
|Code|What|Example|
|---|---|---|
|%b|Month abbreviated|Dec|
|%B|Month|December|
|%y|Year abbreviated|24|
|%Y|Year|2024|
|%d|Day of month|06|
|%a|Weekday abbreviated|Fri|
|%A|Weekday|Friday|
|%j|Day number 001-366|143|
|%W|Week number of year|41|
|%H|Hour of day (24)|17|
|%h|Hour of day (12)|5|


**Now get ptrend to plot using month-day-year format**. This is helpful since we are now only looking at two months, the day becomes more important.

In [None]:
# Used the table able to figure out the correct date format.
EDS.ptrend(Late2021,"date","cases_avg_per_100k",fmtdate="...")

### Histogram
Many of the matplotlib plot are available as methods availabe to data tables. A histogram method, for example, is realized by appending .hist('column name')

In [None]:
# Data tables have a built-in histogram method.
Late2021.hist('deaths')

We can also access summary statistics for the datascience table

In [None]:
Late2021.stats()

### <font color=blue> **Question 3.** </font><br />
Construct a histogram and stats for November - December 2020 and compare this to those from November - December 2021 in a markdown cell below the histogram and statistics.

In [None]:
# Filter the orginal data for this data range,
# just like you did for 2021
Late2020 = ...
Late2020

The `.hist()` Table method has additional arguments to get more bins and change the range for the histogram among other arguments as shown below.

In [None]:
Late2020.hist('deaths',bins = 30, range =[0,4000])
Late2021.hist('deaths', bins = 30, range =[0,4000])

#### <font color=blue> Write your comparison between Late2020 and Late2021 histograms in the cell below.
Replace the ... lines with as many lines of text as you need for your answer. Leave the lines with """ unchanged.

In [None]:
q3_answer = """
...
...
"""

In [None]:
check('tests/q3_open_ended.py')

### <font color=blue> **Question 4.** </font><br />
Now use the EDS.ptrend() function plot your Late2020 data and then the Late2021.

In [None]:
# Plot Late 2020 and late 2021 data
...

#### <font color=blue> Write your comparison in the cell below. </font>
In the cell below, describe the differences in the line graphs between 2020 and 2021.  
Replace the ... lines with as many lines of text as you need for your answer. Leave the lines with """ unchanged.

In [None]:
q4_answer = """
...
...
"""

In [None]:
check('tests/q4_open_ended.py')

## 2. Defining functions

Functions return a value(s) for values of one or more variables or arguments. In algebra we develop the concept of functions such as the following:
$$ f(x) = 3 \cdot x-5 $$
If we substitute the value:
$$ x = 3$$
$$f(3) = 3 \cdot 3-5 = 4$$

This function can be coded in Python in the following straightforward way:
```python
def f(x):
    result = 3*x - 5
    return result
```
To compute the value of the function f(x) at x = 3:
```python
f(3)
4
```

### <font color=blue> **Question 5A.** </font>

Define a Python function for the following algebraic function:
$$ f(x) = 2 \cdot x + 5 $$

In [None]:
def f(x):
    ...
    return ...

Use your function to evaluate the function f(x) at x = 4

In [None]:
f(...) # test function

In [None]:
check('tests/q5a.py')

In [None]:
def f(x):
    return 2 * x + 5

Note something special. If you call the function with an input parameter that contains multiple values, such as an array, it will compute the function on all of those values automatically. How cool is that?

In [None]:
some_numbers = make_array(1, 3, 5, 7, 9)
f(some_numbers)

### Visualizing functions
We can use the matplotlib plot function to visualize your above function. In the EDS module there is a function to plot functions which takes three arguments, the function name, the x axis range, the y axis range. The axis range is given as a list with the minimum and maximum values given, [-5,5] provide the range from -5 to 5.

In [None]:
EDS.fplot(f,[-10,10],[-10,10])

In [None]:
EDS.fplot(f,[-20,20],[-20,20])

#### New function
The below function, g(x), is a quadratic function whose plotted shape is known as a parabola. See the [OpenStax Algebra text](https://openstax.org/books/algebra-and-trigonometry-2e/pages/5-1-quadratic-functions) for more detailed information.
$$ g(x) = x^2 $$

In [None]:
def g(x):
    result = x**2
    return result

In [None]:
EDS.fplot(g,[-10,10],[-10,10])

<font color='green'>Now create a new version of the above g(x) function that shifts the function downward by 4 units on the y-axis as in the figure below.</font><br>
<img src='parabola.png' style="width:550px;"/>


In [None]:
def g(x):
    result = ...
    return result

In [None]:
g(...) # test function

In [None]:
EDS.fplot(g,[-10,10],[-10,10])

In [None]:
check('tests/q5ab.py')

### Simple Percentage Function Example
Let's write a very simple function that converts a proportion to a percentage by multiplying it by 100.  For example, the value of `to_percentage(.5)` should be the number 50.  (No percent sign.)

A function definition has a few parts.

##### `def`
It always starts with `def` (short for **def**ine):

    def

##### Name
Next comes the name of the function.  Let's call our function `to_percentage`.
    
    def to_percentage

##### Signature
Next comes something called the *signature* of the function.  This tells Python how many arguments your function should have, and what names you'll use to refer to those arguments in the function's code.  `to_percentage` should take one argument, and we'll call that argument `proportion` since it should be a proportion.
    '''
    def to_percentage(proportion)
    '''

We put a colon after the signature to tell Python it's over.

    def to_percentage(proportion):

##### Documentation
Functions can do complicated things, so you should write an explanation of what your function does.  For small functions, this is less important, but it's a good habit to learn from the start.  Conventionally, Python functions are documented by writing a triple-quoted string:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
    
    
##### Body
Now we start writing code that runs when the function is called.  This is called the *body* of the function.  We can write anything we could write anywhere else.  First let's give a name to the number we multiply a proportion by to get a percentage. Note that all of the lines of code in the body of the function must be indented by the same amount, typically four spaces. Python knows it has reached the end of the function when the indentation stops.

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100

##### `return`
The special instruction `return` in a function's body tells Python to make the value of the function call equal to whatever comes right after `return`. The function will *return* that value to the main body of the code. We want the value of `to_percentage(.5)` to be the proportion .5 times the factor 100, so we write:

    def to_percentage(proportion):
        """Converts a proportion to a percentage."""
        factor = 100
        return proportion * factor

### <font color=blue> **Question 5B.** </font>

Define `to_percentage` in the cell below.  Call your function to convert the proportion 0.20 to a percentage.  Name that percentage `twenty_percent`.

In [None]:
def ...
    """ ... """
    ... = ...
    return ...

In [None]:
twenty_percent = ...
twenty_percent

In [None]:
check('tests/q5b.py')

Write a second function which computes the density of an ideal gas off a given molecular weight, temperature, pressure. 
$$ PV = nRT $$
$$ \frac{n}{V} = \frac{P}{RT} $$
To convert to grams from number of moles we use the molecular mass, $ M $. <br>Water has a molecular mass of $ M = 18.0 \frac{g}{mol}$ <br>
density is given the symbol $\rho $ and has units of $ \frac{g}{L} $ <br><br>
$$  \rho = \frac{M\cdot P}{R\cdot T} $$



In [None]:
def density(P, T, M):
    """  ...    """   
    R = 0.082057
    ...
    ...
    return ...

Test the function by calculating the density of water vapor (gas, $ M = 18.0 \frac{g}{mol}$ ) at 1 atm and 298 K. R is the gas constant.<br> $$ R = 0.082057 $$

In [None]:
# Substitute values for P, T, and M
density(...,...,...)

Now create an array with temperatures in Kelvin from freezing, 273.15, to 313 ($40^\circ$ C) in 1.0 degree steps With the array, create a new array using the above `density` function. Make a scatter plot of these arrays.

Matplotlib can generate a variety of figures and given how we initialize it in this lab is referenced with the plt. prefix. The traditional .plot() is also known as a line plot. A scatter plot is useful for exploring the relationship bewteen two or more variables and uses markers for data points. Matplotlib takes one or more numpy arrays as arguments to plot. Matplotlib can also use lists as arguments.  An example is shown below.

In [None]:
x = np.array([3, 8, 5, 6, 1, 9, 6, 7, 2, 1, 8])
y = np.array([4, 5, 2, 4, 6, 1, 4, 6, 5, 2, 3])
color = "red"
plt.scatter(x, y, c = color, label = color)

Same scatter plot with lists:

In [None]:
x = [3, 8, 5, 6, 1, 9, 6, 7, 2, 1, 8]
y = [4, 5, 2, 4, 6, 1, 4, 6, 5, 2, 3]
color = "orange"
plt.scatter(x, y, c = color, label = color)

Use the same approach to plot the density computer with your function versus the temperature. Hint: you can create an array of temperatures and pass this to your density function to compute densities for given temperatures.

In [None]:
plt.scatter(..., ...)

### <font color=blue> **Question 6.** </font>

Now define another function which takes the ratio of two number and then uses the *'to_percentage'* function above to convert it into a percentage. One issue is when the denominator is zero we get a result which is not a number or `nan` in Python. This can be changed to a zero as a place holder with a little trick shown below that can be incorporated as two lines of your code.

In [None]:
# First approach to deal with dividing by zero
from math import nan

z = nan
print("First: ", z)
# Use this part in your function
if z != z:  # if conditional statement
    z = 0
# Up to here
print("Now: ", z)

In [None]:
# Now your function...
def ratio(x1,x2):
    """ Computes a ratio of x1 to x2 """
    ...
    r = to_percentage(z)
    return r

In [None]:
check('tests/q6a.py')

### COVID cases leading to bad outcomes
Now we will apply the function to our COVID data. Here we will need to use the *with_columns* method of a Table object to add the result of applying the ratio function with two columns as arguments. These columns will be *deaths* and *cases*. The percentage return by the function will create a new column.<br>
<br>See Inferential Thinking 8.1.1 for inspiration: <br> https://inferentialthinking.com/chapters/08/1/Applying_a_Function_to_a_Column.html

### <font color=blue> **Question 7.** </font> 

Now apply your function to create a new column, *deathrate*. Examine the histogram for deathrate. Now plot the trend for *deathrate* for the entire timeperiod of the dataset. Remember the special codes from above to define the x ('date') and y ('deathrate') data to plot. Discuss the results in the markdown cell below.

In [None]:
COVID = COVID.with_columns("deathrate",...).sort("deathrate")
COVID

In [None]:
# Check that there are no nan...
np.isnan(COVID.column('deathrate')).any()

Okay, this line of code also needs explanation. Removing nans (not a number) is common problem when cleaning up a dataset, so pay attention. 
Again, let's go step by step. 

Let's say we have a numpy array that has some nans.

In [None]:
unclean_array = make_array(5, 7, np.nan, 9, np.nan, 13, 4)
unclean_array

Let's see what happens when we pass this array to np.isnan()

In [None]:
np.isnan(unclean_array)

We get an array back of the same length with True or False value for each element of our original array. True for each nan element, false otherwise.

np.any() chack and array to see if there are any true values

In [None]:
np.any(np.isnan(unclean_array))

Because there was at least one True value np.any() returns True.

In [None]:
np.isnan(unclean_array).any()

The way of writing it does the same thing. It is called "chaining funcions."  The function any() is applied to the result of the isnan() function call. np.isnan() is applied first, then any() is chained onto the end using "." Look closely at the previous two cells to see if you understand why they do the same thing. If you still don't understand, as a CA or instructor.

We have a way to is if an array has any nans. How do we apply this to a column in our data table? The column() data table method takes a column name and returns all of the values in that column of the data table as an array.

In [None]:
COVID.column('deathrate')

Putting it all together:
1. Use column() to extract the data in the 'deathrate' column of the COVID array.
2. Use isnan() to return True or False for each element of the array depending on whether it is a nan.
3. Use any() to return True if any element is, in fact, a nan. If we get back False our array is clean: no nans.

You are seeing this for the first time, so it is bound to be a bit confusing. It will become more familiar with practice but PLEASE ask when there are steps you don't understand, and remember, you can always add cells to your notebook to try things, or to break code into smaller steps so you can see how it works.

In [None]:
# So, once again check that there are no nan values...
np.isnan(COVID.column('deathrate')).any()

If our data is clean, we can proceed with the visualization.

In [None]:
# Histogram
...

Use `EDS.ptrend()` to plot the deathrate.

In [None]:
check('tests/q7a.py')

#### <font color=blue> Your discussion of results from question 7:</font>
 Replace the ... lines with as many lines of text as you need for your answer. Leave the lines with """ unchanged.

In [None]:
q7_answer = """
...
...
"""

In [None]:
check('tests/q7_open_ended.py')

### <font color=blue> **Question 8.** </font>

At the end of each lab, please include a reflection. 
* How did this lab go? 
* What aspects of visualization or functions do you find confusing?
* Were there questions you found especially challenging you would like your instructor to review in class? 
* How long did the lab take you to complete?

Share your feedback so we can continue to improve this class!

**Insert a markdown cell below this one and write your reflection on this lab.**

**Congratulations** , you're done with lab 4! Be sure to
run all the tests and verify that they all pass (the next cell has a shortcut for that),
Save and Checkpoint from the File menu
Run the last two cells for partial grading. Comments and markdown will be graded separately. 

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import glob
from gofer.ok import check
correct = 0
for x in ['q1_open_ended', 'q1new','q2new', 'q3_open_ended', 'q4_open_ended', 'q5a', 'q5ab', 'q5b', 'q6a', 'q7a']:
    print('Testing question {}: '.format(str(x)))
    g = check('tests/{}.py'.format(str(x)))
    if g.grade == 1.0:
        print("Passed")
        correct += 1
    else:
        print('Failed')
        display(g)

print('Grade:  {}'.format(str(correct/6)))

In [None]:
print("Nice work ",name, user)
import time;
localtime = time.asctime( time.localtime(time.time()) )
print("Submitted @ ", localtime)