# `ES 157` Section 3: Modeling COVID cases

This week we will use everything we've been learning in class about systems and the Fourier transform so far, in order to make a simplistic model for the daily COVID cases. As a reminder from the first notebook, this is the data that we will work with.

![covid_mass](https://i.imgur.com/bI1umxq.png)

At the end of this notebook you will
1. have used `pandas` and `python` to clean up data,
2. have used the Fourier transform to analyze data, and
3. have created a simple model for the daily new cases.

As we always, let us import some needed libraries.

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.signal as signal
import pandas as pd

## Loading up COVID data 💉

Along with this week's notebook you should have downloaded some COVID data. In the introductory video, Demba mentioned the COVID tracker of the [John Hopkins University](https://coronavirus.jhu.edu/data/new-cases-50-states). That tracker, like most of the COVID dashboards, actually use data from [this](https://github.com/CSSEGISandData/COVID-19) GitHub repo. The data we provide were downloaded from the same repo a few weeks ago. Let's load the data from our favorite state.

In [1]:
# load the csv

#check the data

We can see that the data have lots of columns that we probably don't need for our simple analysis. In fact, we see that the first `11` columns have extraneous information. Let's extract the data only for a specific state, and calculate the number of daily cases.

We see that the data is actually split into the numbers reported by each county. When trying to find the total number of daily cases, we have to aggregate the reported numbers by all the counties.

Another thing we need to consider is that each day actually reports the _total_ number of cases thus far. So, again, we have to acount for that in order to find the number of daily cases.

In [2]:
# extract the list of columns

# restrict to your state's data

# find the new daily cases

## Plotting the data and the FFT

Now that we've cleaned up the data a little bit, let us try and plot them to make sure we get a sensible image.

**Optional**: create a sensible time axis using `pandas`/`matplotlib` (or, you might want to look into `plotly`; it has a lot of ways to manage plots with dates.)

In [3]:
# plot the daily new cases

Okay, this looks good! It seems to be aligned with the image from the John Hopkins tracker. Let's examine now the frequency content of the daily new cases by using the _Fourier transform_.

In [4]:
# plot the FFT of the daily new cases

# create frequency axis

# plot the fft

## Analyzing the Fourier transform

We see a very strong DC component in the Fourier transform of the signal. As we've talked about before, this is expected and encodes the mean of the signal. As a first analysis step, let's remove it and see if the image becomes clearer.

In [5]:
# remove the dc component from the FFT

# plot the fft

We see a little bit clearer two symmetric peaks. What do they correspond to? Remember, our frequency axis is measured in _days_; therefore this implies that if we have a frequency $x$, then this means that the event happens every $\frac{1}{x}$ days (the period is the inverse of the frequency). Do the peaks make sense?

Now, let's try the following; reconstruct the signal using the inverse FFT, but zero out everything except the DC and the interval $[0.12, 0.17]$ (and its symmetric one.)

In [6]:

# find the indices of interest

# zero out everything except the parts of interest

# reconstruct the signal

# plot the reconstructed signal

It turns out that our intuition was correct; the DC component and the frequency content we identified are able to give us a pretty good approximation of the original curve. Obviously, there's some information lost since we zero'ed out most of the Fourier transform; however, we get a fairly sensible image with only a few coefficients.

This idea is the basis of _compression_; we identify the most important coefficients in a base and we keep _only_ those coefficients.

## Trying to model COVID

Let's try and make a simple model for the COVID data. We will base our model on two observations
- first, we saw that the number of daily cases exhibit some sort of periodicity, and
- second, we see a Gaussian-shaped component.

The simplest periodic signal we know is a _sine_; therefore, we will try to model the daily new cases by using the product of a sine and a Gaussian distribution.

First, let us create a sine wave of the appropriate frequency.

In [7]:
# create a time variable

# choose an appropriate frequency/period

# create the sine

# plot the sine

Hmm, but our sine takes negative values. To get a clearer picture on our graph, we could make sure we only keep the parts that are above zero.

In [8]:
# keep only the positive part of the sine

# plot the sine

We're slowly making progress. 🤓 Let's move on to the Gaussian. As a reminder, the equation of a Gaussian distribution is given by
<center>$f(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2}$.</center>
Around what day does the Gaussian seem to be centered? And what would you say is it's standard deviation?

Also, note that we need to acount for the height of the distribution. Notice that when $x = \mu$, the Gaussian simplifies to
<center>$f(\mu) = \frac{1}{\sigma \sqrt{2 \pi}}$.</center>
So, we should look at the corresponding desired height at that peak, and add an adjustment term to our distribution.

In [9]:
# make a guess for the mean and the std

# find the adjustment term

# create the Gaussian

# plot the Gaussian

Okay, so now we have our Gaussian and we have our sine wave. Let's multiply them and plot the product in the same plot as the COVID data and see how we did.

In [1]:
# multiply the signals

# plot the product signal

# plot the covid data

It's obviously not perfect, but it's a simple approximation that captures the essential components of the data. I hope you enjoyed getting playing a bit with some real data and getting some experience with data science. Please take a minute to fill out this week's feedback [form]().

**Optional**: pay attention exactly on what _day_ are the spikes in the daily new cases reported. Can you think of any plausible reasons why this happens?


**Optional**: can you play around and create a better model for the COVID data? An idea would be to make sure that we don't have the valleys of zero in our product signal. Another thing you could try could involve convolution.