In [1]:
import os
import datetime
import random
from bokeh.io import show
from bokeh.plotting import figure
from bokeh.io import output_notebook
from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource, Band
from bokeh.embed import components

import numpy as np

In [2]:
output_notebook()

In [3]:
import math
import scipy.optimize as optim
import pandas as pd

# Introducion to Curve Fitting

Goals for this module¶

* Understand exponential functions
* Understand that SSR is used to fit a curve to a function
* Understand what parameters are
* Understand the null hypothesis for an exponential function and how to determine the p value.
* Understand the rt values for a pandemic

Most of the time, you should use linear regression to find patterns in data. Only in special circumstances should you try to fit your data to a non linear function. 

One such case is modeling covid19. A pandemic grows exponentially. 

## Exponential Functions

An exponential function has an *initial* value, and changes according to a *ratio*. Consider a function that has an initial value of 1, and a ratio of 2. Then the first value, $x_1$ is 1. To get the next value in the series, we multiply the previous value by the ratio. So $x_2$ = 2 * 1 = 2. To get $x_3$, we multiply $x_2$ by 2 to get 4. The first 6 numbers in this series is [1, 2, 4, 8, 16, 32].

The *initial* value and *ratio* are known as parameters.

## Finding the parameters of a function.
Normally, we don't know the parameters. Instead, given a set of numbers, we wish to determine the parameters. As with linear regression, we find the parameters by solving this equation:

$SSR = \sum_{i=1}^n (y - \hat{y})^2$

We can't use linear algegra to solve this equation. Instead, we use a function that keeps trying different values and narrows in on a solution. Lets do an example.

### Risk and Alcohol data

The following dataset consists of two columns. BAC is the "blood alcohol level." As the BAC increases, the risk of an accident increases. It does so in an exponential fashion.


In [5]:
DF = pd.read_csv(os.path.join('data', 'alchohol_risk.csv'))
DF.head()

Unnamed: 0,BAC,risk
0,0.0,1.0
1,0.01,1.03
2,0.03,1.06
3,0.05,1.38
4,0.07,2.09


In [27]:
def resample(l):
    final = []
    for i in range(len(l)):
        final.append(random.choice(l))
    return final

In [6]:
def exp_func(x, initial, ratio):
    return initial * np.power(ratio, x - 1)

We are already acquainted with the resample function. The exp_func takes the list of numbers (x), an initial value, and a ratio. Let's graph the values.

In [11]:
def make_scatter(df, title = None):
    p  = figure(title = title)
    p.circle(x = df['BAC'], y = df['risk'])
    return p
show(make_scatter(DF))

Let's fit a line to these points by finding the parameters. 

In [71]:
popt, pcov = optim.curve_fit(f = exp_func, xdata =DF['BAC'], ydata = DF['risk'])
print('ratio is {r}'.format(r = popt[1]))
print('initial value is {i}'.format(i = popt[0]))

ratio is 57700487606.35652
initial value is 32241472356.689198


In [72]:
# lets' generate a 30 x points
new_x = np.linspace(min(DF['BAC']), max(DF['BAC']), 30)
y_hat = [exp_func(initial = popt[0], ratio = popt[1], x = x) for x in new_x]
p = make_scatter(DF, title = None)
p.line(x = new_x, y = y_hat)
show(p)

## Determine a P value
How good is our fit? To answer this question, we need to state null hypothesis and determine a p value. For an exponential function, the null hypothesis is that the ratio is 1. If the data is just random, on average, the value of each x will sometimes be greater than the previous, and sometimes be smaller. if the ratio is under less than or greater than 1, then we can reject the null hypothesis. 

In [74]:
def get_ratios(x, y, num_iter = 100):
    zip_obj = list(zip(x, y))
    ratios = []
    for i in range(num_iter):
        new_ = resample(zip_obj)
        new_ = sorted(new_, key = lambda x: x[0])
        x_ = [x[0] for x in new_]
        y_ = [x[1] for x in new_]
        popt, pcov = optim.curve_fit(f = exp_func, xdata =np.array(x_), ydata = np.array(y_))
        ratios.append(popt[1])
    return ratios

ratios = get_ratios(x = DF['BAC'].tolist(), y = DF['risk'].tolist())
print(len([x for x in ratios if x > 1]))
print(len([x for x in ratios if x < 1]))

  


100
0


We resampled our points and then determine the ratio for each resample. Every single result was greater than 1. We can formulate our p value as:

In [76]:
p_value = 1 - len([x for x in ratios if x > 1])/len(ratios)
print('p value is {p}'.format(p = p_value))


p value is 0.0


Our p value < .01. We reject the null hypothesis that the ratio is 1

In [94]:
def test_random():
    x = [random.randrange(100) for x in range(30)]
    y = [random.randrange(100) for x in range(30)]
    p = figure()
    p.circle(x = x, y = y)
    popt, pcov = optim.curve_fit(f = exp_func, xdata =np.array(x), ydata = np.array(y))
    y_hat = [x * popt[1] + popt[0] for x in x]
    p.line(x = x, y = y_hat)
    print('ratio is {r}'.format(r = popt[1]))
    resamples =  get_ratios(x, y)
    print('number greater than 1 is {g}'.format(g = len([x for x in resamples if x > 1])))
    show(p)
test_random()


ratio is 0.9954079462688262
number greater than 1 is 14


The above function generates random numbers. After resampling, you will find that the number of resamples greater than 1 is not small. (Run the function several times.)

## RT value for Covid-19
A pandemic such as covid-19 grows exponentially. The ratio is known as the rt value. 

For example, imagine that one week there are 2,500 cases in one week. The next week, there are 3,500 cases. The rt is $3,500/2,500 = 1.4$ Since the rt is above 1, the number of infections increases. In fact, 1.4 is very high: in a short amount of time, the pandemic will be out of control. Here is an example of calculating the rt value for Seattle, Washington for the last 14 days. 

In [103]:
#number of cases per day
seattle = [160.0, 158.0, 202.0, 111.0, 124.0, 145.0, 167.0, 192.0, 134.0, 200.0, 129.0, 148.0, 81.0, 149.0]
#note, to get the x values, we simply assign a number, starting at 0 and increasing, for each y
# we do this because this is a time series
seattle_x = range(len(seattle))
popt, pcov = optim.curve_fit(f = exp_func, xdata =np.array(seattle_x), ydata = np.array(seattle))
print('rt is {r}'.format(r = popt[1]))
resamples =  get_ratios(seattle_x, seattle)
p_value = 1- len([x for x in resamples if x < 1])/len(resamples)
print('p value is {p}'.format(p = p_value))

rt is 0.9861979922862778
p value is 0.12


The rt value for Seattle is less than 1, meaning the pandemic is decreasing. However, the p value is too large, so we cannot reject the null hypothesis that the rt is different than 1. The decrease is not statistical significant. 