<h1>SI 305 Discussion 10: Statistical Analysis</h1>

This week we'll be covering statistical analysis in Python. There are lots of packages you can use to analyze relationships in your data. In this discussion we'll focus on <a href = "https://www.statsmodels.org/stable/index.html">statsmodel.</a> Another useful package that we won't cover today is <a href = "https://scikit-learn.org/stable/">sklearn.</a> 

We'll be using the waste water scan data. This data measures the amount of COVID-19, influenza and other viruses in wastewater. For more details see the <a href = "https://docs.google.com/document/d/1vmTYziZxRMxANLVG0ly1c4-3scamvvXEkpsISZ_dF3E/edit">data dictionary.</a> 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

from statsmodels.formula.api import ols

In [None]:
df = pd.read_csv("https://docs.google.com/spreadsheets/d/1kQ6oEeNNntgQ0V2rv21_osuMO9Xr_ni6but7tVtrYjw/gviz/tq?tqx=out:csv")

In [None]:
df['collection_date'] = pd.to_datetime(df['collection_date'])
df['collection_month'] = df['collection_date'].dt.month

In [None]:
def assign_month(s):
    if s < 3 or s == 12:
        return 'winter'
    elif s <= 5:
        return 'spring'
    elif s <= 8:
        return 'summer'
    else:
        return 'fall'

df['season'] = df['collection_month'].apply(assign_month)

In [None]:
df.head()

<h2>Live Coding Example: Do we detect higher levels of COVID-19 in certain seasons?</h2>

We went ahead and created a new variable called `season` for you. For the sake of this excersize, we'll assume 

* Winter = December, January, February
* Spring = March, April, May
* Summer = June, July, August
* Fall = September, October, November

<h3>1.A: Create a plot that shows the relationship between COVID-19 and season</h3>
Before we go through the effort of doing a full statistical analysis, it can be helpful to plot our data to see if there's any relationship. Let's create a plot that shows the relationship between COVID-19 and season.

There are a few different ways COVID-19 levels are measured in this dataset. We'll use the variable `N_Gene_gc_g_dry_weight`

<h3>1.B Analysis of Variance</h3>

We are interested in how each individual season impacts COVID-19 levels. This means a type 2 anova is most appropriate. For a refresher on what type of ANOVA to use, see <a href = "https://md.psych.bio.uni-goettingen.de/mv/unit/lm_cat/lm_cat_unbal_ss_explained.html">this page</a>

<h3>1.C Interpretation</h3>

One of the most important parts of doing statistical analysis is interpretting your results *in the context of the problem.* Let's write 1-2 sentences explaining what these results mean.

you answer here

<h2>Question 1: What is the relationship between COVID-19 levels and influenza levels?</h2>

Use the column `Influenza_A_gc_g_dry_weight` to measure influenza levels

<h3>1.A: Create a plot that shows the relationship between COVID-19 and influenza</h3>

<h3>1.B: Calculate the correlation between these two variables</h3>

<h3>1.C: Interpret in context</h3>

Your answer here

<h3>1.D: Linear Regression</h3>

Let's predict influenza levels based on COVID-19 levels and the season

<h3>1.E: Interpret in context</h3>

There are lots of interesting results from our linear regression! You don't have to interpret everything, but here are some things to pay attention to when reading your regression results:

* <a href = "https://people.duke.edu/~rnau/rsquared.htm">R-squared</a>
* <a href = "https://stats.oarc.ucla.edu/spss/library/spss-libraryunderstanding-and-interpreting-parameter-estimates-in-regression-and-anova/">Parameter estimates and their p-value</a>
* <a href = "https://stats.oarc.ucla.edu/stata/output/regression-analysis/">F statistic</a>

<h2>Bonus Question</h2>
This is a bonus question for you to attempt individually. This can only help your grade, so try your best! Make sure to submit your completed notebook to Vocareum. 

Create a linear regression model that predicts `Noro_G2_gc_g_dry_weight` based on `Influenza_A_gc_g_dry_weight`, the year the sample was collected and the `dilution` field. Save your fitted model to the variable `bonus_model`