# Introduction. 
What is the context of the work? What research question are you trying to answer? What are your main findings? Include a brief summary of your results.

Introduction: The introduction should be the exposition of the article where you can use less rigorous language. Your language should be generally accessible. Aim for this to be readable by someone who hasn't taken this class (maybe your roommate, your family, or you at the start of the semester). It should still be formal, but someone should come to the end and want to read more. 538 articles might be a good baseline tone for this.

Advanced introductions will immediately tell us what the setting is, what you found, and why it matters. They will add details as they are needed. Language will be polished and free from errors (Note: if your group does not include a native English speaker, make a note of that). Beginning writeups will be less focused and organized. They may jump to technical details without explaining why results are important. They may have spelling and grammatical errors, or awkward or incomplete sentences, indicating that they were written in haste and never reviewed.



# Data description. 
This should be inspired by the format presented in Gebru et al, 2018. Answer any relevant questions from sections 3.1-3.5 of the Gebru et al. article, especially the following questions:

    What are the observations (rows) and the attributes (columns)?
    Why was this dataset created?
    Who funded the creation of the dataset?
    What processes might have influenced what data was observed and recorded and what was not?
    What preprocessing was done, and how did the data come to be in the form that you are using?
    If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?
    Where can your raw source data be found, if applicable? Provide a link to the raw data (hosted in a Cornell Google Drive or Cornell Box). 
# Preregistration statement.
List the two analyses you promised to perform in this final report from your Phase III submission.

# Data analysis.
Use summary functions like mean and standard deviation along with visual displays like scatterplots and histograms to describe data.

Provide at least one model showing patterns or relationships between variables that addresses your research question. 
    This could be a regression or clustering, or something else that measures some property of the dataset.

# Evaluation of significance. 
    Use hypothesis tests, simulation, randomization, or any other techniques we have learned to compare the patterns you observe in the dataset to simple randomness. 
    
# Interpretation and conclusions.
What did you find over the course of your data analysis, and how confident are you in these conclusions? Detail your results more than you did in the introduction, now that the reader is familiar with your methods and analysis. Interpret these results in the wider context of the real-life application from where your data hails.

# Limitations.
What are the limitations of your study? What are the biases in your data or assumptions of your analyses that specifically affect the conclusions you're able to draw?

# Source code.
Provide a link to your Github repository (or other file hosting site) that has all of your project code beyond what's in the notebook itself (if applicable). For example, you might include web scraping code or data filtering and aggregation code.

# Acknowledgments.
Recognize any people or online resources that you found helpful. These can be tutorials, software packages, Stack Overflow questions, peers, and data sources. Showing gratitude is a great way to feel happier! But it also has the nice side-effect of reassuring us that you're not passing off someone else's work as your own. Crossover with other courses is permitted and encouraged, but it must be clearly stated, and it must be obvious what parts were and were not done for 2950. Copying without attribution robs you of the chance to learn, and wastes our time investigating.




# Appendix: 
## Data cleaning description.
Submit an updated version of your data cleaning description from phase II that describes all data cleaning steps performed on your raw data to turn it into the analysis-read dataset submitted with your final project. The data cleaning description should be a separate Jupyter notebook with executed cells, and it should output the dataset you submit as part of your project (e.g. written as a .csv file).

## (Optional) Other appendices.
You will almost certainly feel that you have done a lot of work that didn't end up in the final report. We want you to edit and focus, but we also want to make sure that there's a place for work that didn't work out or that didn't fit in the final presentation. You may include any analyses you tried but were tangential to the final direction of your main report. Graders may briefly look at these appendices, but they also may not. You want to make your final report interesting enough that the graders don't feel the need to look at other things you tried. "Interesting" doesn't necessarily mean that the results in your final report were all statistically significant; it could be that your results were not significant but you were able to interpret them in an interesting and informed way.



# Introduction

This report discusses the relationships between several US county-level variables at two points during the pandemic. The variables are COVID-19 death rate, internet speed and income. We examine whether there is a significant relationship between a county's death rate (# deaths / # cases) and the ladder variables. We will compare models of the two relationships during the third and fourth peaks, which occured before and after the vaccination effort.

, and show that **DISCUSS RESULTS BRIEFLY**.

## FINISH
---

# Data description.

## Time-Series data
This dataset includes data from January, 2020, to November, 2021. Each observation represents a county on a day. For each day, each county observation includes: `cases`, `deaths`, `tot_deaths`, `tot_cases`, `death_rate`, `dr_rating`.

`cases` - new cases
`deaths` - new deaths

`tot_cases` - cumulative cases
`tot_deaths` - cumulative deaths

`death_rate` - our Covid response metric, calculated by cumulative deaths / cumulative cases
`dr_rating` - categorical variable with six bins, derived from `death_rate`

## Static Data
Each observation represents a FIPS code (a county). Each county has four attributes, two relate to internet speed, and two relate to income. This data is merged with the time series data of a single date. 

### `AverageMbps` and `avg_mbps_rating`
`AverageMbps` is the 12-month average of internet speed test results recorded by Measurement Labs for a given county in 2020. `avg_mbps_rating` is a categorical variable, with six bins, made from the `AverageMbps` attribute.

### `med_income` and `income_rating`.
`med_income` is the 2019 median household income for a county. `income_rating` is a categorical variable, with six bins, made from the `med_income` data.


# PROVIDE SOURCES TO RAW DATA
---

# Preregistered Analyses
During the preregistration, we committed to providing analysis on the following questions:
- What is the relationship between access to wifi and Covid transmission?
- Did vaccinations or transmission have a more significant relationship with travel?

We move from looking at vaccinations, travel to working with income data. While we could have spent more time looking at the the time-series vaccination and travel data, we felt the analysis would be more fruitful if our time was spent comparing the relationships of the static variables, median income and death rate to internet speed and death rate.

---

Dates:
1. 2021-01-08 - Pre-Vaccination Peak
2. 2021-09-13 - Post-Vaccination Peak


# Data analysis

## Summary Info

- Cases and deaths over time lineplot
    - include v lines for dates were looking at
    
- Choropleths
    - Int speed
    - Income

- Summary stats for static data
    - mean, std dev, hist for internet speed, income
    - explain bins
    

- FOR EACH SELECTED DATE

    - Correlation heatmap
    - Show displots, boxplots
    - deaths choropleth??

    - Comparisons of variables to death rate
        - boxplot, side-by-side
            - y axis: shared, death rate
            - x axis: separate, income rating, internet rating
        - displot, stacked
            - x axis: shared, death rate
            - y axis: separate, income rating, internet rating    

## Models

- FOR EACH DATE
   
- Side-by-side Regression//scatterplots for income and internet


- Permutations on reg plot

## Eval. Significant

Provide at least one model showing patterns or relationships between variables that addresses your research question. 
    This could be a regression or clustering, or something else that measures some property of the dataset.

# Limitations

- Internet data
    - When do people usually run internet speed test? When their internet isn't working as they'd expect.