# **COVID Tracker - Project 2**
### Analyzing the linear relationship between two quantitative variables.

# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that add extra features to the basic setup. The name of the modules is after the `import` statement, and the purpose is in a non-code comment after thew hashtag (#).



In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from IPython.display import Image
import statsmodels.api as sm


In [2]:
# Assigns the URL of the image to display to the name 'image_url'.
image_url = 'https://cdn.who.int/media/images/default-source/mca/mca-covid-19/coronavirus-2.tmb-1920v.jpg?sfvrsn=4dba955c_19'

# Display the image
Image(url=image_url, width=600)

# **Context**

When reporting about COVID-19, the Associated Press (AP) used data collected by the Johns Hopkins University Center for Systems Science and Engineering as a source for outbreak caseloads and death counts for the United States and globally.

The Johns Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests - and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

The data you will be analyzing is from the Johns Hopkins dashboard (link below) that is updated throughout the day. Like all organizations dealing with data, Johns Hopkins is constantly refining and cleaning up their data feed, so there may be brief moments where data does not appear correctly. You can find the Johns Hopkins daily data reports at https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data.

The AP updates their dataset hourly at 45 minutes past the hour.
To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, visit https://www.ap.org/content/formats/data/.

Attribution: Johns Hopkins University COVID-19 tracking project
* Dashboard: https://www.arcgis.com/apps/dashboards/bda7594740fd40299423467b48e9ecf6





# **About the Dataset**

This dataset contains 133 rows corresponding to a random sample of drafted players. A total of 9 variables are provided as listed below:

| Variable Name(s)      | Description                             |
|:----------------------|:----------------------------------------|
| county_name           | Name of the county                      |
| state                 | State in which the count is located     |
| nchs_urbanization     | Urban-Rural category                    |
| total_population      | County population size                  |
| confirmed             | Number of confirmed cases in the county |
| confirmed_per_100000  | Population adjusted confirmed COVID-19 cases per 100000 people |
| deaths                | Number of deaths in county due to COVID-19 |
| deaths_per_100000     | Population adjusted COVID-19 deaths per 100000 people |

* For more information about nchs_urbanization, visit https://www.cdc.gov/nchs/data/series/sr_02/sr02_166.pdf
* Note on "population adjusted": Is 1500 a lot of cases? It is certainly more significant in a population of 1500 people than in a population of 10 million people. One way to compare populations of different sizes is to calculate the rate "as if" there were only 100,000 people. Do that adjustment for all populations you want to compare. In class we learned to use relative frequency when comparing groups of different sizes. The population adjusted method is another way to do that.



Let's take a look at the data. To do this, first we import it directly from the url below.



# **A Snippet of the Data**

Let's take a look at the data. To do this, first we import it directly from the url.

In [3]:
# Assigns the URL where the data file is stored to 'file_path'.
url='https://raw.githubusercontent.com/thamilton562/STAT108_Projects_Students/main/DataSets/COVID%20Cases.csv'

# Reads in the CSV data file and assigns it to the DataFrame 'df'.
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [4]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,county_name,state,nchs_urbanization,total_population,confirmed,confirmed_per_100000,deaths,deaths_per_100000
0,Lowndes,Alabama,Medium metro,10236,3251,31760.45,80,781.56
1,Ontario,New York,Large fringe metro,109472,25821,23586.85,212,193.66
2,Waukesha,Wisconsin,Large fringe metro,398879,137985,34593.20,1216,304.85
3,Escambia,Florida,Medium metro,311522,96194,30878.72,1452,466.10
4,Greenbrier,West Virginia,Non-core,35347,12633,35739.95,182,514.90
...,...,...,...,...,...,...,...,...
137,Bourbon,Kentucky,Medium metro,20144,7688,38165.21,73,362.39
138,Schley,Georgia,Micropolitan,5211,1387,26616.77,11,211.09
139,Neshoba,Mississippi,Non-core,29376,12475,42466.64,247,840.82
140,Wise,Virginia,Micropolitan,39025,13596,34839.21,225,576.55


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* Replace ellipsis (...) with the relavent names or code.  
* For problems that require a written response, double click the text box to start typing.
* Reference the 3 tutorials from activity for assistance.
* Attend office hours if you still need help.

This assignment is intended to explore the univariate linear relationships between quantitative variables in the data. Choose the dependent and independent variables based on your last name.

| Last Name | Dependent Variable (y)        | Independent Variable (x) |
|-----------|-------------------------------|--------------------------|
| A-L       | Number of confirmed cases     | County population size   |
| M-Z       | Number of deaths due to COVID | County population size   |




Based on your last name, which variable is explanatory and which is the response?

Explanatory: ...

Response: ...

What do you think the direction (positive/negative) and strength (weak/moderate/strong) will be?

# ...

# **QUESTION 1**
## **Scatterplot**

**1.1)** Construct a scatterplot that could be used to show the relationship between the variables stated above.

In [5]:
# Create a scatter plot using Plotly Express

# Scatter plot for A-L
# STUDENTS: replace ... as stated
scatter_plot_AL = px.scatter(df,
                 # Replace ... with the explanatory variable name
                 x='total_population',
                 # Replace ... with the response variable name
                 y='confirmed',
                         # Replace ... with a better x-axis label
                 labels={'total_population': 'County Population Size',
                         # Replace ... with a better y-axis label
                         'confirmed': 'Number of Confirmed COVID Cases'})

# Updating layout: Students do not change anything in this block of code.
# STUDENTS: Do not change anything in this block of code.
scatter_plot_AL.update_layout(
    plot_bgcolor='rgba(255,255,255,1)',        # Sets the background color of the plot area to white with full opacity.
    xaxis=dict(
        showline=True,                         # Displays a line on the x-axis.
        showgrid=False,                        # Hides the grid lines on the x-axis.
        linecolor='black'                      # Sets the color of the x-axis line to black.
    ),
    yaxis=dict(
        showline=True,                         # Displays a line on the y-axis.
        showgrid=False,                        # Hides the grid lines on the y-axis.
        linecolor='black'                      # Sets the color of the y-axis line to black.
    ),
    title={
        'text': 'Scatter Plot A-L',   # Sets the title text.
        'y': 0.9,                              # Positions the title 90% of the way up the plot.
        'x': 0.5,                              # Centers the title horizontally.
        'xanchor': 'center',                   # Anchors the title at its center on the x-axis.
        'yanchor': 'top',                      # Anchors the title at the top on the y-axis.
        'font': dict(
            size=16                            # Sets the title font size to 16 (smaller than default).
        ),
    },
    width=575,                                 # Sets the width of the plot.
    height=400                                 # Sets the height of the plot for portrait mode.
)

# Show the plot:
scatter_plot_AL.show()

# Scatter plot for M-Z
# Create a scatter plot using Plotly Express
scatter_plot_MZ = px.scatter(df,
                 x='total_population', #explanatory variable name
                 y='deaths', #response variable name
                 labels={'total_population': 'County Population Size', #updates axis labels
                         'deaths': 'Number of Deaths Due to COVID'})

# Updating layout: Students do not change anything in this block of code.
scatter_plot_MZ.update_layout(
    plot_bgcolor='rgba(255,255,255,1)',        # Sets the background color of the plot area to white with full opacity.
    xaxis=dict(
        showline=True,                         # Displays a line on the x-axis.
        showgrid=False,                        # Hides the grid lines on the x-axis.
        linecolor='black'                      # Sets the color of the x-axis line to black.
    ),
    yaxis=dict(
        showline=True,                         # Displays a line on the y-axis.
        showgrid=False,                        # Hides the grid lines on the y-axis.
        linecolor='black'                      # Sets the color of the y-axis line to black.
    ),
    title={
        'text': 'Scatter Plot M-Z',   # Sets the title text.
        'y': 0.9,                              # Positions the title 90% of the way up the plot.
        'x': 0.5,                              # Centers the title horizontally.
        'xanchor': 'center',                   # Anchors the title at its center on the x-axis.
        'yanchor': 'top',                      # Anchors the title at the top on the y-axis.
        'font': dict(
            size=16                            # Sets the title font size to 16 (smaller than default).
        ),
    },
    width=575,                                 # Sets the width of the plot.
    height=400                                 # Sets the height of the plot for portrait mode.
)

# Show the plot:
scatter_plot_MZ.show()

**1.2)** Describe the relationship between the two variables and include context, and identify the direction, strength, and form.

**A-L:** There is a strong, positive, linear relationship between the total population size and the number of confirmed cases.

**M-Z:** There is a strong, positive, linear relationship between the total population size and the number of deaths due to COVID.

Is this what you expected?

Answers will vary.

# **QUESTION 2**

## Correlation Coefficient

**2.1)** Calculate the value of the correlation coefficient.

In [6]:
# Calculate the correlation coefficient

# STUDENTS: Replace ... as stated
# Replace the 1st ... with the response variable name.
# Replace the 2nd ... with the explanatory variable name
correlation_AL = df['confirmed'].corr(df['total_population'])
correlation_MZ = df['deaths'].corr(df['total_population'])

# Print the correlation coefficient
print(f"A-L: Correlation coefficient between pop size and confirmed is: r = {correlation_AL}")
print(f"\nM-Z: Correlation coefficient between pop size and deaths is : r = {correlation_MZ}")


A-L: Correlation coefficient between pop size and confirmed is: r = 0.9898086770189443

M-Z: Correlation coefficient between pop size and deaths is : r = 0.9031179334465986


**2.2)** Interpret the correlation coefficient.

**A-L:** There is a strong, positive, linear relationship between the total population size and the number of confirmed cases.

**M-Z:** There is a strong, positive, linear relationship between the total population size and the number of deaths due to COVID.

# **Question 3**

## Least Squares Regression

**3.1)** Calculate the linear model and add the regression line to the scatterplot.

In [7]:
# Calculate the LSRL and add the regression line to the scatterplot.

# A-L
# Extract the variables from the dataframe
# STUDENTS: Replace ... as stated below.
# Replace ... with the explantory variable name
X = df['total_population']
# Replace ... with the response variable name
Y = df['confirmed']

# Add a constant to the explanatory variable for the regression model.
# This forces Statsmodels to calculate the y-intercept.
# STUDENTS: Do not change anything in this block of code.
X_with_const = sm.add_constant(X)

# Fit the linear regression model (calculate slope and y-intercept)
# STUDENTS: Do not change anything in this block of code.
model_AL = sm.OLS(Y, X_with_const).fit()

# If we repeatedly run this section of code then a new regression line is
# added each run, unless we recreate the scatterplot each iteration.
# STUDENTS: Replace ... as stated below.
scatter_plot_AL = px.scatter(df,
                 # Replace ... with the explanatory variable name
                 x='total_population',
                 # Replace ... with the response variable name
                 y='confirmed',
                         # Replace ... with a better x-axis label
                 labels={'total_population': 'County Population Size',
                         # Replace ... with a better y-axis label
                         'confirmed': 'Number of Confirmed COVID Cases'})

# Updating layout
# STUDENTS: Do not change anything in this block of code.
scatter_plot_AL.update_layout(
    plot_bgcolor='rgba(255,255,255,1)',        # Sets the background color of the plot area to white with full opacity.
    xaxis=dict(
        showline=True,                         # Displays a line on the x-axis.
        showgrid=False,                        # Hides the grid lines on the x-axis.
        linecolor='black'                      # Sets the color of the x-axis line to black.
    ),
    yaxis=dict(
        showline=True,                         # Displays a line on the y-axis.
        showgrid=False,                        # Hides the grid lines on the y-axis.
        linecolor='black'                      # Sets the color of the y-axis line to black.
    ),
    title={
        'text': 'Scatter Plot A-L',   # Sets the title text.
        'y': 0.9,                              # Positions the title 90% of the way up the plot.
        'x': 0.5,                              # Centers the title horizontally.
        'xanchor': 'center',                   # Anchors the title at its center on the x-axis.
        'yanchor': 'top',                      # Anchors the title at the top on the y-axis.
        'font': dict(
            size=16                            # Sets the title font size to 16 (smaller than default).
        ),
    },
    width=600,                                 # Sets the width of the plot.
    height=400,                                # Sets the height of the plot for portrait mode.
    showlegend=False                           # Disables the legend (key) display
)

# Add the regression line to the scatterplot
scatter_plot_AL.add_scatter(x=X, y=model_AL.predict(X_with_const),
                         mode='lines', name='Regression Line')

# Show the plot
scatter_plot_AL.show()

# M-Z
# Extract the variables from the dataframe
X = df['total_population']
Y = df['deaths']

# Add a constant to the explanatory variable for the regression model.
# This forces Statsmodels to calculate the y-intercept.
X_with_const = sm.add_constant(X)

# Fit the linear regression model (calculate slope and y-intercept)
model_MZ = sm.OLS(Y, X_with_const).fit()

# If we repeatedly run this section of code then a new regression line is
# added each run, unless we recreate the scatterplot each iteration.
scatter_plot_MZ = px.scatter(df,
                 x='total_population', #explanatory variable name
                 y='deaths', #response variable name
                 labels={'total_population': 'County Population Size', #updates axis labels
                         'deaths': 'Number of Deaths Due to COVID'})

# Updating layout: Students do not change anything in this block of code.
scatter_plot_MZ.update_layout(
    plot_bgcolor='rgba(255,255,255,1)',        # Sets the background color of the plot area to white with full opacity.
    xaxis=dict(
        showline=True,                         # Displays a line on the x-axis.
        showgrid=False,                        # Hides the grid lines on the x-axis.
        linecolor='black'                      # Sets the color of the x-axis line to black.
    ),
    yaxis=dict(
        showline=True,                         # Displays a line on the y-axis.
        showgrid=False,                        # Hides the grid lines on the y-axis.
        linecolor='black'                      # Sets the color of the y-axis line to black.
    ),
    title={
        'text': 'Scatter Plot M-Z',   # Sets the title text.
        'y': 0.9,                              # Positions the title 90% of the way up the plot.
        'x': 0.5,                              # Centers the title horizontally.
        'xanchor': 'center',                   # Anchors the title at its center on the x-axis.
        'yanchor': 'top',                      # Anchors the title at the top on the y-axis.
        'font': dict(
            size=16                            # Sets the title font size to 16 (smaller than default).
        ),
    },
    width=600,                                 # Sets the width of the plot.
    height=400,                                # Sets the height of the plot for portrait mode.
    showlegend=False                           # Disables the legend (key) display
)

# Add the regression line to the scatterplot
scatter_plot_MZ.add_scatter(x=X, y=model_MZ.predict(X_with_const),
                         mode='lines', name='Regression Line')

# Show the plot
scatter_plot_MZ.show()

**3.2)** Print the coefficients (Intercept and Slope)

In [8]:
# Print the coefficients (Intercept and Slope)

# STUDENTS: Do not change the code below.
# The coefficients were calculated when the regression line was added to the
# scatterplot and stored in 'model'. This assigns the values of the y-intercept
# and slope to 'coefficients', then prints the values.

# A-L
coefficients_AL = model_AL.params
coefficients_AL.name='Coefficients'
coefficients_AL

# M-Z
coefficients_MZ = model_MZ.params
coefficients_MZ.name='Coefficients'
coefficients_MZ

# Output the DataFrames
print("A-L Group:")
print(coefficients_AL)
print("\nM-Z Group:")
print(coefficients_MZ)

A-L Group:
const               1135.492447
total_population       0.290335
Name: Coefficients, dtype: float64

M-Z Group:
const               53.501191
total_population     0.002894
Name: Coefficients, dtype: float64


**3.3)** Write the equation of the line that predicts waist circumference from BMI. To type 𝑦̂ you can type “y-hat” or “predicted y”. Do NOT round.

**A-L:** y-hat = 1135.492447 + 0.290335x

**M-Z:** y-hat = 53.501191 + 0.002894x

# **QUESTION 4**

## Slope
Interpret the slope

**NOTE:** units are not required, as they are implied by the variable names.

**A-L:** We expect the number of confirmed cases to increase by 0.29 (cases) when total population size increases by 1 (person).

**M-Z:** We expect the number of deaths to increase by 0.003 (deaths) when the total population size increases by 1 (person).

# **QUESTION 5**
## Y-Intercept

Interpret the y-intercept of the least squares regression line in the context of this study. State whether the interpretation is reasonable.

**NOTE:** units are not required, as they are implied by the variable names.

**A-L:** We expect the number of confirmed cases to be 1135.492 (cases) when the total population size is 0 (people).

This does not make sense.

**M-Z:** We expect the number of deaths to be 53.501 (deaths) when the total population size is 0 (people).

This does not make sense.

# **Question 6**
## Prediction

Show work to predict the value of your dependent variable for the county with the using the actual value of the independent variable shown in the table below. Round to two (2) decimal places.

| Last Name      | Independent Variable (x)               |
|----------------|----------------------------------------|
| **A-L**        | total population = 13,331 (Grundy, TN) |
| **M-Z**        | total population = 8,827 (Jenkins, GA) |


**A-L:** y-hat = 1135.492447369846 + 0.29033548996786296(13331) = 5005.95 cases

**M-Z:** y-hat = 53.501190634198174 + 0.0028941338674035046(8827) = 79.05 deaths

In [9]:
# This is NOT in the student notebook.
# A-L prediction
intercept_AL = model_AL.params['const']
slope_AL = model_AL.params['total_population']
predicted_AL = (intercept_AL + slope_AL * 13331)
print(f"A-L: y-hat = {intercept_AL} + {slope_AL}(13331) = {predicted_AL}")

# M-Z prediction
intercept_MZ = model_MZ.params['const']
slope_MZ = model_MZ.params['total_population']
predicted_MZ = (intercept_MZ + slope_MZ * 8827)
print(f"\nM-Z: y-hat = {intercept_MZ} + {slope_MZ}(8827) = {predicted_MZ}")


A-L: y-hat = 1135.492447369846 + 0.29033548996786296(13331) = 5005.954864131427

M-Z: y-hat = 53.501190634198174 + 0.0028941338674035046(8827) = 79.04771028176891


# **Question 7**
## Compare
**7.1)** Look up the actual value of your dependent variable for the value of the independent variable above. We know it is for Criminology majors, but be sure to type in the value of the independent variable listed in Question 6 in the code below.

In [10]:
# Filter the datafram to find the row where your independent variable
#   equals the value from the table in Question 6.

#A-L
# STUDENTS:
# Replace the 1st ... with the explantory variable name
# Replace the 2nd ... with the value from the table in Question 6
# Replace the 3rd ... with the response variable name
actual_dependent_AL = df[df['total_population']== 13331]['confirmed']
actual_AL = int(actual_dependent_AL.values[0])
print(f"A-L: The actual value of the dependent variable is: {actual_AL}")

#M-Z
actual_dependent_MZ = df[df['total_population']== 8827]['deaths']
actual_MZ = int(actual_dependent_MZ.values[0])
print(f"\nM-Z: The actual value of the dependent variable is: {actual_MZ}")

A-L: The actual value of the dependent variable is: 4552

M-Z: The actual value of the dependent variable is: 68


**7.2)** Compare the predicted value of your dependent variable from Question 6 to the actual value from 7.1. Use the actual name of your dependent variable, as modeled in activity.


**A-L:** The actual number of confirmed cases is lower than predicted.

**M-Z:** The actual number of deaths is lower than predicted.

# **QUESTION 8**

Generate a paragraph of at least 100 words to address one of the following questions. That is, answer only 8a or 8b, but not both.

**8a)** Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

...

--OR--

**8b)** Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?

...


<br><br>
### Once you are done and ready to submit, follow the instructions below to save as a PDF and submit to GradeScope.

### Save as PDF
Note 1: You do not have to select Print Preview. You can print directly from the notebook.
Note 2: Image and graph sizes have been set so you should be able to see them correctly without making any changes to the browser width or the layout (portrait vs landscape).
1. Run all code one last time and make sure your graphs can be seen.
2. File -> Print (or ctrl-p/cmnd-p)
3. Change the "Desination" to PDF.
4. Save the PDF, taking note of where it is saved.

### Submit to GradeScope
**Watch the "GradeScope Submission" video for help.**
1. Login to the Canvas course
2. Click on GradeScope in the course navigation.
3. If you see multiple courses in GradeScope, click on the STAT 108 course
4. Click on the name of the assignment that matches your data set
5. Click on "Submit Work", select PDF
6. Select the PDF you just created
7. You need to tell GradeScope which page each problem answer/output is on. You should see a list of problems on the left, and a display of pages (thumbnails) on the right. Assign pages to questions by clicking on the question number on the left, then clicking on all pages that question is on.
8. After ALL questions have been assigned to their respective page(s), click "Submit"

#### **Still need help? Your STAT 108 team is here to help. Take your laptop to office hours.**
