<a href="https://colab.research.google.com/github/ksuaray/LAEP_S24/blob/Covid/Covid_Tracker_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Covid-19 Case Tracker**

#**Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that enable us to be able to

In [2]:
import pandas as pd
import numpy as np
import plotly.express as px
from IPython.display import Image
import statsmodels.api as sm


#**Context**

The Associated Press (AP) is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.

The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

This data is from the [Hopkins dashboard](https://www.arcgis.com/apps/dashboards/bda7594740fd40299423467b48e9ecf6) that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At [this link](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data), you’ll find the Hopkins daily data reports, and a clean version of their feed.

The AP is updates this dataset hourly at 45 minutes past the hour.

To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go [here](http://https/www.ap.org/en-us/formats/data-journalism) or email kromano@ap.org.

Attribution: Johns Hopkins University COVID-19 tracking project

In [3]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://t3.ftcdn.net/jpg/03/33/03/90/240_F_333039083_vkw5CtEmwFSv4ibMFpOpmv49pPTawcDh.jpg'

# Display the image
Image(url=image_url)

#**About the Dataset**

This dataset contains 142 rows corresponding to a random sample of counties. A total of 7 variables are provided as listed below:

**Variables**

| Variable Name(s)  | Description |
|-------------------|-------------|
| County_name       | The name of the county |
| state             | State in which the county is located |
| nchs_urbanization | Urban-Rural category. For more details see [CDC Urban-Rural Classification Scheme for Counties](https://www.cdc.gov/nchs/data/series/sr_02/sr02_166.pdf) |
| total_population  | County population |
| confirmed         | Number of confirmed covid-19 cases in county |
| confirmed_per_100000 | Population adjusted confirmed covid-19 case rate per 100000 people |
| deaths            | Number of deaths in county due to covid-19 |
| deaths_per_100000 | Population adjusted covid-19 death rate per 100000 people |



*Attribution:  FiveThirtyEight.com*

We can view a snippet of the data by first importing it directly from the url below[link text](https://).

**Data**

In [4]:
file_path = "https://raw.githubusercontent.com/ksuaray/LAEP_S24/Covid/covid_cases23.csv"
df = pd.read_csv(file_path)


Next, we can display the data by typing the name of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [5]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,county_name,state,nchs_urbanization,total_population,confirmed,confirmed_per_100000,deaths,deaths_per_100000
0,Lowndes,Alabama,Medium metro,10236,3251,31760.45,80,781.56
1,Ontario,New York,Large fringe metro,109472,25821,23586.85,212,193.66
2,Waukesha,Wisconsin,Large fringe metro,398879,137985,34593.20,1216,304.85
3,Escambia,Florida,Medium metro,311522,96194,30878.72,1452,466.10
4,Greenbrier,West Virginia,Non-core,35347,12633,35739.95,182,514.90
...,...,...,...,...,...,...,...,...
137,Bourbon,Kentucky,Medium metro,20144,7688,38165.21,73,362.39
138,Schley,Georgia,Micropolitan,5211,1387,26616.77,11,211.09
139,Neshoba,Mississippi,Non-core,29376,12475,42466.64,247,840.82
140,Wise,Virginia,Micropolitan,39025,13596,34839.21,225,576.55


#**ASSIGNMENT 1 - Correlation and Regression Analysis**

**INSTRUCTIONS**

This assignment is intended to explore the univariate linear relationships between quantitative variables in the data. Use SPSS to analyze the relationship between the two variables and complete each of the following questions. As appropriate, copy the SPSS output and paste it into the correct part below. For problems that require a written response, type the answer below.

##**QUESTION 1**

Construct two scatterplots, each with independent variable TOTAL_POPULATION (x) and dependent variables based on your last name:

| LAST NAME | Dependent Variable 1 (y) | Dependent Variable 2 (y) |
|-----------|--------------------------|--------------------------|
| A-M       | CONFIRMED                | CONFIRMED_PER_100000     |
| N-Z       | DEATHS                   | DEATHS_PER_100000        |



In [6]:
# Scatter plot for A-L

scatter_al = px.scatter(
    df, 
    x='total_population', 
    y='confirmed',
    title='Scatter Plot of confirmed by total_population for A-M'
)

scatter_al.show()
# Scatter plot for A-L

scatter_al2 = px.scatter(
    df, 
    x='total_population', 
    y='confirmed_per_100000',
    title='Scatter Plot of confirmed_per_100000 by total_population for A-M'
)

scatter_al2.show()


In [7]:
# Scatter plot for M-Z
scatter_mz = px.scatter(
    df, 
    x='total_population', 
    y='deaths',
    title='Scatter Plot of deaths by total_population for N-Z'
)

scatter_mz.show()
# Scatter plot for M-Z

scatter_mz2 = px.scatter(
    df, 
    x='total_population', 
    y='deaths_per_100000',
    title='Scatter Plot of deaths_per_100000 by total_population for N-Z'
)

scatter_mz2.show()

##**QUESTION 2**

 Compute and interpret the values of the correlation coefficient between the two pairs of variables represented in question 1 above.

In [9]:
# Function to compute regression statistics and return them in a DataFrame
def compute_regression_stats(df, independent_var, dependent_var):
    X = df[independent_var]
    Y = df[dependent_var]
    X = sm.add_constant(X)  # Adds a constant term to the predictor

    # Fit the OLS model
    model = sm.OLS(Y, X).fit()
    predictions = model.predict()
    residuals = Y - predictions
    see = np.sqrt((residuals**2).sum() / (len(Y) - 2))
    # Compute the correlation coefficient
    correlation = X[independent_var].corr(Y)

    # Compile the statistics
    stats_df = pd.DataFrame({
        'R': [correlation],
        'R Squared': [model.rsquared],
        'Adjusted R Squared': [model.rsquared_adj],
        'Std. Error of the Estimate': see
    })
    
    return stats_df


In [13]:
# Calculate statistics for A-L group
stats_am_confirmed = compute_regression_stats(df, 'total_population', 'confirmed')
# Display the results
print("A-M Group: confirmed")
display(stats_am_confirmed)

A-M Group: confirmed


Unnamed: 0,R,R Squared,Adjusted R Squared,Std. Error of the Estimate
0,0.989809,0.979721,0.979576,8255.189622


• There is a strong, positive, linear relationship between the total population size and the number of confirmed cases.
• (Explanation of whether it was what the student expected.)

In [12]:
stats_am_confirmed_per_100000 = compute_regression_stats(df, 'total_population', 'confirmed_per_100000')
print("\nA-M Group: confirmed_per_100000")
display(stats_am_confirmed_per_100000)


A-M Group: confirmed_per_100000


Unnamed: 0,R,R Squared,Adjusted R Squared,Std. Error of the Estimate
0,0.045733,0.002091,-0.005036,6149.666741


There is a (very) weak, positive, linear relationship between the total population size and the number of confirmed cases per 100,000.
• (Explanation of whether it was what the student expected.)

In [14]:
# Calculate statistics for M-Z group
stats_nz_deaths = compute_regression_stats(df, 'total_population', 'deaths')
print("\nN-Z Group: Deaths")
display(stats_nz_deaths)


N-Z Group: Deaths


Unnamed: 0,R,R Squared,Adjusted R Squared,Std. Error of the Estimate
0,0.903118,0.815622,0.814305,271.948043


There is a strong, positive, linear relationship between the total population size and the number of deaths due to COVID.
• (Explanation of whether it was what the student expected.)
Confirmed per 100,000
• There is a (very) weak, positive, linear relationship between the total population size and the number of confirmed cases per 100,000.
• (Explanation of whether it was what the student expected.)

In [15]:
stats_nz_deaths_per_100000 = compute_regression_stats(df, 'total_population', 'deaths_per_100000')
print("\nN-Z Group: Deaths_per_100000")
display(stats_nz_deaths_per_100000)


N-Z Group: Deaths_per_100000


Unnamed: 0,R,R Squared,Adjusted R Squared,Std. Error of the Estimate
0,-0.246867,0.060943,0.054236,182.133379


There is a weak, positive, linear relationship between the total population size and the number of deaths due to COVID.
• (Explanation of whether it was what the student expected.)

## **QUESTION 3**

Which relationship appears strongest? Is this what you expected?


A-M: • The relationship between total population size and the number of confirmed cases.
• (What you expected?)

N-Z:  The relationship between total population size and the number of deaths due to COVID.
• (What you expected?)

## **QUESTION 4**

Compute the least squares regression line using the dependent (y) variable and independent (x) variables indicated below. Add the regression line to the scatterplot. Paste the new scatterplot and output table below. Then type out the prediction equation.

| LAST NAME | Independent Variable (x) | Dependent Variable (y)   |
|-----------|--------------------------|--------------------------|
| A-L       | TOTAL_POPULATION         | CONFIRMED_PER_100000     |
| M-Z       | TOTAL_POPULATION         | DEATHS_PER_100000        |


In [16]:
# Scatter plot with regression line for A-L
scatter_al = px.scatter(
    df, 
    x='total_population', 
    y='confirmed_per_100000',
    title='Scatter Plot of confirmed_per_100000 by total_population for A-M',
    trendline='ols'  # Adds OLS regression line
)


scatter_al.show()

In [17]:
# Scatter plot with regression line for M-Z
scatter_mz = px.scatter(
    df, 
    x='total_population', 
    y='deaths_per_100000',
    title='Scatter Plot of deaths_per_100000 by total_population for M-Z',
    trendline='ols'  # Adds OLS regression line
)


scatter_mz.show()

In [23]:
def regression_results_to_dataframe(df, y_col, x_col):
    """
    Fits an OLS regression model and returns a DataFrame containing the coefficients
    and their standard errors.

    Parameters:
    df: DataFrame containing the data.
    y_col: The name of the response variable column.
    x_col: The name of the predictor variable column.

    Returns:
    A pandas DataFrame with coefficients and standard errors, indexed by the model terms.
    """
    # Add a constant to the predictor variable
    X = sm.add_constant(df[x_col])
    
    # Fit the OLS model
    model = sm.OLS(df[y_col], X).fit()
    
    # Extract parameters and standard errors
    params = model.params
    bse = model.bse
    
    # Create a DataFrame with these values
    results_df = pd.DataFrame({
        'Coefficients': params.values,
        'Standard Error': bse.values
    }, index=['(Constant)', x_col])
    
    return results_df


In [24]:
results_al_df = regression_results_to_dataframe(df,'confirmed_per_100000', 'total_population')

# Output the DataFrames
print("A-L Group:")
print(results_al_df)


A-L Group:
                  Coefficients  Standard Error
(Constant)        29471.325634      580.321475
total_population      0.001425        0.002630


Predicted y = 29471.326 + 0.001x

In [25]:
results_mz_df = regression_results_to_dataframe(df,'deaths_per_100000', 'total_population')


print("\nM-Z Group:")
print(results_mz_df)


M-Z Group:
                  Coefficients  Standard Error
(Constant)          455.696466       17.187258
total_population     -0.000235        0.000078


Predicted y = 455.696 + 0.000x

### **QUESTION 5**

Interpret the slope of the least squares regression line in the context of this study.

A-L: We expect the number of confirmed cases per 100,000 to increase by 0.001 (confirmed cases per 100,000) when total population size increases by 1 (person).
NOTE: units are not required, as they are implied by the variable name.

M-Z: We expect the number of deaths per 100,000 to increase by 0.000 (deaths per 100,000) when the total population size increases by 1 (person).
NOTE: units are not required, as they are implied by the variable name.

## **QUESTION 6**

Interpret the y-intercept of the least squares regression line in the context of this study. State whether the interpretation is reasonable.

A-L: • We expect the number of confirmed cases per 100,000 to be 29,471.326 (confirmed cases per 100,000) when the total population size is 0 (people).
• This does not make sense. (Can’t have cases if the population is empty.)
NOTE: units are not required, as they are implied by the variable name.

M-Z: • We expect the number of deaths per 100,000 to be 455.696 (deaths per 100,000) when the total population size is 0 (people).
• This does not make sense.
NOTE: units are not required, as they are implied by the variable name.

## **QUESTION 7**

Predict the value of your dependent variable from question 4 for a county with independent variable value as stated below. Type your work below.

| Last Name | Independent Variable (x)                               |
|-----------|---------------------------------------------------------|
| A-L       | TOTAL_POPULATION x=13331 (COUNTY_NAME = Grundy, TN)    |
| M-Z       | TOTAL_POPULATION x=88227 (COUNTY_NAME = Jenkins, GA)   |


A-L: Predicted y = 29471.326 + 0.001(13,331)
= 29,484.657 confirmed cases per
100,000

M-Z: Predicted y = 455.696 + 0.000(8827)
= 455.696 deaths per 100,000

## **QUESTION 8**

Look up, in the Excel or SPSS file, the actual county name listed in the table in part 7 above. Compare your answer in question 7 above (the predicted total population) to the actual value of your dependent variable.


A-L: (actual = 34,145.98)
The actual number of cases per 100,000 is higher than the equation predicted.

M-Z: (actual = 770.36)
The actual number of deaths per 100,000 is lower than the equation predicted.

### **QUESTION 9**

Generate a paragraph of at least 100 words to address one of the following questions:

### **QUESTION 9a**

Discuss how analyzing your chosen data set using statistical methods could help you become
better prepared for future courses in your major?

### **QUESTION 9b**

Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?

### **QUESTION 9c**
Discuss how analyzing your chosen data set using statistical methods could help you be aware of social issues, contribute to society, and advocate for marginalized communities.