<a href="https://colab.research.google.com/github/ksuaray/LAEP_S24/blob/Covid/Covid_Tracker_Assignment_Key.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Covid-19 Case Tracker**

#**Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that enable us to be able to

In [45]:
import pandas as pd
import numpy as np
import plotly.express as px
from IPython.display import Image


#**Context**

The Associated Press (AP) is using data collected by the Johns Hopkins University Center for Systems Science and Engineering as our source for outbreak caseloads and death counts for the United States and globally.

The Hopkins data is available at the county level in the United States. The AP has paired this data with population figures and county rural/urban designations, and has calculated caseload and death rates per 100,000 people. Be aware that caseloads may reflect the availability of tests -and the ability to turn around test results quickly -- rather than actual disease spread or true infection rates.

This data is from the [Hopkins dashboard](https://www.arcgis.com/apps/dashboards/bda7594740fd40299423467b48e9ecf6) that is updated regularly throughout the day. Like all organizations dealing with data, Hopkins is constantly refining and cleaning up their feed, so there may be brief moments where data does not appear correctly. At [this link](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data), you’ll find the Hopkins daily data reports, and a clean version of their feed.

The AP is updates this dataset hourly at 45 minutes past the hour.

To learn more about AP's data journalism capabilities for publishers, corporations and financial institutions, go [here](http://https/www.ap.org/en-us/formats/data-journalism) or email kromano@ap.org.

Attribution: Johns Hopkins University COVID-19 tracking project

#**About the Dataset**

This dataset contains 142 rows corresponding to a random sample of counties. A total of 7 variables are provided as listed below:



| **Variable Name(s)** | **Description** |
|----------------------|-----------------|
| County_name          | The name of the county |
| state                | State in which the county is located |
| nchs_urbanization    | Urban-Rural category. For more details see [here](https://www.cdc.gov/nchs/data/series/sr_02/sr02_166.pdf) |
| total_population     | County population |
| confirmed            | Number of confirmed COVID-19 cases in the county |
| confirmed_per_100000 | Population-adjusted confirmed COVID-19 case rate per 100,000 people |
| deaths               | Number of deaths in the county due to COVID-19 |
| deaths_per_100000    | Population-adjusted COVID-19 death rate per 100,000 people |



*Attribution:  FiveThirtyEight.com*

We can view a snippet of the data by first importing it directly from the url below[link text](https://).

**Data**

In [47]:
file_path = "https://raw.githubusercontent.com/ksuaray/LAEP_S24/Covid/covid_cases23.csv"
df = pd.read_csv(file_path)


Next, we can display the data by typing the name of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [48]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,county_name,state,nchs_urbanization,total_population,confirmed,confirmed_per_100000,deaths,deaths_per_100000
0,Lowndes,Alabama,Medium metro,10236,3251,31760.45,80,781.56
1,Ontario,New York,Large fringe metro,109472,25821,23586.85,212,193.66
2,Waukesha,Wisconsin,Large fringe metro,398879,137985,34593.20,1216,304.85
3,Escambia,Florida,Medium metro,311522,96194,30878.72,1452,466.10
4,Greenbrier,West Virginia,Non-core,35347,12633,35739.95,182,514.90
...,...,...,...,...,...,...,...,...
137,Bourbon,Kentucky,Medium metro,20144,7688,38165.21,73,362.39
138,Schley,Georgia,Micropolitan,5211,1387,26616.77,11,211.09
139,Neshoba,Mississippi,Non-core,29376,12475,42466.64,247,840.82
140,Wise,Virginia,Micropolitan,39025,13596,34839.21,225,576.55


#**ASSIGNMENT 1 - Descriptive Statistics: Graphical and Numerical Summary**

**INSTRUCTIONS**

Use Python to analyze the data set and complete each of the following. As appropriate, copy the output and paste it in the correct part below. For problems that require a written response, type the answer below.

##**QUESTION 1**

Determine whether the three variables below are qualitative or quantitative. If they are quantitative, specify whether they are continuous or discrete.

| Variable         | Qual or Quant | Dis, Con, or Neither |
|------------------|---------------|----------------------|
| **nchs_urbanization**              | Qual or Quant  | Dis, Con, or Neither              |
| **Confirmed**           | Qual or Quant   | Dis, Con, or Neither              |
| **deaths_per_100000**           | Qual or Quant  | Dis, Con, or Neither           |


##**QUESTION 2**

Construct a frequency table, relative frequency table, and relative frequency bar chart to describe the distribution of Major_category. State any fact that jumps out to you.

In [49]:
#Frequency table
freq_table = pd.value_counts(df['nchs_urbanization'])
freq_table

Non-core               61
Micropolitan           25
Large fringe metro     19
Medium metro           17
Small metro            17
Large central metro     3
Name: nchs_urbanization, dtype: int64

In [50]:
#Relative frequency table
freq_table/len(df)

Non-core               0.429577
Micropolitan           0.176056
Large fringe metro     0.133803
Medium metro           0.119718
Small metro            0.119718
Large central metro    0.021127
Name: nchs_urbanization, dtype: float64

In [65]:
# Bar chart of nchs_urbanization
fig = px.bar(df, x='nchs_urbanization',
             title='Frequency Distribution Bar Chart of nchs_urbanization')
fig.show()

In [52]:
# Pie Chart of nchs_urbanization

fig = px.pie(df, names = 'nchs_urbanization',
             title='Frequency Distribution Bar Chart of nchs_urbanization')
fig.show()

Fact that stands out:

NON-CORE IS THE GREATEST PROPORTION, LARGE CENTRAL METRO THE LEAST.

## **QUESTIONS 3-6**

For questions 3-6: Find your variable based on your last name and use that variable when answering questions #3 to #6.  

| Last Name | Variable                  |
|-----------|---------------------------|
| A-F       | confirmed                 |
| G-M       | confirmed_per_100000      |
| N-S       | deaths                    |
| T-Z       | deaths_per_100000         |


###**QUESTION 3**

Construct a histogram for your variable. Use Number of Intervals = 12.

In [53]:
# Histogram of confirmed
# Skewed Right

fig = px.histogram(x=df['confirmed'], labels={'x':'confirmed', 'y':'Frequency'}, nbins = 12)
fig.show()

In [54]:
# Histogram of confirmed_per_100000
# Symmetric

fig = px.histogram(x=df['confirmed_per_100000'], labels={'x':'confirmed_per_100000', 'y':'Frequency'}, nbins = 12)
fig.show()

In [55]:
# Histogram of deaths
# Skewed Right

fig = px.histogram(x=df['deaths'], labels={'x':'deaths', 'y':'Frequency'}, nbins = 12)
fig.show()

In [56]:
# Histogram of deaths_per_100000
# Skewed Right

fig = px.histogram(x=df['deaths_per_100000'], labels={'x':'deaths_per_100000', 'y':'Frequency'}, nbins = 12)
fig.show()

###**QUESTION 4**

Construct a boxplot for your variable.  

In [57]:
# Boxplot of Confirmed
# Skewed Right

px.box(x=df['confirmed'], title="1-D Boxplot of Confirmed")

In [58]:
# Boxplot of confirmed_per_100000
# Symmetric

px.box(x=df['confirmed_per_100000'], title="1-D Boxplot of confirmed_per_100000")

In [59]:
# Boxplot of deaths
# Skewed Right

px.box(x=df['deaths'], title="1-D Boxplot of deaths")

In [60]:
# Boxplot of deaths_per_100000
# Skewed Right

px.box(x=df['deaths_per_100000'], title="1-D Boxplot of deaths_per_100000")

###**QUESTION 5**

Calculate the following summary statistics for your variable: minimum, maximum, mean, median, standard deviation, Q1, and Q3. Paste the output below.

In [61]:
# Summary of confirmed

df[['confirmed']].describe(include='all')

Unnamed: 0,confirmed
count,142.0
mean,30437.591549
std,57764.440625
min,102.0
25%,2458.5
50%,6738.5
75%,24554.0
max,289187.0


In [62]:
# Summary of confirmed_per_100000

df[['confirmed_per_100000']].describe(include='all')

Unnamed: 0,confirmed_per_100000
count,142.0
mean,29615.097183
std,6134.238741
min,13065.33
25%,25873.87
50%,30349.51
75%,33534.0925
max,43987.6


In [63]:
# Summary of deaths

df[['deaths']].describe(include='all')

Unnamed: 0,deaths
count,142.0
mean,345.591549
std,631.082085
min,1.0
25%,39.25
50%,103.5
75%,304.5
max,4032.0


In [64]:
# Summary of deaths_per_100000

df[['deaths_per_100000']].describe(include='all')

Unnamed: 0,deaths_per_100000
count,142.0
mean,432.001972
std,187.282874
min,39.71
25%,301.105
50%,426.55
75%,534.175
max,1094.86


###**QUESTION 6**

Use information from questions #3, #4, and #5 to describe your variable in terms of shape, center, spread, and outliers. Interpret your findings.

**NOTE: OUTLIERS ARE HARD TO COUNT, GRADE WITH A +/-4 TOLERANCE**

THE DISTRIBUTION OF **confirmed** IS SKEWED RIGHT. THE MEDIAN IS 9401.50 CASES AND THE IQR IS 16,313.75. THERE ARE 15 OUTLIERS IN THE RIGHT TAIL.

THE DISTRIBUTION OF **confirmed_per_100000** IS SYMMETRIC. THE MEAN IS 30,486.0433 CASES AND THE STANDARD DEVIATION IS 5696.18027. THERE ARE 2 OUTLIERS IN THE LEFT TAIL.

THE DISTRIBUTION OF **deaths** IS SKEWED RIGHT. THE MEDIAN IS 124.50 AND THE IQR IS 197.75. THERE ARE 15 OUTLIERS IN THE RIGHT TAIL.

THE DISTRIBUTION OF **deaths_per_100000** IS SKEWED RIGHT. THE MEDIAN IS 410.1450 AND THE IQR IS 197.34. THERE ARE 15 OUTLIERS IN THE RIGHT TAIL.

##**QUESTION 7**

Calculate and report the median confirmed_per_100000 and median deaths_per_100000 for “Large fringe metro” regions. Do the same for “Small metro” regions. Compare the results. Note: These are categories of the nchs_urbanization variable

| **nchs_urbanization variable** | | confirmed_per_100000 | deaths_per_100000 |
|----------------------|-----------------| ----------- | -------- |
|Large central metro| N | 3 | 3 |
| | Medium | 33467.0200 | 311.0200|
|Large fringe metro| N | 18 | 18|
| | Medium | 29264.3400 | 266.1650|
|Medium metro | N | 16 | 16 |
| | Medium | 31669.8150 | 367.2450 |
|Micropolitan | N | 32 | 32 |
| | Medium | 30285.0550 | 434.8950 |
| Non-core | N | 60 | 60 |
| | Medium | 30480.0700 | 452.0100 |
| Small metro | N | 13 | 13 |
| | Medium | 32661.1700 | 425.0100 |
| Total | N | 142 | 142 |
| | Medium | 30492.6100 | 410.1450|


THE SMALL METRO REGIONS HAD LARGER VALUES THAN THE LARGE FRINGE METRO AREAS FOR BOTH REGIONS


##**QUESTION 8**

Generate a paragraph of at least 100 words to address one of the following questions:

### **QUESTION 8a**

Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

### **QUESTION 8b**

Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?

### **QUESTION 8c**

Discuss how analyzing your chosen data set using statistical methods could help you be aware of social issues, contribute to society, and advocate for marginalized communities.