# Project 1 Jupyter Notebook

# Step 1: Load the data

This project explores U.S. county-level COVID-19 total_cases data. The data was released by White House COVID-19 Team, Joint Coordination Cell, Data Strategy and Execution Workgroup. (https://healthdata.gov/dataset/COVID-19-Community-Profile-Report-County-Level/di4u-7yu6/about_data)

- [­ЪЊі Download Dataset (CSV)](COVID-19_Community_Profile_Report_-_County-Level.csv)

In [1]:
import pandas as pd

# Read the CSV
df = pd.read_csv("COVID-19_Community_Profile_Report_-_County-Level.csv")

# Preview the dataset and head 10
print(df.shape)
df.head(10)


(3294, 39)


Unnamed: 0,fips,county,state,fema_region,date,cases_last_7_days,cases_per_100k_last_7_days,total_cases,cases_pct_change_from_prev_week,deaths_last_7_days,...,pct_icu_beds_used_avg_last_7_days,pct_icu_beds_used_abs_change_from_prev_week,pct_icu_beds_used_covid_avg_last_7_days,pct_icu_beds_used_covid_abs_change_from_prev_week,pct_vents_used_avg_last_7_days,pct_vents_used_abs_change_from_prev_week,pct_vents_used_covid_avg_last_7_days,pct_vents_used_covid_abs_change_from_prev_week,pct_fully_vacc_total_pop,pct_fully_vacc_65_and_older
0,1000,"Unallocated, AL",AL,4.0,05/10/2023 12:00:00 AM,0.0,,0.0,,0.0,...,,,,,,,,,,
1,1001,"Autauga County, AL",AL,4.0,05/10/2023 12:00:00 AM,8.0,14.319,19913.0,-0.333,0.0,...,0.717,-0.15,0.013,-0.011,,,,,0.461,0.744
2,1003,"Baldwin County, AL",AL,4.0,05/10/2023 12:00:00 AM,45.0,20.158,70521.0,-0.196,0.0,...,0.836,-0.016,0.007,-0.009,,,,,0.534,0.888
3,1005,"Barbour County, AL",AL,4.0,05/10/2023 12:00:00 AM,3.0,12.153,7582.0,-0.25,0.0,...,0.75,-0.057,0.026,0.01,,,,,0.474,0.747
4,1007,"Bibb County, AL",AL,4.0,05/10/2023 12:00:00 AM,4.0,17.862,8149.0,-0.556,0.0,...,0.85,0.01,0.009,-0.003,,,,,0.365,0.639
5,1009,"Blount County, AL",AL,4.0,05/10/2023 12:00:00 AM,16.0,27.669,18872.0,1.0,0.0,...,0.85,0.01,0.009,-0.003,,,,,0.329,0.555
6,1011,"Bullock County, AL",AL,4.0,05/10/2023 12:00:00 AM,1.0,9.9,3057.0,-0.75,0.0,...,0.717,-0.15,0.013,-0.011,,,,,0.568,0.845
7,1013,"Butler County, AL",AL,4.0,05/10/2023 12:00:00 AM,2.0,10.284,6617.0,,0.0,...,0.5,0.381,0.0,0.0,,,,,0.41,0.677
8,1015,"Calhoun County, AL",AL,4.0,05/10/2023 12:00:00 AM,19.0,16.725,41931.0,0.727,0.0,...,0.798,0.008,0.021,0.008,,,,,0.49,0.832
9,1017,"Chambers County, AL",AL,4.0,05/10/2023 12:00:00 AM,7.0,21.05,10935.0,1.333,0.0,...,0.717,0.055,0.007,-0.007,,,,,0.332,0.529


# Step 2: Identify numeric columns

In [2]:
# Choose which column is worth analyzing and computing next
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3294 entries, 0 to 3293
Data columns (total 39 columns):
 #   Column                                                       Non-Null Count  Dtype  
---  ------                                                       --------------  -----  
 0   fips                                                         3294 non-null   int64  
 1   county                                                       3294 non-null   object 
 2   state                                                        3294 non-null   object 
 3   fema_region                                                  3291 non-null   float64
 4   date                                                         3294 non-null   object 
 5   cases_last_7_days                                            3289 non-null   float64
 6   cases_per_100k_last_7_days                                   3221 non-null   float64
 7   total_cases                                                  3278 non-null   f

# Step 3: Compute mean, median, mode

In [22]:
mean_pandas = df["total_cases"].mean()
median_pandas = df["total_cases"].median()
mode_pandas = df["total_cases"].mode()[0]

print("Mean:", mean_pandas)
print("Median:", median_pandas)
print("Mode:", mode_pandas)

Mean: 31945.17754728493
Median: 8045.5
Mode: 0.0


# Data Visualization

In [23]:
state_cases = df.groupby("state")["total_cases"].sum().sort_values(ascending=False)

print("COVID-19 Total Cases by State by 2023/5/10")
print("(Each ­Ъда represents approximately 1,000 total cases)\n")
for state, value in state_cases.items():
    # Adds a cap (min(value // 1000, 100)) so extreme values (like NYC) donРђЎt print 50,000 stars.
    bar = "­Ъда" * min(int(value) // 100000, 100)
    # Each county still has a proportional bar.
    print(f"{state:5s}{bar}")
    # Use scrollable elements to see full data visualization.

COVID-19 Total Cases by State by 2023/5/10
(Each ­Ъда represents approximately 1,000 total cases)

CA   ­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда
TX   ­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда
FL   ­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда­Ъда

# The hard way

In [12]:
import csv
import os

# Computing mean, mode, median
values = []
# Put CSV in same folder as this script
BASE_DIR = os.getcwd()
CSV_PATH = os.path.join(
    BASE_DIR, "COVID-19_Community_Profile_Report_-_County-Level.csv"
)
with open(CSV_PATH, newline="") as f:
    read_data = csv.DictReader(f)
    for row in read_data:
        try:
            value = float(row["total_cases"])
            values.append(value)
        except:
            pass  # skip missing or bad data

# Mean
mean_manual = sum(values) / len(values)

# Median
values_sorted = sorted(values)
n = len(values)
if n % 2 == 1:
    median_manual = values_sorted[n // 2]
else:
    median_manual = (values_sorted[n // 2 - 1] + values_sorted[n // 2]) / 2

# Mode
mode_manual = max(set(values), key=values.count)

print("Mean of total_cases:", mean_manual)
print("Median of total_cases:", median_manual)
print("Mode of total_cases:", mode_manual)

Mean of total_cases: 31945.17754728493
Median of total_cases: 8045.5
Mode of total_cases: 0.0


# Analysis and Data Intepretation

The dataset, РђюCOVID-19 Community Profile Report РђЊ County-Level,РђЮ contains cumulative COVID-19 case counts by county across the United States.  
The column 'total_cases' represents the total number of confirmed cases recorded for each county since the start of the pandemic to 2023/5/10. The data was released by White House COVID-19 Team, Joint Coordination Cell, Data Strategy and Execution Workgroup. 

## Statistical Summary
- Mean total cases per county: 31945.17754728493
- Median total cases per county: 8045.5
- Mode total cases per county: 0

## Interpretation
The mean is much higher than the median, which suggests a right-skewed distribution.  
CA and TX have extremely high case counts as states, while rural areas have fewer cases. 
This could implies that the pandemicРђЎs burden was unevenly distributed, heavily concentrated in urban areas. Even though the total cases seem large, mode is 0, indicating that some counties/states didn't gather valid data or covid had never happened in that area (which is not very likely).

## Key Assumptions
- Missing or incomplete records were excluded.  
- РђюTotal casesРђЮ are influenced by county population size; a more meaningful comparison might normalize by population.

## Conclusion
The project highlights disparities in county-level case distribution and with the help of data visualization in state-wide and county-wide, the statistics can be understood more evidently.
