# Analysis of COVID-19 Deaths
**Joseph Jones and Julia Gui**

## Setting Up
**1. Choosing Dataset**
<br>For this assignment, we will be using [Conditions Contributing to COVID-19 Deaths, by State and Age, Provisional 2020-2023](https://data.cdc.gov/NCHS/Conditions-Contributing-to-COVID-19-Deaths-by-Stat/hk9y-quqm) from the CDC. This dataset contains the number of COVID-19 deaths for a variety of intervals, regions, contributing conditions, and age groups.


**2. Preparing Questions**
<br>We can analyze this data to answer the following questions:
1. Was there an increase in the number of deaths during certain times of the year?
2. What is the percentage change of COVID-19 deaths every month?
3. For each year, which three states have the highest and lowest number of COVID-19 deaths?
4. Which age groups experienced the most deaths? Has this changed since the beginning of the pandemic?
5. How are contributing conditions distributed across the different age groups?
6. What is the longest streak of less than 10 deaths in a month? When and in what state did this streak occur?


**3. Creating DataFrame**
<br>We need to import Matplotlib, Pandas, and the dataset into the Python interpreter:

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('conditions_contributing_to.csv')
df

Unnamed: 0,Data As Of,Start Date,End Date,Group,Year,Month,State,Condition Group,Condition,ICD10_codes,Age Group,COVID-19 Deaths,Number of Mentions,Flag
0,09/24/2023,01/01/2020,09/23/2023,By Total,,,United States,Respiratory diseases,Influenza and pneumonia,J09-J18,0-24,1569.0,1647.0,
1,09/24/2023,01/01/2020,09/23/2023,By Total,,,United States,Respiratory diseases,Influenza and pneumonia,J09-J18,25-34,5804.0,6029.0,
2,09/24/2023,01/01/2020,09/23/2023,By Total,,,United States,Respiratory diseases,Influenza and pneumonia,J09-J18,35-44,15080.0,15699.0,
3,09/24/2023,01/01/2020,09/23/2023,By Total,,,United States,Respiratory diseases,Influenza and pneumonia,J09-J18,45-54,37414.0,38878.0,
4,09/24/2023,01/01/2020,09/23/2023,By Total,,,United States,Respiratory diseases,Influenza and pneumonia,J09-J18,55-64,82668.0,85708.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
620995,09/24/2023,05/01/2023,05/31/2023,By Month,2023.0,5.0,Puerto Rico,COVID-19,COVID-19,U071,All Ages,67.0,67.0,
620996,09/24/2023,06/01/2023,06/30/2023,By Month,2023.0,6.0,Puerto Rico,COVID-19,COVID-19,U071,All Ages,122.0,122.0,
620997,09/24/2023,07/01/2023,07/31/2023,By Month,2023.0,7.0,Puerto Rico,COVID-19,COVID-19,U071,All Ages,114.0,114.0,
620998,09/24/2023,08/01/2023,08/31/2023,By Month,2023.0,8.0,Puerto Rico,COVID-19,COVID-19,U071,All Ages,78.0,78.0,


With all of these different categories, we must be careful to not overcount deaths. For example, one COVID-19 death may have had several contributing conditions, so it would have been added to the number of deaths for each condition. If we only want the total number of COVID-19 deaths, we must specify that the condition is COVID-19.



## Cleaning and Preprocessing
**1. Removing Null or Duplicate Rows**
<br>We will address null values later. With the code below, we can see that there are no duplicate rows:

In [2]:
df.isnull().sum()

Data As Of                 0
Start Date                 0
End Date                   0
Group                      0
Year                   12420
Month                  62100
State                      0
Condition Group            0
Condition                  0
ICD10_codes                0
Age Group                  0
COVID-19 Deaths       183449
Number of Mentions    177577
Flag                  437551
dtype: int64

**2. Removing Unuseful Columns**
<br>To better understand the data we are working with, we can display the unique values of each column:

In [3]:
for col in df:
	print(col, ': ', df[col].unique())

Data As Of :  ['09/24/2023']
Start Date :  ['01/01/2020' '01/01/2021' '01/01/2022' '01/01/2023' '02/01/2020'
 '03/01/2020' '04/01/2020' '05/01/2020' '06/01/2020' '07/01/2020'
 '08/01/2020' '09/01/2020' '10/01/2020' '11/01/2020' '12/01/2020'
 '02/01/2021' '03/01/2021' '04/01/2021' '05/01/2021' '06/01/2021'
 '07/01/2021' '08/01/2021' '09/01/2021' '10/01/2021' '11/01/2021'
 '12/01/2021' '02/01/2022' '03/01/2022' '04/01/2022' '05/01/2022'
 '06/01/2022' '07/01/2022' '08/01/2022' '09/01/2022' '10/01/2022'
 '11/01/2022' '12/01/2022' '02/01/2023' '03/01/2023' '04/01/2023'
 '05/01/2023' '06/01/2023' '07/01/2023' '08/01/2023' '09/01/2023']
End Date :  ['09/23/2023' '12/31/2020' '12/31/2021' '12/31/2022' '01/31/2020'
 '02/29/2020' '03/31/2020' '04/30/2020' '05/31/2020' '06/30/2020'
 '07/31/2020' '08/31/2020' '09/30/2020' '10/31/2020' '11/30/2020'
 '01/31/2021' '02/28/2021' '03/31/2021' '04/30/2021' '05/31/2021'
 '06/30/2021' '07/31/2021' '08/31/2021' '09/30/2021' '10/31/2021'
 '11/30/2021' '01/31

We are going to drop:
* the 'Data As Of' column, because it contains the same value for each row;
* the 'Start Date' and 'End Date' columns, because the numerical format of 'Year' and 'Month' will be easier to work with;
* the 'Group' column, because whether a row is by month, by year, or in total can be deduced from the presence or absence of null values in 'Year' and 'Month';
* the 'Condition Group' column, because 'Condition' is more specific;
* the 'ICD10_codes' column, where each code uniquely identifies a condition, because conditions are already uniquely and more recognizably identified by their names;
* and the 'Number of Mentions' column, because it is not entirely clear what the difference is between a “death” and a “mention”.

The columns that are essential to our analysis are 'Year', 'Month', 'State', 'Condition', 'Age Group' and 'COVID-19 Deaths'. We will deal with the 'Flag' column in a moment- for now, we can remove the unuseful columns, and shorten the name of 'COVID-19 Deaths':

In [4]:
df.drop(columns=['Data As Of', 'Start Date', 'End Date', 'Group', 'Condition Group', 'ICD10_codes', 'Number of Mentions'], inplace=True)
df.rename(columns={'COVID-19 Deaths':'Deaths'}, inplace=True)
df

Unnamed: 0,Year,Month,State,Condition,Age Group,Deaths,Flag
0,,,United States,Influenza and pneumonia,0-24,1569.0,
1,,,United States,Influenza and pneumonia,25-34,5804.0,
2,,,United States,Influenza and pneumonia,35-44,15080.0,
3,,,United States,Influenza and pneumonia,45-54,37414.0,
4,,,United States,Influenza and pneumonia,55-64,82668.0,
...,...,...,...,...,...,...,...
620995,2023.0,5.0,Puerto Rico,COVID-19,All Ages,67.0,
620996,2023.0,6.0,Puerto Rico,COVID-19,All Ages,122.0,
620997,2023.0,7.0,Puerto Rico,COVID-19,All Ages,114.0,
620998,2023.0,8.0,Puerto Rico,COVID-19,All Ages,78.0,


**3. Filling Missing Values**
<br>This code finds our missing values:

In [5]:
df.isnull().sum()

Year          12420
Month         62100
State             0
Condition         0
Age Group         0
Deaths       183449
Flag         437551
dtype: int64

There are null values in 'Year' and 'Month', but these are supposed to be here.

The rest of the missing values are in 'Deaths' and 'Flag'. When we looked at the unique values of each column, we found that rows are flagged when one or more data cells with counts between 1-9 have been “suppressed”.

The following code checks for unflagged rows with a missing value in 'Deaths', and finds nothing:

In [6]:
df.loc[(df['Deaths'].isnull()) & (df['Flag'].isnull())]

Unnamed: 0,Year,Month,State,Condition,Age Group,Deaths,Flag


Therefore, each null value represents some number less than 10. Considering how large the other numbers in the table are, we can replace the nulls with zero, without significantly affecting the data. We will also remove the 'Flag' column, since we are done with it.

In [7]:
df.drop(columns=['Flag'], inplace=True)
df.fillna({'Deaths':0}, inplace=True)
df

Unnamed: 0,Year,Month,State,Condition,Age Group,Deaths
0,,,United States,Influenza and pneumonia,0-24,1569.0
1,,,United States,Influenza and pneumonia,25-34,5804.0
2,,,United States,Influenza and pneumonia,35-44,15080.0
3,,,United States,Influenza and pneumonia,45-54,37414.0
4,,,United States,Influenza and pneumonia,55-64,82668.0
...,...,...,...,...,...,...
620995,2023.0,5.0,Puerto Rico,COVID-19,All Ages,67.0
620996,2023.0,6.0,Puerto Rico,COVID-19,All Ages,122.0
620997,2023.0,7.0,Puerto Rico,COVID-19,All Ages,114.0
620998,2023.0,8.0,Puerto Rico,COVID-19,All Ages,78.0


**4. Removing Unuseful Rows**
<br>While testing the data, we found that the 'United States' category of the 'State' column sums the deaths in the fifty states, Puerto Rico, District of Columbia, and New York City, but excludes the deaths from Puerto Rico. You can play with the code below:

In [8]:
# the count for the united states is 1146242
df.loc[(df['Year'].isnull()) & (df['State']=='United States') & (df['Condition']=='COVID-19') & (df['Age Group']=='All Ages')]['Deaths'].sum()

# the count for the fifty states, puerto rico, district of columbia, and new york city is 1152658
# df.loc[(df['Year'].isnull()) & (df['State']!='United States') & (df['Condition']=='COVID-19') & (df['Age Group']=='All Ages')]['Deaths'].sum()

# the count for the fifty states, district of columbia, and new york city is 1146242
# df.loc[(df['Year'].isnull()) & (df['State']!='United States') & (df['State']!='Puerto Rico') & (df['Condition']=='COVID-19') & (df['Age Group']=='All Ages')]['Deaths'].sum()

1146242.0

Since we are more interested in the fifty states, we will drop all of the rows from Puerto Rico and District of Columbia, then collapse each New York City row into the corresponding New York row.

In [9]:
columns = df.columns[:-1].tolist() #[:-1] excludes the 'Deaths' column
df = df.loc[(df['State']!='Puerto Rico') & (df['State']!='District of Columbia')]
df = df.replace('New York City', 'New York')
df = df.groupby(columns, dropna=False).sum()
df.reset_index(inplace=True)
df.sort_values(by=columns, na_position='first', inplace=True)
df

Unnamed: 0,Year,Month,State,Condition,Age Group,Deaths
574770,,,Alabama,Adult respiratory distress syndrome,0-24,0.0
574771,,,Alabama,Adult respiratory distress syndrome,25-34,35.0
574772,,,Alabama,Adult respiratory distress syndrome,35-44,65.0
574773,,,Alabama,Adult respiratory distress syndrome,45-54,159.0
574774,,,Alabama,Adult respiratory distress syndrome,55-64,283.0
...,...,...,...,...,...,...
563035,2023.0,9.0,Wyoming,Vascular and unspecified dementia,65-74,0.0
563036,2023.0,9.0,Wyoming,Vascular and unspecified dementia,75-84,0.0
563037,2023.0,9.0,Wyoming,Vascular and unspecified dementia,85+,0.0
563038,2023.0,9.0,Wyoming,Vascular and unspecified dementia,All Ages,0.0


**5. Data Types**
<br>If we quickly check the data types, we can see that they are suitable for our analysis:

In [10]:
df.dtypes

Year         float64
Month        float64
State         object
Condition     object
Age Group     object
Deaths       float64
dtype: object

## Analysis and Visualization
**Question 1**
<br>Was there an increase in the number of deaths during certain times of the year?

Using for loops, we will create a list for each year containing the number of COVID-19 deaths per month. We will graph these lists, then examine the graphs for any months that show a significant spike in deaths every year.

In [11]:
month_list = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
year_list = [2020, 2021, 2022, 2023]

deaths_bymonth = {}

for year in year_list:
	deaths = []
	for month in month_list:
    	deaths.append(df.loc[(df['Year']==year) & (df['Month']==month) & (df['State']=='United States') & (df['Condition']=='COVID-19') & (df['Age Group']=='All Ages')]['Deaths'].sum())
	deaths_bymonth[year] = deaths

figs, axes = plt.subplots(2,2)
figs.suptitle('COVID-19 Deaths by Month, By Year')

axes[0,0].set_title(2020)
axes[0,0].bar(month_list.values(), deaths_bymonth[2020], color='blue')

axes[0,1].set_title(2021)
axes[0,1].bar(month_list.values(), deaths_bymonth[2021], color='darkorange')

axes[1,0].set_title(2022)
axes[1,0].bar(month_list.values(), deaths_bymonth[2022], color='green')

axes[1,1].set_title(2023)
axes[1,1].bar(month_list.values(), deaths_bymonth[2023], color='red')

for ax in axes.flat:
	ax.set(xlabel='Month', ylabel='Number of Deaths')
	ax.tick_params('x', labelrotation=90)
	plt.setp(ax, ylim=(0,120000))
	ax.label_outer()

TabError: inconsistent use of tabs and spaces in indentation (2322982333.py, line 9)

Looking at these graphs, there seems to be a small spike in deaths around July, August, or September, and then a large spike in December.

This is likely because of the colder weather. When it is cold outside, people spend more time together indoors, facilitating the transmission of COVID-19 and other illnesses. December is also a part of the holiday season, and many people interact (and share illness) with extended family and friends.

We wonder whether this pattern will hold steady or change for the rest of 2023 and the next few years.


**Question 2**
<br>What is the percentage change of COVID-19 deaths every month?

We can loop through each month with complete data and calculate the percentage change in COVID-19 deaths since the previous month, using the formula `(new value - old value) / (old value) * 100`. This calculation will raise an error if the previous month is not in the dataset or if its number of deaths is zero. For our goal of visualization, it is acceptable to set the percentage change to zero.

After we have found the percentage change for each month, we will graph the percentage change over the years.

In [None]:
def prevmonth_func(year, month):
    if (month==1):
        return year-1, 12
    else:
        return year, month-1

percent_bymonth = {}

for year in year_list:
    deaths = []
    for month in month_list:
        if (year==2023) and (month>8):
            percent = 0
        else:
            prevyear, prevmonth = prevmonth_func(year,month)
            if prevyear not in year_list:
                percent = 0
            elif (deaths_bymonth[prevyear][prevmonth-1]==0):
                percent = 0
            else:
                percent = (deaths_bymonth[year][month-1] - deaths_bymonth[prevyear][prevmonth-1])/(deaths_bymonth[prevyear][prevmonth-1]) * 100
        deaths.append(percent)
    percent_bymonth[year] = deaths

figs, axes = plt.subplots(2,2)
figs.suptitle('Percentage Change of COVID-19 Deaths from Previous Month')

axes[0,0].set_title(2020)
axes[0,0].bar(month_list.values(), percent_bymonth[2020], color='blue')

axes[0,1].set_title(2021)
axes[0,1].bar(month_list.values(), percent_bymonth[2021], color='darkorange')

axes[1,0].set_title(2022)
axes[1,0].bar(month_list.values(), percent_bymonth[2022], color='green')

axes[1,1].set_title(2023)
axes[1,1].bar(month_list.values(), percent_bymonth[2023], color='red')

for ax in axes.flat:
    ax.set(xlabel='Month', ylabel='Percentage Change')
    ax.tick_params('x', labelrotation=90)
    plt.setp(ax, ylim=(-100,100))
    ax.label_outer()

You can play with the line `plt.setp(ax, ylim=(-100,100))` to change the scale. The maximum percentage change in deaths was a 37500% increase between February and March 2020, with 19 and 7160 deaths, respectively. After all, March 2020 was when COVID-19 really took off through the United States.

In 2020, most months experienced an increase in deaths, but in the following years, more months experienced a decrease in deaths than an increase. These results also confirm our conclusion in Question 1, that there appears to be a small increase in deaths sometime during the summer and a large increase in deaths in December.

It would be interesting to compare the percentage change between each month to the timeline of the pandemic- what happened when lockdown restrictions were placed and lifted, when COVID-19 vaccines were introduced, when new variants were discovered, etc.


**Question 3**
<br>For each year, which three states have the highest and lowest number of COVID-19 deaths?

We will make a copy of the original DataFrame that contains only the deaths for each state for each year. We can sort this new DataFrame, then use `loc` and `iloc` to pull the first and last three rows of data for each year, which will correspond to the states with the highest and lowest number of deaths.

In [None]:
df2 = df.copy().loc[(df['Month'].isnull()) & (df['State']!='United States') & (df['Condition']=='COVID-19') & (df['Age Group']=='All Ages')][['Year', 'State', 'Deaths']]
df2.sort_values(by=['Year', 'Deaths'], ascending=False, inplace=True)

for year in year_list:
    df3 = df2.loc[(df2['Year']==year)]

    print(int(year))

    print('Most deaths:')
    print('1. ', df3.iloc[0]['State'], ' (', int(df3.iloc[0]['Deaths']), ')', sep='')
    print('2. ', df3.iloc[1]['State'], ' (', int(df3.iloc[1]['Deaths']), ')', sep='')
    print('3. ', df3.iloc[2]['State'], ' (', int(df3.iloc[2]['Deaths']), ')', sep='')

    print('Least deaths:')
    print('1. ', df3.iloc[-1]['State'], ' (', int(df3.iloc[-1]['Deaths']), ')', sep='')
    print('2. ', df3.iloc[-2]['State'], ' (', int(df3.iloc[-2]['Deaths']), ')', sep='')
    print('3. ', df3.iloc[-3]['State'], ' (', int(df3.iloc[-3]['Deaths']), ') \n', sep='')

The states with the most and least COVID-19 deaths have been displayed for each year. In 2020, the most deaths were in California, New York, and Texas, and Florida has replaced New York in every year since. As for the states with the lowest number of deaths, Alaska appears each year, and Hawaii and Vermont eventually give way to Wyoming and North Dakota.

We thought our results were interesting but unsurprising. A more in-depth analysis might include calculating and comparing ratio of COVID-19 deaths to population for each state.


**Question 4**
<br>Which age groups experienced the most deaths? Has this changed since the beginning of the pandemic?

We will calculate and graph the distribution of COVID-19 deaths across the different age groups for each year.

In [None]:
age_list = df['Age Group'].unique().tolist()
age_list.remove('All Ages')

deaths_byage = {}

for year in year_list:
    deaths = []
    for age in age_list:
        deaths.append(df.loc[(df['Year']==year) & (df['Month'].isnull()) & (df['State']=='United States') & (df['Condition']=='COVID-19') & (df['Age Group']==age)]['Deaths'].sum())
    deaths_byage[year] = deaths

figs, axes = plt.subplots(2,2)
figs.suptitle('COVID-19 Deaths by Age Group, By Year')

axes[0,0].set_title(2020)
axes[0,0].bar(age_list, deaths_byage[2020], color='blue')

axes[0,1].set_title(2021)
axes[0,1].bar(age_list, deaths_byage[2021], color='darkorange')

axes[1,0].set_title(2022)
axes[1,0].bar(age_list, deaths_byage[2022], color='green')

axes[1,1].set_title(2023)
axes[1,1].bar(age_list, deaths_byage[2023], color='red')

for ax in axes.flat:
    ax.set(xlabel='Age Group', ylabel='Number of Deaths')
    ax.tick_params('x', labelrotation=90)
    plt.setp(ax, ylim=(0,125000))
    ax.label_outer()

As age increases, the number of COVID-19 deaths increases.

We can also create pie charts to visualize each age group's percentage of deaths.

In [None]:
color_list = ['#6E5E4D', '#887561', '#A08C77', '#BFAB95', '#E7DA61', '#9AB8C8', '#7392BD', '#535E84', 'black']

figs, axes = plt.subplots(2,2)
figs.suptitle('Percentage of COVID-19 Deaths by Age Group, By Year')

axes[0,0].set_title(2020)
axes[0,0].pie(deaths_byage[2020], labels=age_list, autopct='%1.1f%%', colors=color_list)

axes[0,1].set_title(2021)
axes[0,1].pie(deaths_byage[2021], labels=age_list, autopct='%1.1f%%', colors=color_list)

axes[1,0].set_title(2022)
axes[1,0].pie(deaths_byage[2022], labels=age_list, autopct='%1.1f%%', colors=color_list)

axes[1,1].set_title(2023)
axes[1,1].pie(deaths_byage[2023], labels=age_list, autopct='%1.1f%%', colors=color_list)

plt.show()

The labels are a little messy, but it is clear that COVID-19 deaths occur most frequently in people aged 65+, and most frequently in people aged 85+.

This makes sense. Your immune system weakens as you age, so older people are more vulnerable to COVID-19. Policymakers and healthcare providers have used this information to develop prevention and treatment plans that protect at-risk populations.


**Question 5**
<br>How are contributing conditions distributed across the different age groups?

We will create a new data frame that, for each condition, contains the condition's total number of deaths and the proportion of deaths constituted by each age group.

In [None]:
condition_list = df['Condition'].unique().tolist()

total_bycondition = {}

df2 = df.loc[(df['Year'].isnull()) & (df['State']=='United States') & (df['Age Group']!='All Ages')][['Condition', 'Age Group', 'Deaths']]

for condition in condition_list:
    deaths = df2.loc[(df2['Condition']==condition)]['Deaths'].sum()
    total_bycondition[condition] = deaths

new_data = {'Condition':condition_list, 'Total Deaths':total_bycondition.values()}

for age in age_list:
    proportions = []
    for condition in condition_list:
        deaths = df2.loc[(df2['Condition']==condition) & (df2['Age Group']==age)]['Deaths'].sum()
        proportions.append(deaths / total_bycondition[condition])

    if (age=='Not stated'):
        key = 'Proportion of Unstated Age'
    else:
        key = 'Proportion Aged ' + age

    new_data[key] = proportions

new_df = pd.DataFrame(new_data)
new_df

We can use this new DataFrame to visualize the distribution of each condition across the different age groups. For example, if we sort the new Data Frame by the 'Proportion Aged 0-24' column, we can see which condition saw most of its deaths happen in the younger population:

In [None]:
new_df.sort_values('Proportion Aged 0-24').tail(1)

**Question 6**
<br>What is the longest streak of less than 10 deaths in a month? When and in what state did this streak occur?

We will create a new Data Frame containing only the monthly records for the states with a death count less than 10 (because of the missing values we filled in earlier). Then we can work through each row of the new Data Frame, finding the latest consecutive month that is in the table.

If no streaks have been found yet, the new streak will be saved. If the new streak is larger than the old streak, the new streak will be saved. If the new streak is equal to the old streak, the new streak will be saved with the old streak.

In [None]:
df2 = df.copy().loc[(df['Year'].notnull()) & ((df['Year']!=2023) | ((df['Year']==2023) & (df['Month']<9))) & (df['State']!='United States') & (df['Condition']=='COVID-19') & (df['Age Group']=='All Ages') & (df['Deaths']<10)][['Year', 'Month', 'State', 'Deaths']]
df2.sort_values(by=['Year', 'Month'], ascending=False, inplace=True)
df2

def length_func(currentyear, currentmonth, length):
    prevyear, prevmonth = prevmonth_func(currentyear, currentmonth)
    prevrecord = df2.loc[(df2['Year']==prevyear) & (df2['Month']==prevmonth) & (df2['State']==state)]

    if prevrecord.empty:
        return currentyear, currentmonth, length

    return length_func(prevyear, prevmonth, length+1)

streak_list = []

# for i in range(len(df2)):
for i in range(len(df2.loc[(df['Year']==2023)])):
    endyear, endmonth, state = df2.iloc[i][['Year', 'Month', 'State']]
    startyear, startmonth, length = length_func(endyear, endmonth, 1)
    
    new_streak = {'state':state, 'start year':startyear, 'start month': startmonth, 'end year':endyear, 'end month':endmonth, 'length':length}

    if not streak_list:
        streak_list.append(new_streak)

    elif length > streak_list[0]['length']:
        streak_list = new_streak

    elif length == streak_list[0]['length']:
        streak_list.append(new_streak)

streak_list

We have two lines of code starting with `for i in range(len(df2`. The first one is hidden, but would find the longest streak of less than COVID-19 deaths monthly from any year. The second line searches for the longest streak that ends in 2023.

If we test both lines of code, we see that Wyoming had the longest streak overall, with less than 10 COVID-19 deaths monthly from January to August 2020. The longest streak ending in 2023 was in Alaska and lasted from March to August 2023.

Once again, these findings are probably too dependent on population size to be much use. Still, we are intrigued by our results, and pleased that the code works!



## Conclusion
We identified some questions we could with this dataset, then cleaned the data for analysis.

We used Pythons and the Pandas library to create graphs, tables, and the `print()` displaying the distribution of COVID-19 deaths over different years, months, states, conditions, and age groups.

From these visualizations, we made conclusions about when COVID-19 deaths increase, which age groups are particularly at-risk, and how contributing conditions are spread across age groups. We also identified points of interest for further analysis, such as using population data to calculate and compare the COVID-19 deaths for the states.

In summary, we implemented several data science practices to answer inquisitive questions and gain insight into the pandemic. 