In [None]:
Murali Krishna Enugula
HDS 5230 - 07
Week 05 - Dask Programming Assignment

1.  Creating a Dask DataFrame for US States

In [2]:
import dask.dataframe as dd

# Loading dataset using Dask 
df = dd.read_csv('C://Users//drmur//Downloads//timeseries.csv', dtype={'cases': 'float64', 'deaths': 'float64', 'population': 'float64'})

# Converting date column to datetime
df['date'] = dd.to_datetime(df['date'], errors='coerce')

# Filtering for US states
df_us = df[df['country'] == 'United States']

# Selecting relevant columns
df_us = df_us[['state', 'date', 'population', 'cases', 'deaths']]

Parallelization is not necessary since filtering rows based on country is a simple operation that even large datasets can handle efficiently with Pandas.

2. Computing Per-Capita Mortality

In [3]:
# Filling missing values
df_us['deaths'] = df_us['deaths'].fillna(0)
df_us['population'] = df_us['population'].fillna(df_us.groupby('state')['population'].transform('mean'))

# Grouping by state and compute total deaths and average population
state_mortality = df_us.groupby('state').agg({'deaths': 'sum', 'population': 'mean'})

# Computing per-capita mortality
state_mortality['Per-Capita Mortality'] = state_mortality['deaths'] / state_mortality['population']

# Result
state_mortality = state_mortality.compute().sort_values(by='Per-Capita Mortality', ascending=False)

  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  df_us['population'] = df_us['population'].fillna(df_us.groupby('state')['population'].transform('mean'))


Since the computation involves only 50 states, Pandas can efficiently aggregate and compute per-capita mortality without parallelization. However, if performing the same calculation for thousands of counties or cities, a parallelized framework like Dask would be beneficial for efficiently handling the large number of groups.

3. Computing Monthly Case Fatality Rate (CFR) Using WHO Guidelines

In [4]:
# Converting 'date' column to datetime format
df['date'] = dd.to_datetime(df['date'], errors='coerce')

# Filtering dataset for US states
df_us = df[df['country'] == 'United States']

# Keeping relevant columns
df_us = df_us[['state', 'date', 'cases', 'deaths']]

# Filling missing values in deaths and cases
df_us['cases'] = df_us['cases'].fillna(0)
df_us['deaths'] = df_us['deaths'].fillna(0)

# Extracting year-month for grouping
df_us['year_month'] = df_us['date'].dt.to_period('M')

# Grouping by state and month and reset index to retain 'state'
monthly_stats = df_us.groupby(['state', 'year_month']).agg({'cases': 'sum', 'deaths': 'sum'}).reset_index()

# Computing CFR using WHO approach
monthly_stats['CFR'] = (monthly_stats['deaths'] / monthly_stats['cases']) * 100

# Replacing infinite values with NaN
monthly_stats['CFR'] = monthly_stats['CFR'].replace([float('inf'), -float('inf')], None)

# Computing to bring Dask DataFrame into Pandas format
monthly_stats = monthly_stats.compute()

# Pivoting the data into a 50 (states) × 14 (months) matrix
cfr_matrix = monthly_stats.pivot(index='state', columns='year_month', values='CFR')

# Displaying the final CFR matrix
print(cfr_matrix.head())

year_month      2020-01  2020-02   2020-03   2020-04   2020-05   2020-06  \
state                                                                      
Alabama             NaN      NaN  0.532313  2.830899  3.889270  2.962907   
Alaska              NaN      NaN  0.335008  2.314519  2.196905  1.303247   
American Samoa      NaN      NaN       NaN       NaN       NaN       NaN   
Arizona             0.0      0.0  0.000000  1.486545  1.992175  0.211513   
Arkansas            NaN      NaN  0.915656  1.911450  2.129628  1.515155   

year_month       2020-07  
state                     
Alabama         2.381771  
Alaska          1.207417  
American Samoa       NaN  
Arizona         0.973523  
Arkansas        1.274360  


For state-level monthly CFR calculations (50 × 14 = 700 groups), Pandas is sufficient. However, if computing CFR for daily cases across thousands of counties, Dask would significantly speed up the grouping and aggregation operations by distributing computations across multiple processors.

4. Ranking States Based on CFR Changes Over Time

In [5]:
# Computing month-to-month CFR changes
cfr_changes = cfr_matrix.diff(axis=1)  # This is now a Pandas DataFrame

# Computing total absolute change across months
state_cfr_variability = cfr_changes.abs().sum(axis=1)

# Ranking states based on CFR fluctuations
state_cfr_ranking = state_cfr_variability.sort_values(ascending=False)

# Displaying the ranking
print(state_cfr_ranking.head(10))  # Show the top 10 states with the highest CFR variability

state
Michigan                    8.688709
Northern Mariana Islands    7.932137
New Jersey                  7.849657
Connecticut                 7.686037
Massachusetts               7.437951
Washington                  6.695390
Pennsylvania                6.512038
Wisconsin                   6.470096
New Hampshire               6.167981
Missouri                    5.982125
dtype: float64


This operation involves a small matrix (50 × 14), where month-to-month CFR changes are computed and summed across months. Since this dataset is small, parallelization is unnecessary, and Pandas is the optimal choice for quick computation. For larger datasets (e.g., county-level trends), Dask could be considered but is not required for state-level analysis