In [11]:
import dask.dataframe as dd
import pandas as pd
import numpy as np

In [12]:
dtypes = {
    'population': 'float64',
    'cases': 'float64',
    'deaths': 'float64'
}

covid_df = dd.read_csv('timeseries.csv', dtype=dtypes)
us_states = covid_df[covid_df['country'] == 'United States']
us_states = us_states[us_states['level'] == 'state']

us_states['date'] = dd.to_datetime(us_states['date'])

mask = (us_states['date'] >= '2020-01-01') & (us_states['date'] <= '2021-02-28')
us_states = us_states[mask]

total_deaths = us_states.groupby('state')['deaths'].max() - us_states.groupby('state')['deaths'].min()
avg_population = us_states.groupby('state')['population'].mean()
mortality_per_capita = (total_deaths / avg_population).compute()

ranked_states_mortality = mortality_per_capita.sort_values(ascending=False)
print(ranked_states_mortality)

state
New Jersey                      0.001712
New York                        0.001280
Connecticut                     0.001216
Massachusetts                   0.001187
Rhode Island                    0.000903
Washington, D.C.                0.000791
Louisiana                       0.000706
Michigan                        0.000623
Illinois                        0.000553
Maryland                        0.000536
Pennsylvania                    0.000527
Delaware                        0.000520
Indiana                         0.000392
Mississippi                     0.000373
Colorado                        0.000295
New Hampshire                   0.000274
Georgia                         0.000269
Minnesota                       0.000253
Ohio                            0.000248
New Mexico                      0.000244
Arizona                         0.000231
Iowa                            0.000228
Virginia                        0.000217
Alabama                         0.000205
Washington

In [13]:
# Calculate monthly CFR
monthly_data = us_states.groupby(['state', us_states['date'].dt.to_period('M')]).agg({
    'cases': 'max',
    'deaths': 'max'
}).compute()

# Calculate monthly new cases and deaths
monthly_data['new_cases'] = monthly_data.groupby('state')['cases'].diff()
monthly_data['new_deaths'] = monthly_data.groupby('state')['deaths'].diff()

# Calculate CFR (using new deaths divided by new cases)
monthly_data['cfr'] = monthly_data['new_deaths'] / monthly_data['new_cases'] * 100

# Reshape to get state x month matrix
cfr_matrix = monthly_data['cfr'].unstack()
print(cfr_matrix)

date                           2020-03    2020-04    2020-05    2020-06  \
state                                                                     
Virginia                           NaN   3.699390   2.861121   2.139745   
Washington                    3.970528   6.605625   3.716834   1.765157   
Alabama                            NaN   4.252492   3.301930   1.592594   
Alaska                             NaN        NaN   0.884956   0.306279   
Arizona                            NaN        NaN   4.744466   0.472351   
Arkansas                           NaN   1.969528   1.889764   1.054576   
California                         NaN   4.430410   3.471603   1.608304   
Colorado                           NaN   5.846825   5.850422   3.868014   
Connecticut                        NaN   8.793001  11.779081   8.815299   
Delaware                           NaN   3.216308   4.540632   7.189542   
Washington, D.C.                   NaN   5.616510   5.404198   5.570118   
Florida                  

In [14]:
# Calculate month-to-month changes in CFR
cfr_changes = cfr_matrix.diff(axis=1)

# Aggregate changes (sum of absolute changes)
total_cfr_change = cfr_changes.abs().sum(axis=1)

# Rank states by total CFR change
ranked_states_cfr_change = total_cfr_change.sort_values(ascending=False)
print(ranked_states_cfr_change)

state
United States Virgin Islands    66.666667
Alaska                          40.884956
Rhode Island                    31.083942
New Jersey                      26.936583
Pennsylvania                    20.208528
New Hampshire                   11.869711
Connecticut                     11.571058
Delaware                        10.729251
Missouri                        10.614211
Michigan                        10.267054
Washington                       8.684911
Massachusetts                    7.813686
New York                         7.656158
Louisiana                        7.334429
Vermont                          7.191781
Arizona                          7.116321
Ohio                             6.788839
Illinois                         6.506730
Oklahoma                         6.013869
Mississippi                      5.750529
Florida                          5.698160
New Mexico                       5.463665
Wisconsin                        5.162849
Maine                       

Reading and Initial Filtering Operations:
The first part of reading CSV employs parallel capabilities because it involves loading big time series datasets. The partitioning feature of Dask lets simultaneous processing occur with multiple CSV file chunks. Different data partitions will allows perfect parallelization for filtering US states and selecting a date range. The distributed computing suitability of initial data loading and filtering operations plays very well during this stage.

Per-capita Mortality Calculation:
The per-capita mortality calculation requires two aggregated operations to obtain total deaths while calculating average population prior to performing a division. The groupby operations at the beginning can be parallelized yet the operations that need combined data from various partitions follow afterward. The partition system computes local maximums along with minimums and means then consolidates them to produce final outcomes. The last division operation occurs following full data aggregation which reduces its ability to parallelize. The distribution benefit in this case depends on the size of the data because small memory-fit grouped data would be better handled without distribution.

Monthly CFR Calculations:
Recomputing the CFR requires different processes which exhibit varying capabilities for data parallel processing. Effective parallelization applies to preliminary state-month groupings because each partition handles data within its region. To determine month-to-month changes in new cases and deaths it is necessary to sequentially access temporal information within state groups. The existing dependencies affect parallelization possibilities when processing these steps. The CFR calculation functions properly when deaths divide cases because this operation is an element-wise process although complete death and case information must exist at each state-month intersection.

CFR Change Matrix and Rankings:
Building the state-by-month matrix and performing monthly change computations demands complete state-level time series data to execute_operations. The differential calculations need sequential access to monthly data even though the initial matrix creation process can function across multiple states in parallel. The ranking process must conduct its analysis across all states which makes it behave as a sequential operation. Distribution does not afford much benefit to this phase unless there are a massive number of states or time periods.

Data Storage and Memory Considerations:
Dask remains an optimal choice to manage memory usage even though parallelization offers restricted benefits during these operations. The lazy evaluation of Dask together with its memory-beyond-the-restrictions feature enables stable operation and scalability. No matter how many processes Dask spans for an operation it offers helpful memory management features to work with datasets exceeding RAM capacity.

Some tasks such as data loading and filtering operations gain most of their performance benefits through parallel execution whereas final aggregation and ranking operations show minimal parallelization possible. The decision about distributed computing implementation should evaluate parallelization advantages against variables including data set capacity and existing resources as well as network modifications. A combined method would offer the best results for this evaluation by employing parallel processing until data processing reaches the grouping and ranking portions which require sequential execution.