In [3]:
# Azure storage access info
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""  # Leave empty if not using SAS token

# Allow SPARK to read from Blob remotely
wasbs_path = f'wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/{blob_relative_path}'
spark.conf.set(f'fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net', blob_sas_token)

# SPARK read parquet
df = spark.read.parquet(wasbs_path)

# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView('bing_covid19_data')

# Display top 10 rows
display(spark.sql('SELECT * FROM bing_covid19_data LIMIT 20'))

StatementMeta(myfirstsp, 8, 4, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 7f84db9c-45e4-4c61-8daf-131ae431daf9)

Let's check see the size and of the dataset and the column types:

In [4]:
print( df.count(), len(df.columns) )
df.printSchema()

StatementMeta(myfirstsp, 8, 5, Finished, Available, Finished)

4766736 17
root
 |-- id: integer (nullable = true)
 |-- updated: date (nullable = true)
 |-- confirmed: integer (nullable = true)
 |-- confirmed_change: integer (nullable = true)
 |-- deaths: integer (nullable = true)
 |-- deaths_change: short (nullable = true)
 |-- recovered: integer (nullable = true)
 |-- recovered_change: integer (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- iso2: string (nullable = true)
 |-- iso3: string (nullable = true)
 |-- country_region: string (nullable = true)
 |-- admin_region_1: string (nullable = true)
 |-- iso_subdivision: string (nullable = true)
 |-- admin_region_2: string (nullable = true)
 |-- load_time: timestamp (nullable = true)



In [5]:
display(spark.sql(
    """
    WITH total_count AS (
        SELECT COUNT(*) AS total FROM bing_covid19_data
    )
    SELECT
        country_region,
        COUNT(*) as num_records,
        ROUND((COUNT(*) * 100.0 / (SELECT total FROM total_count)),2) AS percent_records
    FROM
        bing_covid19_data
    GROUP BY
        country_region
    ORDER BY
        num_records DESC
    LIMIT 10;
    """
))

StatementMeta(myfirstsp, 8, 6, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 71dadd38-7dc3-4c86-93d3-c88e74730d47)

80% of the records are captured in the first 3 largest, those being the United States (66%), India (7.5%), and Germany (6.2%) - but this includes multiple levels of hierarchical data which requires some clarification. This dataset does not contain duplicate entries, but it *does* contain aggregated data as additional rows. For instance, the covid statistics for various counties of New York as well as the state of New York itself are present... so if one were to blindly aggregate over rows belonging to New York, you could potentially double count. The dataset seems more granular for the United States and less so for other countries.

So if you want to aggregate over...

1. **State**: Check where admin_region_2 is NULL

2. **Country**: Check where *both* admin_region_1 and admin_region_2 are NULL.

Now let's re-compute the number of records, but only at a national level and show the first and last dates of data submission:

In [6]:
display(spark.sql(
    """
    WITH total_count AS (
        SELECT 
            COUNT(*) AS total 
        FROM
            bing_covid19_data
        WHERE
            admin_region_1 is NULL
            AND admin_region_2 is NULL
    )
    SELECT
        country_region,
        COUNT(*) as num_records,
        ROUND((COUNT(*) * 100.0 / (SELECT total FROM total_count)),2) AS percent_records,
        MIN(updated) as start,
        MAX(updated) as end
    FROM
        bing_covid19_data
    WHERE
        admin_region_1 is NULL
        AND admin_region_2 is NULL
    GROUP BY
        country_region
    ORDER BY
        num_records DESC
    LIMIT 15;
    """
))

StatementMeta(myfirstsp, 8, 7, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 0aef7231-b0aa-4cc1-8f83-84615d78f1d2)

And that seems to be a lot more even, with many countries possessing a similar number of records (since we're just counting the number of days at this point) - although there is a strangely larger amount of records for the Federated States of Micronesia, especially since they start recording a year after everyone else.

But nevertheless, as a confirmation that the dataset is being navigated properly, let's try to extract the records for the entire U.S. that correspond to the first day of every month (e.g. date = 202X-YY-01)

In [21]:
display(spark.sql(
"""
SELECT
    *
FROM
    bing_covid19_data
WHERE
    country_region = 'United States'
    AND admin_region_1 IS NULL
    AND admin_region_2 IS NULL
    AND updated LIKE '202_-__-01'
ORDER BY
    updated DESC
        
"""
))

StatementMeta(myfirstsp, 8, 22, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 67c29b77-69b0-4ef0-b680-ce68100657fb)

Okay, seems like the records were successfully extracted correctly - although some numbers seem quite dubious (7.1 *million less* confirmed positive cases on 1 March 2022 than the previous record?), these are what the records state exactly. Let's look closer to that time range to see what's up:

In [25]:
display(spark.sql(
    """
    SELECT
        *
    FROM
        bing_covid19_data
    WHERE
        country_region = 'United States'
        AND admin_region_1 IS NULL
        AND admin_region_2 IS NULL
        AND (updated <= '2022-03-01'
            AND updated >= '2022-02-15')
    """
))

StatementMeta(myfirstsp, 8, 26, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, faa6e2f2-ca4b-4237-be37-010efad5e581)

That is... interesting. Seems there was a giant addition error on the record for 20 Feb 2022 of about +7.24M which gets corrected on the 1 March 2022 record with a -7.20M... And also the numbers in 'confirmed_change' for the dates 21 - 26 Feb 2022 are a little too similar, which makes it seem like its more likely to be an error rather than genuine.

We should probably check how many records are negative in the confirmed_change / deaths_change / recovered_change to get an estimate for how many inaccurate rows there might be:

In [28]:
display(spark.sql(
    """
    SELECT
        COUNT(*) as neg_confirmed_change
    FROM
        bing_covid19_data
    WHERE
        confirmed_change < 0
    """
))

display(spark.sql(
    """
    SELECT
        COUNT(*) as neg_deaths_change
    FROM
        bing_covid19_data
    WHERE
        deaths_change < 0
    """
))

display(spark.sql(
    """
    SELECT
        COUNT(*) as neg_recovered_change
    FROM
        bing_covid19_data
    WHERE
        recovered_change < 0
    """
))

StatementMeta(myfirstsp, 8, 29, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 15dcda2d-b101-4a08-b62e-628c80749a70)

SynapseWidget(Synapse.DataFrame, 209e527d-df4f-4590-abcc-560cd85e59e1)

SynapseWidget(Synapse.DataFrame, e8822354-6311-4cfb-aaa3-3f268c361ef3)

Okay, not too large of an amount that it should be a big issue. When aggregating over all dates in the database it should be fine (since they'll all be summed), but any time-interval specific or subsequent non-linear ML functions should be wary and clean this data further before use - honestly fitting a spline or piecewise linear trend and replacing the huge values with the predictions should suffice. But that's for a different notebook; onto the questions!

## Questions:

**Q1: Which countries had the largest counts in death, infection, and recovery?**

**Q2: Which countries has the lowest and highest fatality rate?**

### Q1: Which countries had the largest counts in death, infection, and recovery?

In [32]:
display(spark.sql(
    """
    SELECT
        country_region,
        MAX(confirmed) as max_confirmed,
        MAX(deaths),
        MAX(recovered)
    FROM
        bing_covid19_data
    WHERE
        admin_region_1 IS NULL
        AND admin_region_2 IS NULL
    GROUP BY
        country_region
    ORDER BY
        max_confirmed DESC
    LIMIT 15
    """
))

StatementMeta(myfirstsp, 8, 33, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 67fd31fc-f09c-4f9e-8508-58eea1e21971)

For countries with the largest amount of confirmed cases, deaths, and recoveries, we can see from the table that the countries are

1. **Most confirmed cases:** United States (103M), Indie (45M), France (39M)

2. **Most deaths:** United States (1.1M), Brazil (699K), and India (530K)

3. **Most recovered:** India (44M), United States (22M), and Brazil (20M).

### Q2: Which countries had the lowest and highest average fatality rates?

The fatality rate can be calculated as the number dead / number confirmed. We'll look at the last records uploaded (so we don't have to worry about significant numerical errors) for each country and calculate this to see which had the highest and lowest rates:

In [45]:
display(spark.sql(
    """
    WITH last_dates as (
        SELECT
            country_region,
            MAX(updated) as final_date
        FROM
            bing_covid19_data
        WHERE
            admin_region_1 IS NULL
            AND admin_region_2 IS NULL
        GROUP BY
            country_region
    )
    SELECT
        b.country_region,
        b.updated,
        b.confirmed,
        b.deaths,
        b.recovered,
        b.deaths / b.confirmed as fatality_rate
    FROM
        bing_covid19_data b
    JOIN
        last_dates ld
    ON
        b.country_region = ld.country_region
        AND b.updated = ld.final_date
    WHERE
        admin_region_1 IS NULL
        AND admin_region_2 IS NULL
        AND deaths > 0
    ORDER BY
        fatality_rate DESC
    """
))

StatementMeta(myfirstsp, 8, 46, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, c2e9042e-99db-4f27-b7da-e697258e898d)

As of the latest records, the country with the highest and lowest fatality rates are:

1. **Highest fatality rate:** Yemen (18%), but this is with small-population statistics.

2. **Lowest fatality rate:** Nauru (0.02%) - but also with small-population statistics.

## Future work:

We're kind of hampered with doing easy time series analysis due to the giant errors (clerical or otherwise) that can occur. We'll do a follow up notebook where we impute some of these large values and see if we can do some forecasting.

## Tableau

In addition to this, a Tableau dashboard of the average monthly confirmed Covid-19 cases in the counties of New York was made, which can be accessed at https://public.tableau.com/app/profile/james.edmond1269/viz/ny_counties/Dashboard1. Screenshots are included below:

*The full dashboard for April 2020*

![image](images/tableau_nycounties_full_april2020.png)

*Full dashboard for July 2022*

![image](images/tableau_nycounties_full_july2022.png)

*The visual with New York county selected in July 2022:*

![image](images/tableau_nycounties_newyork_july2022.png)

*... And with Montgomery county selected in July 2022*

![image](images/tableau_nycounties_montgomery_july2022.png)