# CMSC320 - Introduction to Data Science
## Final Tutorial
#### David Martin
---

The purpose of this tutorial is to perform an analysis of COVID data across the state of Virginia.

In [67]:
import requests
import pandas as pd
import json
import datetime

## Data Collection

Lets start by pulling the data from the dataset "VDH-COVID-19-PublicUseDataset-Cases" as found on data.virginia.gov, and taking a look at what it contains:

https://data.virginia.gov/Government/VDH-COVID-19-PublicUseDataset-Cases/bre9-aqqr

I uploaded the .csv to my GitHub at the following link:

https://github.com/martindavid1995/Data-Science-Tutorial

In [68]:
# Pull data on COVID cases across virginia
covid_data = pd.read_csv("https://raw.githubusercontent.com/martindavid1995/Data-Science-Tutorial/master/Covid_VA.csv")
covid_data.head()


Unnamed: 0,Report Date,FIPS,Locality,VDH Health District,Total Cases,Hospitalizations,Deaths
0,03/17/2020,51001,Accomack,Eastern Shore,0,0,0
1,03/17/2020,51003,Albemarle,Thomas Jefferson,0,0,0
2,03/17/2020,51005,Alleghany,Alleghany,0,0,0
3,03/17/2020,51007,Amelia,Piedmont,0,0,0
4,03/17/2020,51009,Amherst,Central Virginia,0,0,0


## Data Management

Now that we have our dataset imported, lets start looking at what we have to work with. In order to visualize our data, we should look to find what 

In [69]:
print("Total number of columns in the dataset: ",len(covid_data.index))

Total number of columns in the dataset:  103607


As we can see, we have a pretty large set of data. Over 100,000 rows with seemingly an entry for each locality for each date within the specified range. Lets see what the day range looks like:

In [70]:
def printMinMax(df, column):
    print("Oldest entry: ",df[column].min())
    print("Most recent entry: ",df[column].max())
    
printMinMax(covid_data, "Report Date")

Oldest entry:  01/01/2021
Most recent entry:  12/31/2021


If we look at the results of the above cell, we notice our first issue with this dataset. The date range from the days above shows January 1, 2021 as being the earliest recorded date when we can clearly see from the head output in the above cells that we have columns with dates going back into 2020. This must be because there is some inability to compare these date objects in the current dataframe accurately. Lets convert the dates into DateTime objects so we can figure out our actual date range, and manipulate and visualize our data better.

In [74]:
def dateToDateTime(date: str):
    split = date.split("/")
    month = int(split[0])
    day = int(split[1])
    year = int(split[2])
    return datetime.date(year, month, day)
    

def convertDateTime(df):
    datetimes = []
    for index,row in df.iterrows():
        date = df.at[index, "Report Date"]
        datetimes.append(dateToDateTime(date))
    
    df['DateTime'] = datetimes
    return df

covid_data = convertDateTime(covid_data)
covid_data.head()

Unnamed: 0,Report Date,FIPS,Locality,VDH Health District,Total Cases,Hospitalizations,Deaths,DateTime
0,03/17/2020,51001,Accomack,Eastern Shore,0,0,0,2020-03-17
1,03/17/2020,51003,Albemarle,Thomas Jefferson,0,0,0,2020-03-17
2,03/17/2020,51005,Alleghany,Alleghany,0,0,0,2020-03-17
3,03/17/2020,51007,Amelia,Piedmont,0,0,0,2020-03-17
4,03/17/2020,51009,Amherst,Central Virginia,0,0,0,2020-03-17


With the new DateTime column added, lets check to see what our actual date range looks like for this dataset.

In [72]:
printMinMax(covid_data, "DateTime")

Oldest entry:  2020-03-17
Most recent entry:  2022-05-04


Now we have a column with dates that we can actually compare in our data analysis. We can see that our data's report dates span from March 2020 to May 2022. Lets see what counties we have, how many dates are recorded, and see if each county has a report for each date.

In [73]:
# Get a list of the unique counties
counties = covid_data["Locality"].unique()
# Display how many unique counties are in our dataset
print("There are ",len(counties)," unique counties")

# Figure out how many unique dates we have data for
dates = covid_data["DateTime"].unique()
# Display how many unique dates we have
print("There are ",len(dates)," unique dates")

# Get the number of rows containing each unique date in our dataset
date_counts = covid_data["Report Date"].value_counts()
unique_dates = date_counts.unique()
# Display the unique date count values
print("Each date has ",unique_dates[0]," unique entries")

There are  133  unique counties
There are  779  unique dates
Each date has  133  unique entries


This above cell gives us a lot of great information. We now know that we have 133 Virginia counties whose COVID reporting data is in this dataset. We also know that each of the 133 counties has reported COVID data for each of the 779 dates recorded in this table. With this information, we can now move onto visualizing our data.

## Exploratory Data Analysis

Now that we know a bit more about our dataset and have made some slight modifications to make it more manipulatable, lets try to visualize our COVID data.