UnitedStates_COVID_19_dataset


This is the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).


Coronavirus is a family of viruses that can cause illness, which can vary from common cold and cough to sometimes more severe disease. Middle East Respiratory Syndrome (MERS-CoV) and Severe Acute Respiratory Syndrome (SARS-CoV) were such severe cases with the world already has faced. SARS-CoV-2 (n-coronavirus) is the new virus of the coronavirus family, which first discovered in 2019, which has not been identified in humans before. It is a contiguous virus which started from Wuhan in December 2019. Which later declared as Pandemic by WHO due to high rate spreads throughout the world. Currently (on the date 20 May 2020), this leads to a total of 300K+ Deaths across the globe, including 90K+ deaths alone in USA.The dataset  is provided to identify the deaths and recovered cases.



https://github.com/dsrscientist/COVID_19_Datasets/blob/master/csse_covid_19_daily_reports_us.csv



USA daily state reports (csse_covid_19_daily_reports_us)



This table contains an aggregation of each USA State level data. 



File naming convention


Field description


Province_State - The name of the State within the USA.


Country_Region - The name of the Country (US).


Last_Update - The most recent date the file was pushed.


Lat - Latitude.


Long_ - Longitude.


Confirmed - Aggregated confirmed case count for the state.


Deaths - Aggregated Death case count for the state.


Recovered - Aggregated Recovered case count for the state.


Active - Aggregated confirmed cases that have not been resolved (Active = Confirmed - Recovered - Deaths).


FIPS - Federal Information Processing Standards code that uniquely identifies counties within the USA.


Incident_Rate - confirmed cases per 100,000 persons.


People_Tested - Total number of people who have been tested.


People_Hospitalized - Total number of people hospitalized.


Mortality_Rate - Number recorded deaths * 100/ Number confirmed cases.


UID - Unique Identifier for each row entry.


ISO3 - Officialy assigned country code identifiers.


Testing_Rate - Total number of people tested per 100,000 persons.


Hospitalization_Rate - Total number of people hospitalized * 100/ Number of confirmed cases.






Field description


FIPS: US only. Federal Information Processing Standards code that uniquely identifies counties within the USA.


Admin2: County name. US only.


Province_State: Province, state or dependency name.


Country_Region: Country, region or sovereignty name. The names of locations included on the Website correspond with the 


official designations used by the U.S. Department of State.


Last Update: MM/DD/YYYY HH:mm:ss (24 hour format, in UTC).


Lat and Long_: Dot locations on the dashboard. All points (except for Australia) shown on the map are based on geographic 


centroids, and are not representative of a specific address, building or any location at a spatial scale finer than a 


province/state. Australian dots are located at the centroid of the largest city in each state.


Confirmed: Confirmed cases include presumptive positive cases and probable cases, in accordance with CDC guidelines as of April 
14.


Deaths: Death totals in the US include confirmed and probable, in accordance with CDC guidelines as of April 14.


Recovered: Recovered cases outside China are estimates based on local media reports, and state and local reporting when available, and therefore may be substantially lower than the true number. US state-level recovered cases are from COVID Tracking Project.


Active: Active cases = total confirmed - total recovered - total deaths.


Incidence_Rate: Admin2 + Province_State + Country_Region.


Case-Fatality Ratio (%): = confirmed cases per 100,000 persons.


US Testing Rate: = total test results per 100,000 persons. The "total test results" is equal to "Total test results (Positive + Negative)" from COVID Tracking Project.


US Hospitalization Rate (%): = Total number hospitalized / Number confirmed cases. The "Total number hospitalized" is the 
"Hospitalized â€“ Cumulative" count from COVID Tracking Project. The "hospitalization rate" and "hospitalized - Cumulative" data is only presented for those states which provide cumulative hospital data.

In [None]:
# Importing the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore')
%matplotlib inline

In [None]:
# load the datset in the notbook environment

df = pd.read_csv("Project 9 (COVID 19 Dataset).csv")
df.head()

In [None]:
df.drop(['Last_Update','Lat','Long_','Country_Region','FIPS','UID'],inplace=True,axis=1)
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
df['Recovered'].fillna(df['Recovered'].mean(), inplace=True)
df['Incident_Rate'].fillna(df['Incident_Rate'].mean(), inplace=True)
df['People_Tested'].fillna(df['People_Tested'].mean(), inplace=True)
df['People_Hospitalized'].fillna(df['People_Hospitalized'].mean(), inplace=True)
df['Mortality_Rate'].fillna(df['Mortality_Rate'].mean(), inplace=True)
df['Testing_Rate'].fillna(df['Testing_Rate'].mean(), inplace=True)
df['Hospitalization_Rate'].fillna(df['Hospitalization_Rate'].mean(), inplace=True)
df.isnull().sum()

In [None]:
plt.figure(figsize=(15,7))
sns.heatmap(df.corr(), annot=True, linewidths=0.5, linecolor='black', fmt='.2f')
plt.show()

In [None]:
# check for skewness
df.skew()

In [None]:
df.head()