# Cleaning
I will begin by cleaning the data using PANDAS data frames. I will be removing nonessential data columns for the time series analysis.

In [1]:
# Imports
import pandas as pd

In [2]:
# Read csv
BING_data = pd.read_csv('../data/Bing-COVID19-Data.csv')

In [3]:
# Check data types
BING_data.dtypes

ID                   int64
Updated             object
Confirmed            int64
ConfirmedChange    float64
Deaths             float64
DeathsChange       float64
Recovered          float64
RecoveredChange    float64
Latitude           float64
Longitude          float64
ISO2                object
ISO3                object
Country_Region      object
AdminRegion1        object
AdminRegion2        object
dtype: object

It may be important to change 'Confirmed' to float64 to ensure that it meshes well with the rest of the data.

In [4]:
# Swap dtype to float64
BING_data['Confirmed'] = BING_data['Confirmed'].astype('float64')

In [5]:
# Separate worldwide dataset
WWBING_data = BING_data[BING_data['Country_Region']=='Worldwide']

In [6]:
# Columns to be dropped; we are only looking at confirmed cases and confirmed deaths
dropcols=['ID', 'Recovered', 'RecoveredChange', 'Latitude', 'Longitude', 'ISO2', 'ISO3', 
          'Country_Region', 'AdminRegion1', 'AdminRegion2']

In [7]:
WWBING_data = WWBING_data.drop(columns=dropcols)

In [8]:
WWBING_data

Unnamed: 0,Updated,Confirmed,ConfirmedChange,Deaths,DeathsChange
0,01/21/2020,262.0,,0.0,
1,01/22/2020,313.0,51.0,0.0,0.0
2,01/23/2020,578.0,265.0,0.0,0.0
3,01/24/2020,841.0,263.0,0.0,0.0
4,01/25/2020,1320.0,479.0,0.0,0.0
...,...,...,...,...,...
114,05/14/2020,4434590.0,97617.0,301937.0,5685.0
115,05/15/2020,4531811.0,97221.0,307001.0,5064.0
116,05/16/2020,4626632.0,94821.0,311363.0,4362.0
117,05/17/2020,4710614.0,83982.0,315023.0,3660.0


In [9]:
# We are only looking at data from March to present
WWBING_data = WWBING_data[WWBING_data['Updated']>='03/01/2020']

In [10]:
# Now for US data
USBING_data = BING_data[BING_data['Country_Region']=='United States']

In [11]:
# We are looking at region data only
USBING_data = USBING_data[USBING_data['AdminRegion1'].isnull()]

In [12]:
USBING_data = USBING_data.drop(columns=dropcols)

In [13]:
USBING_data = USBING_data[USBING_data['Updated']>='03/01/2020']

At this point, I can merge the data sets and update their headings.

In [14]:
# This way, the Country_Region is redundant, hence the prior removal
WWBING_data.rename(columns={'Confirmed': 'ConfirmedWW', 
                            'ConfirmedChange': 'ConfirmedChangeWW',
                            'Deaths': 'DeathsWW',
                            'DeathsChange': 'DeathsChangeWW'}, inplace=True)

USBING_data.rename(columns={'Confirmed': 'ConfirmedUS', 
                            'ConfirmedChange': 'ConfirmedChangeUS',
                            'Deaths': 'DeathsUS',
                            'DeathsChange': 'DeathsChangeUS'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [15]:
BING_df = pd.merge(WWBING_data, USBING_data, on='Updated', how='outer')

In [16]:
# Should be noted that WW data is missing the most recent row
BING_df

Unnamed: 0,Updated,ConfirmedWW,ConfirmedChangeWW,DeathsWW,DeathsChangeWW,ConfirmedUS,ConfirmedChangeUS,DeathsUS,DeathsChangeUS
0,03/01/2020,87137.0,1734.0,2977.0,53.0,42.0,2.0,4.0,1.0
1,03/02/2020,88948.0,1811.0,3043.0,66.0,57.0,15.0,8.0,4.0
2,03/03/2020,90869.0,1921.0,3112.0,69.0,85.0,28.0,11.0,3.0
3,03/04/2020,93091.0,2222.0,3198.0,86.0,111.0,26.0,13.0,2.0
4,03/05/2020,95324.0,2233.0,3281.0,83.0,175.0,64.0,13.0,0.0
...,...,...,...,...,...,...,...,...,...
75,05/15/2020,4531811.0,97221.0,307001.0,5064.0,1432899.0,25382.0,81423.0,1524.0
76,05/16/2020,4626632.0,94821.0,311363.0,4362.0,1457426.0,24527.0,82654.0,1231.0
77,05/17/2020,4710614.0,83982.0,315023.0,3660.0,1477157.0,19731.0,83439.0,785.0
78,05/18/2020,4786672.0,76058.0,317695.0,2672.0,1498264.0,21107.0,84231.0,792.0


# EDA

Next, the data should be explored.

In [17]:
# Install may be necessary for this library
# !pip install --user pandas_profiling

In [18]:
import pandas_profiling as pp

In [19]:
pp.ProfileReport(BING_df)

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=23.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…






Abbreviations 
- Confirmed (C)
- Confirmed Change (CC)
- Deaths (D)
- Deaths Change (DC)
- Worldwide (WW)
- United States (US)

This report is pretty in depth, which I tend to prefer over the describe function from pandas. The things to note here are the variable details, interactions, and correlations. While there is a lot that should probably ignored (ie. the mean and sum of CWW are meaningless), there is a lot of important information to gain. The range of CWW shows the magnitude of change which you would expect to line up with the sum of CCWW is slightly off. This likely means some form of estimation was used. The means of CC and DC will give us some insight into the rate of cases and deaths. In this case, we see that for every `~60,000` cases, there are `~4,000 deaths` (`~6.67%` death rate).

We can see in the interactions that the change rates tend to level off, or in other words, as C increases, CC hits an upper bound according to some hyperbolic function. In fact, I would guess that as the series expands we will see that CC was actually a skewed parabolic function the whole time (increasing sharply, plateauing, then decreasing over time). The same can be seen in D and DC across both US and WW data sets.

Finally, through correlations we can see that most of the data is highly correlated, meaning that as one variable increases, so do the others. When looking at heatmaps, it's important to filter either everything above or everything below the diagonal stretching from the top left (CCWW) to the bottom right (DCUS) so that we don't see redundant data. This is really just something to confirm there isn't something off in the data. If the series were expanded, I would expect that the 'Change' variables become less and less correlated with their counterparts as the rates should decrease over time.


In [20]:
import matplotlib.pyplot as plt

%matplotlib inline

In [25]:
# Initially, I got an error for there being no numeric data to plot
BING_df['Updated'] = pd.to_datetime(BING_df['Updated'])

In [26]:
BING_df.filter(regex='mr').plot(figsize=(16,9), linewidth=3, fontsize=20, cmap='Set1');
plt.xlabel('Date', fontsize=20);
plt.ylabel('Population', fontsize=20);

TypeError: no numeric data to plot

In [30]:
BING_df.astype(float)

TypeError: cannot astype a datetimelike from [datetime64[ns]] to [float64]