<a href="https://colab.research.google.com/github/niltontac/EspAnalise-EngDados/blob/master/Covid_19_Analysis_and_Predictions%20-%20In%20Progress.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sobre este conjunto de dados

#####Estes conjuntos de dados fornecidos da Johns Hopkins University possui informações com atualizações diárias sobre os números de casos confirmados, de mortes e de recuperação do Covid-19. Observe que esses são dados de séries temporais e, portando, os números de casos em um determinado dia são números acumulados.


#About this Dataset

#####These data sets provides from Johns Hopkins University have information with daily updates on the numbers of confirmed cases, deaths and recovery from Covid-19. Note that these are data from time series and the numbers of cases on a given day are cumulative numbers.

---

#####Fonte | Source (Datasets): 
##### https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series
##### https://github.com/niltontac/EspAnalise-EngDados/tree/master/data/Novel_Corona_Virus_2019_Dataset

---

#####Analyst: Nilton Thiago de Andrade Coura


# Covid-19 - Exploratory Analysis and Predictions

![alt text](https://cdn.cnn.com/cnnnext/dam/assets/200130165125-corona-virus-cdc-image-super-tease.jpg)

In [1]:
# Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go

# Loading dataset
# Last dataset update 03/31/2020

covid19confirmed = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv')

covid19deaths = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')

covid19recovered = pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv')

covid19 = pd.read_csv('https://raw.githubusercontent.com/niltontac/EspAnalise-EngDados/master/data/Novel_Corona_Virus_2019_Dataset/covid_19_data.csv', parse_dates=['ObservationDate', 'Last Update'])

  import pandas.util.testing as tm


In [0]:
last_date_update = '3/31/20'

Checking the last 5 cases to confirm when all the data sets were updated:

In [3]:
print('covid19confirmed:')
print(covid19confirmed.tail())
####
print('covid19deaths:')
print(covid19deaths.tail())
####
print('covid19recovered:')
print(covid19recovered.tail())
####
print('covid19:')
print(covid19.tail())

covid19confirmed:
               Province/State  Country/Region  ...  3/30/20  3/31/20
251  Turks and Caicos Islands  United Kingdom  ...        5        5
252                       NaN      MS Zaandam  ...        2        2
253                       NaN        Botswana  ...        3        4
254                       NaN         Burundi  ...        0        2
255                       NaN    Sierra Leone  ...        0        1

[5 rows x 74 columns]
covid19deaths:
               Province/State  Country/Region  ...  3/30/20  3/31/20
251  Turks and Caicos Islands  United Kingdom  ...        0        0
252                       NaN      MS Zaandam  ...        0        0
253                       NaN        Botswana  ...        0        1
254                       NaN         Burundi  ...        0        0
255                       NaN    Sierra Leone  ...        0        0

[5 rows x 74 columns]
covid19recovered:
               Province/State  Country/Region  ...  3/30/20  3/31/20
237  T

In [0]:
# Rename columns 'ObservationDate' for 'Date'

covid19 = covid19.rename(columns={'ObservationDate' : 'Date'})

Dimension of data sets (rows vs columns):

In [5]:
print('covid19confirmed:')
print(covid19confirmed.shape)
####
print('covid19deaths:')
print(covid19deaths.shape)
####
print('covid19recovered:')
print(covid19recovered.shape)
####
print('covid19:')
print(covid19.shape)

covid19confirmed:
(256, 74)
covid19deaths:
(256, 74)
covid19recovered:
(242, 74)
covid19:
(10671, 8)


Checking for null or missing values:

In [6]:
print('covid19confirmed:')
print(pd.DataFrame(covid19confirmed.isnull().sum()))
####
print('covid19deaths:')
print(pd.DataFrame(covid19deaths.isnull().sum()))
####
print('covid19recovered:')
print(pd.DataFrame(covid19recovered.isnull().sum()))
####
print('covid19:')
print(pd.DataFrame(covid19.isnull().sum()))

covid19confirmed:
                  0
Province/State  177
Country/Region    0
Lat               0
Long              0
1/22/20           0
...             ...
3/27/20           0
3/28/20           0
3/29/20           0
3/30/20           0
3/31/20           0

[74 rows x 1 columns]
covid19deaths:
                  0
Province/State  177
Country/Region    0
Lat               0
Long              0
1/22/20           0
...             ...
3/27/20           0
3/28/20           0
3/29/20           0
3/30/20           0
3/31/20           0

[74 rows x 1 columns]
covid19recovered:
                  0
Province/State  178
Country/Region    0
Lat               0
Long              0
1/22/20           0
...             ...
3/27/20           0
3/28/20           0
3/29/20           0
3/30/20           0
3/31/20           0

[74 rows x 1 columns]
covid19:
                   0
SNo                0
Date               0
Province/State  4956
Country/Region     0
Last Update        0
Confirmed          0
Deat

The data sets have missings values or null in "Province/State" column.
Let's replace them with 'unknow':

In [0]:
# Replacing data missings

covid19confirmed = covid19confirmed.fillna('unknow')
covid19deaths = covid19deaths.fillna('unknow')
covid19recovered = covid19recovered.fillna('unknow')
covid19 = covid19.fillna('unknow')

In [8]:
# Checking for null or missing values again

print('covid19confirmed:')
print(pd.DataFrame(covid19confirmed.isnull().sum()))
####
print('covid19deaths:')
print(pd.DataFrame(covid19deaths.isnull().sum()))
####
print('covid19recovered:')
print(pd.DataFrame(covid19recovered.isnull().sum()))
####
print('covid19:')
print(pd.DataFrame(covid19.isnull().sum()))

covid19confirmed:
                0
Province/State  0
Country/Region  0
Lat             0
Long            0
1/22/20         0
...            ..
3/27/20         0
3/28/20         0
3/29/20         0
3/30/20         0
3/31/20         0

[74 rows x 1 columns]
covid19deaths:
                0
Province/State  0
Country/Region  0
Lat             0
Long            0
1/22/20         0
...            ..
3/27/20         0
3/28/20         0
3/29/20         0
3/30/20         0
3/31/20         0

[74 rows x 1 columns]
covid19recovered:
                0
Province/State  0
Country/Region  0
Lat             0
Long            0
1/22/20         0
...            ..
3/27/20         0
3/28/20         0
3/29/20         0
3/30/20         0
3/31/20         0

[74 rows x 1 columns]
covid19:
                0
SNo             0
Date            0
Province/State  0
Country/Region  0
Last Update     0
Confirmed       0
Deaths          0
Recovered       0


#Plotly Visualizations:

All records including confirmed cases, deaths and recovered:

In [9]:
# all confirmed, deaths and recovered cases

cases_growth = covid19.groupby('Date')['Confirmed', 'Deaths', 'Recovered'].sum()
cases_growth = cases_growth.reset_index()
cases_growth = cases_growth.sort_values('Date', ascending=False)

fig = go.Figure()
fig.update_layout(template='plotly_dark')

fig.add_trace(go.Scatter(x=cases_growth['Date'], 
                        y=cases_growth['Confirmed'], 
                        mode='lines+markers',
                        name='Confirmed',
                        line=dict(color='Yellow', width=2)))

fig.add_trace(go.Scatter(x=cases_growth['Date'], 
                        y=cases_growth['Deaths'], 
                        mode='lines+markers',
                        name='Deaths',
                        line=dict(color='red', width=2)))

fig.add_trace(go.Scatter(x=cases_growth['Date'], 
                        y=cases_growth['Recovered'], 
                        mode='lines+markers',
                        name='Recovered',
                        line=dict(color='green', width=2)))

fig.show()


Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.



Death and recovery rates and percentage increase in confirmed cases:

In [10]:
cases_rate = covid19.groupby(['Date']).agg({'Deaths': ['sum'],'Recovered': ['sum'],'Confirmed': ['sum']})
cases_rate.columns = ['Global_Deaths','Global_Recovered','Global_Confirmed']
cases_rate = cases_rate.reset_index()
cases_rate['Increase_cases_per_day']=cases_rate['Global_Confirmed'].diff().shift(-1)

cases_rate['Global_Deaths_rate_%'] = cases_rate.apply(lambda row: ((row.Global_Deaths)/(row.Global_Confirmed))*100 , axis=1)
cases_rate['Global_Recovered_rate_%'] = cases_rate.apply(lambda row: ((row.Global_Recovered)/(row.Global_Confirmed))*100 , axis=1)
cases_rate['Global_Growth_rate_%']=cases_rate.apply(lambda row: row.Increase_cases_per_day/row.Global_Confirmed*100, axis=1)
cases_rate['Global_Growth_rate_%']=cases_rate['Global_Growth_rate_%'].shift(+1)



fig = go.Figure()
fig.update_layout(template='plotly_dark')
fig.add_trace(go.Scatter(x=cases_rate['Date'], 
                         y=cases_rate['Global_Deaths_rate_%'],
                         mode='lines+markers',
                         name='Death rate %',
                         line=dict(color='red', width=2)))

fig.add_trace(go.Scatter(x=cases_rate['Date'], 
                         y=cases_rate['Global_Recovered_rate_%'],
                         mode='lines+markers',
                         name='Recovery rate %',
                         line=dict(color='Green', width=2)))

fig.add_trace(go.Scatter(x=cases_rate['Date'], 
                         y=cases_rate['Global_Growth_rate_%'],
                         mode='lines+markers',
                         name='Growth rate confirmed %',
                         line=dict(color='Yellow', width=2)))

fig.show()

In [11]:
cases_rate.tail()

Unnamed: 0,Date,Global_Deaths,Global_Recovered,Global_Confirmed,Increase_cases_per_day,Global_Deaths_rate_%,Global_Recovered_rate_%,Global_Growth_rate_%
65,2020-03-27,27198.0,130915.0,593291.0,67415.0,4.58426,22.0659,12.02815
66,2020-03-28,30652.0,139415.0,660706.0,59411.0,4.63928,21.100913,11.362889
67,2020-03-29,33925.0,149082.0,720117.0,62248.0,4.71104,20.702469,8.992048
68,2020-03-30,37582.0,164566.0,782365.0,75122.0,4.80364,21.034428,8.644151
69,2020-03-31,42107.0,178034.0,857487.0,,4.910512,20.762297,9.601912


Confirmed cases, Deaths and Recovered in all affected countries around the world:

In [12]:
cases_temp = covid19confirmed 
cases_temp = cases_temp[['Country/Region', last_date_update]]
cases_temp = cases_temp.groupby('Country/Region').sum().sort_values(by = last_date_update,ascending = False)
cases_temp['Recovered'] = covid19recovered[['Country/Region', last_date_update]].groupby('Country/Region').sum().sort_values(by = last_date_update, ascending = False)
cases_temp['Deaths'] = covid19deaths[['Country/Region', last_date_update]].groupby('Country/Region').sum().sort_values(by = last_date_update, ascending = False)
cases_temp['Active'] = cases_temp[last_date_update] - cases_temp['Recovered'] - cases_temp['Deaths']
cases_temp = cases_temp.rename(columns = {last_date_update: 'Confirmed', 'Recovered' : 'Recovered', 'Deaths' : 'Deaths', 'Active' : 'Active'})

cases_temp.style.background_gradient(cmap='Reds')

Unnamed: 0_level_0,Confirmed,Recovered,Deaths,Active
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
US,188172,7024,3873,177275
Italy,105792,15729,12428,77635
Spain,95923,19259,8464,68200
China,82279,76206,3309,2764
Germany,71808,16100,775,54933
France,52827,9513,3532,39782
Iran,44605,14656,2898,27051
United Kingdom,25481,179,1793,23509
Switzerland,16605,1823,433,14349
Turkey,13531,243,214,13074


report in progress for the next few days...