<a href="https://colab.research.google.com/github/niltontac/EspAnalise-EngDados/blob/master/Novel_Coronavirus_analysis_and_predictions_2019_2020%20-%20In%20Progress.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sobre este conjunto de dados

#####Este counjunto de dados possui informações diárias sobre os números de casos afetados, mortes e recuperações do novo Corona vírus de 2019. Observe que esses são dados de séries temporais e, portando, os números de casos em um determinado dia são números acumulados.


#About this Dataset

#####This dataset has daily level information on the numbers affected cases, deaths and recoveries from the new coronavirus. Note that these are data from time series and the numbers of cases on a given day are cumulative numbers.

---

#####Fonte | Source (Dataset): https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset

---

#####Analyst: Nilton Thiago de Andrade Coura


In [0]:
# Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Loading dataset

dataset = pd.read_csv('https://raw.githubusercontent.com/niltontac/EspAnalise-EngDados/master/data/Novel_Corona_Virus_2019_Dataset/covid_19_data.csv', parse_dates=['ObservationDate', 'Last Update'])

In [2]:
# Dataset Dimension (rows x columns)

dataset.shape

(4935, 8)

In [3]:
# Dataset Information

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 8 columns):
SNo                4935 non-null int64
ObservationDate    4935 non-null datetime64[ns]
Province/State     3120 non-null object
Country/Region     4935 non-null object
Last Update        4935 non-null datetime64[ns]
Confirmed          4935 non-null int64
Deaths             4935 non-null int64
Recovered          4935 non-null int64
dtypes: datetime64[ns](2), int64(4), object(2)
memory usage: 308.6+ KB


In [4]:
# Dataset describe

dataset.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SNo,4935.0,2468.0,1424.756119,1.0,1234.5,2468.0,3701.5,4935.0
Confirmed,4935.0,577.61459,4971.492694,0.0,1.0,9.0,93.0,67773.0
Deaths,4935.0,17.694833,192.348513,0.0,0.0,0.0,1.0,3046.0
Recovered,4935.0,201.01155,2179.79852,0.0,0.0,1.0,14.0,49134.0


In [5]:
# Dataset (columns data types) 

dataset.dtypes

SNo                         int64
ObservationDate    datetime64[ns]
Province/State             object
Country/Region             object
Last Update        datetime64[ns]
Confirmed                   int64
Deaths                      int64
Recovered                   int64
dtype: object

In [6]:
# Checking the last 5 cases to see when the dataset was updated

dataset.tail().style.background_gradient(cmap='PRGn')

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
4930,4931,2020-03-11 00:00:00,Mississippi,US,2020-03-10 02:33:04,0,0,0
4931,4932,2020-03-11 00:00:00,North Dakota,US,2020-03-10 02:33:04,0,0,0
4932,4933,2020-03-11 00:00:00,West Virginia,US,2020-03-10 02:33:04,0,0,0
4933,4934,2020-03-11 00:00:00,Wyoming,US,2020-03-10 02:33:04,0,0,0
4934,4935,2020-03-11 00:00:00,,occupied Palestinian territory,2020-03-11 20:53:02,0,0,0


In [7]:
# The countries list

", ".join(dataset['Country/Region'].unique().tolist())

"Mainland China, Hong Kong, Macau, Taiwan, US, Japan, Thailand, South Korea, Singapore, Philippines, Malaysia, Vietnam, Australia, Mexico, Brazil, Colombia, France, Nepal, Canada, Cambodia, Sri Lanka, Ivory Coast, Germany, Finland, United Arab Emirates, India, Italy, UK, Russia, Sweden, Spain, Belgium, Others, Egypt, Iran, Israel, Lebanon, Iraq, Oman, Afghanistan, Bahrain, Kuwait, Austria, Algeria, Croatia, Switzerland, Pakistan, Georgia, Greece, North Macedonia, Norway, Romania, Denmark, Estonia, Netherlands, San Marino,  Azerbaijan, Belarus, Iceland, Lithuania, New Zealand, Nigeria, North Ireland, Ireland, Luxembourg, Monaco, Qatar, Ecuador, Azerbaijan, Czech Republic, Armenia, Dominican Republic, Indonesia, Portugal, Andorra, Latvia, Morocco, Saudi Arabia, Senegal, Argentina, Chile, Jordan, Ukraine, Saint Barthelemy, Hungary, Faroe Islands, Gibraltar, Liechtenstein, Poland, Tunisia, Palestine, Bosnia and Herzegovina, Slovenia, South Africa, Bhutan, Cameroon, Costa Rica, Peru, Serbia

In [0]:
# Rename some countries to make sure all adhere to (alpha-3) ISO-Standards and for the ploty
# map visualization to work

dataset['Country/Region'] = dataset['Country/Region'].replace('US', 'USA')
dataset['Country/Region'] = dataset['Country/Region'].replace('South Korea', 'KOR')
dataset['Country/Region'] = dataset['Country/Region'].replace('Russia', 'RUS')
dataset['Country/Region'] = dataset['Country/Region'].replace('UK', 'GBR')


# Rename the columns

dataset = dataset.rename(columns={'Country/Region': 'Country', 'ObservationDate': 'Date'})

In [9]:
# Removing unessecary columns

dataset = dataset.drop(['SNo', 'Last Update'], axis=1)

# Checking

dataset.tail().style.background_gradient(cmap='PRGn')

Unnamed: 0,Date,Province/State,Country,Confirmed,Deaths,Recovered
4930,2020-03-11 00:00:00,Mississippi,USA,0,0,0
4931,2020-03-11 00:00:00,North Dakota,USA,0,0,0
4932,2020-03-11 00:00:00,West Virginia,USA,0,0,0
4933,2020-03-11 00:00:00,Wyoming,USA,0,0,0
4934,2020-03-11 00:00:00,,occupied Palestinian territory,0,0,0


In [10]:
# Checking for null or missing values

pd.DataFrame(dataset.isnull().sum()).T

Unnamed: 0,Date,Province/State,Country,Confirmed,Deaths,Recovered
0,0,1815,0,0,0,0


The dataset has 1601 missings values in "Province/State" column.
Let's replace them with 'unknow'

In [0]:
# replacing data missings

dataset = dataset.fillna('unknow')

In [12]:
# Checking for null or missing values again

pd.DataFrame(dataset.isnull().sum()).T

Unnamed: 0,Date,Province/State,Country,Confirmed,Deaths,Recovered
0,0,0,0,0,0,0


In [13]:
# Table that sums up every element Confirmed, Deaths and Recovered columns

cases_temp = dataset.groupby('Date')['Confirmed', 'Deaths', 'Recovered'].sum()
cases_temp = cases_temp.reset_index()
cases_temp = cases_temp.sort_values('Date', ascending=False)
cases_temp.head().style.background_gradient(cmap='PRGn')

Unnamed: 0,Date,Confirmed,Deaths,Recovered
49,2020-03-11 00:00:00,125865,4615,67003
48,2020-03-10 00:00:00,118582,4262,64404
47,2020-03-09 00:00:00,113582,3996,62512
46,2020-03-08 00:00:00,109835,3803,60695
45,2020-03-07 00:00:00,105836,3558,58359


report in progress for the next few days...