<img src="https://raw.githubusercontent.com/andre-marcos-perez/ebac-course-utils/main/media/logo/newebac_logo_black_half.png" alt="ebac-logo">

---

# **Módulo** | Análise de Dados: COVID-19 Dashboard
Caderno de **Exercícios**<br> 
Professor [André Perez](https://www.linkedin.com/in/andremarcosperez/)

---

# **Tópicos**

<ol type="1">
  <li>Introdução;</li>
  <li>Análise Exploratória de Dados;</li>
  <li>Visualização Interativa de Dados;</li>
  <li>Storytelling.</li>
</ol>


---

# **COVID Dashboard**

## 1\. Contexto

Nesse projeto iremos fazer a extração dos dados sobre os casos de COVID-19 no mundo que foram compilados e postado diariamente no [GitHub](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports) pela Johns Hopkins University. Foram atualizados dede janeiro de 2020, apenas iremos usar os dados refentes ao Brasil.

Este dataframe será extraído, tratado e ao fim  estará dividido da seguinte forma para ser carregado:

 - **date**: Data de referência;
 - **state**: Estado;
 - **country**: País; 
 - **population**: População estimada;
 - **confirmed**: Número infectados (acumulado);
 - **confirmed_1d**: Número diário de infectados;
 - **confirmed_moving_avg_7d**: Média móvel de 7 dias do número diário de infectados;
 - **confirmed_moving_avg_7d_rate_14d**: Média móvel de 7 dias dividido pela média móvel de 7 dias de 14 dias atrás;
 - **deaths**: Número de mortos (acumulado);
 - **deaths_1d**: Número diário de mortos;
 - **deaths_moving_avg_7d**: Média móvel de 7 dias do número diário de mortos;
 - **deaths_moving_avg_7d**: Média móvel de 7 dias dividido pela média móvel de 7 dias de 14 dias atrás;
 - **month**: Mês de referência;
 - **year**: Ano de referência.

No segundo dataframe estão os dados da vacinação, que foram retiradas do projeto Our World in Data (Nosso mundo em dados), da University of Oxford, os dados são atualizados diariamente desde de janeiro de 2020. O projeto pode ser encontrado  [link](https://ourworldindata.org/) e os dados da vacinação neste [link](https://covid.ourworldindata.org/data/owid-covid-data.cs). 

Este dataframe será extraído, tratado e ao fim estará dividio da seguinte forma para ser carregado:

 - **date**: Data de referência;
 - **country**: País;
 - **population**: População estimada;
 - **total**: Número acumulado de doses administradas;
 - **one_shot**: Número acumulado de pessoas com uma dose;
 - **one_shot_perc**: Número acumulado relativo de pessoas com uma dose;
 - **two_shots**: Número acumulado de pessoas com duas doses;
 - **two_shot_perc**: Número acumulado relativo de pessoas com duas doses;
 - **three_shots**: Número acumulado de pessoas com três doses;
 - **three_shot_perc**: Número acumulado relativo de pessoas com três doses;
 - **month**: Mês de referência;
 - **year**: Ano de referência.

## 2\. Pacotes e bibliotecas

In [1]:
import math
from typing import Iterator
from datetime import datetime, timedelta

import pandas as pd
import numpy as np

## 3\. Extração

#### **3\.1 Extração dos dados referente aos casos de covid:**

In [2]:
# criando uma função para um intervalo de tempo para extração de dados no GitHub
def date_range(star_date: datetime, end_date: datetime) -> Iterator[datetime]:
    date_range_days: int = (end_date - star_date).days
    for lag in range(date_range_days):
        yield star_date + timedelta(lag)

# definindo nosso intervalo de tempo
start_date = datetime(2021, 1, 1)
end_date = datetime(2022, 12, 31)


In [3]:
# ler o csv direto do GitHub e acrescentar os dados em um único DataFrame
cases = None
cases_empty = True

for date in date_range(star_date=start_date, end_date=end_date):

    date_str = date.strftime('%m-%d-%Y')
    date_source_url = f'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/{date_str}.csv'
    
    case = pd.read_csv(date_source_url, sep=',')

    case = case.drop(['FIPS', 'Admin2', 'Last_Update', 'Lat', 'Long_', 'Recovered', 'Active', 'Combined_Key', 'Case_Fatality_Ratio'], axis=1)
    case = case.query("Country_Region == 'Brazil'").reset_index(drop=True)
    case['Date'] = pd.to_datetime(date.strftime('%Y-%m-%d'))

    if cases_empty:
        cases = case
        cases_empty = False
    else:
        cases = cases.append(case, ignore_index=True)

  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_index=True)
  cases = cases.append(case, ignore_inde

In [4]:
# conferindo nosso dataframe
df_cases = cases
df_cases.head()

Unnamed: 0,Province_State,Country_Region,Confirmed,Deaths,Incident_Rate,Date
0,Acre,Brazil,41689,796,4726.992352,2021-01-01
1,Alagoas,Brazil,105091,2496,3148.928928,2021-01-01
2,Amapa,Brazil,68361,926,8083.066602,2021-01-01
3,Amazonas,Brazil,201574,5295,4863.536793,2021-01-01
4,Bahia,Brazil,494684,9159,3326.039611,2021-01-01


In [5]:
df_cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19683 entries, 0 to 19682
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Province_State  19683 non-null  object        
 1   Country_Region  19683 non-null  object        
 2   Confirmed       19683 non-null  int64         
 3   Deaths          19683 non-null  int64         
 4   Incident_Rate   19683 non-null  float64       
 5   Date            19683 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 922.8+ KB


In [6]:
# Agora iremos manipular os dados de forma que facilite o entendimento
# Renomeando as colunas
df_cases = df_cases.rename(columns={
    'Province_State': 'state',
    'Country_Region': 'country', 
})

for col in df_cases.columns:
    df_cases = df_cases.rename(columns={col: col.lower()})

df_cases.head()

Unnamed: 0,state,country,confirmed,deaths,incident_rate,date
0,Acre,Brazil,41689,796,4726.992352,2021-01-01
1,Alagoas,Brazil,105091,2496,3148.928928,2021-01-01
2,Amapa,Brazil,68361,926,8083.066602,2021-01-01
3,Amazonas,Brazil,201574,5295,4863.536793,2021-01-01
4,Bahia,Brazil,494684,9159,3326.039611,2021-01-01


In [7]:
# Renomeando os estados
states_map = {
    'Amapa': 'Amapá',
    'Ceara': 'Ceará',
    'Espirito Santo': 'Espírito Santo',
    'Goias': 'Goiás',
    'Para': 'Pará',
    'Paraiba': 'Paraíba',
    'Parana': 'Paraná',
    'Piaui': 'Piauí',
    'Rondonia': 'Rondônia',
    'Sao Paulo': 'São Paulo'
}

df_cases['state'] = df_cases['state'].apply(lambda state: states_map.get(state) if state in states_map.keys() else state)

df_cases.head()

Unnamed: 0,state,country,confirmed,deaths,incident_rate,date
0,Acre,Brazil,41689,796,4726.992352,2021-01-01
1,Alagoas,Brazil,105091,2496,3148.928928,2021-01-01
2,Amapá,Brazil,68361,926,8083.066602,2021-01-01
3,Amazonas,Brazil,201574,5295,4863.536793,2021-01-01
4,Bahia,Brazil,494684,9159,3326.039611,2021-01-01


In [8]:
# Acrescentando novas colunas de mês e ano
df_cases['month'] = df_cases['date'].apply(lambda date: date.strftime('%Y-%m'))
df_cases['year'] = df_cases['date'].apply(lambda date: date.strftime('%Y'))
df_cases.head()

Unnamed: 0,state,country,confirmed,deaths,incident_rate,date,month,year
0,Acre,Brazil,41689,796,4726.992352,2021-01-01,2021-01,2021
1,Alagoas,Brazil,105091,2496,3148.928928,2021-01-01,2021-01,2021
2,Amapá,Brazil,68361,926,8083.066602,2021-01-01,2021-01,2021
3,Amazonas,Brazil,201574,5295,4863.536793,2021-01-01,2021-01,2021
4,Bahia,Brazil,494684,9159,3326.039611,2021-01-01,2021-01,2021


In [9]:
# coletando os dados de população de cada estado através do site do IBGE
import requests
from lxml import html

# criando lista com estados e usando um for para gerar as urls necessárias
estados_br = ['ac', 'al', 'ap', 'am', 'ba', 'ce', 'df', 'es', 'go', 'ma', 'mt', 'ms', 'mg', 'pa', 'pb','pi', 'rj', 'rn', 'rs', 'ro', 'rr', 'sc', 'sp', 'se', 'to']
pop_estados = []

for sigla in estados_br:
    
# coletando os dados do IBGE
    url = f'https://www.ibge.gov.br/cidades-e-estados/{sigla}.html'
    response = requests.get(url)
    html_content = response.content

    tree = html.fromstring(html_content)
    populacao = tree.xpath('//*[@id="responseMunicipios"]/div[2]/div[2]/ul/li[2]/div/p/text()')[0]

    pop = int(populacao.replace('.',''))
    pop_estados.append({
        'estados': sigla,
        'populacao': pop
    })


In [10]:
df_pop = pd.DataFrame(pop_estados)
df_pop.head()

Unnamed: 0,estados,populacao
0,ac,906876
1,al,3365351
2,ap,877613
3,am,4269995
4,ba,14985284


In [11]:
# renomeando as siglas para seu respectivo nome
brazil_states = {
    'ac': 'Acre',
    'al': 'Alagoas',
    'ap': 'Amapá',
    'am': 'Amazonas',
    'ba': 'Bahia',
    'ce': 'Ceará',
    'df': 'Distrito Federal',
    'es': 'Espírito Santo',
    'go': 'Goiás',
    'ma': 'Maranhão',
    'mt': 'Mato Grosso',
    'ms': 'Mato Grosso do Sul',
    'mg': 'Minas Gerais',
    'pa': 'Pará',
    'pb': 'Paraíba',
    'pr': 'Paraná',
    'pe': 'Pernambuco',
    'pi': 'Piauí',
    'rj': 'Rio de Janeiro',
    'rn': 'Rio Grande do Norte',
    'rs': 'Rio Grande do Sul',
    'ro': 'Rondônia',
    'rr': 'Roraima',
    'sc': 'Santa Catarina',
    'sp': 'São Paulo',
    'se': 'Sergipe',
    'to': 'Tocantins'
}
df_pop['state'] = df_pop['estados'].apply(lambda state: brazil_states[state])
df_pop = df_pop.drop('estados', axis=1)
df_pop.head()

Unnamed: 0,populacao,state
0,906876,Acre
1,3365351,Alagoas
2,877613,Amapá
3,4269995,Amazonas
4,14985284,Bahia


In [12]:
# Fazendo o merge com a população atualizada para o dataset principal
df_final = pd.merge(left=df_cases, right=df_pop, how='left', on='state')
df_final = df_final.rename(columns={'populacao':'population'})
df_final = df_final.drop('incident_rate', axis=1)
df_final['population'] = df_final['population'].astype('Int64') 
df_final.head()

Unnamed: 0,state,country,confirmed,deaths,date,month,year,population
0,Acre,Brazil,41689,796,2021-01-01,2021-01,2021,906876
1,Alagoas,Brazil,105091,2496,2021-01-01,2021-01,2021,3365351
2,Amapá,Brazil,68361,926,2021-01-01,2021-01,2021,877613
3,Amazonas,Brazil,201574,5295,2021-01-01,2021-01,2021,4269995
4,Bahia,Brazil,494684,9159,2021-01-01,2021-01,2021,14985284


In [13]:
# conferindo se houve alguma perca ou se há algo de errado
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19683 entries, 0 to 19682
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   state       19683 non-null  object        
 1   country     19683 non-null  object        
 2   confirmed   19683 non-null  int64         
 3   deaths      19683 non-null  int64         
 4   date        19683 non-null  datetime64[ns]
 5   month       19683 non-null  object        
 6   year        19683 non-null  object        
 7   population  17496 non-null  Int64         
dtypes: Int64(1), datetime64[ns](1), int64(2), object(4)
memory usage: 1.4+ MB


#### **3\.2 Extração dos dados de vacinação:**

In [14]:
df_vaccines = pd.read_csv('https://covid.ourworldindata.org/data/owid-covid-data.csv', sep=',', parse_dates=['date'], infer_datetime_format=True)
df_vaccines.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-01-03,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
1,AFG,Asia,Afghanistan,2020-01-04,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
2,AFG,Asia,Afghanistan,2020-01-05,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
3,AFG,Asia,Afghanistan,2020-01-06,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
4,AFG,Asia,Afghanistan,2020-01-07,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,


In [15]:
df_vaccines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293805 entries, 0 to 293804
Data columns (total 67 columns):
 #   Column                                      Non-Null Count   Dtype         
---  ------                                      --------------   -----         
 0   iso_code                                    293805 non-null  object        
 1   continent                                   279806 non-null  object        
 2   location                                    293805 non-null  object        
 3   date                                        293805 non-null  datetime64[ns]
 4   total_cases                                 257985 non-null  float64       
 5   new_cases                                   285216 non-null  float64       
 6   new_cases_smoothed                          283952 non-null  float64       
 7   total_deaths                                237667 non-null  float64       
 8   new_deaths                                  285288 non-null  float64      

In [16]:
# vamos selcionar apenas dados refente ao Brasil e apenas as colunas que nos interessa
df_vaccines = df_vaccines.query('location == "Brazil"').reset_index(drop=True)
df_vaccines = df_vaccines[['location', 'population', 'total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated', 'total_boosters', 'date']]

In [17]:
df_vaccines.head()

Unnamed: 0,location,population,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,date
0,Brazil,215313504.0,,,,,2020-01-03
1,Brazil,215313504.0,,,,,2020-01-04
2,Brazil,215313504.0,,,,,2020-01-05
3,Brazil,215313504.0,,,,,2020-01-06
4,Brazil,215313504.0,,,,,2020-01-07


In [18]:
# preenchendo os valores nulos com o valor anterior válido mais próximo
df_vaccines = df_vaccines.fillna(method='ffill')
df_vaccines.head()

Unnamed: 0,location,population,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,date
0,Brazil,215313504.0,,,,,2020-01-03
1,Brazil,215313504.0,,,,,2020-01-04
2,Brazil,215313504.0,,,,,2020-01-05
3,Brazil,215313504.0,,,,,2020-01-06
4,Brazil,215313504.0,,,,,2020-01-07


In [19]:
# filtrando os dados para que ambos os dados possuem o mesmo período de tempo
df_vaccines = df_vaccines[(df_vaccines['date'] >= '2021-01-01') & (df_vaccines['date'] <= '2022-12-31')].reset_index(drop=True)
df_vaccines.head()

Unnamed: 0,location,population,total_vaccinations,people_vaccinated,people_fully_vaccinated,total_boosters,date
0,Brazil,215313504.0,,,,,2021-01-01
1,Brazil,215313504.0,,,,,2021-01-02
2,Brazil,215313504.0,,,,,2021-01-03
3,Brazil,215313504.0,,,,,2021-01-04
4,Brazil,215313504.0,,,,,2021-01-05


## 4\. Transformação

#### **4\.1 Processo de enriquecimento dos dados coletados dos casos de covid:**

In [20]:
# calculando a média móvel (7 dias) e a estabilidade (14 dias) de casos e mortes para cada estado
# Definindo uma função para definir se a média móvel está estável, subindo ou descendo

cases_ = None
cases_empty = True

def get_trend(rate: float) -> str:

    if np.isnan(rate):
        return np.NaN
    
    if rate > 0.75:
        status = 'downward'
    elif rate > 1.15:
        status = 'upward'
    else:
        status = 'stable'

    return status


# criando um novo dataset com as médias

for state in df_final['state'].drop_duplicates():
    cases_state = df_final.query(f'state == "{state}"').reset_index(drop=True)
    cases_state = cases_state.sort_values(by=['date'])

    cases_state['confirmed_1d'] = cases_state['confirmed'].diff(periods=1)
    cases_state['confirmed_moving_avg_7d'] = np.ceil(cases_state['confirmed_1d'].rolling(window=7).mean())
    cases_state['confirmed_moving_avg_7d_rate_14d'] = cases_state['confirmed_moving_avg_7d'] / cases_state['confirmed_moving_avg_7d'].shift(periods=14)
    cases_state['confirmed_trend'] = cases_state['confirmed_moving_avg_7d_rate_14d'].apply(get_trend)

    cases_state['deaths_1d'] = cases_state['deaths'].diff(periods=1)
    cases_state['deaths_moving_avg_7d'] = np.ceil(cases_state['deaths_1d'].rolling(window=7).mean())
    cases_state['deaths_moving_avg_7d_rate_14d'] = cases_state['deaths_moving_avg_7d'] / cases_state['deaths_moving_avg_7d'].shift(periods=14)
    cases_state['deaths_trend'] = cases_state['deaths_moving_avg_7d_rate_14d'].apply(get_trend)
    if cases_empty:
        cases_ = cases_state
        cases_empty = False
    else:
        cases_ = cases_.append(cases_state, ignore_index=True)

cases = cases_
cases_ = None



  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_state, ignore_index=True)
  cases_ = cases_.append(cases_

In [21]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19683 entries, 0 to 19682
Data columns (total 16 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   state                             19683 non-null  object        
 1   country                           19683 non-null  object        
 2   confirmed                         19683 non-null  int64         
 3   deaths                            19683 non-null  int64         
 4   date                              19683 non-null  datetime64[ns]
 5   month                             19683 non-null  object        
 6   year                              19683 non-null  object        
 7   population                        17496 non-null  Int64         
 8   confirmed_1d                      19656 non-null  float64       
 9   confirmed_moving_avg_7d           19494 non-null  float64       
 10  confirmed_moving_avg_7d_rate_14d  19116 non-nu

In [22]:
# corrigindo as colunas para o formato certo
cases = cases.astype({
    'population': 'Int64',
    'confirmed_1d': 'Int64',
    'confirmed_moving_avg_7d': 'Int64',
    'deaths_1d': 'Int64',
    'deaths_moving_avg_7d': 'Int64',
})
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19683 entries, 0 to 19682
Data columns (total 16 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   state                             19683 non-null  object        
 1   country                           19683 non-null  object        
 2   confirmed                         19683 non-null  int64         
 3   deaths                            19683 non-null  int64         
 4   date                              19683 non-null  datetime64[ns]
 5   month                             19683 non-null  object        
 6   year                              19683 non-null  object        
 7   population                        17496 non-null  Int64         
 8   confirmed_1d                      19656 non-null  Int64         
 9   confirmed_moving_avg_7d           19494 non-null  Int64         
 10  confirmed_moving_avg_7d_rate_14d  19116 non-nu

In [23]:
# organizando as colunas do nosso DataFrame 
cases = cases[['date', 'country', 'state', 'population', 'confirmed', 'confirmed_1d', 'confirmed_moving_avg_7d', 'confirmed_moving_avg_7d_rate_14d', 'confirmed_trend', 'deaths', 'deaths_1d', 'deaths_moving_avg_7d', 'deaths_moving_avg_7d_rate_14d', 'deaths_trend', 'month', 'year']]
cases.head()

Unnamed: 0,date,country,state,population,confirmed,confirmed_1d,confirmed_moving_avg_7d,confirmed_moving_avg_7d_rate_14d,confirmed_trend,deaths,deaths_1d,deaths_moving_avg_7d,deaths_moving_avg_7d_rate_14d,deaths_trend,month,year
0,2021-01-01,Brazil,Acre,906876,41689,,,,,796,,,,,2021-01,2021
1,2021-01-02,Brazil,Acre,906876,41941,252.0,,,,798,2.0,,,,2021-01,2021
2,2021-01-03,Brazil,Acre,906876,42046,105.0,,,,802,4.0,,,,2021-01,2021
3,2021-01-04,Brazil,Acre,906876,42117,71.0,,,,806,4.0,,,,2021-01,2021
4,2021-01-05,Brazil,Acre,906876,42170,53.0,,,,808,2.0,,,,2021-01,2021


#### **4\.2 Processo de enriquecimento dos dados coletados das vacinas:**

In [24]:
# renomeando as colunas 
df_vaccines = df_vaccines.rename(
    columns={
    'location': 'country',
    'total_vaccinations': 'total',
    'people_vaccinated': 'one_shot',
    'people_fully_vaccinated': 'two_shots',
    'total_boosters': 'three_shots',
    }
)
df_vaccines.head()

Unnamed: 0,country,population,total,one_shot,two_shots,three_shots,date
0,Brazil,215313504.0,,,,,2021-01-01
1,Brazil,215313504.0,,,,,2021-01-02
2,Brazil,215313504.0,,,,,2021-01-03
3,Brazil,215313504.0,,,,,2021-01-04
4,Brazil,215313504.0,,,,,2021-01-05


In [25]:
# colocando duas chaves temporais (mês e ano)
df_vaccines['month'] = df_vaccines['date'].apply(lambda date: date.strftime('%Y-%m'))
df_vaccines['year'] = df_vaccines['date'].apply(lambda date: date.strftime('%Y'))
df_vaccines.head()

Unnamed: 0,country,population,total,one_shot,two_shots,three_shots,date,month,year
0,Brazil,215313504.0,,,,,2021-01-01,2021-01,2021
1,Brazil,215313504.0,,,,,2021-01-02,2021-01,2021
2,Brazil,215313504.0,,,,,2021-01-03,2021-01,2021
3,Brazil,215313504.0,,,,,2021-01-04,2021-01,2021
4,Brazil,215313504.0,,,,,2021-01-05,2021-01,2021


In [26]:
# vamos colocar os dados relativos refente a cada dose
df_vaccines['one_shot_perc'] = round(df_vaccines['one_shot'] / df_vaccines['population'], 4)
df_vaccines['two_shots_perc'] = round(df_vaccines['two_shots'] / df_vaccines['population'], 4)
df_vaccines['three_shots_perc'] = round(df_vaccines['three_shots'] / df_vaccines['population'], 4)
df_vaccines.head()

Unnamed: 0,country,population,total,one_shot,two_shots,three_shots,date,month,year,one_shot_perc,two_shots_perc,three_shots_perc
0,Brazil,215313504.0,,,,,2021-01-01,2021-01,2021,,,
1,Brazil,215313504.0,,,,,2021-01-02,2021-01,2021,,,
2,Brazil,215313504.0,,,,,2021-01-03,2021-01,2021,,,
3,Brazil,215313504.0,,,,,2021-01-04,2021-01,2021,,,
4,Brazil,215313504.0,,,,,2021-01-05,2021-01,2021,,,


In [27]:
# precisamos converter algumas colunas que estão como 'float64' mas devem ser 'Int64'
df_vaccines = df_vaccines.astype({
    'population': 'Int64',
    'total': 'Int64',
    'one_shot': 'Int64',
    'two_shots': 'Int64',
    'three_shots': 'Int64',
})

In [28]:
# reorganizando as colunas
df_vaccines = df_vaccines[['date', 'country', 'population', 'total', 'one_shot', 'one_shot_perc', 'two_shots', 'two_shots_perc', 'three_shots', 'three_shots_perc', 'month', 'year']]

In [29]:
df_vaccines.tail()

Unnamed: 0,date,country,population,total,one_shot,one_shot_perc,two_shots,two_shots_perc,three_shots,three_shots_perc,month,year
725,2022-12-27,Brazil,215313504,480310839,188549744,0.8757,174881292,0.8122,122616211,0.5695,2022-12,2022
726,2022-12-28,Brazil,215313504,480310839,188549744,0.8757,174881292,0.8122,122616211,0.5695,2022-12,2022
727,2022-12-29,Brazil,215313504,480331769,188552661,0.8757,174886102,0.8122,122629436,0.5695,2022-12,2022
728,2022-12-30,Brazil,215313504,480332769,188553047,0.8757,174886846,0.8122,122629436,0.5695,2022-12,2022
729,2022-12-31,Brazil,215313504,480333910,188553932,0.8757,174887915,0.8122,122629436,0.5695,2022-12,2022


## 5\. Carregamento

In [30]:
# Salvando o arquivo casos para o uso no Google Data Studio
cases.to_csv('covid_cases.csv', sep=',', index=False)

In [31]:
# Salvando o arquivo de vacinados para o uso no Google Data Studio
df_vaccines.to_csv('covid_vaccines.csv', sep=',', index=False)