# **PROJETO ANÁLISE DOS DADOS DO COVID 19 NO ESTADO DE SÃO PAULO**

Este projeto analisa os dados dos casos de COVID-19 no estado de São Paulo do período de fevereiro de 2020 a setembro de 2021.

Os dados estão disponíveis nos sites:

https://www.seade.gov.br/coronavirus/#

https://github.com/seade-R/dados-covid-sp

https://www.seade.gov.br/

# **Importação dos Dados**

## **Importação de arquivos csv**

In [1]:
# Importar bibliotecas necessárias

import numpy as np
import pandas as pd

In [2]:
# Não imprimir os warning

pd.options.mode.chained_assignment = None  # default='warn'

In [3]:
# Leitura de arquivo .csv:
# variavel_tabela = pd.read_csv('caminho_arquivo', sep='separador_no_arquivo', encoding='tipo_encoding')

# encoding: codificação de caracteres, normalmente utiliza-se o iso-8859-1, utf-8, latin-1)

covid_sp = pd.read_csv('.\dados\dados_covid_sp.csv', sep=';', encoding='utf-8')

In [4]:
# Exibir as informações do arquivo .csv:
# variavel_tabela.head(linhas_exibidas)

covid_sp.head(60)

# objeto.método() --> Permite colocar parâmetros (entre os parênteses)

Unnamed: 0,nome_munic,codigo_ibge,dia,mes,datahora,casos,casos_novos,casos_pc,casos_mm7d,obitos,...,nome_drs,cod_drs,pop,pop_60,area,map_leg,map_leg_s,latitude,longitude,semana_epidem
0,Adamantina,3500105,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,Marília,5,33894,7398,41199,0,8.0,-216820,-510737,9
1,Adolfo,3500204,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,São José do Rio Preto,15,3447,761,21106,0,8.0,-212325,-496451,9
2,Aguaí,3500303,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,São João da Boa Vista,14,35608,5245,47455,0,8.0,-220572,-469735,9
3,Águas da Prata,3500402,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,São João da Boa Vista,14,7797,1729,14267,0,8.0,-219319,-467176,9
4,Águas de Lindóia,3500501,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,Campinas,3,18374,3275,6013,0,8.0,-224733,-466314,9
5,Águas de Santa Bárbara,3500550,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,Bauru,12,5931,1106,40446,0,8.0,-228812,-492421,9
6,Águas de São Pedro,3500600,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,Piracicaba,11,3122,764,361,0,8.0,-225977,-478734,9
7,Agudos,3500709,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,Bauru,12,36134,5524,96671,0,8.0,-224694,-489863,9
8,Alambari,3500758,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,Sorocaba,6,5779,830,1596,0,8.0,-235503,-478980,9
9,Alfredo Marcondes,3500808,25,2,2020-02-25,0,0,"0,00000000000000e+00",0,0,...,Presidente Prudente,2,3927,907,11892,0,8.0,-219527,-514140,9


In [5]:
# Quando passa do limite(60) de linhas exibidas, são exibidas as 5 primeiras e 5 ultimas linhas de acordo com "linhas_exibidas"

covid_sp.head(61)

Unnamed: 0,nome_munic,codigo_ibge,dia,mes,datahora,casos,casos_novos,casos_pc,casos_mm7d,obitos,...,nome_drs,cod_drs,pop,pop_60,area,map_leg,map_leg_s,latitude,longitude,semana_epidem
0,Adamantina,3500105,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,...,Marília,5,33894,7398,41199,0,8.0,-216820,-510737,9
1,Adolfo,3500204,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,...,São José do Rio Preto,15,3447,761,21106,0,8.0,-212325,-496451,9
2,Aguaí,3500303,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,...,São João da Boa Vista,14,35608,5245,47455,0,8.0,-220572,-469735,9
3,Águas da Prata,3500402,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,...,São João da Boa Vista,14,7797,1729,14267,0,8.0,-219319,-467176,9
4,Águas de Lindóia,3500501,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,...,Campinas,3,18374,3275,6013,0,8.0,-224733,-466314,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56,Barão de Antonina,3505005,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,...,Bauru,12,3383,637,15314,0,8.0,-236284,-495634,9
57,Barbosa,3505104,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,...,Araçatuba,7,7284,1122,20521,0,8.0,-212657,-499518,9
58,Bariri,3505203,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,...,Bauru,12,33993,6061,44441,0,8.0,-220730,-487438,9
59,Barra Bonita,3505302,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,...,Bauru,12,34914,6605,15012,0,8.0,-224909,-485583,9


In [6]:
# Verificar quantos registros (linhas) e variáveis (colunas) que o arquivo possui:
# variavel_tabela.shape 

covid_sp.shape

# objeto.atributo --> Não permite colocar parâmetros

(374034, 26)

## **Importação de arquivos Excel**

In [7]:
# Leitura de arquivo .xlsx:
# variavel_tabela = pd.read_excel('caminho_arquivo')

# Arquivos excel demoram mais que os csv para carregar!!!

## **Importação através de uma url**

In [8]:
# variavel_url = "endereço_url"

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

In [9]:
# Se não existir nomes nas colunas, é possível nomeá-las:
# variavel_colunas = ['nome_coluna_1', nome_coluna_2', ...]

colnames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

In [10]:
# variavel_tabela = pd.read_csv(variavel_url, names=variavel_colunas)

iris = pd.read_csv(url, names=colnames)

In [11]:
iris.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [12]:
type(iris)

pandas.core.frame.DataFrame

In [13]:
# Utilizar para consultas

iris.shape

(150, 5)

In [14]:
# Utilizar nos códigos (ex: for, while, funções, etc) 

len(iris)

150

## **Datasets das bibliotecas do Python**

### **Statsmodels**

https://www.statsmodels.org/stable/datasets/index.html

In [15]:
import statsmodels.api as sm

In [16]:
# Importar tabela de dados:
# variavel = sm.datasets.nome_tabela.load_pandas().data

cancer = sm.datasets.cancer.load_pandas().data
cancer.head()

Unnamed: 0,cancer,population
0,1.0,445.0
1,0.0,559.0
2,3.0,677.0
3,4.0,681.0
4,3.0,746.0


In [17]:
type(cancer)

pandas.core.frame.DataFrame

In [18]:
cancer.shape

(301, 2)

### **Scikit-learn**

https://scikit-learn.org/stable/datasets/toy_dataset.html

In [19]:
import sklearn
from sklearn import datasets

In [20]:
# Importar tabela de dados:
# variavel_dataset = datasets.load_nomeDataset()

iris = datasets.load_iris()

In [21]:
# Imprime um dicionário com as informações

iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [22]:
# Imprime um array com os dados

iris.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [23]:
# Imprime os alvos (ex: classificações das plantas)

iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [24]:
# Imprime os nomes dos alvos (ex: nomes das classificações das plantas)

iris.target_names

#           0             1            2  
# array(['setosa' , 'versicolor', 'virginica'], dtype='<U10')

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

# **Organização dos Dados**

## **Renomeando variáveis (colunas)**

In [25]:
# Exibir todos os nomes das colunas
# nome_tabela.columns

covid_sp.columns

Index(['nome_munic', 'codigo_ibge', 'dia', 'mes', 'datahora', 'casos',
       'casos_novos', 'casos_pc', 'casos_mm7d', 'obitos', 'obitos_novos',
       'obitos_pc', 'obitos_mm7d', 'letalidade', 'nome_ra', 'cod_ra',
       'nome_drs', 'cod_drs', 'pop', 'pop_60', 'area', 'map_leg', 'map_leg_s',
       'latitude', 'longitude', 'semana_epidem'],
      dtype='object')

In [26]:
# Sobrescrever tabela de dados:
# nome_tabela = nome_tabela.rename(columns={'nome_atual': 'nome_novo'})

covid_sp = covid_sp.rename(columns={'nome_munic': 'municipio'})

In [27]:
# Sem sobrescrever tabela de dados:
# nome_tabela.rename(columns={'nome_atual': 'nome_novo'}, inplace=True)

covid_sp.rename(columns={'datahora': 'data'}, inplace=True)

In [28]:
# Alterar vários nomes de uma vez:
# nome_tabela.rename(columns={'nome_1_atual': 'nome_1_novo', 'nome_2_atual': 'nome_2_novo', ...}, inplace=True)

covid_sp.rename(columns={'map_leg': 'rotulo_mapa','map_leg_s':'codigo_mapa'}, inplace=True)

In [29]:
covid_sp.columns

Index(['municipio', 'codigo_ibge', 'dia', 'mes', 'data', 'casos',
       'casos_novos', 'casos_pc', 'casos_mm7d', 'obitos', 'obitos_novos',
       'obitos_pc', 'obitos_mm7d', 'letalidade', 'nome_ra', 'cod_ra',
       'nome_drs', 'cod_drs', 'pop', 'pop_60', 'area', 'rotulo_mapa',
       'codigo_mapa', 'latitude', 'longitude', 'semana_epidem'],
      dtype='object')

## **Excluindo variáveis (colunas)**

In [30]:
covid_sp.shape

(374034, 26)

In [31]:
# Excluir por nome (sobrescrevendo):
# nome_tabela_alterada = nome_tabela.drop(columns=['nome_coluna'])

covid_sp_alterado = covid_sp.drop(columns=['cod_ra'])

In [32]:
# Excluir por número (sobrescrevendo):
# nome_tabela_alterada = nome_tabela_alterada.drop(nome_tabela_alterada.columns[[num_coluna]], axis=0_ou_1)

# axis = 0 (p/ linha) ou 1(p/ coluna)

covid_sp_alterado = covid_sp_alterado.drop(covid_sp_alterado.columns[[1]], axis=1)

In [33]:
# Excluir mais de uma variável (sem sobrescrever):
# 1 --> nome_tabela_alterada.drop(columns=['nome_coluna_1', 'nome_coluna_2', ...], inplace=True)
# 2 --> nome_tabela_alterada.drop(nome_tabela_alterada.columns[[num_coluna_a, num_coluna_b, num_coluna_c, ...]], axis=0_ou_1)

covid_sp_alterado.drop(columns=['rotulo_mapa', 'codigo_mapa', 'cod_drs'], inplace=True)
covid_sp_alterado.drop(covid_sp_alterado.columns[[13, 14, 18, 19]], axis=1, inplace=True)

In [34]:
covid_sp_alterado.shape

(374034, 17)

## **Criando e alterando valores das colunas (variáveis)**

In [35]:
covid_sp_alterado.shape

(374034, 17)

In [36]:
# Exemplo de alteração de uma variável(coluna) inteira:
# nome_tabela['nome_coluna'] = nome_tabela['nome_coluna']/100

covid_sp_alterado['area'] = covid_sp_alterado['area']/100
# ou 
# covid_sp_alterado['area'] = covid_sp_alterado.area/100

In [37]:
# Exemplo de criação de nova variável(coluna):
# nome_tabela['nova_coluna'] = nome_tabela['nome_coluna_a'] / ['nome_coluna_b']

covid_sp_alterado['densidade'] = covid_sp_alterado['pop'] / covid_sp_alterado['area']

In [38]:
# Criação de uma coluna com índices (a partir do 1):
# lista = list(range(1, num_registros + 1))
# nome_tabela = pd.DataFrame(lista, columns=['indice'])

lista = list(range(1, covid_sp_alterado.shape[0] + 1))
df = pd.DataFrame(lista, columns=['indice'])
df

Unnamed: 0,indice
0,1
1,2
2,3
3,4
4,5
...,...
374029,374030
374030,374031
374031,374032
374032,374033


In [39]:
# Juntar dois DataFrames:
# nome_tabela = pd.concat([nome_tabela_a, nome_tabela_b], axis=0_ou_1)

covid_sp_alterado = pd.concat([covid_sp_alterado, df], axis=1)  # axis=1 --> Juntar por coluna

In [40]:
# Colocar a última coluna no começo:
# nome_tabela = nome_tabela.reindex(columns=['nome_coluna'] + list(nome_tabela.columns[:-1]))

covid_sp_alterado = covid_sp_alterado.reindex(columns=['indice'] + list(covid_sp_alterado.columns[:-1]))

In [41]:
# Escolher colunas específicas:
# nome_tabela[['coluna_1', 'coluna_2']]

covid_sp_alterado[['indice', 'area', 'densidade']]

Unnamed: 0,indice,area,densidade
0,1,411.99,82.268987
1,2,211.06,16.331849
2,3,474.55,75.035297
3,4,142.67,54.650592
4,5,60.13,305.571262
...,...,...,...
374029,374030,0.00,
374030,374031,0.00,
374031,374032,0.00,
374032,374033,0.00,


## **Contagem de Registros das Variáveis (Colunas)**

In [42]:
# Contagem em uma coluna:
# nome_tabela['nome_coluna'].value_counts()

covid_sp_alterado['semana_epidem'].value_counts()

semana_epidem
35    9044
24    9044
36    9044
10    9044
34    9044
33    9044
32    9044
31    9044
30    9044
29    9044
28    9044
27    9044
26    9044
25    9044
23    9044
38    9044
22    9044
21    9044
20    9044
19    9044
18    9044
17    9044
16    9044
15    9044
14    9044
13    9044
12    9044
11    9044
37    9044
9     7752
39    4522
51    4522
7     4522
6     4522
5     4522
4     4522
3     4522
2     4522
1     4522
53    4522
52    4522
50    4522
40    4522
49    4522
48    4522
47    4522
46    4522
45    4522
44    4522
43    4522
42    4522
41    4522
8     4522
Name: count, dtype: int64

In [43]:
# Reordenar por índice:
# nome_tabela['nome_coluna'].value_counts().sort_index()

covid_sp_alterado['semana_epidem'].value_counts().sort_index()

semana_epidem
1     4522
2     4522
3     4522
4     4522
5     4522
6     4522
7     4522
8     4522
9     7752
10    9044
11    9044
12    9044
13    9044
14    9044
15    9044
16    9044
17    9044
18    9044
19    9044
20    9044
21    9044
22    9044
23    9044
24    9044
25    9044
26    9044
27    9044
28    9044
29    9044
30    9044
31    9044
32    9044
33    9044
34    9044
35    9044
36    9044
37    9044
38    9044
39    4522
40    4522
41    4522
42    4522
43    4522
44    4522
45    4522
46    4522
47    4522
48    4522
49    4522
50    4522
51    4522
52    4522
53    4522
Name: count, dtype: int64

In [44]:
# Contar utilizando a função Counter:
# Counter(nome_tabela.nome_coluna)

from collections import Counter

Counter(covid_sp_alterado.semana_epidem)

Counter({10: 9044,
         11: 9044,
         12: 9044,
         13: 9044,
         14: 9044,
         15: 9044,
         16: 9044,
         17: 9044,
         18: 9044,
         19: 9044,
         20: 9044,
         21: 9044,
         22: 9044,
         23: 9044,
         24: 9044,
         25: 9044,
         26: 9044,
         27: 9044,
         28: 9044,
         29: 9044,
         30: 9044,
         31: 9044,
         32: 9044,
         33: 9044,
         34: 9044,
         35: 9044,
         36: 9044,
         37: 9044,
         38: 9044,
         9: 7752,
         39: 4522,
         40: 4522,
         41: 4522,
         42: 4522,
         43: 4522,
         44: 4522,
         45: 4522,
         46: 4522,
         47: 4522,
         48: 4522,
         49: 4522,
         50: 4522,
         51: 4522,
         52: 4522,
         53: 4522,
         1: 4522,
         2: 4522,
         3: 4522,
         4: 4522,
         5: 4522,
         6: 4522,
         7: 4522,
         8: 4522})

In [45]:
# Relacionando duas colunas:
# nome_tabela.query('condição')['verificar_por'].value_counts()

covid_sp_alterado.query('obitos_novos > 50')['municipio'].value_counts()
# ou seja,
# quantas vezes registrou óbitos novos (por dia) maiores que 50 em determinado município

municipio
São Paulo                295
Guarulhos                  9
São Bernardo do Campo      3
Taubaté                    3
Itapetininga               2
Campinas                   1
Sorocaba                   1
Name: count, dtype: int64

## **Selecionar Variáveis (Colunas) por Índices**

In [46]:
# x = nome_tabela.iloc[linhas, colunas]

x = covid_sp_alterado.iloc[:, 5:13]
x

# OBS: [:, :] --> Os ':' sozinhos representam pegar todas as linhas ou colunas

Unnamed: 0,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d
0,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000
1,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000
2,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000
3,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000
4,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000
...,...,...,...,...,...,...,...,...
374029,1956,-477,"0,00000000000000e+00",-931428571428571445,7,-1,"0,00000000000000e+00",-0285714285714286
374030,1414,-542,"0,00000000000000e+00",-1038714285714285779,7,0,"0,00000000000000e+00",-0571428571428571
374031,962,-452,"0,00000000000000e+00",-348000000000000000,6,-1,"0,00000000000000e+00",-1714285714285714
374032,557,-405,"0,00000000000000e+00",-774428571428571445,1,-5,"0,00000000000000e+00",-2428571428571428


In [47]:
type(x)

pandas.core.frame.DataFrame

In [48]:
y = covid_sp_alterado.iloc[:, 1]  # Reconhece como uma Serie
y

0               Adamantina
1                   Adolfo
2                    Aguaí
3           Águas da Prata
4         Águas de Lindóia
                ...       
374029            Ignorado
374030            Ignorado
374031            Ignorado
374032            Ignorado
374033            Ignorado
Name: municipio, Length: 374034, dtype: object

In [49]:
type(y)

pandas.core.series.Series

In [50]:
y = covid_sp_alterado.iloc[:, 1].values  # Reconhece como um array
y

array(['Adamantina', 'Adolfo', 'Aguaí', ..., 'Ignorado', 'Ignorado',
       'Ignorado'], dtype=object)

In [51]:
type(y)

numpy.ndarray

In [52]:
# Transforma o array em lista

lista_y = list(y.flatten())
lista_y

['Adamantina',
 'Adolfo',
 'Aguaí',
 'Águas da Prata',
 'Águas de Lindóia',
 'Águas de Santa Bárbara',
 'Águas de São Pedro',
 'Agudos',
 'Alambari',
 'Alfredo Marcondes',
 'Altair',
 'Altinópolis',
 'Alto Alegre',
 'Alumínio',
 'Álvares Florence',
 'Álvares Machado',
 'Álvaro de Carvalho',
 'Alvinlândia',
 'Americana',
 'Américo Brasiliense',
 'Américo de Campos',
 'Amparo',
 'Analândia',
 'Andradina',
 'Angatuba',
 'Anhembi',
 'Anhumas',
 'Aparecida',
 "Aparecida d'Oeste",
 'Apiaí',
 'Araçariguama',
 'Araçatuba',
 'Araçoiaba da Serra',
 'Aramina',
 'Arandu',
 'Arapeí',
 'Araraquara',
 'Araras',
 'Arco-Íris',
 'Arealva',
 'Areias',
 'Areiópolis',
 'Ariranha',
 'Artur Nogueira',
 'Arujá',
 'Aspásia',
 'Assis',
 'Atibaia',
 'Auriflama',
 'Avaí',
 'Avanhandava',
 'Avaré',
 'Bady Bassitt',
 'Balbinos',
 'Bálsamo',
 'Bananal',
 'Barão de Antonina',
 'Barbosa',
 'Bariri',
 'Barra Bonita',
 'Barra do Chapéu',
 'Barra do Turvo',
 'Barretos',
 'Barrinha',
 'Barueri',
 'Bastos',
 'Batatais',
 '

In [53]:
# Criar um DataFrame utilizando uma lista:
# df = pd.DataFrame(lista, columns=['nome_coluna_para_lista'])

df = pd.DataFrame(lista_y, columns=['municipio_lista'])
df

Unnamed: 0,municipio_lista
0,Adamantina
1,Adolfo
2,Aguaí
3,Águas da Prata
4,Águas de Lindóia
...,...
374029,Ignorado
374030,Ignorado
374031,Ignorado
374032,Ignorado


## **Excluindo, Filtrando e Substituindo Registros (Linhas)**

In [54]:
covid_sp_alterado['indice']

0              1
1              2
2              3
3              4
4              5
           ...  
374029    374030
374030    374031
374031    374032
374032    374033
374033    374034
Name: indice, Length: 374034, dtype: int64

In [55]:
# Excluindo linhas por índices (valores absolutos):
# nome_tabela = nome_tabela.drop(nome_tabela.index[[num_variavel_a, num_variavel_b, ...]])

covid_sp_alterado_2 = covid_sp_alterado.drop(covid_sp_alterado.index[[1, 3]])
covid_sp_alterado_2['indice']

0              1
2              3
4              5
5              6
6              7
           ...  
374029    374030
374030    374031
374031    374032
374032    374033
374033    374034
Name: indice, Length: 374032, dtype: int64

In [56]:
# Excluindo linhas por índices (intervalo de valores):
# nome_tabela = nome_tabela.drop(nome_tabela.index[num_variavel_a:num_variavel_b])

covid_sp_alterado_2 = covid_sp_alterado_2.drop(covid_sp_alterado.index[4:7])
covid_sp_alterado_2['indice']

0              1
2              3
7              8
8              9
9             10
           ...  
374029    374030
374030    374031
374031    374032
374032    374033
374033    374034
Name: indice, Length: 374029, dtype: int64

In [57]:
# Reordenar(resetar) os índices(padrões):
# nome_tabela = nome_tabela.reset_index(drop=True_False)

covid_sp_alterado_2 = covid_sp_alterado_2.reset_index(drop=True)  # drop é para excluir índice anterior
covid_sp_alterado_2['indice']

0              1
1              3
2              8
3              9
4             10
           ...  
374024    374030
374025    374031
374026    374032
374027    374033
374028    374034
Name: indice, Length: 374029, dtype: int64

In [58]:
# Localizar a quantidade de linhas com um valor específico:
# variavel = nome_tabela.loc[nome_tabela.nome_coluna == 'valor_específico']

ignorado = covid_sp_alterado.loc[covid_sp_alterado.municipio == 'Ignorado']
ignorado

Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade
373455,373456,Ignorado,25,2,2020-02-25,-1,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,0,0,0.0,9,
373456,373457,Ignorado,26,2,2020-02-26,0,1,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,0,0,0.0,9,
373457,373458,Ignorado,27,2,2020-02-27,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,0,0,0.0,9,
373458,373459,Ignorado,28,2,2020-02-28,-1,-1,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,0,0,0.0,9,
373459,373460,Ignorado,29,2,2020-02-29,0,1,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,0,0,0.0,9,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
374029,374030,Ignorado,21,9,2021-09-21,1956,-477,"0,00000000000000e+00",-931428571428571445,7,-1,"0,00000000000000e+00",-0285714285714286,0003578732106339468,0,0,0.0,38,
374030,374031,Ignorado,22,9,2021-09-22,1414,-542,"0,00000000000000e+00",-1038714285714285779,7,0,"0,00000000000000e+00",-0571428571428571,0004950495049504951,0,0,0.0,38,
374031,374032,Ignorado,23,9,2021-09-23,962,-452,"0,00000000000000e+00",-348000000000000000,6,-1,"0,00000000000000e+00",-1714285714285714,0006237006237006237,0,0,0.0,38,
374032,374033,Ignorado,24,9,2021-09-24,557,-405,"0,00000000000000e+00",-774428571428571445,1,-5,"0,00000000000000e+00",-2428571428571428,0001795332136445242,0,0,0.0,38,


In [59]:
ignorado.shape  # (x, y)
                # x --> Quantidade de linhas em que 'Ignorado' aparece

(579, 19)

In [60]:
# Localizar a quantidade de linhas que não possuem um valor específico:
# Também serve para excluir esse valor específico da tabela:
# variavel_ou_nome_tabela = nome_tabela.loc[nome_tabela.nome_coluna != 'valor_específico']

covid_sp_alterado = covid_sp_alterado.loc[covid_sp_alterado.municipio != 'Ignorado']
covid_sp_alterado

Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade
0,1,Adamantina,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,33894,7398,411.99,9,82.268987
1,2,Adolfo,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,3447,761,211.06,9,16.331849
2,3,Aguaí,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,35608,5245,474.55,9,75.035297
3,4,Águas da Prata,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,7797,1729,142.67,9,54.650592
4,5,Águas de Lindóia,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,18374,3275,60.13,9,305.571262
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
373450,373451,Votorantim,25,9,2021-09-25,11670,1,"9,73928428361597e+03",2571428571428572,515,0,"4,29797035652290e+02",0142857142857143,0044130248500428446,119824,16378,183.52,38,652.920663
373451,373452,Votuporanga,25,9,2021-09-25,16147,9,"1,75969921534438e+04",7571428571428571,453,0,"4,93679163034002e+02",0142857142857143,0028054747011828824,91760,17203,42.07,38,2181.126694
373452,373453,Zacarias,25,9,2021-09-25,268,0,"1,04687500000000e+04",0000000000000000,10,0,"3,90625000000000e+02",0000000000000000,0037313432835820892,2560,481,319.06,38,8.023569
373453,373454,Chavantes,25,9,2021-09-25,1388,0,"1,13556410046633e+04",0142857142857143,48,0,"3,92702282582018e+02",0000000000000000,0034582132564841501,12223,2098,188.73,38,64.764478


In [61]:
# Exemplo --> Análise de Guarulhos

guarulhos = covid_sp_alterado.loc[covid_sp_alterado.municipio == 'Guarulhos']
guarulhos

Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade
212,213,Guarulhos,25,2,2020-02-25,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,1351275,162662,318.68,9,4240.225304
857,858,Guarulhos,26,2,2020-02-26,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,1351275,162662,318.68,9,4240.225304
1502,1503,Guarulhos,27,2,2020-02-27,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,1351275,162662,318.68,9,4240.225304
2147,2148,Guarulhos,28,2,2020-02-28,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,1351275,162662,318.68,9,4240.225304
2792,2793,Guarulhos,29,2,2020-02-29,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,1351275,162662,318.68,9,4240.225304
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
370442,370443,Guarulhos,21,9,2021-09-21,63334,28,"4,68698081441602e+03",102571428571428569,4857,3,"3,59438308264417e+02",2571428571428572,0076688666435090161,1351275,162662,318.68,38,4240.225304
371087,371088,Guarulhos,22,9,2021-09-22,63351,17,"4,68823888549703e+03",105000000000000000,4861,4,"3,59734324989362e+02",2428571428571428,0076731227604931257,1351275,162662,318.68,38,4240.225304
371732,371733,Guarulhos,23,9,2021-09-23,63368,17,"4,68949695657805e+03",107285714285714292,4862,1,"3,59808329170598e+02",2000000000000000,0076726423431384930,1351275,162662,318.68,38,4240.225304
372377,372378,Guarulhos,24,9,2021-09-24,63387,19,"4,69090303602154e+03",61285714285714285,4863,1,"3,59882333351834e+02",2142857142857143,0076719201098016943,1351275,162662,318.68,38,4240.225304


In [62]:
guarulhos.drop(columns=['data', 'municipio'], inplace=True)
guarulhos

Unnamed: 0,indice,dia,mes,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade
212,213,25,2,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,1351275,162662,318.68,9,4240.225304
857,858,26,2,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,1351275,162662,318.68,9,4240.225304
1502,1503,27,2,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,1351275,162662,318.68,9,4240.225304
2147,2148,28,2,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,1351275,162662,318.68,9,4240.225304
2792,2793,29,2,0,0,"0,00000000000000e+00",0000000000000000,0,0,"0,00000000000000e+00",0000000000000000,0000000000000000000,1351275,162662,318.68,9,4240.225304
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
370442,370443,21,9,63334,28,"4,68698081441602e+03",102571428571428569,4857,3,"3,59438308264417e+02",2571428571428572,0076688666435090161,1351275,162662,318.68,38,4240.225304
371087,371088,22,9,63351,17,"4,68823888549703e+03",105000000000000000,4861,4,"3,59734324989362e+02",2428571428571428,0076731227604931257,1351275,162662,318.68,38,4240.225304
371732,371733,23,9,63368,17,"4,68949695657805e+03",107285714285714292,4862,1,"3,59808329170598e+02",2000000000000000,0076726423431384930,1351275,162662,318.68,38,4240.225304
372377,372378,24,9,63387,19,"4,69090303602154e+03",61285714285714285,4863,1,"3,59882333351834e+02",2142857142857143,0076719201098016943,1351275,162662,318.68,38,4240.225304


In [63]:
# Realizar substituições nos valores de uma coluna em uma tabela (utilizando dicionário):
# nome_tabela['nome_coluna'] = nome_tabela['nome_coluna'].replace({valor_atual_1: valor_novo_1, valor_atual_2: valor_novo_2, ...})

guarulhos['semana_epidem'] = guarulhos['semana_epidem'].replace({9: 'nove', 10: 'dez'})
guarulhos['semana_epidem']

212       nove
857       nove
1502      nove
2147      nove
2792      nove
          ... 
370442      38
371087      38
371732      38
372377      38
373022      38
Name: semana_epidem, Length: 579, dtype: object

In [64]:
# Realizar substituições nos valores de uma coluna em uma tabela (utilizando lista):
# nome_tabela['nome_coluna'] = nome_tabela['nome_coluna'].replace([valor_atual_1, valor_atual_2, ...], [valor_novo_1, valor_novo_2, ...])

guarulhos['semana_epidem'] = guarulhos['semana_epidem'].replace([11, 12, 13], ['onze', 'doze', 'treze'])

In [65]:
# Escolher registros em que os valores da coluna x estão dentro da lista:
# nome_tabela.loc[nome_tabela[x].isin(['valor_1', 'valor_2'])]

guarulhos.loc[guarulhos['semana_epidem'].isin(['onze', 'doze', 'treze'])]

Unnamed: 0,indice,dia,mes,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade
7952,7953,8,3,0,0,"0,00000000000000e+00",0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,onze,4240.225304
8597,8598,9,3,0,0,"0,00000000000000e+00",0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,onze,4240.225304
9242,9243,10,3,0,0,"0,00000000000000e+00",0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,onze,4240.225304
9887,9888,11,3,0,0,"0,00000000000000e+00",0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,onze,4240.225304
10532,10533,12,3,0,0,"0,00000000000000e+00",0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,onze,4240.225304
11177,11178,13,3,0,0,"0,00000000000000e+00",0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,onze,4240.225304
11822,11823,14,3,0,0,"0,00000000000000e+00",0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,onze,4240.225304
12467,12468,15,3,0,0,"0,00000000000000e+00",0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,doze,4240.225304
13112,13113,16,3,0,0,"0,00000000000000e+00",0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,doze,4240.225304
13757,13758,17,3,1,1,740041812362398e-02,142857142857143,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,doze,4240.225304


In [66]:
# Python utiliza o sistema inglês (com . entre os números ao invés de , ), portanto:
# nome_tabela['nome_coluna'] = nome_tabela['nome_coluna'].apply(lambda x: x.replace(',', '.'))

guarulhos['casos_pc'] = guarulhos['casos_pc'].apply(lambda x: x.replace(',', '.'))
guarulhos['casos_pc']

212       0.00000000000000e+00
857       0.00000000000000e+00
1502      0.00000000000000e+00
2147      0.00000000000000e+00
2792      0.00000000000000e+00
                  ...         
370442    4.68698081441602e+03
371087    4.68823888549703e+03
371732    4.68949695657805e+03
372377    4.69090303602154e+03
373022    4.69171708201513e+03
Name: casos_pc, Length: 579, dtype: object

In [67]:
guarulhos.shape

(579, 17)

In [68]:
# Criar coluna com datas:
# data = np.array('data_início', dtype = np.datetime64())

data = np.array('2020-02-25', dtype = np.datetime64()) 
data

array('2020-02-25', dtype='datetime64[D]')

In [69]:
# Criar um array com a data_início até a data final de acordo com num_datas:
# data = data + np.arange(num_datas)

data = data + np.arange(guarulhos.shape[0])
data

array(['2020-02-25', '2020-02-26', '2020-02-27', '2020-02-28',
       '2020-02-29', '2020-03-01', '2020-03-02', '2020-03-03',
       '2020-03-04', '2020-03-05', '2020-03-06', '2020-03-07',
       '2020-03-08', '2020-03-09', '2020-03-10', '2020-03-11',
       '2020-03-12', '2020-03-13', '2020-03-14', '2020-03-15',
       '2020-03-16', '2020-03-17', '2020-03-18', '2020-03-19',
       '2020-03-20', '2020-03-21', '2020-03-22', '2020-03-23',
       '2020-03-24', '2020-03-25', '2020-03-26', '2020-03-27',
       '2020-03-28', '2020-03-29', '2020-03-30', '2020-03-31',
       '2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04',
       '2020-04-05', '2020-04-06', '2020-04-07', '2020-04-08',
       '2020-04-09', '2020-04-10', '2020-04-11', '2020-04-12',
       '2020-04-13', '2020-04-14', '2020-04-15', '2020-04-16',
       '2020-04-17', '2020-04-18', '2020-04-19', '2020-04-20',
       '2020-04-21', '2020-04-22', '2020-04-23', '2020-04-24',
       '2020-04-25', '2020-04-26', '2020-04-27', '2020-

In [70]:
# Trasformar o array das datas em DataFrame:

data = pd.DataFrame(data)
data

Unnamed: 0,0
0,2020-02-25
1,2020-02-26
2,2020-02-27
3,2020-02-28
4,2020-02-29
...,...
574,2021-09-21
575,2021-09-22
576,2021-09-23
577,2021-09-24


In [71]:
# Nomear a coluna das datas:
# data.columns = ['nome_coluna']

data.columns = ['data']
data.head()

Unnamed: 0,data
0,2020-02-25
1,2020-02-26
2,2020-02-27
3,2020-02-28
4,2020-02-29


In [72]:
# Não ficará correta, pois os índices de 'guarulhos' não estão resetados

guarulhos_2 = pd.concat([data, guarulhos], axis=1)
guarulhos_2.head()

Unnamed: 0,data,indice,dia,mes,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade
0,2020-02-25,,,,,,,,,,,,,,,,,
1,2020-02-26,,,,,,,,,,,,,,,,,
2,2020-02-27,,,,,,,,,,,,,,,,,
3,2020-02-28,,,,,,,,,,,,,,,,,
4,2020-02-29,,,,,,,,,,,,,,,,,


In [73]:
guarulhos = guarulhos.reset_index(drop=True)

In [74]:
# Agora sim!

guarulhos_2 = pd.concat([data, guarulhos], axis=1)
guarulhos_2.head()

Unnamed: 0,data,indice,dia,mes,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade
0,2020-02-25,213,25,2,0,0,0.0,0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,nove,4240.225304
1,2020-02-26,858,26,2,0,0,0.0,0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,nove,4240.225304
2,2020-02-27,1503,27,2,0,0,0.0,0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,nove,4240.225304
3,2020-02-28,2148,28,2,0,0,0.0,0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,nove,4240.225304
4,2020-02-29,2793,29,2,0,0,0.0,0,0,0,"0,00000000000000e+00",0,0,1351275,162662,318.68,nove,4240.225304


In [75]:
guarulhos_2.shape

(579, 18)

## **Valores Missing (NaN)**

In [76]:
# RELAÇÃO DA QUANTIDADE

# Quantidade de valores ausentes em cada coluna:
# nome_tabela.isnull().sum()

covid_sp_alterado.isnull().sum()

indice           0
municipio        0
dia              0
mes              0
data             0
casos            0
casos_novos      0
casos_pc         0
casos_mm7d       0
obitos           0
obitos_novos     0
obitos_pc        0
obitos_mm7d      0
letalidade       0
pop              0
pop_60           0
area             0
semana_epidem    0
densidade        0
dtype: int64

In [77]:
# RELAÇÃO DA QUANTIDADE

# Quantidade de valores ausentes em uma coluna específica:
# nome_tabela['nome_coluna'].isnull().sum()

covid_sp_alterado['casos'].isnull().sum()  

0

In [78]:
# RELAÇÃO DA QUANTIDADE

covid_sp.isnull().sum()

municipio          0
codigo_ibge        0
dia                0
mes                0
data               0
casos              0
casos_novos        0
casos_pc           0
casos_mm7d         0
obitos             0
obitos_novos       0
obitos_pc          0
obitos_mm7d        0
letalidade         0
nome_ra          579
cod_ra             0
nome_drs         579
cod_drs            0
pop                0
pop_60             0
area               0
rotulo_mapa      579
codigo_mapa      579
latitude           0
longitude          0
semana_epidem      0
dtype: int64

In [79]:
# Excluir todos os VALORES MISSING:
# nome_tabela = nome_tabela.dropna()

covid_sp_2 = covid_sp.dropna()
covid_sp_2.isnull().sum()

municipio        0
codigo_ibge      0
dia              0
mes              0
data             0
casos            0
casos_novos      0
casos_pc         0
casos_mm7d       0
obitos           0
obitos_novos     0
obitos_pc        0
obitos_mm7d      0
letalidade       0
nome_ra          0
cod_ra           0
nome_drs         0
cod_drs          0
pop              0
pop_60           0
area             0
rotulo_mapa      0
codigo_mapa      0
latitude         0
longitude        0
semana_epidem    0
dtype: int64

In [80]:
# Preencher os VALORES MISSING pela MEDIANA:
# nome_tabela['nome_coluna'].fillna(nome_tabela['nome_coluna'].median, inplace=True)

covid_sp['obitos_novos'].fillna(covid_sp['obitos_novos'].median, inplace=True)

In [81]:
# Preencher os VALORES MISSING pela MÉDIA:
# nome_tabela['nome_coluna'].fillna(nome_tabela['nome_coluna'].mean, inplace=True)

covid_sp['obitos_novos'].fillna(covid_sp['obitos_novos'].mean, inplace=True)

In [82]:
# Preencher os VALORES MISSING por QUALQUER OUTRO VALOR:
# nome_tabela['nome_coluna'].fillna(qualquer_valor, inplace=True)

covid_sp['obitos_novos'].fillna(10, inplace=True)

## **Classificação e alteração da tipagem dos atributos**

Análise dos tipos de atributos:

- object: strings

- int64: inteiros

- float64: reais

- complex: complexos

In [83]:
# Verificar tipos das colunas de uma tabelas:
# nome_tabela.dtypes

covid_sp_alterado.dtypes

indice             int64
municipio         object
dia                int64
mes                int64
data              object
casos              int64
casos_novos        int64
casos_pc          object
casos_mm7d        object
obitos             int64
obitos_novos       int64
obitos_pc         object
obitos_mm7d       object
letalidade        object
pop                int64
pop_60             int64
area             float64
semana_epidem      int64
densidade        float64
dtype: object

In [84]:
# Transformar uma variável(coluna) em outro tipo:
# nome_tabela['nome_coluna'] = nome_tabela['nome_coluna'].astype(tipo_atributo)

# covid_sp_alterado['casos_pc'] = covid_sp_alterado['casos_pc'].astype(float)

# Resultará em erro, pois 'casos_pc' está escrito com "," entre os números

In [85]:
covid_sp_alterado['casos_pc'] = covid_sp_alterado['casos_pc'].apply(lambda x: x.replace(',', '.'))

In [86]:
# Agora sim!

covid_sp_alterado['casos_pc'] = covid_sp_alterado['casos_pc'].astype(float)
covid_sp_alterado.dtypes

indice             int64
municipio         object
dia                int64
mes                int64
data              object
casos              int64
casos_novos        int64
casos_pc         float64
casos_mm7d        object
obitos             int64
obitos_novos       int64
obitos_pc         object
obitos_mm7d       object
letalidade        object
pop                int64
pop_60             int64
area             float64
semana_epidem      int64
densidade        float64
dtype: object

In [87]:
# Realizar o tratamento em todas as variáveis necessárias

covid_sp_alterado['casos_mm7d'] = covid_sp_alterado['casos_mm7d'].apply(lambda x: x.replace(',', '.'))
covid_sp_alterado['obitos_pc'] = covid_sp_alterado['obitos_pc'].apply(lambda x: x.replace(',', '.'))
covid_sp_alterado['obitos_mm7d'] = covid_sp_alterado['obitos_mm7d'].apply(lambda x: x.replace(',', '.'))
covid_sp_alterado['letalidade'] = covid_sp_alterado['letalidade'].apply(lambda x: x.replace(',', '.'))
covid_sp_alterado.head(1)

Unnamed: 0,indice,municipio,dia,mes,data,casos,casos_novos,casos_pc,casos_mm7d,obitos,obitos_novos,obitos_pc,obitos_mm7d,letalidade,pop,pop_60,area,semana_epidem,densidade
0,1,Adamantina,25,2,2020-02-25,0,0,0.0,0.0,0,0,0.0,0.0,0.0,33894,7398,411.99,9,82.268987


In [88]:
# Mudar a tipagem das variáveis necessárias

covid_sp_alterado['casos_mm7d'] = covid_sp_alterado['casos_mm7d'].astype(float)
covid_sp_alterado['obitos_pc'] = covid_sp_alterado['obitos_pc'].astype(float)
covid_sp_alterado['obitos_mm7d'] = covid_sp_alterado['obitos_mm7d'].astype(float)
covid_sp_alterado['letalidade'] = covid_sp_alterado['letalidade'].astype(float)
covid_sp_alterado.dtypes

indice             int64
municipio         object
dia                int64
mes                int64
data              object
casos              int64
casos_novos        int64
casos_pc         float64
casos_mm7d       float64
obitos             int64
obitos_novos       int64
obitos_pc        float64
obitos_mm7d      float64
letalidade       float64
pop                int64
pop_60             int64
area             float64
semana_epidem      int64
densidade        float64
dtype: object

In [89]:
# Tipagem para datatime:
# nome_tabela['nome_coluna'] = pd.to_datetime(nome_tabela['nome_coluna'])

covid_sp_alterado['data'] = pd.to_datetime(covid_sp_alterado['data'])
covid_sp_alterado.dtypes

indice                    int64
municipio                object
dia                       int64
mes                       int64
data             datetime64[ns]
casos                     int64
casos_novos               int64
casos_pc                float64
casos_mm7d              float64
obitos                    int64
obitos_novos              int64
obitos_pc               float64
obitos_mm7d             float64
letalidade              float64
pop                       int64
pop_60                    int64
area                    float64
semana_epidem             int64
densidade               float64
dtype: object

# **Salvando (Exportando) o DataFrame Tratado**

In [90]:
# nome_tabela.to_csv('caminho_arquivo', sep='separador_no_arquivo', encoding='tipo_encoding', index=True_False)

# index --> Se criará uma coluna com os índices padrões na tabela (não é necessário, pois o Python já faz isso quando abre o arquivo)

covid_sp_alterado.to_csv('./dados/covid_sp_tratado.csv', sep=';', encoding='utf-8', index=False)

# **Desafio**

Transformar o dataset "iris" em um DataFrame

In [91]:
iris = datasets.load_iris()

In [92]:
iris.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [93]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [94]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [95]:
iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

## **Maneira 1**

In [96]:
iris_data = pd.DataFrame(iris.data)

iris_data.rename(
    columns={0: 'sepal-length(cm)', 1: 'sepal-width(cm)', 2: 'petal-length(cm)', 3: 'petal-width(cm)'},
    inplace=True
)

iris_target = pd.DataFrame(iris.target)
iris_target.rename(
    columns={0: 'classification'},
    inplace=True
)

iris_df = pd.concat([iris_data, iris_target], axis=1)
iris_df['classification'] = iris_df['classification'].replace({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

In [97]:
iris_df.head()

Unnamed: 0,sepal-length(cm),sepal-width(cm),petal-length(cm),petal-width(cm),classification
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [98]:
iris_df['classification'].value_counts()

classification
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

## **Maneira 2 (melhor)**

In [99]:
dados = pd.DataFrame(iris.data)
classes = pd.DataFrame(iris.target)

iris_df_2 = pd.concat([dados, classes], axis=1)

colnames = iris.feature_names
colnames.append('classification')

iris_df_2.columns = colnames

nomes = iris.target_names

iris_df_2['classification'] = iris_df_2['classification'].replace({0: nomes[0],
                                                                   1: nomes[1],
                                                                   2: nomes[2]})

In [100]:
iris_df_2.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),classification
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [101]:
iris_df_2['classification'].value_counts()

classification
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64