<a href="https://colab.research.google.com/github/isacNepo/aulas_faculdade/blob/master/aula_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introdução ao Pandas

O Pandas é uma biblioteca para a linguagem Python para manipulação e análise de dados. Em particular, oferece estruturas e operações para manipular tabelas numéricas e séries temporais. É um _software_ livre sob a licensa licença BSD. O nome é derivado do termo inglês "panel data" (dados em painel), um termo usado em estatística e econometria para conjunto de dados que incluem várias unidades amostrais (indivíduos, empresas, etc.) acompanhadas ao longo do tempo. (Wikipédia, 2022)

## Documentação do Pandas

https://pandas.pydata.org/pandas-docs/stable/

## Definições Importantes

#### Dados Tabulares

* Primeira linha: Cabeçalho (*Header*)
* Cada coluna: uma variável
* Cada linha: uma observação
* Cada Tabela/arquivo: um nível de observação

![](https://github.com/storopoli/ciencia-de-dados/blob/main/notebooks/images/dados-tabulares.png?raw=1)

## Elementos do Pandas

* *DataFrame*: Tabela Retangular de Dados
    - Conjunto de *Series*
    - Todas compartilhando o mesmo índice (*index*)
* *Series*: Coluna do *DataFrame*
    - *arrays* em 1-D
    - Composta por:
        - Sequência de Valores
            - *numeric*
            - *string*
            - *bool*
        - Sequencia de *index*
        
![](https://media.geeksforgeeks.org/wp-content/cdn-uploads/creating_dataframe1.png)

![](https://storage.googleapis.com/lds-media/images/series-and-dataframe.width-1200.png)

## Tipos de Dados

| Pandas dtype    | Python type    | NumPy type                                | Uso                                                   |
|-----------------|----------------|-------------------------------------------|-------------------------------------------------------|
| `object`        | `str` ou misto | *string_*, *unicode_*, misto              | Texto ou misto de valores `numeric` and `non-numeric` |
| `int64`         | `int`          | *int_*, `int8`, `int16`, `int32`, `int64` | Número Inteiros                                       |
| `float64`       | `float`        | *float_*, `float16`, `float32`, `float64` | Número Reais                                          |
| `bool`          | `bool`         | `bool`                                    | Verdadeiro ou Falso                                   |
| `datetime64`    | NA             | `datetime64[ns]`                          | Data e Hora                                           |
| `timedelta[ns]` | NA             | NA                                        | Diferença entre duas `datetimes`                      |
| `category`      | NA             | NA                                        | Lista Finita de Valores em Texto                      |

## Importação/Exportação de Dados

<img src="https://github.com/storopoli/ciencia-de-dados/blob/main/notebooks/images/pandas-io.svg?raw=1" alt="" style="background-color: white;"/>


| Formato   | Input                 | Output            | Observação                     |
| --------- | --------------------- | ----------------- | ------------------------------ |
| CSV       | `pd.read_csv()`       | `.to_csv()`       | Arquivo Texto CSV, TSV, etc    |
| XLS/XLSX  | `pd.read_excel()`     | `.to_excel()`     | Planilha                       |
| HDF       | `pd.read_hdf()`       | `.to_hdf()`       | HDF5 database                  |
| SQL       | `pd.read_sql()`       | `.to_sql()`       | SQL table                      |
| JSON      | `pd.read_json()`      | `.to_json()`      | JavaScript Object Notation     |
| MSGPACK   | `pd.read_msgpack()`   | `.to_msgpack()`   | Portable binary format         |
| HTML      | `pd.read_html()`      | `.to_html()`      | código HTML                    |
| GBQ       | `pd.read_gbq()`       | `.to_gbq()`       | Google Big Query format        |
| DTA       | `pd.read_stata()`     | `.to_stata()`     | Stata                          |
| Parquet   | `pd.read_parquet()`   | `.to_parquet()`   | Apache Parquet                 |
| Feather   | `pd.read_feather()`   | `.to_feather()`   | Apache Arrow                   |
| Qualquer  | `pd.read_clipboard()` | `.to_clipboard()` | Ex., de pág HTML               |
| Qualquer  | `pd.read_pickle()`    | `.to_pickle()`    | (Structured) Python object     |

## Tipos de importação mais utilizados

### `CSV` (.CSV)

Se atentar com os seguintes argumentos de [`pd.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html):

* Caminho (`path`)
* `sep`: `','`, para europeu/brasileiro use `';'`
* `decimal`: `'.'`, para europeu/brasileiro use `','`
* `header`: `pandas` tenta adivinhar
* `index_col`: `None`, mas pode ser uma coluna do arquivo (ex: 2ª coluna use `index_col=2`)
* `names`: `None`, mas pode ser uma lista dos nomes das variáveis (colunas)
* `skip_rows`: `None` (pular linhas)
* `na_values`: `None`, mas pode ser qualquer string (ex: `'NA'`)
* `thousands`: `None` mas pode ser `','` ou `'.'`
* `encoding`
    - `'utf8'`: padrão
    - `'latin1'`: ç à é î ã
    
### `Excel` (.XLS e .XLSX)

Se atentar com os seguintes argumentos de [`pd.read_excel()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html):
* Caminho (`path`)
* `sheet_name`: `0`, mas pode ser qualquer `string` ou `int`
    - `sheet_name=0`: Primeira aba da planilha
    - `sheet_name=2`: Terceira aba da planilha
    - `sheet_name='Plan1'`: Primeira aba da planilha
    - `sheet_name='nome_que_usuário_colocou'`    

---
# Trabalhando com o Pandas

In [1]:
# Importação de bibliotecas (Pandas)
import pandas as pd # "as" em inglês é um apelidos (alias)

In [2]:
# Vamos declarar uma variável para armazenar os dados externos.
# Ao armazenar os dados eles serão transformados em um DataFrame.
# "d(ata)f(rame)" é uma referência para um DataFrame.

# Carregando os dados de forma remota
df_covid = pd.read_csv("http://www.edsonmelo.com.br/datasets/covid_19_data.csv")

# Inspecionando um DataFrame

Essa é a primeira etapa para se trabalhar com dados, pois é de fundamental importância sabermos **com o que** estamos lidando, ou seja, a qualidade, a quantidade, a distribuição, dados ausentes, etc.

In [3]:
# Descobrindo informações sobre: colunas, tipos de dados, número de linhas
# Lembrando que a partir de "agora" todos os dados estão na variável "df_covid"
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98252 entries, 0 to 98251
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   SNo              98252 non-null  int64  
 1   ObservationDate  98252 non-null  object 
 2   Province/State   67099 non-null  object 
 3   Country/Region   98252 non-null  object 
 4   Last Update      98252 non-null  object 
 5   Confirmed        98252 non-null  float64
 6   Deaths           98252 non-null  float64
 7   Recovered        98252 non-null  float64
dtypes: float64(3), int64(1), object(4)
memory usage: 6.0+ MB


In [4]:
# Analisando a quantidade de linhas e colunas
df_covid.shape

(98252, 8)

In [5]:
# Lendo os valores de forma individual
print(df_covid.shape[0])
print(df_covid.shape[1])

98252
8


In [6]:
# Mostrando alguns dados do DataFrame
# Vamos começar pelo início (Topo ou Header)
# Este comando mostra as cinco primeiras linhas do DataFrame
df_covid.head()

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


In [7]:
# Podemos colocar um parâmetro no head(n_linhas)
df_covid.head(10)

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0
5,6,01/22/2020,Guangdong,Mainland China,1/22/2020 17:00,26.0,0.0,0.0
6,7,01/22/2020,Guangxi,Mainland China,1/22/2020 17:00,2.0,0.0,0.0
7,8,01/22/2020,Guizhou,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
8,9,01/22/2020,Hainan,Mainland China,1/22/2020 17:00,4.0,0.0,0.0
9,10,01/22/2020,Hebei,Mainland China,1/22/2020 17:00,1.0,0.0,0.0


In [8]:
# Se consigo ver o topo, então dá para ver o fim (tail)
df_covid.tail()

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
98247,98248,08/29/2020,Zaporizhia Oblast,Ukraine,2020-08-30 04:28:22,1520.0,25.0,883.0
98248,98249,08/29/2020,Zeeland,Netherlands,2020-08-30 04:28:22,1048.0,72.0,0.0
98249,98250,08/29/2020,Zhejiang,Mainland China,2020-08-30 04:28:22,1277.0,1.0,1268.0
98250,98251,08/29/2020,Zhytomyr Oblast,Ukraine,2020-08-30 04:28:22,3155.0,61.0,1837.0
98251,98252,08/29/2020,Zuid-Holland,Netherlands,2020-08-30 04:28:22,18774.0,1344.0,0.0


In [9]:
df_covid.tail(10)

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
98242,98243,08/29/2020,Yukon,Canada,2020-08-30 04:28:22,15.0,0.0,15.0
98243,98244,08/29/2020,Yunnan,Mainland China,2020-08-30 04:28:22,199.0,2.0,191.0
98244,98245,08/29/2020,Zabaykalsky Krai,Russia,2020-08-30 04:28:22,4541.0,56.0,4207.0
98245,98246,08/29/2020,Zacatecas,Mexico,2020-08-30 04:28:22,5195.0,470.0,3970.0
98246,98247,08/29/2020,Zakarpattia Oblast,Ukraine,2020-08-30 04:28:22,7385.0,259.0,3220.0
98247,98248,08/29/2020,Zaporizhia Oblast,Ukraine,2020-08-30 04:28:22,1520.0,25.0,883.0
98248,98249,08/29/2020,Zeeland,Netherlands,2020-08-30 04:28:22,1048.0,72.0,0.0
98249,98250,08/29/2020,Zhejiang,Mainland China,2020-08-30 04:28:22,1277.0,1.0,1268.0
98250,98251,08/29/2020,Zhytomyr Oblast,Ukraine,2020-08-30 04:28:22,3155.0,61.0,1837.0
98251,98252,08/29/2020,Zuid-Holland,Netherlands,2020-08-30 04:28:22,18774.0,1344.0,0.0


In [10]:
# Mas e se eu quiser ver um pedaço aleatório?
df_covid.sample(5)

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
67561,67562,07/19/2020,Moscow Oblast,Russia,2020-07-20 05:34:40,61743.0,1058.0,43367.0
69617,69618,07/22/2020,Gansu,Mainland China,2020-07-23 05:15:04,167.0,2.0,165.0
87479,87480,08/15/2020,Huanuco,Peru,2020-08-16 04:27:42,8904.0,285.0,0.0
64565,64566,07/15/2020,Los Lagos,Chile,2020-07-16 04:44:59,2456.0,24.0,1960.0
97155,97156,08/28/2020,Kabardino-Balkarian Republic,Russia,2020-08-29 04:28:19,6439.0,77.0,6043.0


# Resumão

Para inspecionar um DataFrame, procure usar __sempre__ os seguintes comandos na seguinte ordem:
* `.info()`
* `.head()` - 5 é o padrão
* `.tail()` - 5 é o padrão
* `.sample(5)` - deve-se especificar a quantidade

# Atividade para treinar os conceitos

1. Carregar cada um dos Datasets disponíveis no endereço [http://www.edsonmelo.com.br/datasets](http://www.edsonmelo.com.br/datasets) em variáveis diferentes.
2. Aplicar a inspeção do DataFrame para cada um dos Datasets.


In [12]:
import pandas as pd

In [16]:
df_petra = pd.read_csv("http://www.edsonmelo.com.br/datasets/PETR4.csv")
df_byline = pd.read_excel("http://www.edsonmelo.com.br/datasets/byline2.xlsx")
df_coronaLimpo = pd.read_csv("http://www.edsonmelo.com.br/datasets/corona_limpo.csv")
df_countriesOfWorld = pd.read_csv("http://www.edsonmelo.com.br/datasets/countries%20of%20the%20world.csv")
df_dadosAlunos = pd.read_csv("http://www.edsonmelo.com.br/datasets/dados_alunos.csv")
df_dadosLimpos = pd.read_csv("http://www.edsonmelo.com.br/datasets/dados_limpos.csv")
df_imoveisData = pd.read_csv("http://www.edsonmelo.com.br/datasets/imoveis_data.csv")
df_indiceAlunos = pd.read_excel("http://www.edsonmelo.com.br/datasets/indice_alunos.xlsx")
df_mtcars = pd.read_csv("http://www.edsonmelo.com.br/datasets/mtcars.csv")
df_notas = pd.read_csv("http://www.edsonmelo.com.br/datasets/notas.csv")
df_notasAlunos = pd.read_csv("http://www.edsonmelo.com.br/datasets/notas_alunos.csv")
df_pesquisaAgrupada = pd.read_csv("http://www.edsonmelo.com.br/datasets/pesquisa_agrupada.csv")
df_pesquisaAFS = pd.read_csv("http://www.edsonmelo.com.br/datasets/pesquisa_agrupada_filtro_status.csv")
df_pesquisaCompleta = pd.read_csv("http://www.edsonmelo.com.br/datasets/pesquisa_completa.csv")

In [17]:
df_petra.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       1599 non-null   object 
 1   Open       1599 non-null   float64
 2   High       1599 non-null   float64
 3   Low        1599 non-null   float64
 4   Close      1599 non-null   float64
 5   Adj Close  1599 non-null   float64
 6   Volume     1599 non-null   int64  
dtypes: float64(5), int64(1), object(1)
memory usage: 87.6+ KB


In [18]:
df_byline.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20097 entries, 0 to 20096
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   au_total_in_article  20097 non-null  int64  
 1   au_author_byline     20097 non-null  float64
 2   au_author_position   20097 non-null  int64  
 3   theory               20097 non-null  float64
 4   methodology          20097 non-null  float64
 5   logistic             20097 non-null  float64
 6   au_h_index           20095 non-null  float64
dtypes: float64(5), int64(2)
memory usage: 1.1 MB


In [19]:
df_coronaLimpo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 223 entries, 0 to 222
Data columns (total 28 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   ObservationDate                     223 non-null    object 
 1   Country/Region                      223 non-null    object 
 2   Last Update                         223 non-null    object 
 3   Confirmed                           223 non-null    float64
 4   Deaths                              223 non-null    float64
 5   Recovered                           223 non-null    float64
 6   Confirmed + Deaths                  223 non-null    float64
 7   Death by Cases                      212 non-null    float64
 8   Country                             187 non-null    object 
 9   Region                              187 non-null    object 
 10  Population                          187 non-null    float64
 11  Area (sq. mi.)                      187 non-n

In [20]:
df_countriesOfWorld.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   Country                             227 non-null    object
 1   Region                              227 non-null    object
 2   Population                          227 non-null    int64 
 3   Area (sq. mi.)                      227 non-null    int64 
 4   Pop. Density (per sq. mi.)          227 non-null    object
 5   Coastline (coast/area ratio)        227 non-null    object
 6   Net migration                       224 non-null    object
 7   Infant mortality (per 1000 births)  224 non-null    object
 8   GDP ($ per capita)                  226 non-null    object
 9   Literacy (%)                        209 non-null    object
 10  Phones (per 1000)                   223 non-null    object
 11  Arable (%)                          225 non-null    object

In [21]:
df_dadosAlunos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24456 entries, 0 to 24455
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             24456 non-null  int64 
 1   identificador  24456 non-null  object
 2   cursosem       24456 non-null  object
 3   modalidade     24444 non-null  object
 4   bolsista       24456 non-null  object
 5   chipvivo       24456 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 1.1+ MB


In [22]:
df_dadosLimpos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24456 entries, 0 to 24455
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   matricula   24456 non-null  object
 1   bolsista    24456 non-null  object
 2   ingresso    24456 non-null  int64 
 3   campus      24456 non-null  object
 4   curso       24456 non-null  object
 5   semestre    24456 non-null  int64 
 6   modalidade  24456 non-null  object
 7   estrutura   24456 non-null  object
 8   chipvivo    24456 non-null  int64 
dtypes: int64(3), object(6)
memory usage: 1.7+ MB


In [23]:
df_imoveisData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

In [24]:
df_indiceAlunos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Nome     50 non-null     object 
 1   Indice1  50 non-null     float64
 2   Indice2  50 non-null     float64
 3   Indice3  50 non-null     float64
 4   Indice4  50 non-null     float64
 5   Indice5  50 non-null     float64
dtypes: float64(5), object(1)
memory usage: 2.5+ KB


In [25]:
df_mtcars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   model   32 non-null     object 
 1   mpg     32 non-null     float64
 2   cyl     32 non-null     int64  
 3   disp    32 non-null     float64
 4   hp      32 non-null     int64  
 5   drat    32 non-null     float64
 6   wt      32 non-null     float64
 7   qsec    32 non-null     float64
 8   vs      32 non-null     int64  
 9   am      32 non-null     int64  
 10  gear    32 non-null     int64  
 11  carb    32 non-null     int64  
dtypes: float64(5), int64(6), object(1)
memory usage: 3.1+ KB


In [26]:
df_notas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Abel Marques  49 non-null     object 
 1   3.78          49 non-null     float64
 2   9.00          49 non-null     float64
 3   2.00          49 non-null     float64
 4   3.00          49 non-null     float64
 5   9.00.1        49 non-null     float64
dtypes: float64(5), object(1)
memory usage: 2.4+ KB


In [27]:
df_notasAlunos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Nome    50 non-null     object 
 1   Nota1   50 non-null     float64
 2   Nota2   50 non-null     float64
 3   Nota3   50 non-null     float64
 4   Nota4   50 non-null     float64
 5   Nota5   50 non-null     float64
dtypes: float64(5), object(1)
memory usage: 2.5+ KB


In [28]:
df_pesquisaAgrupada.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1672 entries, 0 to 1671
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   status_saude      1672 non-null   int64  
 1   sexo              1672 non-null   int64  
 2   idade             1672 non-null   int64  
 3   imc               1672 non-null   float64
 4   fumante           1672 non-null   int64  
 5   ingere_alcool     1672 non-null   int64  
 6   ingestao_fvl      1672 non-null   int64  
 7   atividade_fisica  1672 non-null   int64  
 8   horas_sono        1672 non-null   int64  
 9   doenca_familia    1672 non-null   int64  
 10  doenca_pessoal    1672 non-null   int64  
 11  dores             1672 non-null   int64  
dtypes: float64(1), int64(11)
memory usage: 156.9 KB


In [29]:
df_pesquisaAFS.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450 entries, 0 to 449
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   status_saude      450 non-null    int64  
 1   sexo              450 non-null    int64  
 2   idade             450 non-null    int64  
 3   imc               450 non-null    float64
 4   fumante           450 non-null    int64  
 5   ingere_alcool     450 non-null    int64  
 6   ingestao_fvl      450 non-null    int64  
 7   atividade_fisica  450 non-null    int64  
 8   horas_sono        450 non-null    int64  
 9   doenca_familia    450 non-null    int64  
 10  doenca_pessoal    450 non-null    int64  
 11  dores             450 non-null    int64  
dtypes: float64(1), int64(11)
memory usage: 42.3 KB


In [30]:
df_pesquisaCompleta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1672 entries, 0 to 1671
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   status_saude       1672 non-null   int64  
 1   sexo               1672 non-null   int64  
 2   idade              1672 non-null   int64  
 3   peso               1672 non-null   int64  
 4   altura             1672 non-null   int64  
 5   imc                1672 non-null   float64
 6   fumante            1672 non-null   int64  
 7   ingere_alcool      1672 non-null   int64  
 8   ingestao_fvl       1672 non-null   int64  
 9   atividade_fisica   1672 non-null   int64  
 10  horas_sono         1672 non-null   int64  
 11  doenca_familia_1   1672 non-null   int64  
 12  doenca_familia_2   1672 non-null   int64  
 13  doenca_familia_3   1672 non-null   int64  
 14  doenca_familia_4   1672 non-null   int64  
 15  doenca_familia_5   1672 non-null   int64  
 16  doenca_familia_6   1672 