# 📊 Análise Exploratória de Dados de da Rede Sonda 🌦️

## 📌 Introdução
Este notebook realiza uma análise exploratória dos dados meteorológicos coletados de diversas estações. O objetivo é entender a estrutura dos dados, avaliar sua qualidade e identificar padrões relevantes.

## 📂 Fonte dos Dados
- Arquivos CSV formatados armazenados no ftp
- Contêm medições de variáveis meteorológicas, solarimétricas e cameras.

## 🔍 Objetivos da Análise
1. **Carregar e explorar os dados**: verificar onde os dados estão armazenados, seu formato e estrutura.
2. **Dimensionamento e variáveis disponíveis**: entender o tamanho dos arquivos, número de registros e colunas.
3. **Análise temporal dos dados disponíveis**: identificar o período coberto e eventuais lacunas temporais.
4. **Visualização da distribuição espacial das estações**: verificar a abrangência geográfica das medições.
5. **Exploração inicial de distribuições**: histogramas e estatísticas básicas das variáveis.
6. **Análise de qualidade dos dados** *(última etapa)*: identificar valores ausentes, inconsistências e flags de qualidade.

### 1. Carregar e Explorar os Dados
Vamos começar listando o tamanho da base de dados que estão no diretório do ftp.

In [1]:
# Diretório onde os arquivos estão localizados
DIRETORIO = '../sonda/dados_formatados/'

In [2]:
# Exibe o tamanho de cada arquivo no diretório ordenado por tamanho de forma decrescente
!du -h --max-depth=1 {DIRETORIO} | sort -rh

11G	../sonda/dados_formatados/
1,2G	../sonda/dados_formatados/BRB
1011M	../sonda/dados_formatados/PTR
989M	../sonda/dados_formatados/FLN
959M	../sonda/dados_formatados/PMA
759M	../sonda/dados_formatados/JOI
722M	../sonda/dados_formatados/CPA
714M	../sonda/dados_formatados/SMS
686M	../sonda/dados_formatados/SLZ
613M	../sonda/dados_formatados/NAT
536M	../sonda/dados_formatados/CGR
464M	../sonda/dados_formatados/SBR
413M	../sonda/dados_formatados/TMA
365M	../sonda/dados_formatados/MCL
349M	../sonda/dados_formatados/ORN
300M	../sonda/dados_formatados/UBE
284M	../sonda/dados_formatados/BJL
175M	../sonda/dados_formatados/TLG
174M	../sonda/dados_formatados/CAI
171M	../sonda/dados_formatados/CTB
55M	../sonda/dados_formatados/CBA
196K	../sonda/dados_formatados/TRI
196K	../sonda/dados_formatados/SPK
196K	../sonda/dados_formatados/SCR
196K	../sonda/dados_formatados/RLM
196K	../sonda/dados_formatados/OPO
196K	../sonda/dados_formatados/MDS
196K	../sonda/dados_formatados/LEB
196K	../sonda/dados_form

Existem 3 tipos de dados:
- Dados Meteorológicos
- Dados Solarimétricos
- Dados Anemometricos

Abaixo vamos adicionar cada tipo de dado em uma lista para facilitar a análise.

In [3]:
import glob

# listar todos os dados Meteorológicos usando o glob só para o tipo de arquivo .csv
dados_metereologicos = glob.glob(DIRETORIO + "*/Meteorologicos/**/*.csv", recursive=True)
# Remove arquivos que contenham 'YYYY_MM_MD_DQC'
dados_metereologicos = [arquivo for arquivo in dados_metereologicos if 'YYYY_MM' not in arquivo]

# listar todos os dados de Solarimétricos usando o glob só para o tipo de arquivo .csv
dados_solarimetricos = glob.glob(DIRETORIO + "*/Solarimetricos/**/*.csv", recursive=True)
# Remove arquivos que contenham 'YYYY_MM_MD_DQC'
dados_solarimetricos = [arquivo for arquivo in dados_solarimetricos if 'YYYY_MM' not in arquivo]

# listar todos os dados de Anemometricos usando o glob só para o tipo de arquivo .csv
dados_anemometricos = glob.glob(DIRETORIO + "*/Anemometricos/**/*.csv", recursive=True)
# Remove arquivos que contenham 'YYYY_MM_MD_DQC'
dados_anemometricos = [arquivo for arquivo in dados_anemometricos if 'YYYY_MM' not in arquivo]

In [4]:
# Listar a quantidade de arquivos em cada categoria
print(f"Quantidade de arquivos Meteorologicos: {len(dados_metereologicos)}")
print(f"Quantidade de arquivos Solarimetricos: {len(dados_solarimetricos)}")
print(f"Quantidade de arquivos Anemometricos: {len(dados_anemometricos)}")

Quantidade de arquivos Meteorologicos: 1036
Quantidade de arquivos Solarimetricos: 1022
Quantidade de arquivos Anemometricos: 0


Utilizaremos a biblioteca duckdb para realizar a análise dos dados. DuckDB funciona como um banco de dados SQL, mas em memória, o que facilita a análise de grandes volumes de dados.

In [5]:
# Importar a biblioteca DuckDB para manipulação de dados
import duckdb
import os
import pandas as pd
from multiprocessing import Pool

In [6]:
# Cria variavel para conexão do banco de dados
global con

# Conectar ao banco de dados DuckDB e controla numero de threads
con = duckdb.connect(database=':memory:')

# Remove qualquer arquivo temporário que possa existir
!rm -rf .tmp

In [7]:
# Função para criar a tabela no banco de dados
def criar_base(dados, base, arquivo):

    # Verifica se o arquivo já existe
    if os.path.exists(arquivo):
        con.execute(f"CREATE TABLE {base} AS SELECT * FROM read_parquet('{arquivo}')")
    else:
        try:
            # Lê o arquivo CSV
            df = pd.read_csv(dados[0], skiprows=[1])
            # Formata os dados forçando para que sejam númericos, e caso não seja, substitui pelo valor original
            df = df.apply(pd.to_numeric, errors='coerce').fillna(df)
            # Converte a coluna de timestamp para datetime
            df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce')
            # Cria a tabela no banco de dados apenas com o nome das colunas e os tipos de dados
            # Registra o DataFrame como uma tabela temporária no DuckDB
            con.register('df', df)
            # Cria uma tabela vazia com a mesma estrutura usando uma query que não retorna linhas
            con.execute(f"CREATE TABLE {base} AS SELECT * FROM df WHERE 1=0")
            print(f"Banco de dados criado com sucesso: {base}")
        except Exception as e:
            print(e)

In [8]:
# Apontar o caminho das bases de dados
ARQV_METEOROLOGICO = '../sonda/dados_meteorologicos.parquet'
ARQV_SOLARIMETRICA = '../sonda/dados_solarimetricos.parquet'

# Nome das tabelas
BASE_METEOROLOGICO = 'base_meteorologica'
BASE_SOLARIMETRICA = 'base_solarimetrica'

In [9]:
# Criar a base de dados meteorológicos
criar_base(dados_metereologicos, BASE_METEOROLOGICO, ARQV_METEOROLOGICO)
# Criar a base de dados solarimétricos
criar_base(dados_solarimetricos, BASE_SOLARIMETRICA, ARQV_SOLARIMETRICA)

Banco de dados criado com sucesso: base_meteorologica
Banco de dados criado com sucesso: base_solarimetrica


In [10]:
# Verifica o conteudo da tabela base_meteorologica
con.execute(f"SELECT * FROM {BASE_METEOROLOGICO} LIMIT 5").fetch_df()

Unnamed: 0,acronym,timestamp,year,day,min,tp_sfc,humid_sfc,press,rain,ws10_avg,ws10_std,wd10_avg,wd10_std


In [11]:
# Verifica o conteudo da tabela base_solarimetrica
con.execute(f"SELECT * FROM {BASE_SOLARIMETRICA} LIMIT 5").fetch_df()

Unnamed: 0,acronym,timestamp,year,day,min,glo_avg,glo_std,glo_max,glo_min,dif_avg,...,dir_min,lw_avg,lw_std,lw_max,lw_min,temp_glo,temp_dir,temp_dif,temp_dome,temp_case


In [12]:
# Função para popular a base de dados
def inserir_dados(args):
    # Recebe os argumentos
    base, arquivo, linha_linha = args
    # Lendo os novos dados do CSV, pulando a segunda linha mas mantendo o cabeçalho
    new_data = pd.read_csv(arquivo, skiprows=[1])
    # Coluna de variáveis
    variaveis = new_data.columns[5:]
    # Caso alguma variável contenha "," substitui por "." em seus valores
    new_data[variaveis] = new_data[variaveis].replace({',': '.'}, regex=True)
    # Caso alguma variável conhenha um numero segido de um - substitui por 0
    new_data[variaveis] = new_data[variaveis].replace({r'\d+-': '0'}, regex=True)
    # Formata os dados forçando para que sejam númericos, e caso não seja, substitui pelo valor original
    new_data = new_data.apply(pd.to_numeric, errors='coerce').fillna(new_data)
    # Converte a coluna de timestamp para datetime
    new_data['timestamp'] = pd.to_datetime(new_data['timestamp'], errors='coerce')
    # Pega o nome da estação do arquivo
    estacao = new_data['acronym'][0]
    # Pega o tempo mínimo e máximo
    tempo = new_data['timestamp'].min(), new_data['timestamp'].max()
    # Registra o DataFrame new_data na tabela base
    con.register('new_data', new_data)
    # Verifica se é linha a linha
    if linha_linha:
       # Verifica linha a linha e variável a variável se já existe na base
        for i in range(new_data.shape[0]):
            for v in variaveis:
                if con.execute(f"SELECT COUNT(*) FROM {base} WHERE acronym = '{estacao}' AND timestamp = '{new_data['timestamp'][i]}' AND {v} = {new_data[v][i]};").fetch_df().values[0][0] == 0:
                    try:
                        con.execute(f"INSERT INTO {base} SELECT * FROM new_data WHERE acronym = '{estacao}' AND timestamp = '{new_data['timestamp'][i]}' AND {v} = {new_data[v][i]};")
                    except Exception as e:
                        print(f"Erro ao inserir dados da variável {v} na base {base} do arquivo {arquivo}")
                        print(e)
    else: # Insere todos os dados de uma vez
         # Verifica se dados já existem
        if con.execute(f"SELECT COUNT(*) FROM {base} WHERE acronym = '{estacao}' AND timestamp BETWEEN '{tempo[0]}' AND '{tempo[1]}';").fetch_df().values[0][0] > 0:
            return
        try:
            con.execute(f"INSERT INTO {base} SELECT * FROM new_data")
        except Exception as e:
            print(f"Erro ao inserir dados na base {base} do arquivo {arquivo}")
            print(e)

In [13]:
# Funções para inserir dados de forma paralela ou sequencial
def inserir_dados_paralelo(base, arquivos, linha_linha=False):
    print(f"Inserindo dados na base {base} de forma paralela")
    with Pool() as pool:
        pool.map(inserir_dados, [(base, arquivo, linha_linha) for arquivo in arquivos])
        
# Função para inserir os dados de forma sequencial
def inserir_dados_sequencial(base, arquivos, linha_linha=False):
    print(f"Inserindo dados na base {base} de forma sequencial")
    for arquivo in arquivos:
        inserir_dados((base, arquivo, linha_linha))

In [16]:
# Popula a base de dados meteorológicos de forma paralela
inserir_dados_sequencial(BASE_METEOROLOGICO, dados_metereologicos, linha_linha=False)

Inserindo dados na base base_meteorologica de forma sequencial


In [18]:
# Popula a base de dados solarimétricos de forma paralela
inserir_dados_sequencial(BASE_SOLARIMETRICA, dados_solarimetricos, linha_linha=False)

Inserindo dados na base base_solarimetrica de forma sequencial
Erro ao inserir dados na base base_solarimetrica do arquivo ../sonda/dados_formatados/SBR/Solarimetricos/2009/SBR_2009_11_SD_formatado.csv
Invalid Input Error: Failed to cast value: Could not convert string '-' to DOUBLE


In [22]:
# Salvar a base de dados meteorológicos em um arquivo parquet caso não exista
if not os.path.exists(ARQV_METEOROLOGICO):
    con.execute(f"COPY {BASE_METEOROLOGICO} TO '{ARQV_METEOROLOGICO}' (FORMAT 'parquet')")

# Salvar a base de dados solarimétricos em um arquivo parquet caso não exista
if not os.path.exists(ARQV_SOLARIMETRICA):
    con.execute(f"COPY {BASE_SOLARIMETRICA} TO '{ARQV_SOLARIMETRICA}' (FORMAT 'parquet')")

Com as bases de dados criadas e carregadas em memória, podemos começar a análise exploratória dos dados.

In [17]:
# Verifica o conteudo da tabela base_meteorologica
con.execute(f"SELECT * FROM {BASE_METEOROLOGICO} LIMIT 5").fetch_df()

Unnamed: 0,acronym,timestamp,year,day,min,tp_sfc,humid_sfc,press,rain,ws10_avg,ws10_std,wd10_avg,wd10_std
0,FLN,2019-03-01 00:00:00,2019,60,0,22.38,97.7,1010.284,0.72,5.2698,0.923294,23.422166,4.079194
1,FLN,2019-03-01 00:10:00,2019,60,10,21.22,101.1,1010.503,0.0,4.6318,0.352908,29.199733,1.963143
2,FLN,2019-03-01 00:20:00,2019,60,20,21.15,101.1,1010.476,0.18,3.9738,0.71351,33.52802,2.825004
3,FLN,2019-03-01 00:30:00,2019,60,30,21.08,101.1,1010.474,0.0,3.144,0.646525,22.548946,3.690781
4,FLN,2019-03-01 00:40:00,2019,60,40,21.2,101.1,1010.629,0.0,0.8769,0.440114,58.076207,8.97748


In [21]:
# Verifica o conteudo da tabela base_solarimetrica
con.execute(f"SELECT * FROM {BASE_SOLARIMETRICA} LIMIT 5").fetch_df()

Unnamed: 0,acronym,timestamp,year,day,min,glo_avg,glo_std,glo_max,glo_min,dif_avg,...,dir_min,lw_avg,lw_std,lw_max,lw_min,temp_glo,temp_dir,temp_dif,temp_dome,temp_case
0,FLN,2019-04-01 00:00:00,2019,91,0,-1.786,0.135,-1.397,-2.196,-2.651,...,-0.479,390.7,0.126,391.0,390.5,23.93,22.8,109.3,7999,13.01
1,FLN,2019-04-01 00:01:00,2019,91,1,-1.808,0.124,-1.397,-2.196,-2.647,...,-0.479,391.4,0.249,391.8,390.9,23.93,22.8,109.3,7999,13.01
2,FLN,2019-04-01 00:02:00,2019,91,2,-1.815,0.145,-1.397,-2.196,-2.647,...,-0.479,392.0,0.075,392.1,391.8,23.94,22.8,109.3,7999,13.01
3,FLN,2019-04-01 00:03:00,2019,91,3,-1.8,0.136,-1.397,-2.196,-2.647,...,-0.479,392.3,0.107,392.4,392.0,23.94,22.79,109.3,7999,13.01
4,FLN,2019-04-01 00:04:00,2019,91,4,-1.797,0.163,-1.397,-2.196,-2.666,...,-0.479,392.4,0.048,392.5,392.3,23.93,22.77,109.3,7999,13.01


In [23]:
# Pega apenas colunas de dados meteorológicos
colunas_meteoro = con.execute(f"SELECT * FROM {BASE_METEOROLOGICO} LIMIT 1").description
colunas_meteoro = [c[0] for c in colunas_meteoro[5:]]
print(f"Colunas de dados meteorológicos: {colunas_meteoro}")

# Pega apenas colunas de dados solarimétricos
colunas_solar = con.execute(f"SELECT * FROM {BASE_SOLARIMETRICA} LIMIT 1").description
colunas_solar = [c[0] for c in colunas_solar[5:]]
print(f"Colunas de dados solarimétricos: {colunas_solar}")

Colunas de dados meteorológicos: ['tp_sfc', 'humid_sfc', 'press', 'rain', 'ws10_avg', 'ws10_std', 'wd10_avg', 'wd10_std']
Colunas de dados solarimétricos: ['glo_avg', 'glo_std', 'glo_max', 'glo_min', 'dif_avg', 'dif_std', 'dif_max', 'dif_min', 'par_avg', 'par_std', 'par_max', 'par_min', 'lux_avg', 'lux_std', 'lux_max', 'lux_min', 'dir_avg', 'dir_std', 'dir_max', 'dir_min', 'lw_avg', 'lw_std', 'lw_max', 'lw_min', 'temp_glo', 'temp_dir', 'temp_dif', 'temp_dome', 'temp_case']


A primeira análise será a verificação temporal dos dados, para entender o período coberto e eventuais lacunas temporais.

In [26]:
def verifica_temporal(base):
    # Faz um agrupamento por acronym e timestamp para verificar se os dados são temporais, fazendo a contagem de registros por dia
    query = f"""
    SELECT acronym, DATE_TRUNC('month', timestamp) AS data, COUNT(*) AS registros
    FROM {base}
    GROUP BY acronym, data
    ORDER BY data
    """
    return con.execute(query).fetch_df()

In [27]:
# Verificar se os dados meteorológicos são temporais
verifica_temporal(BASE_METEOROLOGICO)

Unnamed: 0,acronym,data,registros
0,SBR,2004-01-01,4464
1,SBR,2004-02-01,4176
2,CGR,2004-02-01,4176
3,CGR,2004-03-01,4464
4,SBR,2004-03-01,4464
...,...,...,...
1029,BRB,2020-03-01,4464
1030,BRB,2020-04-01,4320
1031,BRB,2020-05-01,4464
1032,BRB,2020-06-01,4320


In [28]:
def verificar_dados_invalidos(base, colunas): 
    query = f"""
    SELECT acronym, COUNT(*) AS total_dados,
    """
    # Para cada coluna em colunas_meteoro, criamos a parte do "dados_invalidos"
    for i, coluna in enumerate(colunas):
        query += f"""
        SUM(CASE WHEN \"{coluna}\" = 3333.0 THEN 1 ELSE 0 END) AS {coluna}_3333,
        SUM(CASE WHEN \"{coluna}\" = -5555.0 THEN 1 ELSE 0 END) AS {coluna}_minus_5555,
        """
    # Remover a última vírgula da consulta
    query = query.rstrip(",\n")
    # Adiciona a parte do FROM e GROUP BY
    query += f"""
    FROM \"{base}\"
    GROUP BY acronym
    """
    # Executar a query
    df = con.execute(query).fetch_df()
    return df

In [29]:
verificar_dados_invalidos(BASE_METEOROLOGICO, colunas_meteoro)

Unnamed: 0,acronym,total_dados,tp_sfc_3333,tp_sfc_minus_5555,humid_sfc_3333,humid_sfc_minus_5555,press_3333,press_minus_5555,rain_3333,rain_minus_5555,ws10_avg_3333,ws10_avg_minus_5555,ws10_std_3333,ws10_std_minus_5555,wd10_avg_3333,wd10_avg_minus_5555,wd10_std_3333,wd10_std_minus_5555
0,TLG,65664,3627.0,0.0,3627.0,0.0,3627.0,0.0,0.0,65664.0,0.0,65664.0,0.0,65664.0,0.0,65664.0,0.0,65664.0
1,BJL,105120,8839.0,0.0,8839.0,0.0,8839.0,0.0,0.0,105120.0,0.0,105120.0,0.0,105120.0,0.0,105120.0,0.0,105120.0
2,CAI,65808,7397.0,0.0,7397.0,0.0,7397.0,0.0,0.0,65808.0,0.0,65808.0,0.0,65808.0,0.0,65808.0,0.0,65808.0
3,MCL,157680,10728.0,0.0,10728.0,0.0,10728.0,0.0,0.0,148896.0,0.0,148896.0,0.0,148896.0,0.0,148896.0,0.0,148896.0
4,PTR,324432,31342.0,0.0,31343.0,0.0,31342.0,0.0,31339.0,0.0,31340.0,0.0,31340.0,0.0,31339.0,0.0,31339.0,0.0
5,SMS,302400,26949.0,0.0,22509.0,0.0,22476.0,0.0,22479.0,0.0,23129.0,0.0,23129.0,0.0,22480.0,0.0,22480.0,0.0
6,TMA,166176,7354.0,0.0,7354.0,0.0,7354.0,0.0,0.0,135468.0,0.0,135468.0,0.0,135468.0,0.0,135468.0,0.0,135468.0
7,CTB,65664,5705.0,0.0,5705.0,0.0,5705.0,0.0,0.0,65664.0,5987.0,0.0,5987.0,0.0,5705.0,0.0,5705.0,0.0
8,FLN,411552,18812.0,0.0,18810.0,0.0,18801.0,0.0,18801.0,0.0,42172.0,8784.0,42172.0,8784.0,16906.0,8784.0,16906.0,8784.0
9,NAT,263088,12335.0,0.0,12339.0,0.0,12335.0,0.0,12335.0,0.0,12335.0,0.0,12335.0,0.0,12335.0,0.0,12335.0,0.0


In [30]:
verificar_dados_invalidos(BASE_SOLARIMETRICA, colunas_solar)

Unnamed: 0,acronym,total_dados,glo_avg_3333,glo_avg_minus_5555,glo_std_3333,glo_std_minus_5555,glo_max_3333,glo_max_minus_5555,glo_min_3333,glo_min_minus_5555,...,temp_glo_3333,temp_glo_minus_5555,temp_dir_3333,temp_dir_minus_5555,temp_dif_3333,temp_dif_minus_5555,temp_dome_3333,temp_dome_minus_5555,temp_case_3333,temp_case_minus_5555
0,CGR,2548800,342819.0,0.0,342819.0,0.0,342818.0,0.0,342819.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,TLG,656640,36263.0,0.0,36263.0,0.0,36261.0,0.0,36261.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,BJL,1051200,137063.0,0.0,137063.0,0.0,121013.0,0.0,121013.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,CAI,658080,73906.0,0.0,73906.0,0.0,73906.0,0.0,73906.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,PMA,4563328,537022.0,0.0,537021.0,0.0,536870.0,0.0,537020.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,SLZ,3241440,91784.0,0.0,91784.0,0.0,91781.0,0.0,91783.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,SBR,2325255,394703.0,0.0,392358.0,0.0,392341.0,0.0,394704.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,CPA,2672640,61813.0,0.0,61813.0,0.0,61791.0,0.0,61791.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,BRB,3287520,355789.0,0.0,354468.0,0.0,355781.0,0.0,355789.0,0.0,...,4.0,0.0,4.0,0.0,4.0,0.0,4.0,0.0,4.0,0.0
9,UBE,1270080,197516.0,0.0,197516.0,0.0,197516.0,0.0,197516.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
