#Introdução




Este projeto visa analisar, em período histórico, e prever a demanda no uso de painéis solares em residências no Brasil. Serão utilizadas bases de dados da ANEEL (Agência Nacional de Energia Elétrica) disponibilizados no seguinte site:

https://dadosabertos.aneel.gov.br/organization/agencia-nacional-de-energia-eletrica

Através da base "Relação de empreendimentos de Geração Distribuída", é possível compreender o cenário atual no país em relação a utilização de energia solar em residências e prever a utilização nos próximos 4 anos.

##Importação de bibliotecas

In [1]:
!pip install pycaret[full]

Collecting dask<2023.7.1,>=2022.9.0 (from dask[dataframe,distributed]<2023.7.1,>=2022.9.0; python_version >= "3.8" and extra == "dask"->fugue[dask]>=0.8.0; extra == "full"->pycaret[full])
  Using cached dask-2023.7.0-py3-none-any.whl.metadata (3.6 kB)
Collecting packaging (from deprecation>=2.1.0->pycaret[full])
  Using cached packaging-21.3-py3-none-any.whl (40 kB)
Collecting distributed==2023.7.0 (from dask[dataframe,distributed]<2023.7.1,>=2022.9.0; python_version >= "3.8" and extra == "dask"->fugue[dask]>=0.8.0; extra == "full"->pycaret[full])
  Using cached distributed-2023.7.0-py3-none-any.whl.metadata (3.3 kB)
Using cached dask-2023.7.0-py3-none-any.whl (1.2 MB)
Using cached distributed-2023.7.0-py3-none-any.whl (981 kB)
Installing collected packages: packaging, dask, distributed
  Attempting uninstall: packaging
    Found existing installation: packaging 23.2
    Uninstalling packaging-23.2:
      Successfully uninstalled packaging-23.2
  Attempting uninstall: dask
    Found ex



In [2]:
import numpy as np
import pandas as pd
import requests
import matplotlib.pyplot as plt

#Exploração dos dados coletados

Vamos verificar quais informações estão disponíveis na base de dados e realizar a limpeza da base, retirando as colunas de valores que não serão utilizados.

In [3]:
df = pd.read_csv("empreendimento-geracao-distribuida.csv", encoding='latin1', sep=";", low_memory=False)   # lê o arquivo csv

In [4]:
df.shape    # retorna a quantidade de linhas e colunas

(2349997, 33)

In [5]:
df

Unnamed: 0,DatGeracaoConjuntoDados,AnmPeriodoReferencia,NumCNPJDistribuidora,SigAgente,NomAgente,CodClasseConsumo,DscClasseConsumo,CodSubGrupoTarifario,DscSubGrupoTarifario,codUFibge,...,QtdUCRecebeCredito,SigTipoGeracao,DscFonteGeracao,DscPorte,MdaPotenciaInstaladaKW,NumCoordNEmpreendimento,NumCoordEEmpreendimento,NomSubEstacao,NumCoordESub,NumCoordNSub
0,2024-02-04,02/2024,4.065033e+12,ELETROACRE,ENERGISA ACRE - DISTRIBUIDORA DE ENERGIA S.A,2,Comercial,11,B3,12.0,...,1,UFV,Radiação solar,Microgeracao,3250,-6785,-996,,,
1,2024-02-04,02/2024,4.065033e+12,ELETROACRE,ENERGISA ACRE - DISTRIBUIDORA DE ENERGIA S.A,1,Residencial,9,B1,12.0,...,1,UFV,Radiação solar,Microgeracao,400,-7078,-815,,,
2,2024-02-04,02/2024,4.065033e+12,ELETROACRE,ENERGISA ACRE - DISTRIBUIDORA DE ENERGIA S.A,2,Comercial,11,B3,12.0,...,1,UFV,Radiação solar,Microgeracao,200,,,,,
3,2024-02-04,02/2024,4.065033e+12,ELETROACRE,ENERGISA ACRE - DISTRIBUIDORA DE ENERGIA S.A,1,Residencial,9,B1,12.0,...,1,UFV,Radiação solar,Microgeracao,200,-6786,-995,,,
4,2024-02-04,02/2024,4.065033e+12,ELETROACRE,ENERGISA ACRE - DISTRIBUIDORA DE ENERGIA S.A,1,Residencial,9,B1,12.0,...,1,UFV,Radiação solar,Microgeracao,500,-6787,-996,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2349992,2024-02-04,02/2024,2.508603e+13,ETO,ENERGISA TOCANTINS DISTRIBUIDORA DE ENERGIA S.A.,1,Residencial,9,B1,17.0,...,1,UFV,Radiação solar,Microgeracao,1000,,,,,
2349993,2024-02-04,02/2024,2.508603e+13,ETO,ENERGISA TOCANTINS DISTRIBUIDORA DE ENERGIA S.A.,3,Rural,10,B2,17.0,...,1,UFV,Radiação solar,Microgeracao,4000,,,,,
2349994,2024-02-04,02/2024,2.508603e+13,ETO,ENERGISA TOCANTINS DISTRIBUIDORA DE ENERGIA S.A.,1,Residencial,9,B1,17.0,...,1,UFV,Radiação solar,Microgeracao,770,,,,,
2349995,2024-02-04,02/2024,2.508603e+13,ETO,ENERGISA TOCANTINS DISTRIBUIDORA DE ENERGIA S.A.,1,Residencial,9,B1,17.0,...,1,UFV,Radiação solar,Microgeracao,2000,,,,,


In [6]:
df = df[df['SigTipoGeracao'] == 'UFV']    # filtra os dados
df = df[df['DscClasseConsumo'] == 'Residencial']    # filtra os dados
df = df.fillna('')    # preenche os valores nulos

In [7]:
df.columns

Index(['DatGeracaoConjuntoDados', 'AnmPeriodoReferencia',
       'NumCNPJDistribuidora', 'SigAgente', 'NomAgente', 'CodClasseConsumo',
       'DscClasseConsumo', 'CodSubGrupoTarifario', 'DscSubGrupoTarifario',
       'codUFibge', 'SigUF', 'codRegiao', 'NomRegiao', 'CodMunicipioIbge',
       'NomMunicipio', 'CodCEP', 'SigTipoConsumidor', 'NumCPFCNPJ',
       'NomeTitularEmpreendimento', 'CodEmpreendimento',
       'DthAtualizaCadastralEmpreend', 'SigModalidadeEmpreendimento',
       'DscModalidadeHabilitado', 'QtdUCRecebeCredito', 'SigTipoGeracao',
       'DscFonteGeracao', 'DscPorte', 'MdaPotenciaInstaladaKW',
       'NumCoordNEmpreendimento', 'NumCoordEEmpreendimento', 'NomSubEstacao',
       'NumCoordESub', 'NumCoordNSub'],
      dtype='object')

In [8]:
# Selecionando as colunas 'DthAtualizaCadastralEmpreend' e 'MdaPotenciaInstaladaKW'
df = df[['DthAtualizaCadastralEmpreend', 'MdaPotenciaInstaladaKW']]

# Convertendo 'DthAtualizaCadastralEmpreend' para datetime
df['DthAtualizaCadastralEmpreend'] = pd.to_datetime(df['DthAtualizaCadastralEmpreend'])

# Verificando se a coluna 'MdaPotenciaInstaladaKW' é do tipo string
is_string = df['MdaPotenciaInstaladaKW'].dtype == 'object'

# Verificando se a coluna 'MdaPotenciaInstaladaKW' contém algum valor NaN
has_nan = df['MdaPotenciaInstaladaKW'].isnull().values.any()

# Se a coluna 'MdaPotenciaInstaladaKW' não for do tipo string, convertê-la para string
if not is_string:
    df['MdaPotenciaInstaladaKW'] = df['MdaPotenciaInstaladaKW'].astype(str)

# Se a coluna 'MdaPotenciaInstaladaKW' contiver algum valor NaN, remover as linhas com valores NaN
if has_nan:
    df = df.dropna()

# Substituindo vírgulas por pontos e convertendo 'MdaPotenciaInstaladaKW' para float
df['MdaPotenciaInstaladaKW'] = df['MdaPotenciaInstaladaKW'].str.replace(',', '.').astype(float)

# Removendo todas as ocorrências do dia 2022-09-01
df = df[df['DthAtualizaCadastralEmpreend'].dt.date != pd.to_datetime('2022-09-01').date()]


In [9]:
df.describe()

Unnamed: 0,MdaPotenciaInstaladaKW
count,1838767.0
mean,6.715687
std,11.43666
min,0.0
25%,3.24
50%,5.0
75%,7.5
max,5000.0


#**PyCaret**

In [10]:
!pip install --upgrade packaging

Collecting packaging
  Using cached packaging-23.2-py3-none-any.whl.metadata (3.2 kB)
Using cached packaging-23.2-py3-none-any.whl (53 kB)
Installing collected packages: packaging
  Attempting uninstall: packaging
    Found existing installation: packaging 21.3
    Uninstalling packaging-21.3:
      Successfully uninstalled packaging-21.3
Successfully installed packaging-23.2


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mlflow 1.30.1 requires packaging<22, but you have packaging 23.2 which is incompatible.


In [11]:
!pip install packaging



In [12]:
from pycaret.time_series import *
import pandas as pd

# Definindo a coluna de data como índice
df.set_index('DthAtualizaCadastralEmpreend', inplace=True)

In [13]:
# Verificando se o índice é um datetime
df.index = pd.to_datetime(df.index)

In [14]:
# Agregando os dados no nível diário
df_daily = df.resample('D').sum()

In [15]:
!pip install --upgrade packaging distributed

Collecting distributed
  Using cached distributed-2024.1.1-py3-none-any.whl.metadata (3.4 kB)
Collecting dask==2024.1.1 (from distributed)
  Using cached dask-2024.1.1-py3-none-any.whl.metadata (3.7 kB)
Using cached distributed-2024.1.1-py3-none-any.whl (1.0 MB)
Using cached dask-2024.1.1-py3-none-any.whl (1.2 MB)
Installing collected packages: dask, distributed
  Attempting uninstall: dask
    Found existing installation: dask 2023.7.0
    Uninstalling dask-2023.7.0:
      Successfully uninstalled dask-2023.7.0
  Attempting uninstall: distributed
    Found existing installation: distributed 2023.7.0
    Uninstalling distributed-2023.7.0:
      Successfully uninstalled distributed-2023.7.0
Successfully installed dask-2024.1.1 distributed-2024.1.1


In [16]:
# Configurando o ambiente
s = setup(data = df_daily, target = 'MdaPotenciaInstaladaKW', fh = 12, fold = 3, session_id = 123)

Unnamed: 0,Description,Value
0,session_id,123
1,Target,MdaPotenciaInstaladaKW
2,Approach,Univariate
3,Exogenous Variables,Not Present
4,Original data shape,"(5341, 1)"
5,Transformed data shape,"(5341, 1)"
6,Transformed train set shape,"(5329, 1)"
7,Transformed test set shape,"(12, 1)"
8,Rows with missing values,0.0%
9,Fold Generator,ExpandingWindowSplitter


In [17]:
# Comparando modelos e selecionando o melhor
best_model = compare_models()

Unnamed: 0,Model,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2,TT (Sec)
exp_smooth,Exponential Smoothing,4.4602,2.2165,1840.1745,2946.2138,0.9864,0.2653,0.6934,0.3733
ets,ETS,4.5359,2.2423,1871.4015,2980.5406,0.9871,0.2673,0.6861,1.95
snaive,Seasonal Naive Forecaster,4.6439,2.2336,1916.1314,2968.853,0.5604,0.2886,0.6786,1.3167
omp_cds_dt,Orthogonal Matching Pursuit w/ Cond. Deseasonalize & Detrending,4.6555,2.1248,1920.2569,2824.0454,0.6043,0.3024,0.7142,0.2167
huber_cds_dt,Huber w/ Cond. Deseasonalize & Detrending,4.6791,2.2135,1930.7807,2942.3053,0.562,0.2972,0.6828,0.2467
arima,ARIMA,4.6841,2.2501,1932.7621,2990.8284,0.5629,0.2886,0.6733,0.11
stlf,STLF,4.718,2.2789,1949.09,3031.0288,0.7286,0.2906,0.6675,0.1167
lr_cds_dt,Linear w/ Cond. Deseasonalize & Detrending,5.1756,2.0714,2134.9915,2753.1553,0.7764,0.3935,0.7286,0.5367
lasso_cds_dt,Lasso w/ Cond. Deseasonalize & Detrending,5.1756,2.0714,2134.9911,2753.155,0.7764,0.3935,0.7286,0.2333
ridge_cds_dt,Ridge w/ Cond. Deseasonalize & Detrending,5.1756,2.0714,2134.9915,2753.1553,0.7764,0.3935,0.7286,0.48


In [18]:
# Ajustando os hiperparâmetros do melhor modelo
tuned_model = tune_model(best_model)

Unnamed: 0,cutoff,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2
0,2023-12-16,5.3627,2.8299,2194.9808,3740.9965,1.2514,0.3411,0.628
1,2023-12-28,3.9756,2.1963,1639.9005,2921.8287,1.4916,0.2633,0.6945
2,2024-01-09,4.0422,1.6233,1685.6424,2175.8164,0.2163,0.1915,0.7576
Mean,NaT,4.4602,2.2165,1840.1745,2946.2138,0.9864,0.2653,0.6934
SD,NaT,0.6388,0.4928,251.5799,639.2147,0.5533,0.0611,0.0529


Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  12 | elapsed:    9.4s remaining:    1.8s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    9.4s finished


In [19]:
# Fazendo previsões
predictions = predict_model(tuned_model)

Unnamed: 0,Model,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2
0,Exponential Smoothing,10.9329,4.3569,4592.9297,5853.0581,60.3467,0.8138,-2.8642


In [20]:
# Reamostrando as previsões no nível mensal
predictions_monthly = predictions.resample('M').sum()

# Reamostrando as previsões no nível anual
predictions_annual = predictions.resample('A').sum()



In [21]:
# Plotando o desempenho do modelo
plot_model(tuned_model)

In [22]:
# Plotando as previsões do modelo
plot_model(tuned_model, plot='forecast', data_kwargs = {'fh': 730})

In [23]:
# Plotando os resíduos do modelo
plot_model(tuned_model, plot='residuals')

In [24]:
# Definindo a data do outlier
outlier_date = '2022-09-01'  # Substitua por a data do outlier

# Removendo o outlier
df_daily = df_daily[df_daily.index != outlier_date]

In [25]:
# Agregando os dados no nível diário
df_daily = df.resample('D').sum()

# Definindo a frequência do índice datetime
df_daily = df_daily.asfreq('D')

# Configurando o ambiente
s = setup(data = df_daily, target = 'MdaPotenciaInstaladaKW', fh = 12, fold = 3, session_id = 123)

# Restante do código...

Unnamed: 0,Description,Value
0,session_id,123
1,Target,MdaPotenciaInstaladaKW
2,Approach,Univariate
3,Exogenous Variables,Not Present
4,Original data shape,"(5341, 1)"
5,Transformed data shape,"(5341, 1)"
6,Transformed train set shape,"(5329, 1)"
7,Transformed test set shape,"(12, 1)"
8,Rows with missing values,0.0%
9,Fold Generator,ExpandingWindowSplitter


In [26]:
# Comparando modelos e selecionando o melhor
best = compare_models()

Unnamed: 0,Model,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2,TT (Sec)
exp_smooth,Exponential Smoothing,4.4602,2.2165,1840.1745,2946.2138,0.9864,0.2653,0.6934,0.3033
ets,ETS,4.5359,2.2423,1871.4015,2980.5406,0.9871,0.2673,0.6861,0.16
snaive,Seasonal Naive Forecaster,4.6439,2.2336,1916.1314,2968.853,0.5604,0.2886,0.6786,0.0567
omp_cds_dt,Orthogonal Matching Pursuit w/ Cond. Deseasonalize & Detrending,4.6555,2.1248,1920.2569,2824.0454,0.6043,0.3024,0.7142,0.18
huber_cds_dt,Huber w/ Cond. Deseasonalize & Detrending,4.6791,2.2135,1930.7807,2942.3053,0.562,0.2972,0.6828,0.1767
arima,ARIMA,4.6841,2.2501,1932.7621,2990.8284,0.5629,0.2886,0.6733,0.1033
stlf,STLF,4.718,2.2789,1949.09,3031.0288,0.7286,0.2906,0.6675,0.0833
lr_cds_dt,Linear w/ Cond. Deseasonalize & Detrending,5.1756,2.0714,2134.9915,2753.1553,0.7764,0.3935,0.7286,0.6833
lasso_cds_dt,Lasso w/ Cond. Deseasonalize & Detrending,5.1756,2.0714,2134.9911,2753.155,0.7764,0.3935,0.7286,0.1467
ridge_cds_dt,Ridge w/ Cond. Deseasonalize & Detrending,5.1756,2.0714,2134.9915,2753.1553,0.7764,0.3935,0.7286,0.39


In [27]:
# Ajustando os hiperparâmetros do melhor modelo
tuned = tune_model(best)

Unnamed: 0,cutoff,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2
0,2023-12-16,5.3627,2.8299,2194.9808,3740.9965,1.2514,0.3411,0.628
1,2023-12-28,3.9756,2.1963,1639.9005,2921.8287,1.4916,0.2633,0.6945
2,2024-01-09,4.0422,1.6233,1685.6424,2175.8164,0.2163,0.1915,0.7576
Mean,NaT,4.4602,2.2165,1840.1745,2946.2138,0.9864,0.2653,0.6934
SD,NaT,0.6388,0.4928,251.5799,639.2147,0.5533,0.0611,0.0529


Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  12 | elapsed:    0.8s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  12 out of  12 | elapsed:    0.9s finished


In [28]:
predict = predict_model(tuned)

Unnamed: 0,Model,MASE,RMSSE,MAE,RMSE,MAPE,SMAPE,R2
0,Exponential Smoothing,10.9329,4.3569,4592.9297,5853.0581,60.3467,0.8138,-2.8642


In [29]:
plot_model(tuned, plot = 'forecast', data_kwargs = {'fh': 365})