# ☀️ Renewable Power Generation Prediction

## Problem Statement
How weather directly impacts renewable energy generation. That's why we'll approach this problem as a regression problem but not as a time series problem.

# 📚 1. Import Libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('display.float_format', lambda x:'%.3f' % x)
import warnings
warnings.simplefilter(action="ignore", category=Warning) # Ignore warnings

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sb
# Configure default settings for plots
sb.set(style='ticks')
sb.set_palette('Paired')

# 📖 2. Load Data

In [2]:
raw_data = pd.read_csv('./Data/Raw/Renewable.csv')
print(f'The dataset has {raw_data.shape[0]} rows and {raw_data.shape[1]} columns')

The dataset has 196776 rows and 17 columns


In [3]:
# Creating a copy of the dataframe in case we need the raw data in the next sections
df = raw_data.copy()

## 2.1. Data Overview

In [4]:
df.head()

Unnamed: 0,Time,Energy delta[Wh],GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,isSun,sunlightTime,dayLength,SunlightTime/daylength,weather_type,hour,month
0,2017-01-01 00:00:00,0,0.0,1.6,1021,100,4.9,0.0,0.0,100,0,0,450,0.0,4,0,1
1,2017-01-01 00:15:00,0,0.0,1.6,1021,100,4.9,0.0,0.0,100,0,0,450,0.0,4,0,1
2,2017-01-01 00:30:00,0,0.0,1.6,1021,100,4.9,0.0,0.0,100,0,0,450,0.0,4,0,1
3,2017-01-01 00:45:00,0,0.0,1.6,1021,100,4.9,0.0,0.0,100,0,0,450,0.0,4,0,1
4,2017-01-01 01:00:00,0,0.0,1.7,1020,100,5.2,0.0,0.0,100,0,0,450,0.0,4,1,1


In [5]:
df.tail()

Unnamed: 0,Time,Energy delta[Wh],GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,isSun,sunlightTime,dayLength,SunlightTime/daylength,weather_type,hour,month
196771,2022-08-31 16:45:00,118,23.7,18.6,1023,57,3.8,0.0,0.0,52,1,780,825,0.95,3,16,8
196772,2022-08-31 17:00:00,82,15.6,18.5,1023,61,4.2,0.0,0.0,60,1,795,825,0.96,3,17,8
196773,2022-08-31 17:15:00,51,8.0,18.5,1023,61,4.2,0.0,0.0,60,1,810,825,0.98,3,17,8
196774,2022-08-31 17:30:00,24,2.1,18.5,1023,61,4.2,0.0,0.0,60,1,825,825,1.0,3,17,8
196775,2022-08-31 17:45:00,0,0.0,18.5,1023,61,4.2,0.0,0.0,60,0,0,825,0.0,3,17,8


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196776 entries, 0 to 196775
Data columns (total 17 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Time                    196776 non-null  object 
 1   Energy delta[Wh]        196776 non-null  int64  
 2   GHI                     196776 non-null  float64
 3   temp                    196776 non-null  float64
 4   pressure                196776 non-null  int64  
 5   humidity                196776 non-null  int64  
 6   wind_speed              196776 non-null  float64
 7   rain_1h                 196776 non-null  float64
 8   snow_1h                 196776 non-null  float64
 9   clouds_all              196776 non-null  int64  
 10  isSun                   196776 non-null  int64  
 11  sunlightTime            196776 non-null  int64  
 12  dayLength               196776 non-null  int64  
 13  SunlightTime/daylength  196776 non-null  float64
 14  weather_type        

🔎 **Observations:**

# ✅ 3. Sanity Check

- Check Missing values
- Check Duplicates
- Check data type
- Check the number of unique values of each column
- Check statistics of data set
- Check various categories present in the different categorical column

## 3.1. Checking for Missing Values

In [10]:
count = df.isnull().sum()[df.isnull().sum() > 0]
percentage = (count / df.shape[0]) * 100

print(count.shape[0], 'columns has missing values')
print('-'*50)
print(pd.DataFrame({'Count':count, 'Percentage %':percentage}))

0 columns has missing values
--------------------------------------------------
Empty DataFrame
Columns: [Count, Percentage %]
Index: []


In [8]:
# Checking some random records to see if there is any value which could replace null and may be missed by the above function.
df.sample(20, random_state=101)

Unnamed: 0,Time,Energy delta[Wh],GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,isSun,sunlightTime,dayLength,SunlightTime/daylength,weather_type,hour,month
7204,2017-03-22 01:00:00,0,0.0,2.9,1014,86,2.6,0.0,0.0,100,0,0,735,0.0,4,1,3
165767,2021-09-29 17:45:00,0,0.0,12.2,1017,84,4.9,1.0,0.0,99,0,0,705,0.0,5,17,9
65423,2018-11-18 11:45:00,65,14.3,3.3,1031,94,1.8,0.0,0.0,100,1,330,510,0.65,4,11,11
46618,2018-05-06 14:30:00,2269,139.8,19.8,1027,51,3.1,0.0,0.0,0,1,690,930,0.74,1,14,5
116994,2020-05-09 16:30:00,435,52.6,18.3,1014,50,2.0,0.0,0.0,13,1,810,930,0.87,2,16,5
164243,2021-09-13 20:45:00,0,0.0,13.9,1019,90,1.3,0.0,0.0,100,0,0,765,0.0,4,20,9
194491,2022-08-07 22:45:00,0,0.0,14.9,1024,76,2.6,0.0,0.0,69,0,0,915,0.0,3,22,8
7887,2017-03-29 03:45:00,0,0.0,2.2,1017,80,3.8,0.0,0.0,0,0,0,765,0.0,1,3,3
169981,2021-11-14 15:15:00,0,0.0,6.5,1027,82,2.8,0.0,0.0,100,0,0,510,0.0,4,15,11
61838,2018-10-12 03:30:00,0,0.0,13.2,1021,84,4.8,0.0,0.0,0,0,0,660,0.0,1,3,10


🔎 **Observations:** Seems there are no missing value from the dataset.

## 3.2. Checking for Duplicates

In [9]:
df.duplicated().sum()

0

🔎 **Observations:** There are no duplicate records.

## 3.3. Checking for Data Type

In [11]:
df.dtypes

Time                       object
Energy delta[Wh]            int64
GHI                       float64
temp                      float64
pressure                    int64
humidity                    int64
wind_speed                float64
rain_1h                   float64
snow_1h                   float64
clouds_all                  int64
isSun                       int64
sunlightTime                int64
dayLength                   int64
SunlightTime/daylength    float64
weather_type                int64
hour                        int64
month                       int64
dtype: object

## 3.4. Checking the number of unique values of each column

In [12]:
df.nunique()

Time                      196776
Energy delta[Wh]            4556
GHI                         2277
temp                         503
pressure                      71
humidity                      79
wind_speed                   136
rain_1h                      311
snow_1h                      129
clouds_all                   101
isSun                          2
sunlightTime                  69
dayLength                     39
SunlightTime/daylength       101
weather_type                   5
hour                          24
month                         12
dtype: int64

🔎 **Observations:**

- isSun, weather_type, month and hour should be converted into object columns

## 3.5. Summary Statistics

In [13]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Energy delta[Wh],196776.0,573.008,1044.824,0.0,0.0,0.0,577.0,5020.0
GHI,196776.0,32.597,52.172,0.0,0.0,1.6,46.8,229.2
temp,196776.0,9.791,7.995,-16.6,3.6,9.3,15.7,35.8
pressure,196776.0,1015.293,9.586,977.0,1010.0,1016.0,1021.0,1047.0
humidity,196776.0,79.811,15.604,22.0,70.0,84.0,92.0,100.0
wind_speed,196776.0,3.938,1.822,0.0,2.6,3.7,5.0,14.3
rain_1h,196776.0,0.066,0.279,0.0,0.0,0.0,0.0,8.09
snow_1h,196776.0,0.007,0.07,0.0,0.0,0.0,0.0,2.82
clouds_all,196776.0,65.974,36.629,0.0,34.0,82.0,100.0,100.0
isSun,196776.0,0.52,0.5,0.0,0.0,1.0,1.0,1.0


In [14]:
print("From Time : ",df['Time'].min())
print("To Time   : ",df['Time'].max())

From Time :  2017-01-01 00:00:00
To Time   :  2022-08-31 17:45:00


# 📊 4. Exploratory Data Analysis (EDA) and Visualization