# Prediction of Solar Energy Potential Based on Weather and Location Data

### Import Libraries and Settings

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

pd.options.display.max_columns = 100
pd.set_option('display.max_colwidth', None)

# Requirements
print('numpy version : ',np.__version__)
print('pandas version : ',pd.__version__)
print('seaborn version : ',sns.__version__)

numpy version :  1.26.4
pandas version :  2.2.0
seaborn version :  0.13.2


#### Load Dataset (Jupyter Notebook)

In [2]:
df = pd.read_csv('Pasion et al dataset.csv')

#### Load Dataset (Google Colab)

In [3]:
# from google.colab import drive
# drive.mount('/content/drive')

# df = pd.read_csv('Pasion et al dataset.csv')

# Initial Data Understanding and Pre-Processing

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21045 entries, 0 to 21044
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Location       21045 non-null  object 
 1   Date           21045 non-null  int64  
 2   Time           21045 non-null  int64  
 3   Latitude       21045 non-null  float64
 4   Longitude      21045 non-null  float64
 5   Altitude       21045 non-null  int64  
 6   YRMODAHRMI     21045 non-null  float64
 7   Month          21045 non-null  int64  
 8   Hour           21045 non-null  int64  
 9   Season         21045 non-null  object 
 10  Humidity       21045 non-null  float64
 11  AmbientTemp    21045 non-null  float64
 12  PolyPwr        21045 non-null  float64
 13  Wind.Speed     21045 non-null  int64  
 14  Visibility     21045 non-null  float64
 15  Pressure       21045 non-null  float64
 16  Cloud.Ceiling  21045 non-null  int64  
dtypes: float64(8), int64(7), object(2)
memory usage: 2

In [9]:
df.sample(20)

Unnamed: 0,Location,Date,Time,Latitude,Longitude,Altitude,YRMODAHRMI,Month,Hour,Season,Humidity,AmbientTemp,PolyPwr,Wind.Speed,Visibility,Pressure,Cloud.Ceiling
611,Camp Murray,20180629,1100,47.11,-122.57,84,201806000000.0,6,11,Summer,27.46582,34.18907,24.84334,3,10.0,1008.9,100
15467,Peterson,20180813,1500,38.82,-104.71,1879,201808000000.0,8,15,Summer,15.28931,38.7738,16.66081,9,10.0,811.7,722
7336,Kahului,20180406,1000,20.89,-156.44,2,201804000000.0,4,10,Spring,99.34082,23.17162,1.19625,13,2.5,1017.5,8
12872,Offutt,20180804,1115,41.13,-95.75,380,201808000000.0,8,11,Summer,96.44775,24.45816,4.63899,10,10.0,975.1,50
1318,Grissom,20171220,1200,40.67,-86.15,239,201712000000.0,12,12,Winter,50.68359,9.08257,3.94127,9,10.0,990.2,722
14398,Peterson,20180203,1200,38.82,-104.71,1879,201802000000.0,2,12,Winter,4.99268,34.89403,9.89062,11,10.0,806.0,722
7595,Kahului,20180525,1400,20.89,-156.44,2,201805000000.0,5,14,Spring,56.35376,30.29671,24.73989,24,10.0,1016.1,722
20970,USAFA,20180912,1330,38.95,-104.83,1947,201809000000.0,9,13,Fall,8.69751,47.7494,16.19746,16,10.0,797.3,722
7816,Malmstrom,20171210,1400,47.52,-111.18,1043,201712000000.0,12,14,Winter,15.17944,12.93716,2.35719,26,10.0,904.6,140
18495,USAFA,20170527,1100,38.95,-104.83,1947,201705000000.0,5,11,Spring,28.87573,23.17917,3.70381,0,10.0,799.5,43


Just from these information, there are some useful insight to this data that understanding it could make it easier for further analysis and feature transformation, they are :
- Date feature have the wrong format should be in datetime instead of integer
- Time feature values are wrong because it's written in the format of hour:minute but because of it's data time is integer, it became hourminute (ex : 10:00 -> 1000)
- Latitude and Longitude are useful if we want to make it into geographical plot in Tableau or similar tools, but in here it's already represented by Location
- YRMODAHRMI (year, month, day, hour, minute) is actually similar to Date, but it has more detailed date information, probably will check the values are like
- There are also separate feature for Month and Hour, we could use this and even though we could extract similar information in YRMODAHRMI feature
- As for categorical feature like Location and Season potentially could be encoded by one-hot enxoding
- Several features scientifically could have high correlation to each other for example Altitude with Pressure and Humidity, but we will check details of this later in bivariate analysis
- Feature PolyPwr is the target variable, we could move it to the last order in dataframe (personal preference)

In [None]:
# # Initial data transformation :  converting column name to lowercase
# df.columns = df.columns.str.lower()

In [None]:
#Transform and drop date and time features


In [13]:
#Checking missing values
df.isna().sum()

Location         0
Date             0
Time             0
Latitude         0
Longitude        0
Altitude         0
YRMODAHRMI       0
Month            0
Hour             0
Season           0
Humidity         0
AmbientTemp      0
PolyPwr          0
Wind.Speed       0
Visibility       0
Pressure         0
Cloud.Ceiling    0
dtype: int64

In [12]:
#Checking amount of duplicated values
df.duplicated().sum()

0

There are no missing or duplicated values in this dataset

# Descriptive Statistics

In [10]:
#Describe numerical columns
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Date,21045.0,20177200.0,4579.585,20170520.0,20171110.0,20180320.0,20180620.0,20181000.0
Time,21045.0,1267.484,167.6028,1000.0,1100.0,1300.0,1400.0,1545.0
Latitude,21045.0,38.21382,6.323761,20.89,38.16,38.95,41.15,47.52
Longitude,21045.0,-108.5937,16.36413,-156.44,-117.26,-111.18,-104.71,-80.11
Altitude,21045.0,798.8437,770.6818,1.0,2.0,458.0,1370.0,1947.0
YRMODAHRMI,21045.0,201771800000.0,45798460.0,201705000000.0,201711000000.0,201803000000.0,201806000000.0,201810000000.0
Month,21045.0,6.565883,2.983958,1.0,4.0,7.0,9.0,12.0
Hour,21045.0,12.62785,1.672952,10.0,11.0,13.0,14.0,15.0
Humidity,21045.0,37.12194,23.82301,0.0,17.5293,33.12378,52.59399,99.98779
AmbientTemp,21045.0,29.28512,12.36682,-19.98177,21.91528,30.28915,37.47467,65.73837


In [15]:
#Describe categorical columns
df.select_dtypes('object').describe().transpose()

Unnamed: 0,count,unique,top,freq
Location,21045,12,Travis,2746
Season,21045,4,Summer,8208
