# Capstone proposal by Jean-Pierre Braun 

# Use of weather data to predict the energy production of the NIST photovoltaic arrays

This notebook is a template for your project proposal.

The details are outlined in the **Proposal** unit on the platform - you should address all points from those instructions with as many markdown/code cells as needed. This should include code, observations, discussions and the planned steps.

In [1]:
import webbrowser
import numpy as np
import pandas as pd

## 1) The problem

The cost benefit analysis of a photovoltaic (PV) installation requires the knowledge of both the type of solar panels used and the weather conditions on site. In effect, the value of the electricity generated over the life span of an installation must exceed the cost of its deployment. 

To undertake such a study, it is necessary to have at hand weather data, such as irradiation, ambient temperature and the typical weather patterns present on the site. Also necessary, is to have access to the PV array’s performance data such as the power generated for given weather conditions.

To facilitate such studies, the NIST (National Institute of Standards and Technology) in the USA has installed three fully instrumented photovoltaic arrays as well as an ancillary weather station on its Gaithersburg (MD) campus. The data recorded between August 2014 and July 2017 is freely available on the NIST web site. 

The aim of this project is to use the machine learning technique taught in the course on the NIST data. More precisely, it is envisaged to create a model that links the power generated to the wheatear data from the metrological station. It will then be possible to predict the energy generated over a given time period for known weather conditions.

A lot of information about this complete system is available from the following links: (run corresponding cell).

In [2]:
# Short overview of the NIST system (Three arrays and weather station)
webbrowser.open_new_tab('https://www.nist.gov/el/energy-and-environment-division-73200/heat-transfer-alternative-energy-systems/photovoltaic')

True

In [3]:
# Descripton of the NIST PV arrays and its insrumentation
webbrowser.open_new_tab('https://dx.doi.org/10.6028/NIST.TN.1896#page=13')

True

In [4]:
# Descripton of the NIST anciliary weather station and its instrumentation
webbrowser.open_new_tab('https://dx.doi.org/10.6028/NIST.TN.1913#page=13')

True

## 2) The data

### (a) Clear overview of your data

The data available are divided into five groups:

- Parking lot cannopy array
- Ground mount array
- Roof tilted array
- Rooftop weather station
- Rooftop module test station

The data is available for a period of four years starting January 2015 and ending December 2018. The data can be chosen with timestamp intervals of either 1 second (instantaneous values) or 1 min (averaged values). The data is contained in zip files, covering one group over a period of 1 year. Zip files have 12 directory (one per month) containing Excel files (csv format) covering a 24-hour period. Each csv file contains 1440 rows (number of minutes in a day) and 102 (array) or 49 (weather station) columns. The files containing the instantaneous values have 86400 rows (number of seconds in a day).Also available are the images taken by various surveillance cameras observing the complete installation as well as the sky. The pictures are updated every minute. 

The content of theses file can be summarised as follow: (see data dictionary on cell 6 for mor information)

- Array data: Irradiance, ambient temperature, module temperature, wind, DC and AC electrical, etc.
- Weather station: Irradiance (various types), snow depth, wind speed and direction, humidity, precipitation, barometric presssure, hail count, ambient temperature, etc.

-----------------------

The data can be interactively previewed on the NIST web site (next cell). Select the following parameters (on the web page) for a first test:
- Location: Canopy
- Table: OneMin
- Field: RefCell1_Wm2_Avg, AmbTemp_C_Avg, InvPAC_kW_Avg, 
- Start Data: As desired
- Camera: As desired

These settings will display the irradiation, temperature and power produced. The data dictionary (see below) can help in selecting other parameters.

In [2]:
# Visualisation of data
webbrowser.open_new_tab('https://pvdata.nist.gov/')

True

In [6]:
# Data dictionary (if required)
webbrowser.open_new_tab('https://www.nist.gov/system/files/documents/2017/10/04/datadictionary_supplementalcontent.pdf')

True

In [7]:
# Import a cannopy array sample files in python
df_canopy = pd.read_csv('onemin-Canopy-2018-01-01.csv')
print(df_canopy.shape)
df_canopy.head()

(1440, 102)


Unnamed: 0,TIMESTAMP,Pyra1_Wm2_Avg,Pyra2_Wm2_Avg,Pyra3_Wm2_Avg,RECORD,CR1000Temp_C_Avg,DoorOpen_Min,RefCell1_Wm2_Avg,RefCell2_Wm2_Avg,AmbTemp_C_Avg,...,RTD_C_Avg_1,RTD_C_Avg_2,RTD_C_Avg_3,RTD_C_Avg_4,RTD_C_Avg_5,RTD_C_Avg_6,RTD_C_Avg_7,RTD_C_Avg_8,RTD_C_Avg_9,RTD_C_Avg_10
0,2018-01-01 00:00:00-05:00,-4.661181,-11.089881,-4.321073,320454.0,-9.96,0.0,-0.204,0.051,-10.54,...,-14.14,-14.36,-14.59,-14.02,-13.9,-14.46,-14.15,-14.32,-12.71,-7.353
1,2018-01-01 00:01:00-05:00,-4.661181,-11.089881,-4.321073,320455.0,-9.96,0.0,-0.204,0.051,-10.51,...,-14.13,-14.33,-14.43,-13.86,-13.76,-14.33,-14.06,-14.19,-12.66,-7.417
2,2018-01-01 00:02:00-05:00,-4.661181,-11.089881,-4.436302,320456.0,-9.96,0.0,-0.204,0.051,-10.6,...,-14.14,-14.33,-14.43,-13.86,-13.79,-14.36,-14.06,-14.15,-12.73,-7.417
3,2018-01-01 00:03:00-05:00,-4.661181,-11.089881,-4.321073,320457.0,-9.96,0.0,-0.204,0.051,-10.58,...,-14.14,-14.4,-14.43,-13.94,-13.83,-14.41,-14.06,-14.2,-12.77,-7.417
4,2018-01-01 00:04:00-05:00,-4.661181,-11.089881,-4.436302,320458.0,-9.96,0.0,-0.204,0.051,-10.64,...,-14.17,-14.45,-14.52,-14.05,-13.89,-14.46,-14.08,-14.24,-12.79,-7.417


In [8]:
# Import a weather station sample files in python
df_WS = pd.read_csv('onemin-WS_1-2018-01-01.csv')
print(df_WS.shape)
df_WS.head()

(1440, 49)


Unnamed: 0,TIMESTAMP,Pyrh1_Wm2_Avg,Pyrad1_Wm2_Avg,Pyra1_Wm2_Avg,Pyrg1_Wm2_Avg,UVA_Wm2_Avg,UVB_Wm2_Avg,Pyrg1_downwell_Wm2_Avg,RECORD,SolarTime_hr,...,SixInOneHeatStateID_Avg,WindValid_Avg,Battery_V_Min,Battery_A_Avg,Load_A_Avg,ChgState_Min,ChgSource_Min,CkBatt_Max,Qloss_Ah_Max,RelayState_Min
0,2018-01-01 00:00:00-05:00,0.0,-1.041667,-5.764146,-97.77283,0.025379,0.392921,175.04398,296118,23.8,...,3.0,-1.0,13.08,0.009,3.018,3.0,1.0,0.0,0.0,15
1,2018-01-01 00:01:00-05:00,0.0,-1.041667,-5.764146,-97.995544,0.024244,0.391798,174.81833,296119,23.81,...,3.0,-1.0,13.08,0.009,3.019,3.0,1.0,0.0,0.0,15
2,2018-01-01 00:02:00-05:00,0.0,-1.041667,-5.764146,-97.995544,0.024554,0.41927,174.81833,296120,23.83,...,3.0,-1.0,13.08,0.008,3.014,3.0,1.0,0.0,0.0,15
3,2018-01-01 00:03:00-05:00,0.0,-1.041667,-5.764146,-97.995544,0.02507,0.476405,174.69394,296121,23.85,...,3.0,-1.0,13.08,0.009,3.013,3.0,1.0,0.0,0.0,15
4,2018-01-01 00:04:00-05:00,0.0,-1.041667,-5.764146,-97.995544,0.025689,0.436405,174.65259,296122,23.86,...,3.0,-1.0,13.08,0.009,3.016,3.0,1.0,0.0,0.0,15


### (b) Plan to manage and process the data

The data supplied by NIST is well organised, but in it its present form it is not suitable for analysis. The main problem is that the data is spread over 1500 spreadsheets, each corresponding to one physical location and one calendar day. The first task will be the merging of all these data into a database so that the relevant data can then be easily extracted into DataFrames suitable for analysis. A second problem is the high time resolution of the data as well as the duration over which (four years) the data have been collected. It is unlikely that all these data will fit in a PC ‘s memory nor that it can be processed in a reasonable amount of time. This call for multiple analysis over shorter time intervals. This will be greatly facilitated by the use of a database.

The creation of the database will require a high degree of automation of the data extraction, clean up and storage. The database will have the following table structure:
- Ground array data
- Canopy array data
- Roof array data
- Weather station array data
- Weather station data

All the tables will contain only the 1-minute average data. At this stage, it is not envisaged to use the instantaneous values. Each of these data table sets will be a continuous time series over the full four years.  This implies that each table will contain over two millions rows (60 minutes * 24 hours * 365 day * 4 years). Shorter time interval can then be extracted into DataFrame.

Concisely the automation will consist in opening sequentially all the data, clean them up and store them into the database. Only clean data are allowed into the database. The clean-up consists in the search for missing data, verification of the continuity of the time-series as well a search for non-valid data format. It is expected that the data be of high quality as the NIST system is fully automatic, but no assumption is made at this stage. It is essential, that the data clean-up can be fully automated.

## 3) Exploratory data analysis (EDA)

### (a) Preliminary EDA

The first EDA aim in this project is to identify the most relevant data for the modelling effort. The NIST data provide a large amount of data that include power generation, weather information and other technical data. The task is to determine which parameters influence most on the generation of electrical power. 
The following two publications provide some initial information:

- Solar PV Generation Forecast Model Based on the Most Effective Weather Parameters, Muhammad Asim Munir;Abraiz Khattak;Kashif Imran;Abasin Ulasyar;Adeel Khan, 2019 International Conference on Electrical, Communication, and Computer Engineering (ICECCE)
- Assessing the Utility of Weather Data for Photovoltaic Power Prediction, Reza Zafarani, Sara Effekharnejad and Urvi Patel, 2018 arXiv

The following preliminary EDA are envisaged at this time (It is likely that this list will grow with data familiarisation):
- Statistical analysis and visualisation of weather data (min, max, mean, stdev, correleation, etc)
- Grouping of energy generation into classes 'of typical days'. The days will be classified by the energy produced (Data exploration)
- Grouping of the weather situation into classes of typical weather (Data exploration)
- Investigation if these two groups can used to form some initial baseline

### (b) How does the EDA inform your project plan?








The experience I gained so far into this course, is that familiarisation with data is an essential ingredient for the proper use of the various models used throughout the course. The blind use of models is in my opinion a recipe for disaster (The ‘danger zone’ shown in the data science skillset presented in the first lesson). Therefore, at this stage, I believe that before some ‘wrestling’ with the data has been done, it is difficult to respond how the EDA will inform my project plan. I will only be able to answer this question when I acquire a better feel of the data. 

### (c) What further EDA do you plan for project?

The power generated by the arrays is caused primarily by to the irradiance on the PV solar cells (energy conversion). Based on the following paper, this relationship is deterministic:

Evaluation and validation of equivalent circuit photovoltaic solar cell performance models, M. T. Boyd, S. A. Klein, D. T. Reindl, B. P. Dougherty, Journal of Solar Energy Engineering, Vol 133, May 2011

Therefore, this relationship between irradiance and power generated could be explored during the EDA phase through polynomial fitting. If this investigation leads to some useable results, the overall project would boil down to the prediction of the irradiance from the weather data.


## 4) Machine learning 

### (a) Phrase your project goal as a clear machine learning question

The machine learning question aims to establish a model relating weather conditions (sun irradiation, temperature, humidity, wind, etc) to the power generated by the PV arrays. This model should then be able to predict the power generation for known weather conditions with a good accuracy. In this project the large amount of data available, provides ample of date to train and test machine learning algorithms. Such a model could ultimately predict energy production from weather forecast.

### (b) What models are you planning to use and why?

- K-nearest neighbors

K-N nearest neighbours appears an easy way to start the modelisation, and it is perceived that both the weather and array data and array data could fit into clusters. 

- Support vector machine (SVM)

SVM and ANN are often reported in the literature as the machine learning technique used in this specific problem. It is perceived that this method may refine the results from the K-nearest neighbours. The use of new features and kernel may be adapted for the nonlinearity of the data. 

- Artifical Neural networks (ANN)

ANN are well suited for highly nonlinear data such as weather. It also a technique that can be started with a simple network and can be made progressively more sophisticated. 

### (c) Please tell us your detailed machine learning strategy 

At this stage, my strategy is very embryonic as this project is rather overwhelming. Thus, the strategy will be an incremental one that will start with a limited number of parameters over a limited time interval. The three machine learning algorithmic will be explored simultaneously. The models will be progressively refined as the project progresses. It will be very much a trial and error approach.

## 5) Additional information

From the experience gained with the various projects of this course, a deeper familiarisation with the data occurs generally at the ‘time of doing’ and this may influence the various phases of the project. Therefore, while the goals and objectives will remain identical, the details of the implementation may diverge from those outlined in the proposal.