## Data Collection Notebook

This is the Data collection notebook for Project 4 of the GA DSIproject by: <br>
Andres Aguilar <br>
Martijn de Vries <br>
William LopeZ

In [1]:
import sys, os
import pandas as pd
import numpy as np
import time

Let's write a function to obtain the data from NSRDB for a given year. The code to do this was mainly taken from:
https://developer.nrel.gov/docs/solar/nsrdb/python-examples/

In [13]:
def get_nsrdb_data(year, api_key='', your_name='Martijn+de+Vries', your_email='martijndevries91@gmail.com'):
    """
    Construct URL to read in NSRDB data into a pandas data frame. Most of this code is copied from  https://developer.nrel.gov/docs/solar/nsrdb/python-examples/
    Returns:
        pandas dataframe with weather attributes and GHI for the specified year
    """
    # Define the lat, long of the location and the year
    lat, lon = 34.0522, -118.243683 #these are coordinates within LA county

    # Set the attributes to extract (e.g., dhi, ghi, etc.), separated by commas.
    attributes = 'air_temperature,clearsky_dhi,clearsky_dni,clearsky_ghi,cloud_type,dew_point,dhi,dni,fill_flag,ghi,relative_humidity,solar_zenith_angle,'\
        'surface_albedo,surface_pressure,total_precipitable_water,wind_direction,wind_speed'

    if int(year)%4==0: 
        leap_year = 'true'
    else:
        leap_year = 'false'

    # Set time interval in minutes, i.e., '30' is half hour intervals. Valid intervals are 30 & 60.
    interval = '30'

    # Specify Coordinated Universal Time (UTC), 'true' will use UTC, 'false' will use the local time zone of the data.
    # NOTE: In order to use the NSRDB data in SAM, you must specify UTC as 'false'. SAM requires the data to be in the
    # local time zone.
    utc = 'false'

    # Your reason for using the NSRDB.
    reason_for_use = 'private+project'
    # Your affiliation
    your_affiliation = 'General+Assembly'

    # Please join our mailing list so we can keep you up-to-date on new developments.
    mailing_list = 'false'

    # Declare url string
    url = 'https://developer.nrel.gov/api/nsrdb/v2/solar/psm3-download.csv?wkt=POINT({lon}%20{lat})&names={year}&leap_day={leap}&interval={interval}&utc={utc}&full_name={name}&email={email}&affiliation={affiliation}&mailing_list={mailing_list}&reason={reason}&api_key={api}&attributes={attr}'.format(year=year, lat=lat, lon=lon, leap=leap_year, interval=interval, utc=utc, name=your_name, email=your_email, mailing_list=mailing_list, affiliation=your_affiliation, reason=reason_for_use, api=api_key, attr=attributes)
    
    #read in csv frm URL
    df = pd.read_csv(url, skiprows=2)
    df['datetime'] = pd.to_datetime(df[['Year', 'Month', 'Day', 'Hour', 'Minute']])
    df.set_index('datetime', inplace=True)
    # Add Date index
    #df = df.set_index(pd.date_range('1/1/{yr}'.format(yr=year), freq=interval+'Min', periods=525600/int(interval)))
    
    return df

In order to retrieve the data, an API key is required. This can be obtained here: <br>
https://developer.nrel.gov/signup/

In [10]:
api_key = '' #Private key edited out

In order to capture seasonal weather patterns, we would like to retrieve a few years of data. In the code block below, we obtain the data from between 2016 and 2020 and concatenate them into a single dataframe

In [14]:
for i in range(2016, 2021):
    time.sleep(10) #polite
    print(i)

    if i == 2016:
        df = get_nsrdb_data(str(i), api_key=api_key)
    else:
        df1 = get_nsrdb_data(str(i), api_key=api_key)
        df = pd.concat([df, df1])
print(df.shape)


2016
2017
2018
2019
2020
(87696, 22)


In [15]:
df.head()

Unnamed: 0_level_0,Year,Month,Day,Hour,Minute,Temperature,Clearsky DHI,Clearsky DNI,Clearsky GHI,Cloud Type,...,DNI,Fill Flag,GHI,Relative Humidity,Solar Zenith Angle,Surface Albedo,Pressure,Precipitable Water,Wind Direction,Wind Speed
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-01-01 00:00:00,2016,1,1,0,0,5.0,0,0,0,0,...,0,0,0,49.0,168.95,0.128,990,0.472,55.7,4.0
2016-01-01 00:30:00,2016,1,1,0,30,5.0,0,0,0,0,...,0,0,0,49.0,166.74,0.128,990,0.477,55.7,4.0
2016-01-01 01:00:00,2016,1,1,1,0,5.0,0,0,0,0,...,0,0,0,48.97,162.23,0.128,990,0.482,55.7,4.1
2016-01-01 01:30:00,2016,1,1,1,30,5.0,0,0,0,0,...,0,0,0,48.97,156.74,0.128,990,0.489,55.7,4.1
2016-01-01 02:00:00,2016,1,1,2,0,5.0,0,0,0,0,...,0,0,0,48.98,150.83,0.128,990,0.496,56.0,4.2


That looks great! Now we can save the data to a csv file

In [17]:
df.to_csv('../data/NSRDB_data.csv', index='datetime')