ER190C: Data, Envinronment and Society.

Lecture 2: September 3, 2019

In this notebook, we'll do a brief tour of the data set we'll be working with from the California Independent System Operator.

In [None]:
import requests # this is a really useful library for pulling data from the web
import csv # this helps us work with csv files
import numpy as np # numpy is something like a matlab replacement for python.  Numeric and scientific computing.
import pandas as pd # we'll learn more about this soon

California ISO is the system operator for the California grid.  They tell generators when and how much to produce.  

They record renewable production data [here](http://content.caiso.com/green/renewrpt/)

That page links to files giving production for the *day* in question.  

Let's look at Aug 21, 2017, the day you'll explore in the HW

In [None]:
# figure out what the url should be and enter it here:
url = 'http://content.caiso.com/green/renewrpt/20170821_DailyRenewablesWatch.txt' # do this in lecture

Let's "tab into" `requests` to see how we can get data from the url.

Some cool 'help' features of Jupyter
1. pushing tab at the right time shows you what methods are available to apply to an object.
2. pushing shift-tab repeatedly gives you help files 
3. typing a question mark before a command pulls up the full help file.  

In [None]:
caiso_data = requests.get(url) # do this in lecture

In [None]:
?requests.get # do this in lecture

In [None]:
# let's see what we got
caiso_data

'Response' is the object returned by requests.  In this case we've opened a connection to the url but we haven't actually grabbed the text.  

Let's look at the requests documentation to figure out what to do.  (Search for python requests in your favorite search engine and see what you find.)

Looks like we can tack .text on the end of the object to actually pull the data.

In [None]:
caiso_data.text # do this in lecture

Ack!  That's pretty ugly!  What are we looking at?

<br>
<br>
<br>

(a tab delimited file)

I wrote a function that will pull a date range and massage it into the form we want:

In [None]:
import datetime # helps us to work with dates and times in different formats
import os # helps us talk to the operating system command line
def CAISOrenewables(year, month, start_day, end_day, production = False, matrix = False):
    """Scrape CAISO's daily renewable watch .txt files and 
    convert to a DataFrame or Numpy record array. Will only scrape
    a range of days in a given month.
    
    Keyword arguments:
    Year -- year of the date to scrape
    Month -- Month of date to scrape
    start_day -- starting day of month to scrape
    end_day -- ending day to scrape
    production -- If False, will collect hourly breakdown of renewable resources.
                  If True, will scrape hourly breakdown of total production by resource type.
    matrix -- If False, function will return a Pandas DataFrame
              If True, will return numpy recarray
    """
    base_url = 'http://content.caiso.com/green/renewrpt/'
    tail = '_DailyRenewablesWatch.txt'
    
    rv = pd.DataFrame()
    
    for day in range(start_day, end_day + 1):
        #format date and URL to pull
        if month < 10:
            str_month = '0' + str(month)
        else:
            str_month = str(month)
        if day < 10:
            str_day = '0'+ str(day)
        else:
            str_day = str(day)
            
        str_m_day = str_month + str_day
        url = base_url + str(year) + str_m_day + tail

        #Write scraped file to drive
        caiso_data = requests.get(url).text
        txt_filename = str(year) +str_m_day + '.txt'
        csv_filename = str(year) + str_m_day + '.csv'
    
        with open(txt_filename, 'w') as f:
            f.write(str(caiso_data))
    
        #Convert the .txt file to a csv.
        with open(txt_filename) as txtfile, open(csv_filename,'w') as new_csv:
            for line in txtfile: 
                new_csv.write(line.replace('\t',','))

        #Get day of year for dataframe index
        date = datetime.date(year, month, day)
    
        #Load data to dataframe.
        data = pd.read_csv(csv_filename, delimiter='\t')
        
        if not production:
            data = data.iloc[range(0, 25)]
        else:
            data = data.iloc[range(28, 53)].reset_index(drop=True)
    
        #Get column names
        columns = [i for i in np.array2string(data.iloc[0].values).split(',') if len(i)>3]
    
        #Grab first row of data to put in a dictionary then append the rest.
        first_row = [[int(i)] for i in np.array2string(data.iloc[1].values).split(',') if i.isdigit()]
        df_data = dict(zip(columns, first_row))
    
        #Do the same for the rest of the rows
        for row in range(2, data.shape[0]):
            vals = [int(i) for i in np.array2string(data.iloc[row].values).split(',') if i.isdigit()]
            for item in range(len(columns)):
                df_data[columns[item]].append(vals[item])
    
        #create DataFrame with collected data
        d_df = pd.DataFrame(df_data, [date]*24)[columns]
        rv = rv.append(d_df)
        
        os.remove(txt_filename)
        os.remove(csv_filename)
        
    if matrix:
        return rv.to_records(index=True)
    
    return rv

Ok, now we can pull whatever data we want for renewables production from the CAISO website.  

Here we'll pull CAISO renewables data for August 20 through 22, 2017.

In [None]:
caiso_data = CAISOrenewables(2017, 8, 20, 22) # do this in lecture

In [None]:
caiso_data # this shows the data frame

Now let's use the `.loc` method in pandas to look at an individual data column (more on pandas next time)

In [None]:
caiso_data.loc[:,'SOLAR PV'] #do this in lecture

In [None]:
import matplotlib.pyplot as plt # this gives us libraries to plot nice figures.

Let's plot the solar generation data using `plt.plot` and the `.loc` method

In [None]:
plt.plot(caiso_data.loc[:,'SOLAR PV']) # do this in lecture

The problem is that the "index" of the data frame is clustered at the same value for each day -- so the data get plotted just at one location for each day.  

Let's fix the index with a list comprehension. Replace the current indexes with [1, 2, 3, ...]

In [None]:
caiso_data.index = [i for i in range(0,len(caiso_data.index))] # do this in lecture

In [None]:
# Now we can plot according to unique indexes:
plt.plot(caiso_data.loc[:,'SOLAR PV']) # do this in lecture

In [None]:
# alternatively we can plot by hour of day to see things overlap
plt.plot(caiso_data.loc[:,'Hour'],caiso_data.loc[:,'SOLAR PV']) # do this in lecture