# Scraping CENACE

The purpose is to scrape the CENACE day ahead market energy generation predictions for the mexican national electrical system, which is provided at a per-node resolution at http://www.cenace.gob.mx/SIM/VISTA/REPORTES/H_RepCantAsignadas.aspx?N=59&opc=divCssCantAsig&site=Cantidades%20asignadas/MDA/De%20Energ%C3%ADa%20El%C3%A9ctrica%20por%20Zona%20de%20Carga&tipoArch=C&tipoUni=SIN&tipo=De%20Energ%C3%ADa%20El%C3%A9ctrica%20por%20Zona%20de%20Carga&nombrenodop=MDA.


In [None]:
import requests
from   bs4 import BeautifulSoup

In [None]:
url = "http://www.cenace.gob.mx/SIM/VISTA/REPORTES/H_RepCantAsignadas.aspx?N=59&opc=divCssCantAsig&site=Cantidades%20asignadas/MDA/De%20Energ%C3%ADa%20El%C3%A9ctrica%20por%20Zona%20de%20Carga&tipoArch=C&tipoUni=SIN&tipo=De%20Energ%C3%ADa%20El%C3%A9ctrica%20por%20Zona%20de%20Carga&nombrenodop=MDA"

In [None]:
postdata = {
    'ctl00$ContentPlaceHolder1$toolkit':'ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$treePrincipal',
    # So does this next one:
    'ctl00_ContentPlaceHolder1_treePrincipal_ClientState': '{"expandedNodes":[],"collapsedNodes":[],"logEntries":[],"selectedNodes":[],"checkedNodes":["0","0:0"],"scrollPosition":0}',
    'ctl00$ContentPlaceHolder1$HiddenOpcMenu': '',
    'ctl00_ContentPlaceHolder1_ListViewNodos_ClientState': '',
    'ctl00_ContentPlaceHolder1_NotifAvisos_ClientState': '',
    'ctl00$ContentPlaceHolder1$NotifAvisos$hiddenState': '',
    'ctl00_ContentPlaceHolder1_NotifAvisos_XmlPanel_ClientState': '',
    'ctl00_ContentPlaceHolder1_NotifAvisos_TitleMenu_ClientState': '',
    '__EVENTTARGET': 'ctl00$ContentPlaceHolder1$treePrincipal',
    '__EVENTARGUMENT': '{"commandName":"Check","index":"0:0"}', # TODO: this changes!!
    '__VIEWSTATEGENERATOR': '658B03D3',
    '__ASYNCPOST': 'true'
}

In [None]:
# Do the get request and get the VIEWSTATE and EVENTVALIDATION vars
r = requests.get(url)

In [None]:
soup = BeautifulSoup(r.text, 'html5lib')

In [None]:
inputs = soup.find_all('input')

In [None]:
goodinputs = []
for input in inputs:
    if input['name'] == '__VIEWSTATE' or input['name'] == '__EVENTVALIDATION':
        goodinputs.append(input['value'])

In [None]:
postdata['__VIEWSTATE'] = goodinputs[0]
postdata['__EVENTVALIDATION'] = goodinputs[1]

In [None]:
headers= {
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3344.0 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
    'Accept-Encoding': 'gzip, deflate'
    
}
# Result 
r2 = requests.post(url, data=postdata, headers=headers)
# The valuable HTML goes from line 2 up until you find the |hiddenField|__VIEWSTATE| but you have to store 
# this value and the __EVENTVALIDATION to get a new request


## Getting the new vars

To get the new vars, get the single line where the HTML ends (the one that starts with `|<somenumber>|hiddenField|__VIEWSTATE...`.

The parameters continue past that line but both `__EVENTVALIDATION` and `__VIEWSTATE` are in this line so you only need the one.

Once you have that, you can split this string by the vertical bar `|` char which acts as separator. 

Take groups of four elements out of the resulting array. 

1. The first element is a numeric indicator of some sort
2. The second element is the type of field to generate (e.g. `hiddenField` or `formAction`)
3. The third field is the key of the value or blank if it's not a new variable
4. The fourth element is the actual value of the variable

That being said, both `__VIEWSTATE` and `__EVENTVALIDATION` should be in the first 5 groups of four. (Look in the third field for the names of the vars). Discard the rest (they don't change).

In [None]:
# Separate HTML from new server garbage

# Split into lines
lines = r2.text.split('\r\n')
# Trim and Ignore the first line
lines = [x.strip() for x in lines[1:]]
# Remove empty lines
lines = list(filter(lambda x: len(x) != 0, lines))



In [None]:
# Search for the garbage now
import re
garbageLineIndex = -1
searchRegex = re.compile('\|[0-9]+\|hiddenField|__VIEWSTATE')
for index,line in enumerate(lines):
    if searchRegex.match(line):
        garbageLineIndex = index
        break
print(garbageLineIndex)
    

In [None]:
# Store the HTML in a single soup (this is where we'll find the links to the CSVs)
# Note: it's a strong assumption that everyting above this line is useful HTML. 
# You might want to wrap this in a try catch statement
htmlLines = '\n'.join(lines[0:garbageLineIndex]) # end index is non-inclusive
soup = BeautifulSoup(htmlLines, 'html5lib')

In [None]:
## Get the new vars
# Taken from https://stackoverflow.com/questions/752308/split-list-into-smaller-lists
def splitlist(arr, size):
    arrs = []
    while len(arr) > size:
        pice = arr[:size]
        arrs.append(pice)
        arr   = arr[size:]
    arrs.append(arr)
    return arrs

garbageLine = lines[garbageLineIndex]
tmpresult = garbageLine.split("|")
newvars = splitlist(tmpresult[1:], 4) # The first element of the split is blank because the string starts with the separator



In [None]:
for var in newvars:
    if len(var) > 3 and (var[2] == "__VIEWSTATE" or var[2] == "__EVENTVALIDATION"):
        print(var)