# IS-Academia Analysis
This is a data analysis of the IS-Academia data accessible by anyone, without authentication.

**Goal** : 
* Find out how much time do EPFL's students in Computer Science need to get their Bachelor. 
* Do a similar analysis for the Master's degree. 

---

## Collecting the Data

The challenge before analysing the data is to extract this data from the IS-Academia website. By looking at this [page](http://isa.epfl.ch/imoniteur_ISAP/%21gedpublicreports.htm?ww_i_reportmodel=133685247), we can extract information about the names and different values of the HTML `<input>` fields using *Beautiful Soup*. Then we will be able to generate a valid request to get the wanted data. 

### Analysing the requests using Postman

With the *Postman interceptor*, we can intercept requests when submitting the form we are interested in. 
A valid request URL looks like this : 

`http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270&zz_x_UNITE_ACAD=Informatique&ww_x_UNITE_ACAD=249847&zz_x_PERIODE_ACAD=2016-2017&ww_x_PERIODE_ACAD=355925344&zz_x_PERIODE_PEDAGO=Bachelor+semestre+1&ww_x_PERIODE_PEDAGO=249108&zz_x_HIVERETE=Semestre+d%27automne&ww_x_HIVERETE=2936286&dummy=ok`

You can see there are redundant information, for example : `zz_x_UNITE_ACAD=Informatique` and `ww_x_UNITE_ACAD=249847`. You can imagine that getting rid of one of them can still work. It's actually the case, you can get rid of all the `zz_x_*` parameters. With a closer analysis, you can see that the `ww_x_*` parameters correspond to the actual values in the HTML dropdowns.

So this is also a valid request :

`http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=355925344&ww_x_PERIODE_PEDAGO=249108&ww_x_HIVERETE=2936286&dummy=ok`, much more simpler and shorter. 

Then the URL of the empty form page is given by 
`http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportmodel=133685247`
which is a little bit different from the IS-Academia link above, because the page is using frames and their HTML code are not included in the base page. 

From here, you can already guess the form paramters we are going to use, but the goal is to extract them and not hardcode them. At least now we have the request base and format.

IS-Academia generates a link, on which you can click to get the actual data. By inspecting the requests, we can see that the URL required to get the data contains all the parameters, and additionaly a parameter contained in the HTML source of the link, we will need to extract it to get the final data. 


### Getting the parameters using Beautiful Soup

The principle used to extract the parameters here is basically to use the text description provided in the HTML page, and then play with the DOM to extract the wanted values. Parameters are returned as a dictionnary. 

In [None]:
# Import Requests and Beautiful Soup
import requests as rq
from bs4 import BeautifulSoup

# Define the IS-Academia page
empty_form_url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportmodel=133685247'

# Get the page by doing a HTTP request
empty_form = rq.get(empty_form_url)
if empty_form.status_code != rq.codes.ok:
    print("--> Error, I'm gonna crash... <--")

# Get the soup out of it
form_soup = BeautifulSoup(empty_form.text, 'html.parser')

def get_parameters(format, filters):
    """
    Returns a dictionnary containing the paramters
    
    format : 'xls' or 'html', selects the data format
    filters : dictionnary containing the values you want to select in the dropdowns
    """
    # Parameters dictionnary
    parameters = {}
    
    # First, get the checkbox parameters
    checkbox = form_soup.find(text = format).parent
    parameters[checkbox['name']] = checkbox['value']
    
    # Then get the dropdown parameters
    for f in filters:
        html_option = form_soup.find(text = f).parent
        param_name = html_option.parent['name']
        param_value = html_option['value']
        parameters[param_name] = param_value
        
    # Don't forget the hidden fields
    hidden = form_soup.find_all('input', type='hidden')
    
    # Ignore the zz_* fields, they are useless for what we want to do, 
    # also ignore ww_i_reportmodel, already contained in the base URL
    for field in hidden:
        if not (field['name'].startswith('z') or field['name'] == 'ww_i_reportmodel'):
            parameters[field['name']] = field['value']
            
    return parameters

### Getting the actual data

With the parameters, we are able to get the desired link to the data, a little bit of work on this page and we can "simulate" a click on this link by generating the good request. The data is then saved in a file, and the `get_data` function returns its path.

In [None]:
def get_data(parameters, format, filename):
    """
    Returns a file path with the data
    
    url : url of the file to download
    parameters : parameters of the request
    format : 'html' or 'xls
    """
    
    # Get the webpage with the data link
    link_page = rq.get(empty_form_url, parameters)
    if link_page.status_code != rq.codes.ok:
        print("--> Error, I'm gonna crash... <--")
    
    # Get the soup out of it
    link_soup = BeautifulSoup(link_page.text, 'html.parser')

    # The interesting link is the second of the page, and the parameters in this context : onlick:"...'name=value'".
    link_param = link_soup.find_all('a')[1]['onclick'].split('\'')[1].split('=')
    parameters[link_param[0]] = link_param[1]

    # The request needs a capitalized xls
    format = 'XLS' if format == 'xls' else format
    
    url = empty_form_url.replace('filter', format)
        
    # Thanks to http://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py
    r = rq.get(url, parameters)
    filename = filename + '.%s' % format
    
    with open(filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024): 
            if chunk: # Filter out keep-alive new chunks
                f.write(chunk) 
    return filename

We are ready to fetch the desired data and save it in a file : 

In [None]:
format = 'xls'
filters = ['Informatique', 'Bachelor semestre 1', '2016-2017', 'Semestre d\'automne']
filename = 'test'

# Get the request parameters
params = get_parameters(format, filters)
    
get_data(params, format, filename)