# Data from the Web

As explained in the README file, we should scrape data from the EPFL [portal](http://is-academia.epfl.ch/page-6228.html). 

As one selects items from the from the drop-down menu, the parameters which enter the web-address change accordingly. It's suggested to read the parameters of the search with [Postman Interceptor](www.getpostman.com), and that's what we'll do.
 
We see that when accessing http://isa.epfl.ch/imoniteur_ISAP/%21gedpublicreports.htm?ww_i_reportmodel=133685247 we can see which  parameters which can be selected in postman with the basic url: http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247.


Let's first access this page with BeautifulSoup with a *get request*.

In [132]:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import urllib.request as ur 
import requests
import json


#this URL does not contain the list of names yes, just the list of options

base_url = ' http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247 '

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36" }

req = ur.Request(base_url, headers = headers)
page = ur.urlopen(req).read()
soup = BeautifulSoup(page,'lxml') 

print (soup.prettify()[1000:10000])  # Taking a look at it

adio" value="133685270"/>
       html
      </td>
     </tr>
     <tr>
      <td>
       <input name="ww_i_reportModelXsl" type="radio" value="133685271"/>
       xls
      </td>
     </tr>
    </table>
    <h1>
    </h1>
    <table border="0" id="filtre">
     <tr>
      <th>
       Unité académique
      </th>
      <td>
       <input name="zz_x_UNITE_ACAD" type="hidden" value=""/>
       <select name="ww_x_UNITE_ACAD" onchange="document.f.zz_x_UNITE_ACAD.value=document.f.ww_x_UNITE_ACAD.options[document.f.ww_x_UNITE_ACAD.selectedIndex].text">
        <option value="null">
        </option>
        <option value="942293">
         Architecture
        </option>
        <option value="246696">
         Chimie et génie chimique
        </option>
        <option value="943282">
         Cours de mathématiques spéciales
        </option>
        <option value="637841336">
         EME (EPFL Middle East)
        </option>
        <option value="942623">
         Génie civil
        </opti

The drop-down menus have the Tag: 'select'. So, I'll search for all of them.

In [5]:
find_select = soup.find_all('select')

len(find_select) 

4

We see there are four types of selection. We take a look at *find_menu* and see that the attribute we want is _"name"_.  Let's list them:

In [6]:
for elem in find_select:
    print (elem.attrs['name'])

ww_x_UNITE_ACAD
ww_x_PERIODE_ACAD
ww_x_PERIODE_PEDAGO
ww_x_HIVERETE


Now, the next tag we need inside the _select_ tag is _option_ with attribute _value_ and their names, just for clarity:

In [109]:
for elem in find_select:
    options = elem.find_all('option')
    print (elem.attrs['name'] + ":")
    print('')
    for opt in options:
        print ("{} - {}".format(opt.attrs['value'], opt.text))  
    print('')
    print('') 

ww_x_UNITE_ACAD:

null - 
942293 - Architecture
246696 - Chimie et génie chimique
943282 - Cours de mathématiques spéciales
637841336 - EME (EPFL Middle East)
942623 - Génie civil
944263 - Génie mécanique
943936 - Génie électrique et électronique 
2054839157 - Humanités digitales
249847 - Informatique
120623110 - Ingénierie financière
946882 - Management de la technologie
944590 - Mathématiques
945244 - Microtechnique
945571 - Physique
944917 - Science et génie des matériaux
942953 - Sciences et ingénierie de l'environnement
945901 - Sciences et technologies du vivant
1574548993 - Section FCUE
946228 - Systèmes de communication


ww_x_PERIODE_ACAD:

null - 
355925344 - 2016-2017
213638028 - 2015-2016
213637922 - 2014-2015
213637754 - 2013-2014
123456101 - 2012-2013
123455150 - 2011-2012
39486325 - 2010-2011
978195 - 2009-2010
978187 - 2008-2009
978181 - 2007-2008


ww_x_PERIODE_PEDAGO:

null - 
249108 - Bachelor semestre 1
249114 - Bachelor semestre 2
942155 - Bachelor semestre 3
94216

In [201]:
# I'll write the output into a data frame:
param_df = []


for elem in find_select:
    options = elem.find_all('option')
    #columns.append(elem.attrs['name'])
    column = elem.attrs['name']
    row = []
    for opt in options:
        row.append(opt.attrs['value'])
    df = pd.DataFrame(row, columns=[column])
    param_df.append(df)
param_df = pd.concat(param_df, axis=1)    
    
param_df.fillna(0, inplace=True)  # repace NaN with zeros will make things easier
param_df.head(8)

Unnamed: 0,ww_x_UNITE_ACAD,ww_x_PERIODE_ACAD,ww_x_PERIODE_PEDAGO,ww_x_HIVERETE
0,,,,
1,942293.0,355925344.0,249108.0,2936286.0
2,246696.0,213638028.0,249114.0,2936295.0
3,943282.0,213637922.0,942155.0,0.0
4,637841336.0,213637754.0,942163.0,0.0
5,942623.0,123456101.0,942120.0,0.0
6,944263.0,123455150.0,2226768.0,0.0
7,943936.0,39486325.0,942175.0,0.0


We want to start with informatic students from the first semester of 2007. This corresponds to the selection:
    - ww_x_UNITE_ACAD: 249847 (Informatique)
    - ww_x_PERIODE_ACAD: 978181 (2007-2008)
    - ww_x_PERIODE_PEDAGO: 249108 (Bachelor semestre 1)
    - ww_x_HIVERETE: 2936286 (Semestre d'automne)
    
I see that after making this selection and asking for an output in html and lxml, intercept shows me the following urls, respectively:

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_PERIODE_PEDAGO=249108&ww_x_UNITE_ACAD=249847&ww_i_reportModel=133685247&ww_x_GPS=71297531&ww_x_PERIODE_ACAD=978181&ww_i_reportModelXsl=133685270&ww_x_HIVERETE=2936286


http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.XLS?ww_x_PERIODE_PEDAGO=249108&ww_x_UNITE_ACAD=249847&ww_i_reportModel=133685247&ww_x_GPS=71297531&ww_x_PERIODE_ACAD=978181&ww_i_reportModelXsl=133685271&ww_x_HIVERETE=2936286
    
We see there's another parameter which changes according with the selection: 'ww_x_GPS'. I could not find which values it takes. Someone suggested that setting it to -1 solves the issue. Couldn't figure out why though.
    
So below we write a little dictionary with the parameters to make the request for this selection. The result is a table and we  write it as a pandas data frame.

In [89]:
base_url = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?"
params = {
    'ww_x_GPS': 71297531, 
    'ww_i_reportModel': 133685247,
    'ww_i_reportModelXsl': 133685270,
    'ww_x_UNITE_ACAD': 249847,
    'ww_x_PERIODE_ACAD': 978181,
    'ww_x_PERIODE_PEDAGO': 249108,
    'ww_x_HIVERETE': 2936286 # 'null' works too
}

soup_html = requests.get(base_url, params = params)

result_soup = BeautifulSoup(soup_html.text, "lxml") # read the HTML table
result_table = result_soup.find_all('table')[0]


df = pd.read_html(result_table.decode())[0]
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,"Informatique, 2007-2008, Bachelor semestre 1 ...",,,,,,,,,,,
1,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,
2,Monsieur,Arévalo Christian,,,,,,Présent,,,169569,
3,Monsieur,Aubelle Flavien,,,,,,Présent,,,174905,
4,Monsieur,Badoud Morgan,,,,,,Présent,,,173922,
5,Monsieur,Baeriswyl Jonathan,,,,,,Présent,,,179406,
6,Monsieur,Barroco Michael,,,,,,Présent,,,179428,
7,Monsieur,Belfis Nicolas,,,,,,Présent,,,179324,
8,Monsieur,Beliaev Stanislav,,,,,,Présent,,,174597,
9,Monsieur,Bindschaedler Vincent,,,,,,Présent,,,179449,


Now we should collect the data of all the Bachelor students from 2007 to 2016. We gonna iterate over the values of the dataframe with the parameters I produced previously.

In [202]:
for col in param_df:
    param = col
    print (param)
    for elem in param_df[col]:
        if elem != 'null' and elem != 0:
            print(elem)


ww_x_UNITE_ACAD
942293
246696
943282
637841336
942623
944263
943936
2054839157
249847
120623110
946882
944590
945244
945571
944917
942953
945901
1574548993
946228
ww_x_PERIODE_ACAD
355925344
213638028
213637922
213637754
123456101
123455150
39486325
978195
978187
978181
ww_x_PERIODE_PEDAGO
249108
249114
942155
942163
942120
2226768
942175
2226785
2230106
942192
2230128
2230140
2335667
2335676
2063602308
249127
3781783
953159
2754553
953137
2226616
983606
2226626
2227132
ww_x_HIVERETE
2936286
2936295


In [598]:
base_url = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?"

        
params = {
    'ww_x_UNITE_ACAD': unity,
    'ww_x_PERIODE_ACAD': period,
    'ww_x_PERIODE_PEDAGO': pedag,
    'ww_x_GPS': -1, 
    'ww_i_reportModel': 133685247,
    'ww_i_reportModelXsl': 133685270,
    'ww_x_HIVERETE': 'null' 
}



soup_html = requests.get(base_url, params = params)

result_soup = BeautifulSoup(soup_html.text, "lxml") # Read the HTML table
result_table = result_soup.find_all('table')[0]

'["169569", "174905", "173922", "179406", "179428", "179324", "174597", "179449", "178553", "179426", "178271", "182433", "180731", "171619", "179837", "179157", "179864", "174590", "178843", "178711", "178786", "179567", "176282", "178656", "181445", "178718", "175466", "173882", "181612", "181232", "178706", "180284", "181121", "170509", "175379", "180570", "178604", "175190", "178660", "181248", "179163", "181181", "181244", "175685", "169731", "175001", "181424", "181259", "178433", "181460", "181298", "181513", "175478", "176459", "175014", "181514", "179355", "181076", "175576", "181115", "180094", "180853", "178726", "181017", "175031", "179194", "175754", "179980", "179988", "174187", "180959", "171195", "178620", "180979", "180980", "178948", "169795", "178684", "180241", "180982", "181291", "175280", "179053", "180854", "171568", "174120", "180185", "175834", "174340", "178682"]'

In [602]:
import pandas as pd
import numpy as np

students_list = []
for student in find_student:
    if student.string != None:
        if '1' in student.string:
        #if student.string != 1:
            #print (student.contents[0])  # produces the same result
            #studentlist['number']= student.string
            students_list.append(student.string)
json.dumps(students_list)

with open("list_stu.json", "w") as writeJSON:
    json.dump(students_list, writeJSON)  # it actually kind of works


In [648]:
students_list = []
for student in find_student:
    if student.string != None:
        if '1' in student.string:
        #if student.string != 1:
            #print (student.contents[0])  # produces the same result
            #studentlist['number']= student.string
            students_list.append(student.string)
            
json.dumps(students_list)

with open("list_stu.json", "w") as writeJSON:
    json.dump(students_list, writeJSON)  # it actually kind of works
    
#Now, I'm gonna read this list as a data frame

df = pd.read_json("list_stu.json")
df.columns = ['Student Number']
df.head()

Unnamed: 0,Student Number
0,169569
1,174905
2,173922
3,179406
4,179428


In [676]:
students_list = []
for student in find_student:
    if student.string != None:
        if '1' in student.string:
        #if student.string != 1:
            #print (student.contents[0])  # produces the same result
            #studentlist['number']= student.string
            students_list.append(student.string)
            
print(students_list)


['169569', '174905', '173922', '179406', '179428', '179324', '174597', '179449', '178553', '179426', '178271', '182433', '180731', '171619', '179837', '179157', '179864', '174590', '178843', '178711', '178786', '179567', '176282', '178656', '181445', '178718', '175466', '173882', '181612', '181232', '178706', '180284', '181121', '170509', '175379', '180570', '178604', '175190', '178660', '181248', '179163', '181181', '181244', '175685', '169731', '175001', '181424', '181259', '178433', '181460', '181298', '181513', '175478', '176459', '175014', '181514', '179355', '181076', '175576', '181115', '180094', '180853', '178726', '181017', '175031', '179194', '175754', '179980', '179988', '174187', '180959', '171195', '178620', '180979', '180980', '178948', '169795', '178684', '180241', '180982', '181291', '175280', '179053', '180854', '171568', '174120', '180185', '175834', '174340', '178682']


In [678]:
#Here  I'm going to read the main part of the page and add the options for the parameters:
import glob 

names_url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=71297531&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD='
# ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978181&ww_x_PERIODE_PEDAGO=249108&ww_x_HIVERETE=2936286'

#Now I'm going to iterate over all possible the academic periods
for child in find_period:
    if child['value'] != 'null': # the first entrance is null, exclude that
        
        req = ur.Request(names_url+str(child['value'])+str('&ww_x_PERIODE_PEDAGO=249108&ww_x_HIVERETE=2936286'), headers = headers)
        page = ur.urlopen(req).read()
        soup_p = BeautifulSoup(page,'lxml') 
        
        find_student = soup_names.find_all('td')
        students_list= []
        for student in find_student:
            if student.string != None:
                 if '1' in student.string:
                    students_list.append(student.string)
            
        json.dumps(students_list)
                    
        with open("list_stu.json", "w") as writeJSON:
            json.dump((students_list), writeJSON)  # it actually kind of works
        
        df = pd.read_json("list_stu.json")
        df.columns = [child['value']]
        
            
       # appended_df = []
        #for infile in glob.glob("*.json"):
         #   df = pd.read_json(infile)
          #  appended_df = pd.append(df)
           # appended_df = pd.concat(appended_df, axis=1)
            
#appended_df.head()
    
    

In [871]:
re.get()

SyntaxError: invalid syntax (<ipython-input-871-e50caf57890b>, line 1)

In [868]:

names_url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=71297531&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD='
# ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978181&ww_x_PERIODE_PEDAGO=249108&ww_x_HIVERETE=2936286'

#Now I'm going to iterate over all possible the academic periods
aca_period = pd.Series([0]*(len(find_period)-1)) # we have seen that the first one is null and we are skipping it
period_name = pd.Series([0]*(len(find_period)-1)) # to print the names on the data frame
find_student = soup.find_all('td')
students_list =  pd.DataFrame(index=range(1,500),columns=period_name) # create a df with columns as the period

i = 0
for child in find_period:  # iterating on the academic period
    if child['value'] != 'null': # the first entrance is null, exclude that
        aca_period[i] = child['value']
        period_name[i] = str(child.get_text())
        req = ur.Request(names_url+str(child['value'])+str('&ww_x_PERIODE_PEDAGO=249108&ww_x_HIVERETE=2936286'), headers = headers)
        page = ur.urlopen(req).read()
        soup_p = BeautifulSoup(page,'lxml') 
        #print (names_url+str(child['value'])+str('&ww_x_PERIODE_PEDAGO=249108&ww_x_HIVERETE=2936286'))
        
        find_student = soup_p.find_all('td')
        j = 1
        print(i)
        for student in find_student: #reading all students from an academic period
            if student.string != None:
                if '2' in student.string:
                    students_list.loc[j,str(child.get_text())] = student.string
                    print (student.string)
                    print (str(child.get_text()))
                    j = j + 1
                    i = i + 1
    
                   

0
0
0
0
0
0
0
0
0
0
173922
2007-2008
179428
2007-2008
179324
2007-2008
179426
2007-2008
178271
2007-2008
182433
2007-2008
176282
2007-2008
173882
2007-2008
181612
2007-2008
181232
2007-2008
180284
2007-2008
181121
2007-2008
181248
2007-2008
181244
2007-2008
181424
2007-2008
181259
2007-2008
181298
2007-2008
178726
2007-2008
178620
2007-2008
180241
2007-2008
180982
2007-2008
181291
2007-2008
175280
2007-2008
174120
2007-2008
178682
2007-2008


In [861]:
df2 = pd.DataFrame(index=range(1,500),columns=period_name)
#df2.name = 'Period'
#df2.index.name = 'Student'
df2.loc[1,'2007-2008'] = 'va'

df2.head()

Unnamed: 0,2016-2017,2015-2016,2014-2015,2013-2014,2012-2013,2011-2012,2010-2011,2009-2010,2008-2009,2007-2008
1,,,,,,,,,,va
2,,,,,,,,,,
3,,,,,,,,,,
4,,,,,,,,,,
5,,,,,,,,,,


In [713]:
serie = pd.Series([0]*(len(find_period)-1)) # we have seen that the first one is null and we are skipping it
i = 0
for child in find_period:
    if child['value'] != 'null':
        #print (child['value'])
        serie[i] = child['value'] 
        i = 1 + i
        
print (serie)
        
#for item in range(1, len(find_period)):
 #   print (find_period['item'])
#type (find_period)
#find_period.attrs

#for item in find_period.stripped_strings:
 #   print (item)

    #find_period.option

0    355925344
1    213638028
2    213637922
3    213637754
4    123456101
5    123455150
6     39486325
7       978195
8       978187
9       978181
dtype: int64


In [182]:
find_options = soup.find_all("option")



for v in find_options:
    print(v['value'])

null
942293
246696
943282
637841336
942623
944263
943936
2054839157
249847
120623110
946882
944590
945244
945571
944917
942953
945901
1574548993
946228
null
355925344
213638028
213637922
213637754
123456101
123455150
39486325
978195
978187
978181
null
249108
249114
942155
942163
942120
2226768
942175
2226785
2230106
942192
2230128
2230140
2335667
2335676
2063602308
249127
3781783
953159
2754553
953137
2226616
983606
2226626
2227132
null
2936286
2936295


In [745]:
for child in find_period:  # iterating on the academic period
    if child['value'] != 'null': # the first entrance is null, exclude that
        print(child.get_text())


2016-2017
2015-2016
2014-2015
2013-2014
2012-2013
2011-2012
2010-2011
2009-2010
2008-2009
2007-2008


for child in find_menu.children:
    print(child)

In [144]:
for value in find_menu.stripped_strings:
    print (value)

Semestre d'automne
Semestre de printemps


In [164]:
find_option = soup('option')[21] # equivalent to soup.find_all('option')
print(find_option)

<option value="355925344">2016-2017</option>


In [98]:
base_url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=71297531&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978181&ww_x_PERIODE_PEDAGO=249108&ww_x_HIVERETE=2936286' 
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36" }

req = ur.Request(base_url, headers = headers)
page = ur.urlopen(req).read()
soup = BeautifulSoup(page,'lxml') 

print (soup.prettify()[0:1000])

<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="gedpublicreports.css?ww_x_path=Gestac.Moniteur.Style" rel="stylesheet" type="text/css"/>
 </head>
 <body alink="#666666" bgcolor="#ffffff" link="#666666" marginheight="0" marginwidth="5" vlink="#666666">
  <fieldset style="text-align:right; width:40%; position:relative; margin-right: 10px;float:right; border: 0; padding: 0 0 8px 0;">
   <a href="!GEDREPORTS.html?ww_x_GPS=71297531&amp;ww_i_reportModel=133685247&amp;ww_i_reportModelXsl=133685270&amp;ww_x_UNITE_ACAD=249847&amp;ww_x_PERIODE_ACAD=978181&amp;ww_x_PERIODE_PEDAGO=249108&amp;ww_x_HIVERETE=2936286" style="color:#990033;">
    Identification pour accéder aux e-mails
    <br/>
    Login to access email adresses
   </a>
  </fieldset>
  <script>
   function mailList(x) {
   var vtop = (screen.height-200)/2;
   var vleft=(screen.width-600)/2;
   var w=open("", "emaillist", "Scrollbars=1,resizable=1,width=600,height=200,top="+vtop+",lef

<class 'NoneType'>
None
