# Time taken to complete degree
Obtain all the data for the Bachelor students, starting from 2007. Keep only the students for which you have an entry for both Bachelor semestre 1 and Bachelor semestre 6. Compute how many months it took each student to go from the first to the sixth semester. Partition the data between male and female students, and compute the average -- is the difference in average statistically significant?

In [1]:
import requests as req
import urllib
import pandas as pd
from bs4 import BeautifulSoup as bes
import inspect
import copy
from tqdm import tqdm

#Add the base url where the form is
base_url = "http://isa.epfl.ch/imoniteur_ISAP/!gedpublicreports.htm?ww_i_reportmodel=133685247"
full_url = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270&zz_x_UNITE_ACAD=&ww_x_UNITE_ACAD=249847&zz_x_PERIODE_ACAD=&ww_x_PERIODE_ACAD=213638028&zz_x_PERIODE_PEDAGO=&ww_x_PERIODE_PEDAGO=249108&zz_x_HIVERETE=&ww_x_HIVERETE=2936286&dummy=ok"

Applying the filters to get the Bachelor semester 1 in informatique in 2015-16 gives the following url: 
```
http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270&zz_x_UNITE_ACAD=&ww_x_UNITE_ACAD=249847&zz_x_PERIODE_ACAD=&ww_x_PERIODE_ACAD=213638028&zz_x_PERIODE_PEDAGO=&ww_x_PERIODE_PEDAGO=249108&zz_x_HIVERETE=&ww_x_HIVERETE=2936286&dummy=ok
```
Feeding this url to postman interceptor gives us the following parameter values:
```
ww_b_list:1
ww_i_reportmodel:133685247
ww_c_langue:
ww_i_reportModelXsl:133685270
zz_x_UNITE_ACAD:
ww_x_UNITE_ACAD:249847
zz_x_PERIODE_ACAD:
ww_x_PERIODE_ACAD:213638028
zz_x_PERIODE_PEDAGO:
ww_x_PERIODE_PEDAGO:249108
zz_x_HIVERETE:
ww_x_HIVERETE:2936286
dummy:ok
```
We used inspect element  to get the url of the page that displays only the data table without the form. The url was as follows:
```
http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=1897032870&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=213638028&ww_x_PERIODE_PEDAGO=249108&ww_x_HIVERETE=2936286
```
Feeding this url to postman interceptor gives us the following parameter values:
```
ww_x_GPS:1897032870
ww_i_reportModel:133685247
ww_i_reportModelXsl:133685270
ww_x_UNITE_ACAD:249847
ww_x_PERIODE_ACAD:213638028
ww_x_PERIODE_PEDAGO:249108
ww_x_HIVERETE:2936286
```
Looking at the HTML code, we can figure out what each of the parameters stands for:
+ **ww_x_GPS**: There might be several lists which match our search. This specifies which list to open. giving -1 opens tous, which is all. So we set this to -1
+ **ww_i_reportModel and ww_i_reportModelXsl**: Selecting whether to use HTML or excel. We always want to use HTML so we fix it to the values we got above in postman interceptor.
+ **ww_x_UNITE_ACAD**: We are only considering 'Informatique' so we fix it to the value above found using interceptor, for informatique.
+ **ww_x_PERIODE_ACAD**: We need to vary the academic period. We will get a dictionary for what value corresponds to which academic year from the HTML source using beautiful soup.
+ **ww_x_PERIODE_PEDAGO**: We also need a dictionary for this just as we do for ww_x_PERIODE_ACAD.
+ **ww_x_HIVERETE**: Same as above.

Now we need to get the dictionaries for the required fields

In [2]:
r = req.get( full_url ) 
soup = bes(r.text, 'lxml') #applies beautifulSoup on the HTML
#print (soup.prettify())

'option' corresponds to the dropdown menus in the form. We get a dictionary which relates each possibility in each of the drop down menus to the value of the query in the server.

In [3]:
dicti = {}
for select in soup.findAll('select'):
    name = select['name'].strip()
    dicti[name] = {}
    for option in select.findAll('option'):
        if(option['value'] != 'null'):
            strng = option.string.strip()
            dicti[name][strng] = option['value'].strip()

This gives us the following dictionary.

In [4]:
dicti

{'ww_x_HIVERETE': {"Semestre d'automne": '2936286',
  'Semestre de printemps': '2936295'},
 'ww_x_PERIODE_ACAD': {'2007-2008': '978181',
  '2008-2009': '978187',
  '2009-2010': '978195',
  '2010-2011': '39486325',
  '2011-2012': '123455150',
  '2012-2013': '123456101',
  '2013-2014': '213637754',
  '2014-2015': '213637922',
  '2015-2016': '213638028',
  '2016-2017': '355925344'},
 'ww_x_PERIODE_PEDAGO': {'Bachelor semestre 1': '249108',
  'Bachelor semestre 2': '249114',
  'Bachelor semestre 3': '942155',
  'Bachelor semestre 4': '942163',
  'Bachelor semestre 5': '942120',
  'Bachelor semestre 5b': '2226768',
  'Bachelor semestre 6': '942175',
  'Bachelor semestre 6b': '2226785',
  'Master semestre 1': '2230106',
  'Master semestre 2': '942192',
  'Master semestre 3': '2230128',
  'Master semestre 4': '2230140',
  'Mineur semestre 1': '2335667',
  'Mineur semestre 2': '2335676',
  'Mise à niveau': '2063602308',
  'Projet Master automne': '249127',
  'Projet Master printemps': '3781783

We know from postman interceptor the parameters in the above dictionary are not enough to make the request to the server. We are missing three parameters which are fixed. We now add them to te dictionary.

In [5]:
gps = 'ww_x_GPS'
dicti[gps] = {}
r1 = 'ww_i_reportModel'
dicti[r1] = {}
r2 = 'ww_i_reportModelXsl'
dicti[r2] = {}
dicti['ww_x_GPS']['tous'] = '-1'  #for 'tous'. fixed
dicti['ww_i_reportModel']['html'] = '133685247'  #fixed
dicti['ww_i_reportModelXsl']['xls'] = '133685270'  #fixed to HTML

We restrict the dictionary to keep only values we are interested in.

In [6]:
# 'ww_x_HIVERETE' is redundant
del dicti['ww_x_HIVERETE'] 

#keep only informatique for 'ww_x_UNITE_ACAD'
for key in list(dicti['ww_x_UNITE_ACAD']): 
    if(key != 'Informatique'):
        del dicti['ww_x_UNITE_ACAD'][key]
        
#keeps only the bachelor semesters
for key in list(dicti['ww_x_PERIODE_PEDAGO']): 
    if(key.startswith('Bachelor') == False):
        del dicti['ww_x_PERIODE_PEDAGO'][key]

Our dictionary now has precisely only the entries that we need.

In [7]:
dicti

{'ww_i_reportModel': {'html': '133685247'},
 'ww_i_reportModelXsl': {'xls': '133685270'},
 'ww_x_GPS': {'tous': '-1'},
 'ww_x_PERIODE_ACAD': {'2007-2008': '978181',
  '2008-2009': '978187',
  '2009-2010': '978195',
  '2010-2011': '39486325',
  '2011-2012': '123455150',
  '2012-2013': '123456101',
  '2013-2014': '213637754',
  '2014-2015': '213637922',
  '2015-2016': '213638028',
  '2016-2017': '355925344'},
 'ww_x_PERIODE_PEDAGO': {'Bachelor semestre 1': '249108',
  'Bachelor semestre 2': '249114',
  'Bachelor semestre 3': '942155',
  'Bachelor semestre 4': '942163',
  'Bachelor semestre 5': '942120',
  'Bachelor semestre 5b': '2226768',
  'Bachelor semestre 6': '942175',
  'Bachelor semestre 6b': '2226785'},
 'ww_x_UNITE_ACAD': {'Informatique': '249847'}}

Also, we dont need the tags of the values anymore, so we change the dictionary to have the values simply as a list. This makes it easier to iterate over all possibilities for the parameters.

In [8]:
new_dict = {}
for key, value in dicti.items():
    temp = []
    for k, v in dicti[key].items():
        temp.append(v)
    new_dict[key] = copy.copy(temp)

Now we have a dictionary that we can start using to generate the requests. A query corresponds to one possible set of parameters. We generate a list of all the possible set of parameters that we are interested in for this question.

Note that we need to vary only 'ww_x_PERIODE_PEDAGO' and 'ww_x_PERIODE_ACAD' as 'ww_x_HIVERETE is redundant and the other values are fixed.

In [9]:
new_dict

{'ww_i_reportModel': ['133685247'],
 'ww_i_reportModelXsl': ['133685270'],
 'ww_x_GPS': ['-1'],
 'ww_x_PERIODE_ACAD': ['978181',
  '213637922',
  '213637754',
  '123455150',
  '355925344',
  '978195',
  '123456101',
  '39486325',
  '213638028',
  '978187'],
 'ww_x_PERIODE_PEDAGO': ['249108',
  '2226768',
  '2226785',
  '942175',
  '942155',
  '942163',
  '942120',
  '249114'],
 'ww_x_UNITE_ACAD': ['249847']}

The following code gives us a set of parameters that we can use along with the requests library to generate the necessary queries. We construct this using the above dictionary. Basically we get all possible combinations for the possibilities for each parameter which we need to pass to requests.get.

In [82]:
import itertools 

combinations = [[{key: value} for (key, value) in zip(new_dict, values)] 
                for values in itertools.product(*new_dict.values())]

params = []

for i in range(len(combinations)):
    temp = combinations[i][0].copy()
    for j in range(1,len(combinations[i])):
        temp.update(combinations[i][j])
    params.append(temp)

In [12]:
def get_key(dicti, value):
    for k, v in dicti.items():
        if v == value:
            return k

Now we have the parameters for all qeuries we are interested in as a list. We now do the queries to get a list of dataframes.

In [91]:
yr = 'ww_x_PERIODE_ACAD'
sem = 'ww_x_PERIODE_PEDAGO'

df_list = []
query = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?'
for param in tqdm(params):
    r = req.get(query, param)
    try:
        df = pd.read_html(r.url,header=0,skiprows=1)[0]
    except:
        continue
    
    df = df[ (df['Civilité'] == 'Monsieur') | (df['Civilité'] == 'Madame') ]
    df = df.dropna(1)
        
    year_val = param[yr]
    df['Year'] = get_key( dicti[yr], year_val)
    
    sem_val = param[sem]
    df['Semester'] = get_key( dicti[sem], sem_val)
    df_list.append(df)



In [92]:
#Concat the dataFrames to get one big dataframe
df_bachelor = pd.concat(df_list,ignore_index=True)

In [93]:
#Conver 'No Sciper' to int
df_bachelor['No Sciper'] = df_bachelor['No Sciper'].apply(lambda x: int(x))

In [94]:
#final table
df_bachelor.index = [df_bachelor['Civilité'] , df_bachelor['No Sciper']]
df_bachelor = df_bachelor.sort_index()

In [95]:
df_bachelor

Unnamed: 0_level_0,Unnamed: 1_level_0,Civilité,Nom Prénom,Statut,No Sciper,Year,Semester
Civilité,No Sciper,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Madame,154157,Madame,Andriambololona Riana Miarantsoa,Présent,154157,2007-2008,Bachelor semestre 5
Madame,159998,Madame,Jesse Julia,Présent,159998,2007-2008,Bachelor semestre 6
Madame,159998,Madame,Jesse Julia,Présent,159998,2007-2008,Bachelor semestre 5
Madame,161091,Madame,Grivet Ekaterina,Présent,161091,2007-2008,Bachelor semestre 6
Madame,161091,Madame,Grivet Ekaterina,Congé,161091,2007-2008,Bachelor semestre 5
Madame,170888,Madame,Mozuasadila Jennifer,Présent,170888,2007-2008,Bachelor semestre 3
Madame,172680,Madame,Maman Rodriguez Cuellar Alexandra,Présent,172680,2007-2008,Bachelor semestre 3
Madame,173604,Madame,Javanmardy Khameneh Maryam,Présent,173604,2007-2008,Bachelor semestre 6
Madame,173604,Madame,Javanmardy Khameneh Maryam,Présent,173604,2007-2008,Bachelor semestre 5
Madame,173604,Madame,Javanmardy Khameneh Maryam,Présent,173604,2009-2010,Bachelor semestre 5
