First, we import the packages needed.
We create a column "cols" with the attributes that we need to answer the questions below.
We creat a dataFrame ALL_DATA that will contain all the data about the students that we will collect and use it to answer the questions. The URL "url" corresponds to the URL from which we will recover the students.

In [55]:
import requests
import pandas as pd 
from bs4 import BeautifulSoup

# Import packages for the test stastic
import scipy.stats as stats
import math
import numpy as np

url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?'
cols = ['Période académique', 'Période pédagogique', 'Civilité', 'Nom et prénom']
ALL_DATA = pd.DataFrame(columns=cols)

We define the function arrange_student that takes as input a student in argument as a list of tags and then returns a dictionary with the columns that we are interested in (civility, name and first name) 

Example :

Input : 

[<td style="white-space:nowrap">Monsieur</td>, <td style="white-space:nowrap">Albrecht Pablo</td>, <td style="white-space:nowrap"></td>, <td style="white-space:nowrap"></td>, <td style="white-space:nowrap"></td>, <td style="white-space:nowrap"></td>, <td style="white-space:nowrap"></td>, <td style="white-space:nowrap">Présent</td>, <td style="white-space:nowrap"></td>, <td style="white-space:nowrap"></td>, <td>212726</td>, <td style="white-space:nowrap"></td>]

Output : 

{'Civilité': 'Monsieur', 'Nom et prénom': 'Albrecht\xa0Pablo'}



In [56]:
def arrange_student(student_tags) : 
    
    student = {}
    
    if(len(student_tags)!=0) : 
        
        student = {}
    
        student['Civilité'] = student_tags[0].contents[0]
        student['Nom et prénom'] = student_tags[1].contents[0]
    
    return student


On the next cell, we recover the values with which we encoded the 4 filters present in the page.
We stock them in a dictionary named "filters", where the keys are the names of the filters (ex: ww_x_PERIODE_ACAD). The values are stocked in others dictionaries where the keys are all the possible values that the filter can take and the values of these keys are their encoding.
(ex: 2012-2013 => 123456101, '2012-2013' is the key and '123456101' is its corresponding value).

In [57]:
url1 = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247'
r1 = requests.get(url1)
soup1 = BeautifulSoup(r1.content,"lxml")

filters = {}

for filt in soup1.findAll("select"):
    
    filter_values = {}
        
    for option in filt.findAll("option"):
       
        if(option['value']!='null') : #in case that the first element is null
            filter_values[option.contents[0]] = option['value']
    
    filter_values[''] = 'null'
    filters[filt['name']] = filter_values


In [58]:
#Values of filters
filters

{'ww_x_HIVERETE': {'': 'null',
  "Semestre d'automne": '2936286',
  'Semestre de printemps': '2936295'},
 'ww_x_PERIODE_ACAD': {'': 'null',
  '2007-2008': '978181',
  '2008-2009': '978187',
  '2009-2010': '978195',
  '2010-2011': '39486325',
  '2011-2012': '123455150',
  '2012-2013': '123456101',
  '2013-2014': '213637754',
  '2014-2015': '213637922',
  '2015-2016': '213638028',
  '2016-2017': '355925344'},
 'ww_x_PERIODE_PEDAGO': {'': 'null',
  'Bachelor semestre 1': '249108',
  'Bachelor semestre 2': '249114',
  'Bachelor semestre 3': '942155',
  'Bachelor semestre 4': '942163',
  'Bachelor semestre 5': '942120',
  'Bachelor semestre 5b': '2226768',
  'Bachelor semestre 6': '942175',
  'Bachelor semestre 6b': '2226785',
  'Master semestre 1': '2230106',
  'Master semestre 2': '942192',
  'Master semestre 3': '2230128',
  'Master semestre 4': '2230140',
  'Mineur semestre 1': '2335667',
  'Mineur semestre 2': '2335676',
  'Mise à niveau': '2063602308',
  'Projet Master automne': '2491

Then, we define the fix parameters that we pass through the URL (ww_i_reportModel and ww_i_reportModelXsl found thanks to Postman) and we define a new function get_students_bysemester that returns all the students that have already been in semester "periode_pedagogique" for each years.

In [59]:
#Parameters used for filters
fixed_params = {}
fixed_params['ww_x_GPS'] = '-1'
fixed_params['ww_i_reportModel'] = '133685247'
fixed_params['ww_i_reportModelXsl'] = '133685270'
fixed_params['ww_x_UNITE_ACAD'] = filters['ww_x_UNITE_ACAD']['Informatique']


def get_students_bysemester(periode_pedagogique) :

    students = []
    
    for periode_academique in filters['ww_x_PERIODE_ACAD'].keys() : 

        if(periode_academique != '') :
            param = fixed_params.copy()
            param['ww_x_PERIODE_PEDAGO'] = filters['ww_x_PERIODE_PEDAGO'][periode_pedagogique]
            param['ww_x_PERIODE_ACAD'] = filters['ww_x_PERIODE_ACAD'][periode_academique]
            
            req_b = requests.get(url,params=param)
            soup_b = BeautifulSoup(req_b.content,"lxml")
            
            for row in soup_b.find('table').contents[2:] : #First line corresponds ti the headers, so we begin the loop from teh second line
       
                s = arrange_student(row.findAll('td'))
                s['Période académique'] = periode_academique
                s['Période pédagogique'] = periode_pedagogique
                students.append(s)
            
                
    return students

In [60]:
#We recover the students thanks to the function defined above
student_bachelor1 = get_students_bysemester('Bachelor semestre 1')
student_bachelor5 = get_students_bysemester('Bachelor semestre 5')
student_bachelor6 = get_students_bysemester('Bachelor semestre 6')

We want to keep only the students for which we have an entry for both "Bachelor semestre 1" and "Bachelor semestre 6". We add those students to our DataFrame ALL_DATA: 

In [61]:
#Maintenant qu'on a que les étudiants de Bachelor 1 et Bachelor 6, on prends l'interesction des deux
for s1 in student_bachelor1 :
    for s6 in student_bachelor6 :
        if s1['Nom et prénom'] == s6['Nom et prénom'] :

            ALL_DATA = ALL_DATA.append(pd.Series(s1),ignore_index=True)
            ALL_DATA = ALL_DATA.append(pd.Series(s6),ignore_index=True)
            

We also take the students in "Bachelor 5". Indeed we need this information to know if a student obtained the Bachelor in 7 semesters instead of 6.

In [62]:
names = set(ALL_DATA['Nom et prénom'].tolist())

for s5 in student_bachelor5 : 
    if s5['Nom et prénom'] in names : 
        ALL_DATA = ALL_DATA.append(pd.Series(s5),ignore_index=True)

There are a lot of duplicates because of the loop with student_bachelor1 and student_bachelor6. We drop it and set the index of the big DataFrame to be the name of the students "Nom et prénom".

In [63]:
ALL_DATA = ALL_DATA.drop_duplicates()
ALL_DATA = ALL_DATA.set_index(['Nom et prénom'])

To answer the first question: in how much time the student complete his bachelor, we create a DataFrame named "result" with the attributes of "cols_result". We assume that a student can complete its bachelor in semester 5 (last semester registered in is BA5).

We create a new DataFrame "temp_frame" that takes only one student at time (in the loop we iterate among all the students and stock at each iteration the student in the DataFrame). Then, we select the highest academic period (that works with strings too) and the smallest. We convert them into int and do a simple substraction and we stock the result in "diff". Moreover, if the number of entries of "Bachelor semestre 5" is greater than the number of entries of "Bachelor semestre 6", then we need to remove one semester to the student.

In [64]:
cols_result = ['Name','Sex','First year of bachelor', 'Last year of bachelor', 'Time_Bachelor (months)']
result = pd.DataFrame(columns=cols_result)

for name in ALL_DATA.index.drop_duplicates() : 
    
    temp_frame = ALL_DATA.loc[name]
    max_year = str(temp_frame['Période académique'].max()).split("-")[1]
    min_year = str(temp_frame['Période académique'].min()).split("-")[0]

    diff = int(max_year)-int(min_year)
    
    pen=0
    numb_sem5 = temp_frame['Période pédagogique'].values.tolist().count('Bachelor semestre 5')
    numb_sem6 = temp_frame['Période pédagogique'].values.tolist().count('Bachelor semestre 6')
    
    if(numb_sem5>numb_sem6) :
        pen = 6
    
    temp = {'Name' : name,'Sex' : str(temp_frame['Civilité'].values[0]) ,'First year of bachelor' :temp_frame['Période académique'].min(),
            'Last year of bachelor' :temp_frame['Période académique'].max() , 'Time_Bachelor (months)' : 12*diff-pen}
    
    
    result = result.append(temp, ignore_index=True)


In [65]:
#We print the 10 first entries of result 
result.head(10)

Unnamed: 0,Name,Sex,First year of bachelor,Last year of bachelor,Time_Bachelor (months)
0,Alfonso Peterssen Alfonso,Monsieur,2013-2014,2015-2016,36.0
1,Baraschi Zoé,Madame,2012-2013,2016-2017,54.0
2,Birchmeier Alain Dominique,Monsieur,2013-2014,2015-2016,36.0
3,Boissaye Arnaud Didier Marie,Monsieur,2013-2014,2016-2017,42.0
4,Bonfils Nils Pascal,Monsieur,2013-2014,2015-2016,36.0
5,Bonnome Hugo,Monsieur,2012-2013,2015-2016,48.0
6,Bordenca Tobias,Monsieur,2012-2013,2015-2016,48.0
7,Bouron Justinien Gérard Alain,Monsieur,2013-2014,2015-2016,36.0
8,Breitenstein Yannick Lucas,Monsieur,2013-2014,2016-2017,42.0
9,Casademont Nicolas,Monsieur,2013-2014,2015-2016,36.0


In [66]:
#Average time to complete the bachelor
result.mean()

Time_Bachelor (months)    42.725441
dtype: float64

In [71]:
#Average time by sex 

result_by_sex = result.groupby('Sex')
result_by_sex.mean()


Unnamed: 0_level_0,Time_Bachelor (months)
Sex,Unnamed: 1_level_1
Madame,40.758621
Monsieur,42.880435


In [68]:
#In those variables, we collect in a numpy array the time (in months) for men/women to obtain the bachelor
men = np.array(result[result.Sex=='Monsieur']['Time_Bachelor (months)'])
women = np.array(result[result.Sex=='Madame']['Time_Bachelor (months)'])

Doing a Two Sample T-test assuming that both samples (men and women) do not have the same variance. Indeed, we don't have any information about that. The null hypothesis is that the mean of both groups are the same. We choose a significant level of 0.05.

In [72]:
stats.ttest_ind(a= men,
                b= women,
                equal_var=False) 

Ttest_indResult(statistic=1.3437005678090845, pvalue=0.18785555340784144)

We have a p-value of 0.1879. Therefore we can't reject the null hypothesis.
As we failed to reject the null hypothesis, we can think that we are in a Type II error ('a false positive'), but we can state that we are not. Even if we take a lower confidence level (or higher significant level) we still are in the same case: we can't reject the null hypothsesis. Then, we can assume that the mean of both groups are approximately the same.