# 1. HTTP Request with Postman
<br>
Querrying IS-Academia for "Informatique, 2007-2008, Bachelor semestre 1" gives the following parameters on Postman :<br>
ww_x_GPS : 71297531<br>
ww_i_reportModel : 133685247<br>
ww_i_reportModelXsl : 133685270<br>
ww_x_UNITE_ACAD : 249847<br>
ww_x_PERIODE_ACAD : 978181<br>
ww_x_PERIODE_PEDAGO : 249108<br>
ww_x_HIVERETE : null<br>


So here are the parameters that we are mostly interesting in :<br>
ww_x_UNITE_ACAD  <- Informatique<br>
ww_x_PERIODE_ACAD  <- 2007 - 2016<br>
ww_x_PERIODE_PEDAGO  <- Bachelor semestre 1 and Bachelor semestre 6<br>

In [271]:
import numpy as np
import pandas as pd
import requests
import seaborn as sns
import scipy.stats as stats
from bs4 import BeautifulSoup

sns.set_context('notebook')

In [272]:
form_url = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter"
base_url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html'
get_parameters = {
    'ww_i_reportModel': '133685247',  # Report Model for registered students by section and semester
    'ww_i_reportModelXsl': '133685270',  # HTML output
}
r  = requests.get(form_url, get_parameters)
soup = BeautifulSoup(r.text, 'html.parser')

In [273]:
# Extract the appropriate parameters from the html
academic_unit = {'ww_x_UNITE_ACAD': soup.find('option', string='Informatique')['value']}
print('Academic unit:', academic_unit, '\n')

academic_period_select = soup.find('select', attrs={'name': 'ww_x_PERIODE_ACAD'})
academic_period_dict = {option.string: option['value']
                        for option in academic_period_select
                        if option.string is not None}
print('Academic periods:', academic_period_dict, '\n')

pedag_period_select = soup.find('select', attrs={'name': 'ww_x_PERIODE_PEDAGO'})
searched_pedag_periods = {'Bachelor semestre 1', 'Bachelor semestre 6'}
pedag_period = {option.string: option['value']
                for option in pedag_period_select
                if option.string in searched_pedag_periods}
print('Pedagogic period:', pedag_period)

Academic unit: {'ww_x_UNITE_ACAD': '249847'} 

Academic periods: {'2014-2015': '213637922', '2008-2009': '978187', '2016-2017': '355925344', '2010-2011': '39486325', '2007-2008': '978181', '2011-2012': '123455150', '2013-2014': '213637754', '2012-2013': '123456101', '2015-2016': '213638028', '2009-2010': '978195'} 

Pedagogic period: {'Bachelor semestre 6': '942175', 'Bachelor semestre 1': '249108'}


In [274]:
get_parameters.update(academic_unit)  # Add academic unit to get parameters
get_parameters.update({'ww_x_GPS': '-1'})  # This parameters represents the "Tous" ("All") link returned by the form.

In [275]:
def build_dataframe(pedagogic_period: str) -> pd.DataFrame:
    """This function takes a list of academic periods (eg: ['2007-2008', '2008-2009', ...])
    and a pedagogic period (eg: 'Bachelor semestre 1') and builds a dataframe with all
    concerned students.
    """
    df = pd.DataFrame()
    for i, academic_period in enumerate(sorted(academic_period_dict.keys())):  # 2007 until 2016
        # Request GET parameters
        request_params = {**get_parameters,
                          'ww_x_PERIODE_ACAD': academic_period_dict.get(academic_period),
                          'ww_x_PERIODE_PEDAGO': pedag_period.get(pedagogic_period)}
        r = requests.get(base_url, request_params)
        temp_df = pd.read_html(r.text, header=1, index_col=10)[0]  # User sciper nº as index
        temp_df = temp_df[['Civilité', 'Nom Prénom']]  # Keep relevant columns only
        temp_df[pedagogic_period] = i + 2007  # Annotate the corresponding year for the pedagogic period
        df = pd.concat([df, temp_df])
    return df

# Load all CS students that did their first and last bachelor semesters
starting = build_dataframe('Bachelor semestre 1')
ending = build_dataframe('Bachelor semestre 6')

In [276]:
starting = starting[~starting.index.duplicated(keep='first')]  # Ignore repeated first years
ending = ending[~ending.index.duplicated(keep='last')]  # Keep last 6th semester only

# Merge both dataframes.
students = pd.merge(starting, ending, how='inner')
# The 6th semester is always in spring (year + 1)
students['Bachelor semestre 6'] = students['Bachelor semestre 6'] + 1
students.sample(10)

Unnamed: 0,Civilité,Nom Prénom,Bachelor semestre 1,Bachelor semestre 6
303,Monsieur,Pradignac Nicolas,2012,2016
237,Monsieur,Rudelle Matthieu François Edgard,2011,2014
294,Monsieur,Milliet Alain Georges Paul,2012,2016
347,Monsieur,Dupont Costedoat Yann Olivier François Marie,2013,2016
124,Monsieur,Viaccoz Thierry,2009,2012
259,Monsieur,Brousse Cyriaque Gilles Guillaume,2012,2015
256,Monsieur,Bonnome Hugo,2012,2016
88,Monsieur,Blanvillain Olivier Eric Paul,2009,2012
261,Monsieur,Chatelain Bastien Ludovic,2012,2016
290,Monsieur,Lottaz Timothée,2012,2016


In [277]:
students['Delta'] = (students['Bachelor semestre 6'] - students['Bachelor semestre 1']) * 12 # months in a year
print('Male (Monsieur):', students[students['Civilité'] == 'Monsieur'].shape[0])
print('Female (Madame):', students[students['Civilité'] == 'Madame'].shape[0])
students.groupby('Civilité')[['Delta']].mean()

Male (Monsieur): 368
Female (Madame): 29


Unnamed: 0_level_0,Delta
Civilité,Unnamed: 1_level_1
Madame,39.724138
Monsieur,41.771739


Statistical test :
As we have only two data sets independent from each other, and there should be no real difference between them regarding the average time spent at EPFL for bachelor students, we choose the Two-sample T-test as a statistical test for our data.

In [278]:
stats.ttest_ind(a=students[students['Civilité'] == 'Monsieur'].Delta,
               b=students[students['Civilité'] == 'Madame'].Delta, equal_var=False)

Ttest_indResult(statistic=1.5831651359439409, pvalue=0.12191236829650401)

The pvalue is relatively small, which means it only makes a small difference to be a man or a woman when it comes to average time spent at EPFL as a bachelor student.

# 2

In this task we compute the time spent by a master student at EPFL so far. Therefore we don't only consider students who finished their master, but also students who are curretly pursuing their master.
We consider that every entry corresponding to a semester corresponds to six months spent at EPFL. Therefore, for each master student, we compute the total number of semesters spent at EPFL, then multiply it to have the duration of the stay in months.

In [279]:
#Doing as previous but with master pedagogic periods
#Params of master periods
master_searched_pedag_periods = {'Master semestre 1', 'Master semestre 2', 'Master semestre 3', 'Projet Master automne', 'Projet Master printemps'}
master_pedag_period = {option.string: option['value']
                for option in pedag_period_select
                if option.string in master_searched_pedag_periods}
#print('Master Pedagogic period:', pedag_period)

#Redefining build_dataframe with the new columns that we are interested in
def build_master_dataframe(pedagogic_period: str) -> pd.DataFrame:
    """This function takes a list of academic periods (eg: ['2007-2008', '2008-2009', ...])
    and a pedagogic period (eg: 'Master semestre 1') and builds a dataframe with all
    concerned students.
    """
    df = pd.DataFrame()
    for i, academic_period in enumerate(sorted(academic_period_dict.keys())):  # 2007 until 2016
        # Request GET parameters
        request_params = {**get_parameters,
                          'ww_x_PERIODE_ACAD': academic_period_dict.get(academic_period),
                          'ww_x_PERIODE_PEDAGO': master_pedag_period.get(pedagogic_period)}
        r = requests.get(base_url, request_params)
        if('Civilité' in r.text): #check if there is a header i.e. any entries
            temp_df = pd.read_html(r.text, header=1, index_col=10)[0]  # User sciper nº as index
            temp_df = temp_df[['Nom Prénom', 'Spécialisation']]  # Keep relevant columns only
            temp_df[pedagogic_period] = i + 2007  # Annotate the corresponding year for the pedagogic period
            df = pd.concat([df, temp_df])
    return df

# Load all CS students that did their first and last bachelor semesters
ma_1 = build_master_dataframe('Master semestre 1')
ma_2 = build_master_dataframe('Master semestre 2')
ma_3 = build_master_dataframe('Master semestre 3')
pdm_1 = build_master_dataframe('Projet Master automne')
pdm_2 = build_master_dataframe('Projet Master printemps')

In [280]:
ma_1.head()

Unnamed: 0_level_0,Nom Prénom,Spécialisation,Master semestre 1
No Sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
153066,Aeberhard François-Xavier,,2007
180027,Agarwal Megha,,2007
152232,Anagnostaras David,,2007
177395,Auroux Damien,,2007
161970,Awalebo Joseph,,2007


Here, we count the number of entries for each student in each semester dataframe, then combine and sum to find the total number of semsters per student.

In [281]:
# Merge dataframes.

ma_1_count = ma_1.groupby('Nom Prénom').count()['Master semestre 1']
ma_2_count = ma_2.groupby('Nom Prénom').count()['Master semestre 2']
ma_3_count = ma_3.groupby('Nom Prénom').count()['Master semestre 3']
pdm_1_count = pdm_1.groupby('Nom Prénom').count()['Projet Master automne']
pdm_2_count = pdm_2.groupby('Nom Prénom').count()['Projet Master printemps']

#Concatenate data from all semesters
students = pd.concat([ma_1_count, ma_2_count, ma_3_count, pdm_1_count, pdm_2_count], axis = 1)

#Sum all columns to find the total number of semesters per student
students_sem_count = students.sum(axis = 1).to_frame('Total_semesters')
students_sem_count.index.name='Nom Prénom'

#Drop students who only have one semester entry, as they correspond to students
#who are in their first semester as master students at EPFL
students_sem_count=students_sem_count[students_sem_count.Total_semesters != 1]

students_sem_count.head()

Unnamed: 0_level_0,Total_semesters
Nom Prénom,Unnamed: 1_level_1
Abbadi Hajar,3.0
Abelenda Diego,4.0
Abi Akar Nora,3.0
Aeberhard François-Xavier,6.0
Aeby Prisca,3.0


We then average over all students to find the average number of semesters, and multiply by 6 to find the average number of months spent at EPFL for master students.

In [282]:
population_months = students_sem_count*6
population_avg = students_sem_count.mean()*6
print(population_avg)

Total_semesters    20.143939
dtype: float64


In [283]:
ma_1_spec = ma_1.drop('Master semestre 1', 1).dropna(0)
ma_2_spec = ma_2.drop('Master semestre 2', 1).dropna(0)
ma_3_spec = ma_3.drop('Master semestre 3', 1).dropna(0)
pdm_1_spec = pdm_1.drop('Projet Master automne', 1).dropna(0)
pdm_2_spec = pdm_2.drop('Projet Master printemps', 1).dropna(0)

names = ma_1_spec.append(ma_2_spec)
names = names.append(ma_3_spec)
names = names.append(pdm_1_spec)
names = names.append(pdm_2_spec)
names.drop_duplicates(inplace=True)
names.set_index('Nom Prénom', inplace=True)
names.head()

Unnamed: 0_level_0,Spécialisation
Nom Prénom,Unnamed: 1_level_1
Campora Simone,Internet computing
Hofer Thomas,Foundations of Software
Kwanga Rodrigue,Biocomputing
Muriel Hugo Marcelo,Internet computing
Pakzad Pooya,Internet computing


In [284]:
students_sem_count.reset_index(inplace=True)
names.reset_index(inplace=True)

In [285]:
spec_count = pd.merge(students_sem_count, names, on='Nom Prénom', how='inner')
semester_count = spec_count.groupby('Spécialisation').sum()
semester_count.reset_index(inplace = True)

student_count = spec_count.groupby('Spécialisation').count().drop('Nom Prénom', 1)
student_count.columns = ['Total_students']
student_count.reset_index(inplace = True)

In [286]:
data = pd.merge(semester_count, student_count, how = 'outer')
data

Unnamed: 0,Spécialisation,Total_semesters,Total_students
0,Biocomputing,21.0,6
1,Computer Engineering - SP,76.0,21
2,Computer Science Theory,3.0,1
3,Data Analytics,17.0,6
4,Foundations of Software,249.0,63
5,Information Security - SP,25.0,7
6,Internet Information Systems,3.0,1
7,Internet computing,378.0,100
8,Service science,19.0,5
9,"Signals, Images and Interfaces",136.0,34


In [287]:
average = (data.Total_semesters / data.Total_students)*6
spec_avg = pd.DataFrame({
        'Spécialisation' : semester_count['Spécialisation'],
        'Average time' : average
    })
spec_avg.set_index('Spécialisation', inplace=True)
spec_avg

Unnamed: 0_level_0,Average time
Spécialisation,Unnamed: 1_level_1
Biocomputing,21.0
Computer Engineering - SP,21.714286
Computer Science Theory,18.0
Data Analytics,17.0
Foundations of Software,23.714286
Information Security - SP,21.428571
Internet Information Systems,18.0
Internet computing,22.68
Service science,22.8
"Signals, Images and Interfaces",24.0


Statistical test :
We want to know if the average spent time at EPFL is significantly different for samples with specialization than the general population. For this end, we use a one-sample t-test.

In [292]:
stats.ttest_1samp(a=population_months,
               popmean=spec_avg.loc['Biocomputing', 'Average time'])

Ttest_1sampResult(statistic=array([-3.89669231]), pvalue=array([ 0.00010577]))

In [293]:
stats.ttest_1samp(a=population_months,
               popmean=spec_avg.loc['Computer Engineering - SP', 'Average time'])

Ttest_1sampResult(statistic=array([-7.14804114]), pvalue=array([  2.00308538e-12]))

In [294]:
stats.ttest_1samp(a=population_months,
               popmean=spec_avg.loc['Computer Science Theory', 'Average time'])

Ttest_1sampResult(statistic=array([ 9.75897278]), pvalue=array([  2.54311520e-21]))

In [295]:
stats.ttest_1samp(a=population_months,
               popmean=spec_avg.loc['Data Analytics', 'Average time'])

Ttest_1sampResult(statistic=array([ 14.31086114]), pvalue=array([  1.76052055e-41]))

In [296]:
stats.ttest_1samp(a=population_months,
               popmean=spec_avg.loc['Foundations of Software', 'Average time'])

Ttest_1sampResult(statistic=array([-16.25181787]), pvalue=array([  1.83863912e-51]))

In [297]:
stats.ttest_1samp(a=population_months,
               popmean=spec_avg.loc['Information Security - SP', 'Average time'])

Ttest_1sampResult(statistic=array([-5.84750161]), pvalue=array([  7.29779673e-09]))

In [298]:
stats.ttest_1samp(a=population_months,
               popmean=spec_avg.loc['Internet Information Systems', 'Average time'])

Ttest_1sampResult(statistic=array([ 9.75897278]), pvalue=array([  2.54311520e-21]))

In [299]:
stats.ttest_1samp(a=population_months,
               popmean=spec_avg.loc['Internet computing', 'Average time'])

Ttest_1sampResult(statistic=array([-11.54386476]), pvalue=array([  1.34006809e-28]))

In [300]:
stats.ttest_1samp(a=population_months,
               popmean=spec_avg.loc['Service science', 'Average time'])

Ttest_1sampResult(statistic=array([-12.09009136]), pvalue=array([  5.34384068e-31]))

In [301]:
stats.ttest_1samp(a=population_months,
               popmean=spec_avg.loc['Signals, Images and Interfaces', 'Average time'])

Ttest_1sampResult(statistic=array([-17.5523574]), pvalue=array([  1.69195354e-58]))

In [302]:
stats.ttest_1samp(a=population_months,
               popmean=spec_avg.loc['Software Systems', 'Average time'])

Ttest_1sampResult(statistic=array([-3.09341789]), pvalue=array([ 0.00204822]))

The p-value for Software Systems is the highest of all p-values, which means it's the specialization with the most significant difference. Although, since this value is very small, it makes very little difference to consider the specialization when it comes to the average time spent at EPFL as a master student.