# Task 2
Perform a similar operation to what described above, this time for Master students. Notice that this data is more tricky, as there are many missing records in the IS-Academia database. Therefore, try to guess how much time a master student spent at EPFL by at least checking the distance in months between Master semestre 1 and Master semestre 2. If the Mineur field is not empty, the student should also appear registered in Master semestre 3. Last but not the least, don't forget to check if the student has an entry also in the Projet Master tables. Once you can handle well this data, compute the "average stay at EPFL" for master students. Now extract all the students with a Spécialisation and compute the "average stay" per each category of that attribute -- compared to the general average, can you find any specialization for which the difference in average is statistically significant?

In [1]:
import requests
import urllib
import pandas as pd
from bs4 import BeautifulSoup

#Add the base url where the form is
base_url = "http://isa.epfl.ch/imoniteur_ISAP/!gedpublicreports.htm?ww_i_reportmodel=133685247"
full_url = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270&zz_x_UNITE_ACAD=&ww_x_UNITE_ACAD=249847&zz_x_PERIODE_ACAD=&ww_x_PERIODE_ACAD=213638028&zz_x_PERIODE_PEDAGO=&ww_x_PERIODE_PEDAGO=249108&zz_x_HIVERETE=&ww_x_HIVERETE=2936286&dummy=ok"

## Creating a dictionary for all the parameters

We will **follow the same procedure as we did in the task 1** when we worked with the data of the bachelor students.
We will still using the following values of the some of the parameters, this because, those will not change for getting the all the urlss and therefore, the reports:

* ww_x_GPS=-1                    "For 'tous' fixed"
* ww_i_reportModel=133685247     "fixed"
* ww_i_reportModelXsl=133685270  "fixed to HTML"
* ww_x_UNITE_ACAD=249847         "fixed to informatique"

In [2]:
ww_x_GPS = '-1'  #for 'tous' fixed
ww_i_reportModel = '133685247'  #fixed
ww_i_reportModelXsl = '133685270'  #fixed to HTML
ww_x_UNITE_ACAD = '249847'  #fixed to informatique

In [3]:
r = requests.get( full_url )
soup = BeautifulSoup(r.text, 'lxml')

In [4]:
dict = {}
for select in soup.findAll('select'):
    name = select['name'].strip()
    dict[name] = {}
    for option in select.findAll('option'):
        #print( option.string, option['value'])
        if(option['value'] != 'null'):
            strng = option.string.strip()
            dict[name][strng] = option['value'].strip()

## Creating a new dictionary for "Période Pédagogique"

We need to create a new dictionary for the parameter "Période Pédagogique", this because we will only focus on the Master students and that parameter contains values for both master and bachelor. We will obtain this new dictionary from the previous one (the general one, "dict") and we will only consider the following values:
    
* Master semestre 1
* Master semestre 2
* Master semestre 3
* Master semestre 4
* Mineur semestre 1
* Mineur semestre 2
* Projet Master automne
* Projet Master printemps
* Stage printemps master

In [5]:
master_period_keys=['Master semestre 1','Master semestre 2','Master semestre 3','Master semestre 4',
               'Mineur semestre 1','Mineur semestre 2','Projet Master automne',
               'Projet Master printemps','Stage printemps master']
ww_x_PERIODE_PEDAGO_masters={key: dict['ww_x_PERIODE_PEDAGO'][key] for key in master_period_keys}

## Obtaining the DataFrames

We will move trough the diferent permutations between the three parameters that we are interested (the other 4, as metioned before, are fixed):

* ww_x_PERIODE_ACAD "Période académique"
* ww_x_PERIODE_PEDAGO "Période pédagogique"
* ww_x_HIVERETE "Type de semestre"

We need to obtain the corresponding URL for each permutation and then we have to see whether the URL contains information (tables) or not, this in order to not have an error during the loops.

In [6]:
count_df=0  #Counter for all the DataFrames
list_df_masters=[] #we create a list to store all the valid DataFrames
#We define a new dictionary to get all the possible URLs that we are going to use, 
#notice that we are fixing some values as we mentioned before
params={}
params['ww_x_GPS']=ww_x_GPS #fixed
params['ww_i_reportModel']=ww_i_reportModel #fixed
params['ww_i_reportModelXsl']=ww_i_reportModelXsl #fixed
params['ww_x_UNITE_ACAD']=ww_x_UNITE_ACAD #fixed
params['ww_x_PERIODE_ACAD']='null' #this value will vary
params['ww_x_PERIODE_PEDAGO']='null' #this will vary
params['ww_x_HIVERETE']='null' #this value will vary
for semester_type_key, semester_type_value in dict['ww_x_HIVERETE'].items(): #We move trough the "Type de semestre" parameter
    for teaching_period_key,teaching_period_value in ww_x_PERIODE_PEDAGO_masters.items(): #We move trough the "Période pédagogique" parameter 
        for academic_period_key,academic_period_value in dict['ww_x_PERIODE_ACAD'].items(): #We move trough the "Période académique" parameter
            #We change the values for the three parameters so that we can get the new URL with the corresponding report
            params['ww_x_HIVERETE']=semester_type_value
            params['ww_x_PERIODE_PEDAGO']=teaching_period_value
            params['ww_x_PERIODE_ACAD']=academic_period_value
            urls=requests.get('http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?',params) #We get the HTML information with the new parameters for the URL
            try: #We need to see if the URL generates a report with tables to generate the DataFrame
                list_df_masters.append(pd.read_html(urls.url,header=0,skiprows=1,flavor='lxml')[0])
            except: 
                flag=0 #The flag has a "0" value if it finds a report with no information (no tables), this in order to not continue counting the dataframes. in this step no dataframe was created
            else:
                flag=1 #The flag has a "1" value if it finds a report with information (tables). in this point a dataFrame was created with the information of the corresponding report
            if flag==1: #If the dataframe was generated, then we need to add the columns with the academic period "Periode academique", teaching period "periode pedagofique" and semester type "Type de semestre"
                list_df_masters[count_df]['Periode academique']=academic_period_key
                list_df_masters[count_df]['Periode pedagogique']=teaching_period_key
                list_df_masters[count_df]['Type de semestre']=semester_type_key
                count_df=count_df+1 #We need to count how many dataframes are being created.

In [7]:
#Finally, we concatenate all the dataframes from the list of dataframes to create one single DataFrame with all the information
df_masters=pd.concat(list_df_masters,ignore_index=True);

In [12]:
#Dataframe with all the columns
df_masters

Unnamed: 0,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,Unnamed: 11,Periode academique,Periode pedagogique,Type de semestre
0,Monsieur,Aranibar Casas Ivan Wilson,,,,,"Mineur en Management, technologie et entrepren...",Présent,,,225434,,2012-2013,Master semestre 1,Semestre d'automne
1,Monsieur,Aubelle Flavien,,,,,"Mineur en Management, technologie et entrepren...",Présent,,,174905,,2012-2013,Master semestre 1,Semestre d'automne
2,Monsieur,Augsburger Damien,,,,,,Présent,,,186595,,2012-2013,Master semestre 1,Semestre d'automne
3,Monsieur,Bace Mihai,,,,,,Présent,,,224520,,2012-2013,Master semestre 1,Semestre d'automne
4,Madame,Bai Yingkun,,,,,Mineur en Ingénierie financière,Présent,,,221503,,2012-2013,Master semestre 1,Semestre d'automne
5,Madame,Balmau Oana Maria,,,,,,Présent,,,192757,,2012-2013,Master semestre 1,Semestre d'automne
6,Monsieur,Baqapuri Afroze Ibrahim,,,,,,Présent,,,222427,,2012-2013,Master semestre 1,Semestre d'automne
7,Monsieur,Barben Loïc,,,,,,Présent,,,189517,,2012-2013,Master semestre 1,Semestre d'automne
8,Monsieur,Bindal Ashish Kishore,,,,,,Présent,,,212968,,2012-2013,Master semestre 1,Semestre d'automne
9,Monsieur,Blanc Yoan Pierre Michel,,,,,,Présent,,,213552,,2012-2013,Master semestre 1,Semestre d'automne


In [11]:
#Dataframe without columns with NaN values
df_masters.dropna(axis=1)

Unnamed: 0,Civilité,Nom Prénom,Statut,No Sciper,Periode academique,Periode pedagogique,Type de semestre
0,Monsieur,Aranibar Casas Ivan Wilson,Présent,225434,2012-2013,Master semestre 1,Semestre d'automne
1,Monsieur,Aubelle Flavien,Présent,174905,2012-2013,Master semestre 1,Semestre d'automne
2,Monsieur,Augsburger Damien,Présent,186595,2012-2013,Master semestre 1,Semestre d'automne
3,Monsieur,Bace Mihai,Présent,224520,2012-2013,Master semestre 1,Semestre d'automne
4,Madame,Bai Yingkun,Présent,221503,2012-2013,Master semestre 1,Semestre d'automne
5,Madame,Balmau Oana Maria,Présent,192757,2012-2013,Master semestre 1,Semestre d'automne
6,Monsieur,Baqapuri Afroze Ibrahim,Présent,222427,2012-2013,Master semestre 1,Semestre d'automne
7,Monsieur,Barben Loïc,Présent,189517,2012-2013,Master semestre 1,Semestre d'automne
8,Monsieur,Bindal Ashish Kishore,Présent,212968,2012-2013,Master semestre 1,Semestre d'automne
9,Monsieur,Blanc Yoan Pierre Michel,Présent,213552,2012-2013,Master semestre 1,Semestre d'automne
