In [1]:
import pandas as pd
import scipy.stats as stats
import os
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

## Data loading

In [2]:
# requires os
def concatFiles(direc, fileType):
    files = os.listdir(direc)
    if '.DS_Store' in files:
        files.remove('.DS_Store')
    if fileType == 'csv':
        r = pd.read_csv
    for idx, file in enumerate(files):
        file = direc+file
        if idx == 0:
            df = r(file, header=0)
        else:
            new = r(file, header=0)
            df = pd.concat([df, new], axis=0)
    return df

In [3]:
info = concatFiles('data/', 'csv')
# Print the head of the big DataFrame
info.head()

Unnamed: 0,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,Subject,Period,Semester
0,Monsieur,Bouaziz Sofien,,,,,,Présent,Erasmus,Ecole Supérieure d'Ingénieurs en Electronique ...,179749,Informatique,2007-2008,Semestre automne
1,Monsieur,Charles Christian,,,,,,Présent,Erasmus,Ecole Supérieure de Chimie Physique Electroniq...,180104,Informatique,2007-2008,Semestre automne
2,Monsieur,Dagand Pierre-Evariste,,,,,,Présent,Erasmus,Ecole Normale Supérieure de Cachan,181031,Informatique,2007-2008,Semestre automne
3,Monsieur,Grataloup Olivier,,,,,,Présent,Erasmus,Ecole Supérieure de Chimie Physique Electroniq...,179911,Informatique,2007-2008,Semestre automne
4,Monsieur,Grignard Arnaud,,,,,,Présent,Erasmus,Ecole Supérieure de Chimie Physique Electroniq...,179934,Informatique,2007-2008,Semestre automne


# Data exploration
## *Semestre Automne* and *Semestre Printemps* values

The first thing we have to do before analysing the data is to understand them. If most of the field and values are well described just by their name, two of them present problem: values *Semestre Automne* and *Semestre Printemps* of field *Semester*.

Because the whole studies focuses on understanding how many time student spend at EPFL, *Semester* field is very important and we cannot just discard these values.

What does these values mean?

In [4]:
info.Semester.unique()

array(['Semestre automne', 'Semestre printemps', 'Admission automne',
       'Admission printemps', 'Bachelor semestre 1', 'Bachelor semestre 2',
       'Bachelor semestre 3', 'Bachelor semestre 4', 'Bachelor semestre 5',
       'Bachelor semestre 6', 'Master semestre 1', 'Master semestre 2',
       'Master semestre 3', 'Projet Master automne',
       'Projet Master printemps'], dtype=object)

In order to have a first comprehension of what it could mean, we will, for all students that present this kind of semester, display its whole career and try to have a intuition

In [5]:
def career_with_Aut_Prin(df):
    # Find all student who contain at least once a 'Semestre Automne' or 'Semestre Printemps' value in 
    # their 'Semestre' field
    sciper_Aut_Prin = df.loc[df['Semester'].isin(['Semestre automne','Semestre printemps'])]['No Sciper'].unique()
    # Display all the careers of these students
    Aut_Prin = df.loc[df['No Sciper'].isin(sciper_Aut_Prin)]
    
    return Aut_Prin

Aut_Prin = career_with_Aut_Prin(info).sort_values(by='No Sciper')
Aut_Prin.head()

Unnamed: 0,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,Subject,Period,Semester
890,Monsieur,Martin Damien,,,,,,Présent,,,121367,Informatique,2016-2017,Semestre automne
886,Monsieur,Essellak Radouane,,,,,,Présent,,,129094,Informatique,2016-2017,Semestre automne
1126,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2013-2014,Semestre printemps
828,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2016-2017,Projet Master automne
975,Monsieur,Amiguet Jérôme,,,,,,Stage,,,166075,Informatique,2015-2016,Master semestre 2


In a lot of cases, it seems that student only appear twice and are part of an exchange programm. Let's see if it explain the whole "Automne and Printemps" things.

In [6]:
# Count how many occurences of value 'Semestre Automne' or 'Semestre Printemps' appear in the data set.
Num_sem = info.loc[info['Semester'].isin(['Semestre automne','Semestre printemps'])]
num_sem_student = Num_sem['No Sciper'].unique()
print("They are", Num_sem.shape[0], "occurences of Semester automne or Semestre printemps in the data set (distributed between", num_sem_student.shape[0], "different students)")

# Count how many of them appear in parallel with an exchange programm.
Num_sem_in_exchange = info.loc[info['Semester'].isin(['Semestre automne','Semestre printemps']) & info['Type Echange'].notnull()]
num_sem_in_exchange_student = Num_sem_in_exchange['No Sciper'].unique()
print(Num_sem_in_exchange.shape[0], "of them appears for student in exchange (distributed beween", num_sem_in_exchange_student.shape[0], "students)")

They are 287 occurences of Semester automne or Semestre printemps in the data set (distributed between 222 different students)
220 of them appears for student in exchange (distributed beween 193 students)


Our intuition is correct and explain a big part of the `Semestre Automne` and `Semestre Printemps` values. We can suppose that this value is used for all student making something that doesn't "fit" in the IS-Academia system. **This is then some "garbage" value just to enter them in the system, without more precision.**

Let's remove the understood datas and try to understand the remaining cases.

If we look at "Amiguet Jérôme" on the table below, we see that the values `Semestre Automne` and `Semestre Printemps` are parallel to some other semester, they appears during the same year. We can imagine that he took some extra-cursus classes that didn't fit in the official programm, and that this classes were stored under this denomination.

In [7]:
# display the career of Amiguet, which has a 'Semestre Automne' and 'Semestre Printemps' values 
# but was not in exchange
Aut_Prin.loc[Aut_Prin['Nom Prénom'].str.contains('Amiguet')]

Unnamed: 0,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,Subject,Period,Semester
1126,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2013-2014,Semestre printemps
828,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2016-2017,Projet Master automne
975,Monsieur,Amiguet Jérôme,,,,,,Stage,,,166075,Informatique,2015-2016,Master semestre 2
843,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2015-2016,Master semestre 1
1204,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2014-2015,Semestre printemps
1198,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2014-2015,Semestre automne
965,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2014-2015,Master semestre 2
861,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2014-2015,Master semestre 1
1123,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2013-2014,Semestre automne


In [8]:
# from the original dataframe, remove all exchange semester
dataframe_without_exchange = info.loc[info['Type Echange'].isnull()]
# display students having a "semestre automne" or "semestre printemps" in their career 
# (this time without exchange student)
Aut_Prin = career_with_Aut_Prin(dataframe_without_exchange)

print('There are', Aut_Prin['No Sciper'].unique().shape[0], 'students who have Semester Automne or Semestre Printemps values but was not exchange student')
Aut_Prin.head()

There are 29 students who have Semester Automne or Semestre Printemps values but was not exchange student


Unnamed: 0,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,Subject,Period,Semester
587,Monsieur,Gaba Foli Kodjo,,,,,,Présent,,,182435,Informatique,2008-2009,Bachelor semestre 3
590,Monsieur,Gaba Foli Kodjo,,,,,,Présent,,,182435,Informatique,2008-2009,Bachelor semestre 4
632,Monsieur,Atitallah Samir,,,,,,Présent,,,196669,Informatique,2009-2010,Semestre automne
633,Monsieur,Cartier Sebastian,,,,,,Présent,,,192857,Informatique,2009-2010,Semestre automne
634,Monsieur,Gaba Foli Kodjo,,,,,,Présent,,,182435,Informatique,2009-2010,Semestre automne


In [9]:
# compute a table without 'Semestre automne' and 'Semestre printemps' rows
Aut_Prin2 = Aut_Prin.loc[~Aut_Prin['Semester'].isin(['Semestre automne', 'Semestre printemps'])]

# compute a table only with 'Semestre automne' and 'Semestre printemps' rows
Aut_Prin3 = Aut_Prin.loc[Aut_Prin['Semester'].isin(['Semestre automne', 'Semestre printemps'])]
print("There are", Aut_Prin3['No Sciper'].unique().shape[0], "students that have Semestre automne or Semestre printemps field but are not exchange student")

# search students who was at the same time enroled in 'Semestre automne' or 'Semestre printemps' and in an other
# semester
array_sciper = []
for row in Aut_Prin3.itertuples():
    sciper = row[11]
    period = row[13]
    if(not Aut_Prin2.loc[(Aut_Prin2['No Sciper'] == sciper) & (Aut_Prin2['Period'] == period)].empty):
        array_sciper.append(sciper)


print(np.unique(array_sciper).shape[0], "students was doing another normal semester in parallel.")

# display their whole career
info.loc[info['No Sciper'].isin(array_sciper)].sort_values(by='No Sciper').head()

There are 29 students that have Semestre automne or Semestre printemps field but are not exchange student
15 students was doing another normal semester in parallel.


Unnamed: 0,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,Subject,Period,Semester
1204,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2014-2015,Semestre printemps
1198,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2014-2015,Semestre automne
843,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2015-2016,Master semestre 1
965,Monsieur,Amiguet Jérôme,,,,,,Présent,,,166075,Informatique,2014-2015,Master semestre 2
975,Monsieur,Amiguet Jérôme,,,,,,Stage,,,166075,Informatique,2015-2016,Master semestre 2


Following are calculation, it remains now 14 students who have a "semestre automne" or "semestre printemps" which we don't understand.

In [10]:
# find SCIPER of remaining students. Ie: student that doesn't enter in the first two cathegories we create.
array_remaining_student = set(Aut_Prin3['No Sciper'].unique()) - set(array_sciper)
# find their career
careers_of_remaining_student = info.loc[info['No Sciper'].isin(array_remaining_student)].sort_values(by='No Sciper')
# display how many semester they spent at EPFL
careers_of_remaining_student.groupby('No Sciper').size()

No Sciper
121367    1
129094    1
175428    1
192857    4
192899    4
193798    4
203367    1
222261    1
225493    1
225837    1
226599    1
248205    1
268141    1
273605    1
dtype: int64

In the 14 remaining students, there are 11 of them that appear only once in IS-Academia table. They only did one semester at EPFL and we don't know the reason.

In [11]:
# display the three unexplicable remaining students
info.loc[info['No Sciper'].isin([192857, 192899, 193798])].sort_values(by='No Sciper')

Unnamed: 0,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,Subject,Period,Semester
633,Monsieur,Cartier Sebastian,,,,,,Présent,,,192857,Informatique,2009-2010,Semestre automne
638,Monsieur,Cartier Sebastian,,,,,,Présent,,,192857,Informatique,2009-2010,Semestre printemps
537,Monsieur,Cartier Sebastian,,,,,,Présent,,,192857,Informatique,2010-2011,Master semestre 1
634,Monsieur,Cartier Sebastian,,,,,,Présent,,,192857,Informatique,2010-2011,Master semestre 2
635,Monsieur,Monnier Alex,,,,,,Présent,,,192899,Informatique,2009-2010,Semestre automne
640,Monsieur,Monnier Alex,,,,,,Présent,,,192899,Informatique,2009-2010,Semestre printemps
581,Monsieur,Monnier Alex,,,,,,Présent,,,192899,Informatique,2010-2011,Master semestre 1
684,Monsieur,Monnier Alex,,,,,,Présent,,,192899,Informatique,2010-2011,Master semestre 2
636,Monsieur,Schmid Jonas,,,,,,Présent,,,193798,Informatique,2009-2010,Semestre automne
641,Monsieur,Schmid Jonas,,,,,,Présent,,,193798,Informatique,2009-2010,Semestre printemps


The three last students have exactly the same career, begining in 2009 with one year of "Automne" and "Printemps" semesters, and then one year of master. It's hard to guess exactly why they did it, but we can suppose that this was maybe some special admission program.

## Conclusion

There are 287 occurences of "Semestre Automne" and "Semestre Printemps" values in the field "Semester", and 222 students career contain them.

These 222 career are distributed like that:
* 193 exchange students
* 15 students doing out-of-program classes
* 11 students doing only one semester at EPFL
* 3 students doing some special admission program

Looking at our analysis, we can confirm that our previous supposition was correct. **These values are used as "garbage" values, just to register students in IS-Academia system without more precision.** As we saw, it's mostly used for exchange student doing a semester at EPFL, but also for other spectial cases.

## How to deal with it ?

As the objective of the project is to look at the different time students spend doing their bachelor and master, there is no need to take into account these special cases, we will then just remove them when analysing the data in exercice 1 & 2