In [1]:
import pandas as pd
import scipy.stats as stats
import os

# Preparation of the data

First, we need to prepare the data. The very first step is to concatenate the information in all the files we scraped from IsAcademia. (Reminder: We have one file per year.) The function `concatFiles` opens all the files and weill return a DataFrame.

In [2]:
# requires os
def concatFiles(direc, fileType):
    files = os.listdir(direc)
    if '.DS_Store' in files:
        files.remove('.DS_Store')
    if fileType == 'csv':
        r = pd.read_csv
    for idx, file in enumerate(files):
        file = direc+file
        if idx == 0:
            df = r(file, header=0)
        else:
            new = r(file, header=0)
            df = pd.concat([df, new], axis=0)
    return df

In [3]:
info = concatFiles('data/', 'csv')
# Print the head of the big DataFrame
info.head()

Unnamed: 0,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,Subject,Period,Semester
0,Monsieur,Caesar Holger,,,,,,Présent,Erasmus,Karlsruher Institut für Technologie,215268,Informatique,2011-2012,Semestre automne
1,Monsieur,Cerezo Luna Alfredo,,,,,,Présent,Erasmus,Universidad Complutense de Madrid,214433,Informatique,2011-2012,Semestre automne
2,Monsieur,Gracia Diego,,,,,,Présent,Erasmus,Universidad de Zaragoza,214469,Informatique,2011-2012,Semestre automne
3,Monsieur,Herraez Concejo Borja Javier,,,,,,Présent,Erasmus,Universidad Politecnica de Valencia,214428,Informatique,2011-2012,Semestre automne
4,Monsieur,Järnberg Mathias,,,,,,Présent,Erasmus,"Royal Institute of Technology, (KTH) Stockholm",214299,Informatique,2011-2012,Semestre automne


In this exercise, we are looking for Bachelor students. Therefore, we can create a function that will locate if the *Semester* field contains the word *Bachelor*.

In [4]:
def locator_ba(s):
    return s.find('Bachelor') != -1

In [5]:
ba = info.loc[info['Semester'].apply(locator_ba)]
ba.head()

Unnamed: 0,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,Subject,Period,Semester
16,Monsieur,Aiulfi Loris Sandro,,,,,,Présent,,,202293,Informatique,2011-2012,Bachelor semestre 1
17,Monsieur,Akiba David,,,,,,Présent,,,206418,Informatique,2011-2012,Bachelor semestre 1
18,Monsieur,Albasini Romain,,,,,,Présent,,,198197,Informatique,2011-2012,Bachelor semestre 1
19,Monsieur,Albrecht Pablo,,,,,,Présent,,,212726,Informatique,2011-2012,Bachelor semestre 1
20,Monsieur,Alonso Seisdedos Florian,,,,,,Présent,,,215576,Informatique,2011-2012,Bachelor semestre 1


Next step is to locate the *sciper* number. Indeed, it's easier to deal with a number than with a complicated name.

In [6]:
def locSciper(df, sciper):
    return df.loc[df['No Sciper'] == sciper]

The next function is here to return a Boolean in case a specific student has an entry for *Bachelor semestre 1* and *Bachelor semestre 6*. 

We consider that a student **without** these two entries has failed and therefore, we don't take it into account. For the moment, that is our **only** assumption. (Later, we will add another assumption)

In [7]:
def isOneSix(df):
    """ Take a DataFrame and check that there is an entry for both BA1 and BA6
    
    Args:
        df (DataFrame): a DataFrame, typically a .loc on a specific student
        
    Returns:
        bool: True if it finds both BA1 and BA6, False otherwise
    """
    try:
        one = df.isin(['Bachelor semestre 1'])['Semester'].value_counts()[True] > 0
    except KeyError:
        one = 0
    try:
        six = df.isin(['Bachelor semestre 6'])['Semester'].value_counts()[True] > 0
    except KeyError:
        six = 0
    return (one and six)

We also collect the *Gender*. 

In [8]:
def getGender(df):
    """ Take a DataFrame and checks the gender
    
    Args:
        df (DataFrame): a DataFrame, typically a .loc on a specific student
        
    Returns:
        bool: True if student is a woman, False otherwise
    """
    try:
        if df.isin(['Madame'])['Civilité'].value_counts()[True] > 0:
            return 1
    except KeyError:
        return 0

Now, we will create a dictionnary to collect the different information. For each student, we need the **sciper**, the **gender** and the **number of occurencies** of this sciper. This will tell us how many semester the students did to go from *Bachelor semestre 1* to *Bachelor semestre 6*.

In [9]:
dico = {'sciper': [], 'gender': [], 'length': []}
# Go through all the rows in the Bachelor students DataFram
for row in ba.iterrows():
    o = row[1]
    sciper = o['No Sciper']
    # Making sure we consider unique scipers
    if not sciper in dico['sciper']:
        df = locSciper(ba, sciper)
        # Check that he had an entry for Bachelor 1 and Bachelor 6
        if isOneSix(df):
            dico['sciper'].append(sciper)
            dico['gender'].append(getGender(df))
            # Calculating length of stay by nbr of rows
            dico['length'].append(len(df))

We transform the dictionnary into a pandas DataFrame.

In [10]:
data = pd.DataFrame(dico)
data.head()

Unnamed: 0,gender,length,sciper
0,0,12,202293
1,0,11,215576
2,0,8,213618
3,0,8,215623
4,0,10,212464


We use the `describe()` function to check for inconsistencies.

In [11]:
data['length'].describe()

count    397.000000
mean       7.083123
std        1.524428
min        4.000000
25%        6.000000
50%        6.000000
75%        8.000000
max       12.000000
Name: length, dtype: float64

We notice that the minimum length is **4 semesters**. This is wrong. Let's see how many persons did their Bachelor in less than 6 semesters.

In [12]:
data.loc[data.length < 6]

Unnamed: 0,gender,length,sciper
64,0,4,204222


We see that there's only one sciper. We can now use the Bachelor DataFrame to get more information about this person.

In [13]:
locSciper(ba, 204222)

Unnamed: 0,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,Subject,Period,Semester
165,Monsieur,Séguy Louis Marie James,,,,,,Présent,,,204222,Informatique,2011-2012,Bachelor semestre 1
290,Monsieur,Séguy Louis Marie James,,,,,,Présent,,,204222,Informatique,2011-2012,Bachelor semestre 2
721,Monsieur,Séguy Louis Marie James,,,,,,Présent,,,204222,Informatique,2014-2015,Bachelor semestre 5
840,Monsieur,Séguy Louis Marie James,,,,,,Présent,,,204222,Informatique,2014-2015,Bachelor semestre 6


So... Either this person is a genius and didn't have to do his 2nd year or IsAcademia is missing some data. The second option seems more realistic. Therefore, we remove him from the DataFrame. 

In [14]:
data = data.drop(data.loc[data.length < 6].index)
data.loc[data.length < 6]


Unnamed: 0,gender,length,sciper


Data are ready! =)

# Do some statistics

Now, we want to see if the *men* and *women* do their Bachelor in the same amount of time or if there's a significative difference.

First, we create the men DataFrame.

In [15]:
men = data.loc[data.gender == 0]
men.length.describe()

count    367.000000
mean       7.114441
std        1.530379
min        6.000000
25%        6.000000
50%        6.000000
75%        8.000000
max       12.000000
Name: length, dtype: float64

Now, we can create the women DataFrame.

In [16]:
women = data.loc[data.gender == 1]
women.length.describe()

count    29.000000
mean      6.793103
std       1.346406
min       6.000000
25%       6.000000
50%       6.000000
75%       8.000000
max      11.000000
Name: length, dtype: float64

Let's talk about these data.

We can see that there is 367 occurences for men and "only" 29 occurences for women. So, there's a big difference between these two datasets. We can see that the min, 25%, and 50% quantiles are the same for both of the groups. It's quite expected since around half of the students do their Bachelor in 6 semesters (3 years). 

We can now use a statistical test to see if our result on the length are statistically significant. We are using a **two-sample t-test** because we want to compare the two population (men and women).  For this t-test, we say that the variance is different. Indeed, we saw before that the std values are different between men and women. For the t-test, we are mainly looking at the p-value. If it's less than 0.05 (5%), then there is a significant difference between men and women. 

In [17]:
stats.ttest_ind(a=women.length, b=men.length, equal_var=False)

Ttest_indResult(statistic=-1.2242690386586714, pvalue=0.22927095396453942)

We can see that the pvalue is equal to 0.27. Therefore, it tells us that there is no significant result between men and women concerning the length of their Bachelor.

# Add some assumptions

Now, we will restrict a bit the data. Indeed, if we take the year '2013-2014', we will only have the students who finished their Bachelor in 6 semester. Therefore, we are introducing a huge bias. We can also remove the students who started in '2011-2012' and in '2012-2013' because we will miss some students. Normally, the maximum length of the Bachelor is 6 years (12 semesters). So, if we remove all the students who started their Bachelor after '2010-2011', we will remove all the bias due to students who failed some/many years. 

On the other hand, if we add all the students who started their Bachelor in '2007-2008', we will miss the students who started in '2006-2007' and repeated their first year. Therefore, we remove the students who did their first only in '2007-2008'. (If a student did his first year in '2007-2008' and '2008-2009', we keep it because we know he repeated his first year.)

The function `restrict_years` returns true if a student had an entry for *Bachelor semestre 1* in the years '2008-2009', '2009-2010' or '2010-2011'. It returns false otherwise.

In [18]:
def restrict_years(df):
    """ Take a DataFrame and check that there is an entry for both BA1 and BA6
    
    Args:
        df (DataFrame): a DataFrame, typically a .loc on a specific student
        
    Returns:
        bool: True if student started in the correct years, False otherwise
    """  
    good_years = ['2008-2009', '2009-2010', '2010-2011']

    try:
        periods_ba1 = df.loc[df['Semester'] == 'Bachelor semestre 1'].Period
        if any(years in good_years for years in periods_ba1):
            return True
        else:
            return False
    except KeyError:
        return False

Now, we create another dictionnary with the restricted years.

In [19]:
dico_restricted = {'sciper': [], 'gender': [], 'length': []}
for row in ba.iterrows():
    o = row[1]
    sciper = o['No Sciper']
    # Making sure we consider unique sciper #s
    if not sciper in dico_restricted ['sciper']:
        df = locSciper(ba, sciper)
        if isOneSix(df) and restrict_years(df):
            dico_restricted ['sciper'].append(sciper)
            dico_restricted ['gender'].append(getGender(df))
            # Calculating length of stay by nbr of rows
            dico_restricted['length'].append(len(df))

We transform this dictionnary into a DataFrame and we describe the length to see if there is something wrong or not.

In [20]:
data_restricted = pd.DataFrame(dico_restricted)
data_restricted.length.describe()

count    147.000000
mean       7.414966
std        1.755007
min        6.000000
25%        6.000000
50%        6.000000
75%        8.000000
max       12.000000
Name: length, dtype: float64

As we can see, we have now 147 students instead of 367. So, this restriction removed a lot of data. We also lost the student who did his Bachelor in only two years. Therefore, we can directly create the DataFrame for the men and describe it.

In [21]:
men_restricted = data_restricted.loc[data_restricted.gender == 0]
men_restricted.length.describe()

count    138.000000
mean       7.471014
std        1.776538
min        6.000000
25%        6.000000
50%        7.000000
75%        8.000000
max       12.000000
Name: length, dtype: float64

We can do the same for the women.

In [22]:
women_restricted = data_restricted.loc[data_restricted.gender == 1]
women_restricted.length.describe()

count    9.000000
mean     6.555556
std      1.130388
min      6.000000
25%      6.000000
50%      6.000000
75%      6.000000
max      9.000000
Name: length, dtype: float64

As we can see, we only have 9 women now. But we can see that the mean value changed. Now, the women seem to finish their master earlier than the men. We can again use a **two-sample t-test** to see if it's the case.

In [23]:
stats.ttest_ind(a=women_restricted.length, b=men_restricted.length, equal_var=False)

Ttest_indResult(statistic=-2.2547599154275426, pvalue=0.045997368363103121)

And now, we have a really interesting result! The p-value is now under 0.05 (5%) which means that there is a statistically significant difference. 



# Conclusion:

Women (~6.5 semesters) seem to take one year less than men (~7.5 semesters) to finish their Bachelor. The t-test proved that it's statistically significant. **But** we have to be carefule about these data because we had really few entries for women.