In [1]:
import pandas as pd
import numpy as np
from IPython.display import HTML
import base64

The following function allows to download the generated sample files from the server

In [2]:
def create_download_link(df, filename):  
    title = f'Download {filename}',
    csv = df.to_csv(index=False)
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

Load the full population and display information about it

In [3]:
url = ('https://raw.githubusercontent.com/michelbierlaire/mooc-discrete-choice/master/'
       'syntheticPopulationWithChoice.zip')
population = pd.read_csv(url)

In [4]:
population.describe()

Unnamed: 0,Id,MarginalCostPT,WaitingTimePT,CostCarCHF,NbTransf,distance_km,TimePT,TimeCar,OccupStat,LangCode,CarAvail,Education,TripPurpose,Prob0,Prob1,Prob2,Choice
count,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0
mean,499999.5,11.130081,13.12591,5.76096,2.00925,40.380471,107.915374,40.702842,1.923853,1.744393,1.102572,4.15305,1.656476,0.286302,0.649073,0.06462583,0.779402
std,288675.278932,16.310957,22.341342,8.404421,2.200499,63.054669,88.125821,48.109134,0.869789,0.436202,0.441161,1.518441,0.474885,0.283246,0.271691,0.1078408,0.549619
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,3.0,1.0,2e-06,0.0,5.856389e-111,0.0
25%,249999.75,2.533611,0.0,1.451197,0.0,8.844138,46.343927,13.346889,1.0,1.0,1.0,3.0,1.0,0.084735,0.537786,2.228475e-06,0.0
50%,499999.5,5.473843,5.212542,2.969225,1.621549,18.861824,85.299806,26.206421,2.0,2.0,1.0,3.0,2.0,0.181506,0.710326,0.004427529,1.0
75%,749999.25,12.756591,17.591064,6.292116,3.281843,43.464997,143.598715,49.411055,3.0,2.0,1.0,6.0,2.0,0.368227,0.855592,0.08545803,1.0
max,999999.0,275.9058,470.223169,81.148874,16.785921,622.584149,893.873634,592.32768,3.0,2.0,3.0,7.0,2.0,1.0,0.999997,0.9720215,2.0


Calculate the population size, and set the sample size.

In [5]:
populationSize = population.shape[0]
sampleSize = 1000

# Simple random sample

## Calculation of the sampling probabilities

Simple random sample assigns the same probability to be sampled to each individual. We store it in the column <code>SRS</code>.

In [6]:
population['SRS'] = sampleSize / populationSize

We use the sample function of Pandas.

In [7]:
srsData = population.sample(n=1000)
srsData.shape

(1000, 18)

You can obtain the sample from the following link.

In [8]:
create_download_link(srsData, 'srsData.csv')

# Sampling function

We first define a generic function for stratified sampling. It takes as arguments: 

- the name of the sampling scheme, that is used to create new columns in the database,
- the mask identifying the entries in the database corresponding with each group,
- the target shares for each group in the sample (the length of the list must be the number of groups).

The function returns $W$, the share of each group in the population, and the sample.

If the name of the sampling scheme is SS, say, two columns are added to the database: 

- the column <code>SS</code> contains the sampling probability of each individual,
- the column <code>SSGroup</code> contains the ID of the group of each individual.

Note the function <code>sampleStratum</code> that takes a stratum as argument and samples from it. The statement <code>x[groupname].mean()</code> retrieves the group ID from the column <code>SSGRoup</code>, so that the requested group size in the sample can be obtained. Two special cases need to be addressed: 

- if the requested size is zero, the function returns <code>None</code>,
- if the requested size exceeds the number of individuals actually present in this group in the population, the whole group is returned.

In [9]:
def sample(name, mask, H):
    # name: name of the sampling scheme
    # mask: list of masks identifying the strata
    # H: target shares for each stratum
    groupname = f'{name}Group'
    # Calculate the share of each group in the population
    W = [None] * len(mask)
    for i, m in enumerate(mask):
        W[i] = population[m].shape[0] / populationSize
        population.loc[m, groupname] = i
    # Sampling probabilities
    for s in range(4):
        population.loc[mask[s], name] = (
            H[s] * sampleSize / (W[s] * populationSize)
        )
    # Sampling
    groupsize = np.array(H) * sampleSize

    def sampleStratum(x):
        # The statement 'int(x[groupname].mean())' retrieves the index of
        # the group.
        size = int(groupsize[int(x[groupname].mean())])
        print(f'Sample {size} out of {x.shape[0]}')
        if size == 0:
            return None
        if size > x.shape[0]:
            print(
                'Warning: not enough individual in stratum '
                'to reach the requirements'
            )
            return x
        return x.sample(n=size)

    return (
        W,
        (
            population.groupby(groupname)
            .apply(sampleStratum)
            .reset_index(drop=True)
        ),
    )


# Exogenously stratified Sample

The strata are defined based on trip purpose and car availability. As the choice is not involved in the definition of strata, the stratified sampling is *exogenous*.

In [10]:
mask = [None]*4

Group 0: TripPurpose = work (1), CarAvail = yes (1) 

In [11]:
mask[0] = (population['TripPurpose'] == 1) & (population['CarAvail'] == 1)

Group 1: TripPurpose = work (1), CarAvail = no (not 1) 

In [12]:
mask[1] = (population['TripPurpose'] == 1) & (population['CarAvail'] != 1)

Group 2: TripPurpose = other (not 1), CarAvail = yes (1) 

In [13]:
mask[2] = (population['TripPurpose'] != 1) & (population['CarAvail'] == 1)

Group 3: TripPurpose = other (not 1), CarAvail = no (not 1) 

In [14]:
mask[3] = (population['TripPurpose'] != 1) & (population['CarAvail'] != 1)

Target shares in the sample: 25% for each stratum

In [15]:
H = [0.25]*4

Sampling

In [16]:
W, sampleXSS = sample('XSS', mask, H)

Sample 250 out of 322501
Sample 250 out of 21023
Sample 250 out of 626213
Sample 250 out of 30263


In [17]:
W

[0.322501, 0.021023, 0.626213, 0.030263]

Checksum: they should add up to one.

In [18]:
sum(W)

1.0

We verify the sample size

In [19]:
sampleXSS.shape[0]

1000

We verify the share of each group in the sample

In [20]:
actualShares = list(sampleXSS.groupby('XSSGroup').size() / sampleSize)
actualShares

[0.25, 0.25, 0.25, 0.25]

You can obtain the sample from the following link.

In [21]:
create_download_link(sampleXSS, 'sampleXSS.csv')

## Endogenously stratified Sample 1

The strata are defined based on car availability and the chosen alternative.  As the choice is involved in the definition of strata, the stratified sampling is *endogenous*.

In [22]:
mask = [None] * 6

Group 0: CarAvail = yes (1), Choice = public transportation (0) 

In [23]:
mask[0] = (population['CarAvail'] == 1) & (population['Choice'] == 0)

Group 1: CarAvail = yes (1), Choice = car (1) 

In [24]:
mask[1] = (population['CarAvail'] == 1) & (population['Choice'] == 1)

Group 2: CarAvail = yes (1), Choice = slow mode (2) 

In [25]:
mask[2] = (population['CarAvail'] == 1) & (population['Choice'] == 2)

Group 3: CarAvail = no (not 1), Choice = public transportation (0) 

In [26]:
mask[3] = (population['CarAvail'] != 1) & (population['Choice'] == 0)

Group 4: CarAvail = no (not 1), Choice = car (1) 

In [27]:
mask[4] = (population['CarAvail'] != 1) & (population['Choice'] == 1)

Note: there is no individual in group 4. Indeed, if no car is available, car cannot be chosen.

Group 5: CarAvail = no (not 1), Choice = slow mode (2) 

In [28]:
mask[5] = (population['CarAvail'] != 1) & (population['Choice'] == 2)

Target shares in the sample: there are 5 non empty strata, taking 20% each

In [29]:
H = [0.2, 0.2, 0.2, 0.2, 0.0, 0.2]

In [30]:
W, sampleESS = sample('ESS', mask, H)

Sample 200 out of 234751
Sample 200 out of 649256
Sample 200 out of 64707
Sample 200 out of 50920
Sample 200 out of 366


In [31]:
W

[0.234751, 0.649256, 0.064707, 0.05092, 0.0, 0.000366]

Checksum: they should add up to one.

In [32]:
sum(W)

0.9999999999999999

We verify the sample size

In [33]:
sampleESS.shape[0]

1000

We verify the share of each group in the sample

In [34]:
actualShares = list(sampleESS.groupby('ESSGroup').size() / sampleSize)
actualShares

[0.2, 0.2, 0.2, 0.2, 0.2]

You can obtain the sample from the following link.

In [35]:
create_download_link(sampleESS, 'sampleESS.csv')

## Endogenously stratified Sample 2

The strata are defined based on the trip purpose, the language and the chosen alternative.   As the choice is involved in the definition of strata, the stratified sampling is *endogenous*.

In [36]:
mask = [None] * 12

Group 0: CarAvail = yes (1), language = French (1), Choice = public transportation (0) 

In [37]:
mask[0] = (population['CarAvail'] == 1) & (population['LangCode'] == 1) & (population['Choice'] == 0)

Group 1: CarAvail = yes (1), language = French (1), Choice = car (1) 

In [38]:
mask[1] = (population['CarAvail'] == 1) & (population['LangCode'] == 1) & (population['Choice'] == 1)

Group 2: CarAvail = yes (1), language = French (1), Choice = slow mode (2) 

In [39]:
mask[2] = (population['CarAvail'] == 1) & (population['LangCode'] == 1) & (population['Choice'] == 2)

Group 3: CarAvail = no (not 1), language = French (1), Choice = public transportation (0) 

In [40]:
mask[3] = (population['CarAvail'] != 1) & (population['LangCode'] == 1) & (population['Choice'] == 0)

Group 4: CarAvail = no (not 1), language = French (1), Choice = car (1) 

In [41]:
mask[4] = (population['CarAvail'] != 3) & (population['LangCode'] == 1) & (population['Choice'] == 1)

There is  no individual in group 4. Indeed, if no car is available, car cannot be chosen.

Group 5: CarAvail = no (not 1), language = French (1), Choice = slow mode (2) 

In [42]:
mask[5] = (population['CarAvail'] != 1) & (population['LangCode'] == 1) & (population['Choice'] == 2)

Group 6: CarAvail = yes (1), language = German (2), Choice = public transportation (0) 

In [43]:
mask[6] = (population['CarAvail'] == 1) & (population['LangCode'] == 2) & (population['Choice'] == 0)

Group 7: CarAvail = yes (1), language = German (2), Choice = car (1) 

In [44]:
mask[7] = (population['CarAvail'] == 1) & (population['LangCode'] == 2) & (population['Choice'] == 1)

Group 8: CarAvail = yes (1), language = German (2), Choice = slow mode (2) 

In [45]:
mask[8] = (population['CarAvail'] == 1) & (population['LangCode'] == 2) & (population['Choice'] == 2)

Group 9: CarAvail = no (not 1), language = German (2), Choice = public transportation (0) 

In [46]:
mask[9] = (population['CarAvail'] != 1) & (population['LangCode'] == 2) & (population['Choice'] == 0)

Group 10: CarAvail = no (not 1), language = German (2), Choice = car (1) 

In [47]:
mask[10] = (population['CarAvail'] != 1) & (population['LangCode'] == 2) & (population['Choice'] == 1)

There is  no individual in group 10. Indeed, if no car is available, car cannot be chosen.

Group 11: CarAvail = no (not 1), language = German (2), Choice = slow mode (2) 

In [48]:
mask[11] = (population['CarAvail'] != 1) & (population['LangCode'] == 2) & (population['Choice'] == 2)

Target shares in the sample: there are 10 non empty strata, taking 10% each

In [49]:
H = [0.1, 0.1, 0.1, 0.1, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.0, 0.1]

In [50]:
W, sampleESS2 = sample('ESS2', mask, H)

Sample 100 out of 38401
Sample 100 out of 9632
Sample 100 out of 5707
Sample 0 out of 201857
Sample 100 out of 10
Sample 100 out of 196350
Sample 100 out of 447399
Sample 100 out of 55075
Sample 100 out of 45213
Sample 100 out of 356


In [51]:
W

[0.038401,
 0.201857,
 0.009632,
 0.005707,
 0.201857,
 1e-05,
 0.19635,
 0.447399,
 0.055075,
 0.045213,
 0.0,
 0.000356]

Checksum: they should add up to one.

In [52]:
sum(W)

1.201857

We verify the sample size. In this case, we did not manage to sample 100 individuals from 4, as this group is composed of only 10 individuals in the population. We are therefore missing 90 individuals in the sample, compared to the original request.

In [53]:
sampleESS2.shape[0]

810

We verify the share of each stratum in the sample. In this case, the last stratum did not contain enough individuals to fulfill the requirement.

In [54]:
actualShares = list(sampleESS2.groupby('ESSGroup').size() / sampleSize)
actualShares

[0.2, 0.1, 0.2, 0.2, 0.11]

You can obtain the sample from the following link.

In [55]:
create_download_link(sampleESS2, 'sampleESS2.csv')