# Participant Selection for the HGSFP Winter School 2019

This notebook documents the participant selection procedure for the HGSFP Winter School 2019.

Disclaimer: This Jupyter notebook is based on a tutorial notebook for the participant selection for Python in Astronomy 2017 by Daniela Huppenkothen, which is available at https://github.com/dhuppenkothen/PyAstro17ParticipantSelection. Vast portions of the present notebook have been adopted from there, and some parts were retained verbatim. We are very thankful to Daniela Huppenkothen and the pyastro17 SOC for providing their insightful notebook and for making their selection process transparent and comprehensible.

---------------------------

For privacy reasons, this notebook uses data that has been completely randomized within categories, thus no candidate is individually identifiable (and names and other markers of identity have been removed completely).

For this reason, the results of this procedure do not exactly mirror the results of our participant selection: the candidates in our data set here are random combinations that follow the input distributions of our real data, and not actual people. We felt it was important to be both transparent about and accountable for our selection procedure. This notebook is designed to give the reader an overview of the procedure from start to finish, and we have added our reasoning for certain choices where those were part of the selection. The notebook is also an example of what this kind of procedure can look like, and thus a kind of tutorial for other conference organizers.

Our procedures for admitting participants is constantly evolving as we tweak, make mistakes and learn from them. If you have any suggestions for future procedures (or more generally have thoughts about participant selection), we would love to hear from you either via an issue on this repository, or an e-mail to **winterschool2019@physi.uni-heidelberg.de**.

## Asking The Right Questions

Designing the application form was perhaps the most difficult task, and it is at this stage that conference organizers will already want to put serious thought into the goals of the workshop and the ideal mix of participants to achieve those goals. It should be obvious, but it bears repeating: you will only be able to include categories in your selection that you actually ask for! 

## Pre-selection
Excluding speakers, we have 52 spots for the meeting.
Our participant selection proceeded in two parts. In the first part, 
we rejected candidates outright who were either (1) duplicate entries or (2) candidates who had informed us that they would not be able to come.

Two spots were reserved for the HGSFP representatives. 

Finally, we pre-selected the organizing committee, who needs to be present at the school. Thus, a total of 8 participants (6 organizers, 2 representatives) were pre-selected.

We then anonymized our applicant pool by replacing names and other identifying information with a unique identifier. 

## Participant Selection

For the remaining 52 - 8 = 44 slots, we used `Entrofy` to optimize our participant set based on a set of well-defined criteria on which the organizers agreed. It's worth noting here that this discussion took place _before_ performing the selection, which then depended entirely on the _goals_ for the selection and was independent of the input data set. 

### Imports

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import entrofy
import pandas as pd
import datetime

## Data Loading

Pandas to the rescue!

In [None]:
applicants = pd.read_csv("../data/applicants.csv", sep=",")

Rename some columns with lengthy names

In [None]:
columnsRename = {
    'Unnamed: 0':'No',
    'Username':'email',
    'HGSFP Branch':'branch',
    'First Name':'firstname',
    'Last Name':'lastname',
    'Matriculation Number (if enrolled)':'matrikel',
    'Have you attended an HGSFP Winter School before?':'alrPart',
    'Gender':'gender',
    'When did you start your PhD?':'startP',
    'Poster Abstract':'Abstract',
    'Affiliation of Authors':'Affiliation',
    'Names of Authors':'Autor',
    'Name of your Supervisor/Professor/PI':'SuperWork',
    'Thesis subject':'ThesisSub',
    'Poster Title ':'TitleOfPoster',
    'Residential Address':'address',
    'Date of Birth':'birth',
    'Do you plan to bring your own skiing/snowboarding equipment?':'equipment',
    'Mobile Number':'mobil',
    'I have noted that during the school activities pictures may be taken, which may be used for promotion purposes of the HGSFP (e.g., website). I will contact the organizers directly if I do not agree with the usage of my pictures.* ':'duty',
    'Please describe your motivation for attending the winter school in up to two sentences.':'motivation'
}
applicants.rename(index=str, columns=columnsRename, inplace=True)

Let's have a look at the data:

In [None]:
applicants.head()

Check for duplicates based on the email addresses

In [None]:
len(applicants) == applicants.email.nunique()

add Organizers, and preselected participants to list of accepted participants 
'rejected' candidates include applicants who have declined their participation.

In [None]:
# lists of rejected/pre-selected participants (random names from listofrandomnames.com)
rejected = []
organizers = ['Parkman', 'Morant', 'Cerna', 'Wellman', 'Knaack', 'Deemer']
representatives = ['Mueller','Mayer']
# speakers = []

applicants.loc[applicants.lastname.isin(rejected), 'rejected'] = 1
applicants.loc[applicants.lastname.isin(organizers), 'accepted'] = 1
applicants.loc[applicants.lastname.isin(representatives), 'accepted'] = 1
SOC_idx = list(applicants[(applicants.lastname.isin(organizers))
                          | (applicants.lastname.isin(representatives))].index)
# applicants.loc[applicants.lastname.isin(speakers), 'accepted'] = 1

print("""After pre-selection, we have {} accepted participants from {} applicants. 
      {} are marked rejected/declined.""".format(int(applicants.accepted.sum()), len(applicants), int(applicants.rejected.sum())))

Let's convert some columns to reasonable data types:

In [None]:
applicants["rejected"] = applicants["rejected"].astype("str")
applicants["alrPart"] = applicants["alrPart"].astype("str")
applicants["branch"] = applicants['branch'].astype('str')
applicants['gender'] = applicants['gender'].astype('str')
applicants['startP'] = pd.to_datetime(applicants['startP'])

The columns encode the following information:

* `gender`: The gender identity as stated by the applicant
* `alrPart`: responses to the question _"Have you attended an HGSFP winter school before?"_
* `branch`: The HGSFP branch the applicant is affiliated with
* `startP`: The date of the start of a person's PhD 

### Prepare anonymized table for entrofy
We save the applicants table for later reference, then for the following drop all fields that enable identification of a person, as well as fields that are not used by entrofy

In [None]:
preselect_idx = applicants[applicants['accepted'] == 1].index
rejected_idx = applicants[applicants['rejected'] == 1].index

In [None]:
anonym = applicants.copy(deep=True)

Transform PhD start dates to time deltas till today (in days)


In [None]:
deltaPhd = datetime.datetime.now() - anonym['startP']
deltaPhd = deltaPhd.astype('timedelta64[D]')
anonym['phdDur'] = deltaPhd

drop columns not needed for the optimization procedure

In [None]:
anonym = anonym.drop(['No','Abstract','Affiliation','Autor','Institute','SuperWork',
                      'ThesisSub', 'TitleOfPoster','address','birth','duty',
                      'email','firstname','lastname','equipment','matrikel',
                      'mobil', 'accepted','Timestamp',
                      'rejected', 'startP', 'motivation','notes'], axis=1)

In [None]:
anonym

----------------------------

## Setting Up Entrofy

Okay, now we're ready to set up entrofy for selection. This will invovle tots and lots of dictionaries! 

There are two important decisions to make for each category (=column): (1) set its weight and (2) set the relative target fractions for each possible answer within a category.

The weight essentially sets the relative importance of the questions we asked compared to each other. The target fractions decide for each category what fraction of participants should ideally have that characteristic. One example would be a split of, say 0.1 for participants with previous winter school participation, and 0.9 for participants without. 

**Note**: Setting target fractions is the single most important part of the selection procedure. It is here that a discussion about the goals of the workshop is of crucial importance, because those goals will necessarily inform the target fractions to be set.

At first, we set all weights to 1, giving equal weight to all categories:

In [None]:
weights = dict([(c, 1.0) for c in anonym.columns])

`entrofy` works with a class called `Mappers`. These mappers essentially map target fractions to possible values within categories and contain a lot of information about how the code mades choices. For columns with discrete, unordered responses, we can use the `CategoricalMapper` class to construct mappers. For continuous inputs, there is a `ContinuousMapper` class.

In [None]:
datatypes = dict([(c, "categorical") for c in anonym.columns])
datatypes = {
    'phdDur' : 'continuous',
    'alrPart' : 'categorical',
    'branch' :'categorical',
    'gender' :'categorical'
}

In [None]:
mappers = entrofy.core.construct_mappers(anonym, weights, datatypes=datatypes)

`entrofy` has some plotting capabilities. in particular, it has the ability to make a corner plot to display the relative distributions in the input data set, as well as correlations between different input categories. 

**Note**: We will plot our data here as an example, but it is generally inadvisable to do this *before* having decided on target fractions, because the targets should be a function of the *goals* of the workshop, rather than the input data set.

In [None]:
fig, axes = entrofy.plotting.plot_triangle(anonym, weights, mappers=mappers)
# fix axis ticks
for axess in axes:
    for a in axess:
        xticks = a.xaxis.get_ticklabels()
        a.xaxis.set_ticklabels(xticks, rotation=90)

## Targets

Now we can define some targets. Each category (e.g. "already participated") has a discrete, finite number of possible outcomes (e.g. "yes" and "no"). The targets define the fraction of participants in the final output set who share the same value (e.g. 10% of participants should be in "yes"). 
The target fractions must sum up to be smaller or equal to 1.0 for each category. If the target fractions sum to a value smaller than one, the algorithm will try to fill up categories to *at least* the given fractions, and will ignore that category for the rest of the optimization procedure. The resulting mix of participants in the final set for this category will thus be a combination of the input fractions and the distribution in the input sample, conditioned on the constraints set by the remaining categories.

Below, we will go through each category one by one and lay out our reasoning for the categories chosen. The justification for our choices is an abbreviated version of a longer discussion the organizing committee had before starting the selection procedure. We should note at this point that there is no "correct" way to choose target fractions; the target fractions must necessarily always be a function of the objectives and goals of the workshop, as defined by the organizers, and may also depend on how the organizers see the role of the workshop in the larger community.

## Selection Goals
Broadly, the goals we defined for the HGSFP Winter School 2018 for participant selection are the following:
* enable every HGSFP student to attend one winter school during their PhD:
    * => strongly favor applicants that have not attended a HGSFP winter school before
    * => favor applicants that are longer into their PhD (since the clock is ticking...)
* Reflect the student numbers of the different HGSFP branches
* Increase the participation of underrepresented minorities (in our case this translates to an effort for gender equality)

#### HGSFP branch
For the branch attribute, we aim to reflect the distribution of the overall branch affiliation

In [None]:
anonym["branch"].unique()

In [None]:
fig, ax = plt.subplots(1, 1)
entrofy.plotting.plot_distribution(anonym, "branch", ax=ax)
xticks = ax.xaxis.get_ticklabels()
ax.xaxis.set_ticklabels(xticks, rotation=90);

In [None]:
branch_targets = {
    'Fundamental Interactions and Cosmology' : 0.25,
    'Astronomy and Cosmic Physics' : 0.25,
    'Quantum Dynamics and Complex Quantum Systems' : 0.25,
    'Complex Classical Systems' : 0.08333,
    'Mathematical Physics' : 0.08333,
    'Environmental Physics' : 0.08333
        }

In [None]:
mappers["branch"].targets = branch_targets

Since this category is not directly connected to our top-priority requirement of enabling every HGSFP student the participation of at least one winter school, we give a weight of less than 1 for this category:

In [None]:
weights["branch"] = 0.7

#### Previous Winter School Attendance

Derived from our top requirement, the acceptance of applicants with previous attendance of a winter school should be an exception. We decided if we allow previous attendees at all based on the oversubscription of the school. The latter was not very high, we therefore decided to accept applicants with previous attendence only via the waiting list. We enforce this criterion further below and do not solve for this parameter.

In [None]:
# already participated applicants only for waiting list
mappers["alrPart"].targets["Yes"] = 0.
mappers["alrPart"].targets["No"] = 1.

In [None]:
weights["alrPart"] = 999.

#### Gender Identity

Any social engineering involving gender is necessarily subject to scrutiny. 
Our choices here reflect our beliefs about what we would like the Winter School to be:

* We recognize that underrepresented minorities are particularly underrepresented in physics, which is reflected in the number of non-male PhD students.
* We also recognize studies that show that diverse groups outperform groups lacking diversity among several axes
* Representation is important: we believe that minority participants might feel more comfortable participating if they do not feel singled out based on their gender.

Realizing that an equal representation of genders cannot be realized given the input set, we choose to set a goal fraction of female participants slightly higher than the corresponding share in the HGSFP and allow a sufficient margin for the option "Don't identify with either".

In [None]:
mappers["gender"].targets = {"Female": 0.4, "Male": 0.5, "Don't identify with either" : 0.1}

#### PhD Duration

We aim to give senior PhD students that have not participated in a Winter School before an advantage in the selection, since they have less or no opportunities to re-apply next year.

We have to turn the continuous variable "PhD Duration" into a binned quantity and do this by dividing the duration into 1-year bins with an ultimate bin for durations greater than 3 years:

In [None]:
boundaries = [0., 365., 730., 1095., max(anonym.phdDur)]
column_names = ['1st', '2nd', '3rd', '4th']
targets = {'1st' : 0.15, '2nd' : 0.15, '3rd' : 0.30, '4th' : 0.40}
mappers['phdDur'] = entrofy.mappers.ContinuousMapper(anonym['phdDur'], n_out=4,
                                                    boundaries=boundaries, targets=targets,
                                                    column_names=column_names)

### Running Entrofy

We are now almost ready to run the code. 

Because some categories have the same responses (e.g. "Yes" and "No"), we need to add prefixes to the mappers so that answers that appear in multiple columns get attributed correctly:

In [None]:
for key in mappers.keys():
    mappers[key].prefix = key + "_"

Exclude entries not in the SOC who participated before

In [None]:
optout = list(anonym[(anonym.alrPart == "Yes") & (~anonym.index.isin(SOC_idx))].index)
optout

In [None]:
candidates4opt = len(applicants[(applicants.alrPart == "No") | (applicants.index.isin(SOC_idx))])
print("We have {} candidates joining the optimization: these are SOC members plus candidates without previous participation.".format(candidates4opt))

Now we're actually ready to run entrofy. We will select for 52 participants, using the pre-selected candidates as a starting point for the optimization. They are included in the procedure so that their attributes will explicitly count towards the total fractions in each category. 

In [None]:
idx, max_score = entrofy.core.entrofy(anonym, 52, 
                                      pre_selects=preselect_idx,
                                      opt_outs=optout,
                                      mappers=mappers,
                                      weights=weights, seed=20)

In [None]:
max_score

Let's make a data set with just the output set:

In [None]:
df_out = anonym.iloc[idx]

Here are the distributions of the output set:

In [None]:
fig, ax  = entrofy.plotting.plot_triangle(df_out, weights,
                                          mappers=mappers,
                                          cat_type="violin")
# fix axis ticks
for axes in ax:
    for a in axes:
        xticks = a.xaxis.get_ticklabels()
        a.xaxis.set_ticklabels(xticks, rotation=90)

We can also visualize the results as bar plots for each category. In the following, blue bars represent the fraction of candidates with that particular attributes. Green bars represent the fraction of participants in the output set with that attribute, and dashed black lines show the user-defined targets. This allows easy comparison between input/output sample as well as how closely the output set matches the targets.

In [None]:
for c in anonym.columns:
    _, _ = entrofy.plotting.plot_fractions(anonym[c], idx,
                                       c, mappers[c])

These are, of course, not the exact numbers, since the data set used in this notebook only resembles the real sample in the aggregate.

### Unblinding
At this point, we finally un-blinded ourselves and printed out the names and e-mail addresses for the accepted sample set so that we could start sending out acceptance e-mails.

Aside from the organizers and representatives, the entire procedure was performed entirely without names and based only on the candidates' responses and the complex optimization of the participant selection with respect to the goals of our selection.  

In [None]:
applicants.loc[idx, 'accepted'] = 1.
accepted = applicants.loc[idx]

In [None]:
len(accepted)

Check if all SOC members are in the accepted list

In [None]:
accepted[accepted.index.isin(SOC_idx)].lastname

Check if there are any non-SOC members who participated before

In [None]:
np.any((accepted.alrPart == 'Yes') & (~accepted.index.isin(SOC_idx)))

In [None]:
# accepted.to_csv("../data/accepted.csv")
accepted = pd.read_csv("../data/accepted.csv")

We also saved the remaining participants in a waitlist file.

In [None]:
waitlist = applicants.loc[~applicants.index.isin(idx)]

In [None]:
waitlist.to_csv("../data/waitlist.csv")

In [None]:
len(waitlist)

format list of email adresses so that email client can use it

In [None]:
emailAddresses = [s for s in accepted.email.values]
print(', '.join(emailAddresses))

same for waitlist

In [None]:
emailAddressesWaitlist = [s for s in waitlist.email.values]
print(', '.join(emailAddressesWaitlist))

### Email notifications
At this point, all applicants were informed about the outcome and applicants in the 'accepted' list were asked to confirm their attendance within a specified period.

In [None]:
applicants.to_csv('../data/applicants.csv')

We externally marked confirmed/rejected participants.

In [None]:
applicants = pd.read_csv('../data/applicants.csv')

In [None]:
confirmed = applicants[applicants.confirmed == "1"]

In [None]:
len(confirmed)

## After the selection

Not all participants accepted our invitation on the first round. At the time of preparing this notebook, it was not clear how we want to fill the remaining spots. In the following, we will demonstrate one option: re-running the algorithm.


To continue the selection procedure as we had done above, we removed those that declined from the set, and re-ran entrofy with those that had accepted as pre-selects. 

In the following, I will pick randomly from the set of accepted participants, since this is a simulated data set. This was of course not the case for the real sample, where the IDs corresponded to actual participants.

In [None]:
declined_accepted = np.random.choice(accepted.index, replace=False, size=11)

In [None]:
declined_waitlist = np.random.choice(waitlist.index, replace=False, size=2)

Now we make a new data frame with the set of participants who accepted:

In [None]:
accepted_new = accepted.drop(declined_accepted)

We also drop these from our original data frame, since they no longer matter to our selection procedure:

In [None]:
anonym_new = anonym.drop(np.hstack([declined_accepted, declined_waitlist]))

In [None]:
len(anonym_new)

Let's run entrofy again for new waitlist picks. 

In [None]:
idx_new, max_score_new = entrofy.core.entrofy(anonym_new, 55, 
                                      pre_selects=accepted_new.index,
                                      mappers=mappers,
                                      weights=weights, seed=25)

In [None]:
accepted_secondrun = anonym_new.loc[idx_new]

In [None]:
idx_fromwaitlist = accepted_secondrun.drop(accepted_new.index).index

In [None]:
idx_new

At this point, we would print out the names of the newly selected participants and e-mail those as well.

In [None]:
from_waitlist_secondrun = anonym_new.loc[idx_fromwaitlist]

Let's save the results of this run to file:

In [None]:
from_waitlist_secondrun.to_csv("../data/secondrun_fromwaitlist.csv", sep="\t")

In [None]:
df_out_new = anonym_new.loc[idx_new]

And we can plot the results of our selection again to see whether any categories notably changed: 

In [None]:
for c in mappers.keys():
    _, _ = entrofy.plotting.plot_fractions(anonym_new[c], idx_new,
                                       c, mappers[c])

## Conclusion

This notebook is an attempt to be transparent to our participants as well as those who did not get accepted this year, and as mentioned above, we welcome any feedback about the process as we attempt to learn from our experiences and improve for next year.

Martin Schlecker, January 2019
schlecker@mpia.de