# Matching Students to Projects

## Constraints

We use will use the following strict requirements:
- with every project, we associate a minimum and a maximum possible team size
- each team must contain at least one data science skilled student
- each team must contain at least one project management skilled student

We define whether or not a student is data science (resp. project management) skilled based on their data science score. You can set the threshold above which we classify a student as skilled below.

We model this problem through a mixed integer linear program. For that, let $S$ be the set of students and $P$ be the set of projects. We define $x_{i, j} = 1$ if student $i \in S$ is assigned to project $j \in P$, and $x_{i,j} = 0$, otherwise. Hence, $x \in \{0,1\}^{S \times R}$ is are the decision variables. We clearly need the constraints:
- $\sum_{j \in P} x_{i,j} = 1$ for all $i \in S$, which guarantees that every student is assigned to precisely one project
- $\sum_{i \in S} x_{i,j} \leq \operatorname{MAX\_TEAM\_SIZE}$ for all $j \in P$, which quarantees that every project gets assigned a maximum of $\operatorname{MAX\_TEAM\_SIZE}$ students.
- $\sum_{i \in S} x_{i,j} \geq \operatorname{MIN\_TEAM\_SIZE}$ for all $j \in P$, which quarantees that every project gets assigned a minimum of $\operatorname{MIN\_TEAM\_SIZE}$ students.

To model the data science / project management skilled constraints, we define $d_i \in \{0,1\}$ for every $i \in S$, indicating whether student $i$ is data science skilled (in which case $d_i = 1$) or not (in which case $d_i = 0$). Similarly, we define $p_i \in \{0,1\}$ for every $i \in S$, indicating whether student $i$ is project management skilled (in which case $p_i = 1$) or not (in which case $p_i = 0$). Note that $d \in \{0,1\}^S$ and $p \in \{0,1\}^S$ can be obtained from the data provided and the thresholds set. We hence impose inequalities:
- $\sum_{i \in S} x_{i,j} d_i \geq 1$ for all $j \in P$
- $\sum_{i \in S} x_{i,j} p_i \geq 1$ for all $j \in P$

## Objective

The goal is to find a matching such that every student is assigned to the a project they enjoy (based on the ranking provided by the students) and such that the teams are diverse in terms of gender. Hence, the objective consists of 2 parts: priority and gender diversity.

### Priority Objective
Each student assigned a score from 1 (most desired project) to 6 (least desired project) to every project. Let $c_{i,j}$ be the score student $i \in S$ assigned to project $j \in P$. To make the students happy, we hence wish to minimise
$$a = \sum_{i \in S} \sum_{j \in P} c_{i,j} x_{i,j}.$$

### Gender Diversity Objective
For every project, we define a gender diversity coefficient $g_j$ for each $j \in P$ which is simply the absolute value of the difference between the number of females and the number of males assigned to each project. Let us define a binary variable $m_i \in \{0,1\}$ for every student $i \in S$, indicating whether student $i$ is male (then $m_i = 1$) or not (then $m_i = 0$). Then we can compute the gener diversity coefficient as follows:
$$g_j = \left| \sum_{i \in S} x_{i,j} m_i - \sum_{i \in S} x_{i,j} (1 - m_i) \right|.$$
Note that the minuend is simply the number of males assigned to project $j$ and the subtrahent is the number of females assigned to project $j$.

We aim to minimise 
$$b = \sum_{j \in P} g_j.$$

> Note: The above expression for $g_j$ is not linear and can hence not be used for our MIP. We hence impose a slightly weaker set of constraints which look as follows: 
> $$g_j \geq \sum_{i \in S} x_{i,j} m_i - \sum_{i \in S} x_{i,j} (1 - m_i)$$
> $$g_j \geq -\left( \sum_{i \in S} x_{i,j} m_i - \sum_{i \in S} x_{i,j} (1 - m_i) \right)$$
> But since we minimise the sum of $g_j$, these constraints are equivalent.

### The complete Objective
We hence aim to minimise the combined objective
$$a + b = \sum_{i \in S} \sum_{j \in P} c_{i,j} x_{i,j} + \sum_{j \in P} g_j$$


In [1]:
import pandas as pd
from mip import Model, MINIMIZE, BINARY, CONTINUOUS, xsum

In [2]:
#### PARAMETER DEFINITION ####

MAX_TEAM_SIZE = 5
MIN_TEAM_SIZE = 3
DS_SKILLED_THRESHOLD = 4
PM_SKILLED_THRESHOLD = 2.5
# NOTE: Project names must coincide with the column names in the matching_input.xlsx table
PROJECTS = ['Helvetas', 'Rega', 'GIZ I', 'GIZ II', 'MSF', 'IMPACT']

In [3]:
#### DATA PREPARATION ####

# read data
students = pd.read_excel('matching_input.xlsx')

# decide whether students are data science / project management skilled (based on thresholds)
students['data_science_skilled'] = 0
students.loc[students['ds_skill'] >= DS_SKILLED_THRESHOLD, 'data_science_skilled'] = 1
students['project_management_skilled'] = 0
students.loc[students['pm_skill'] >= PM_SKILLED_THRESHOLD, 'project_management_skilled'] = 1

# create column is_male
students['is_male'] = 0
students.loc[students['gender'] == 'Male', 'is_male'] = 1

# calculate number of students
number_students = students.shape[0]

students.head(3)

Unnamed: 0,ID,ds_skill,pm_skill,nationality,gender,Helvetas,Rega,GIZ I,GIZ II,MSF,IMPACT,data_science_skilled,project_management_skilled,is_male
0,16,4.666667,3.0,French,Male,3,4,3,4,6,5,1,1,1
1,38,4.25,2.75,Indian,Male,1,5,6,3,4,2,1,1,1
2,27,4.333333,3.0,Italy,Male,3,1,3,3,1,2,1,1,1


In [4]:
#### SET UP MODEL ####

m = Model(name="matching_ip", sense=MINIMIZE)

In [5]:
#### DECLARE DECISION VARIABLES ####

# set up assignment variables, where 
# x[i][p] = 1 if student i is assigned to project p
# x[i][p] = 0 if student i is not assigned to project p
x = {}
for i in range(number_students):
    x[i] = {p: m.add_var(f'x_{i}_{p}', var_type=BINARY) for p in PROJECTS}

# set up diversity coefficient for each project
gender_diversity = {project: m.add_var(f'gd_{project}', lb=0, var_type=CONTINUOUS) for project in PROJECTS}

In [6]:
#### IMPOSE STRICT REQUIREMENTS ####

# every student is assigned to precisely one project
for i in range(number_students):
    m += xsum(x[i][project] for project in PROJECTS) == 1

for project in PROJECTS:
    # each project gets assigned at least MIN_TEAM_SIZE students
    m += xsum(x[i][project] for i in range(number_students)) >= MIN_TEAM_SIZE

    # each project gets assigned at most MAX_TEAM_SIZE students
    m += xsum(x[i][project] for i in range(number_students)) <= MAX_TEAM_SIZE

    # each project must have at least one data science skilled student
    m += xsum(students.loc[i, 'data_science_skilled'] * x[i][project] for i in range(number_students)) >= 1

    # each project must have at least one project management skilled student
    m += xsum(students.loc[i, 'project_management_skilled'] * x[i][project] for i in range(number_students)) >= 1

In [7]:
#### IMPOSE OBJECTIVE FUNCTION ####

# this part of the objective function makes sure students get assigned to a project they like
priority_objective_function = xsum(
    students.iloc[i][project]*x[i][project] for i in range(number_students) for project in PROJECTS
)

# calculate the gender diversity coefficient for every project
for project in PROJECTS:
    number_of_males = xsum(x[i][project] * students.loc[i, 'is_male'] for i in range(number_students))
    number_of_females = xsum(x[i][project] * (1-students.loc[i, 'is_male']) for i in range(number_students))
    gender_diversity[project] >=  number_of_males - number_of_females
    gender_diversity[project] >=  - number_of_males + number_of_females


# this part of the objective function makes sure that we favour gender-diverse teams
gender_diversity_objective_function = xsum(
    gender_diversity[project] for project in PROJECTS
)


m += priority_objective_function + gender_diversity_objective_function

In [8]:
#### SOLVE THE MIP ####

m.optimize()

<OptimizationStatus.OPTIMAL: 0>

In [9]:
#### WRITE OUTPUT ####

# write assigned project in a new column in the dataframe
students['assigned_project'] = ''
for i in range(number_students):
    for project in PROJECTS:
        if x[i][project].x == 1:
            students.loc[i, 'assigned_project'] = project

# export to Excel
students.to_excel('matching_output.xlsx')