In [2]:
using HTTP, JSON, PrettyTables, JLD, DotEnv, WordCloud, Random, Dates
institutions = load("institutions.jld")["institutions"]
academic_list = load("academic_list.jld")["academic_list"]
adjacency = load("adjacency_2023.jld")["out"];

Loaded the adjacency matrix of placements from 2023 only. To verify the relationship between the objects loaded try to learn their properties.  For example

In [6]:
println(" Total number of organizations is ", length(institutions),
"\n Total number of academic institutions is ", length(academic_list),
"\n The dimension of the adjacency matrix is ", size(adjacency),
"\n The number of placements analyzed is ", sum(adjacency))

 Total number of organizations is 991
 Total number of academic institutions is 412
 The dimension of the adjacency matrix is (991, 412)
 The number of placements analyzed is 1596


The variables `institutions` and `academic_list` are both julia `sets`.  All elements of sets are distinct (so if you try to add an element to a set that is already there it will ignore you).  Sets have no order, however they are indexed.  The set of all acacemic distributions is created by concatenating the academic list with a list of all the sinks.  From the following you should see how the indexing is preserved.

In [7]:
println("The 47th element of academic_list is ", academic_list[47],
"\nwhile the 47th element of institutions is ", institutions[47])

The 47th element of academic_list is Cranfield University
while the 47th element of institutions is Cranfield University


The adjacency matrix is constructed so that its 47th row coincides with the 47th element of `institutions`.  The 47th row is 

In [8]:
adjacency[47, :]

412-element Vector{Int32}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

The matrix is sparse (which means most of its cells have 0 in them).  We can count the number of hires by

In [10]:
sum(adjacency[47, :])

0

That means that Cranfield didn't hire in 2023.  Since the relation between indices and actual entities are preserved between `institutions` and `academic_list`, Cranfield must have graduated a student in 2023.  This computation verifies: 

In [11]:
sum(adjacency[:, 47])

1

This means we can do computations using only matrices, since we don't need the labelling features of Dataframes or dictionaries.

## Communities

We'll start with the assumption that we have chosen the members of 5 distinct academic communities ${ c_1 , c_2 , c_3 , c_4, c_5}$.  We can write $c_i^j$ to represent the $j^th$ member of community $i$.

Each member of community $i$ interacts with all the $991$ institutions for which we have data.  We'll maintain the assumption here that each academic community coincides with a distinct hiring community.  Given the members we have assigned to each community we have to find the likelihood that we would see the adjacency matrix `adjacency` that we found above. 

This consists of two parts.  First we need to find a vector of placement rates $\lambda$ that describes how the communities are related.  The number of elements of this vector depend on the number of  hiring communities we use.  Here we'll use 11, 5 academic hiring communities and 6 sinks from the public and private sector.  This means that for each academic community there are 11 rates at which this community places students with each of the hiring communities.  So in this example, there are 55 placement rates to be estimated.

The placement rates that maximize likelihood are actually very easy to find.  To simplify, suppose we observe 3 members of one community whose placements go to 5 different institutions in another community.  Since all placements are supposed to be independently drawn, this gives a small portion of the adjacency matrix above that looks like
<table><tr><td>3</td><td>0</td><td>1</td></tr>
<tr><td>0</td><td>0</td><td>1</td></tr>
<tr><td>1</td><td>4</td><td>2</td></tr>
<tr><td>1</td><td>3</td><td>1</td></tr>
<tr><td>0</td><td>0</td><td>1</td></tr>
</table>
If $\lambda$ is the rate at which we think members of our 3 graduating institutions place graduates at each of our 5 hiring institutions then we take each element $k$ of the adjacency matrix and compute its probability according to $\lambda$ at 
$$
 \frac{e^{-\lambda}\lambda^{k!}}{k!}
$$

To find the probability of the numbers in the table above, we have to multiply the probabilities of each of the cells.  Lets do it symbolically - with $k_{ij}$ being the number recorded in each cell.  The general formula is
$$
\prod_{i=1}^3 \prod_{j=1}^5 \frac{e^{-\lambda}\lambda^{k_{ij}}}{k_{ij}!}
$$
Since the log transformation is monotonically increasing, we can find the value for $\lambda$ that maximizes this expression by maximizing the log which is given by 
$$
\sum_{i=1}^3 \sum_{j=1}^5{-\lambda+k_{ij}log(\lambda)-\log(k_{ij}!)}
$$
This is 
$$ -15\lambda+\sum_{i=1}^3\sum_{i=1}^5 k_{ij}\log(\lambda)-\sum_{i=1}^3\sum_{j=1}^5\log(k_{ij}!) $$
To maximize this you would do the usual, take the derivative with respect to $\lambda$ and set it to zero.  Since $\lambda$ only occurs in the first two terms, this means the best estimate is just
$$
\lambda^\ast = \frac{\sum_{i=1}^3\sum_{i=1}^5 k_{ij}}{15} $$

We'll have to substitute that value back into the likelihood function itself because we are eventually going to choose community members to maximize this likelihood value.