In [2]:
using HTTP, JSON, PrettyTables, JLD, DotEnv,  Random, Dates
institutions = load("institutions.jld")["institutions"]
academic_list = load("academic_list.jld")["academic_list"]
adjacency = load("adjacency_2023.jld")["out"];

Loaded the adjacency matrix of placements from 2023 only. To verify the relationship between the objects loaded try to learn their properties.  For example

In [3]:
println(" Total number of organizations is ", length(institutions),
"\n Total number of academic institutions is ", length(academic_list),
"\n The dimension of the adjacency matrix is ", size(adjacency),
"\n The number of placements analyzed is ", sum(adjacency))

 Total number of organizations is 991
 Total number of academic institutions is 412
 The dimension of the adjacency matrix is (991, 412)
 The number of placements analyzed is 1596


The variables `institutions` and `academic_list` are both julia `sets`.  All elements of sets are distinct (so if you try to add an element to a set that is already there it will ignore you).  Sets have no order, however they are indexed.  The set of all acacemic distributions is created by concatenating the academic list with a list of all the sinks.  From the following you should see how the indexing is preserved.

In [4]:
println("The 47th element of academic_list is ", academic_list[47],
"\nwhile the 47th element of institutions is ", institutions[47])

The 47th element of academic_list is Cranfield University
while the 47th element of institutions is Cranfield University


The adjacency matrix is constructed so that its 47th row coincides with the 47th element of `institutions`.  The 47th row is 

In [5]:
adjacency[47, :]

412-element Vector{Int32}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

The matrix is sparse (which means most of its cells have 0 in them).  We can count the number of hires by

In [6]:
sum(adjacency[47, :])

0

That means that Cranfield didn't hire in 2023.  Since the relation between indices and actual entities are preserved between `institutions` and `academic_list`, Cranfield must have graduated a student in 2023.  This computation verifies: 

In [7]:
sum(adjacency[:, 47])

1

This means we can do computations using only matrices, since we don't need the labelling features of Dataframes or dictionaries.

## Communities

We'll start with the assumption that we have chosen the members of 5 distinct academic communities ${ c_1 , c_2 , c_3 , c_4, c_5}$.  We can write $c_i^j$ to represent the $j^th$ member of community $i$.

Each member of community $i$ interacts with all the $991$ institutions for which we have data.  We'll maintain the assumption here that each academic community coincides with a distinct hiring community.  Given the members we have assigned to each community we have to find the likelihood that we would see the adjacency matrix `adjacency` that we found above. 

This consists of two parts.  First we need to find a vector of placement rates $\lambda$ that describes how the communities are related.  The number of elements of this vector depend on the number of  hiring communities we use.  Here we'll use 11, 5 academic hiring communities and 6 sinks from the public and private sector.  This means that for each academic community there are 11 rates at which this community places students with each of the hiring communities.  So in this example, there are 55 placement rates to be estimated.

The placement rates that maximize likelihood are actually very easy to find.  To simplify, suppose we observe 3 members of one community whose placements go to 5 different institutions in another community.  Since all placements are supposed to be independently drawn, this gives a small portion of the adjacency matrix above that looks like
<table><tr><td>3</td><td>0</td><td>1</td></tr>
<tr><td>0</td><td>0</td><td>1</td></tr>
<tr><td>1</td><td>4</td><td>2</td></tr>
<tr><td>1</td><td>3</td><td>1</td></tr>
<tr><td>0</td><td>0</td><td>1</td></tr>
</table>
If $\lambda$ is the rate at which we think members of our 3 graduating institutions place graduates at each of our 5 hiring institutions then we take each element $k$ of the adjacency matrix and compute its probability according to $\lambda$ at 
$$
 \frac{e^{-\lambda}\lambda^{k!}}{k!}
$$

To find the probability of the numbers in the table above, we have to multiply the probabilities of each of the cells.  Lets do it symbolically - with $k_{ij}$ being the number recorded in each cell.  The general formula is
$$
\prod_{i=1}^3 \prod_{j=1}^5 \frac{e^{-\lambda}\lambda^{k_{ij}}}{k_{ij}!}
$$
Since the log transformation is monotonically increasing, we can find the value for $\lambda$ that maximizes this expression by maximizing the log which is given by 
$$
\sum_{i=1}^3 \sum_{j=1}^5{-\lambda+k_{ij}log(\lambda)-\log(k_{ij}!)}
$$
This is 
$$ -15\lambda+\sum_{i=1}^3\sum_{i=1}^5 k_{ij}\log(\lambda)-\sum_{i=1}^3\sum_{j=1}^5\log(k_{ij}!) $$
To maximize this you would do the usual, take the derivative with respect to $\lambda$ and set it to zero.  Since $\lambda$ only occurs in the first two terms, this means the best estimate is just
$$
\lambda^\ast = \frac{\sum_{i=1}^3\sum_{i=1}^5 k_{ij}}{15}= \frac{18}{15} $$

We'll have to substitute that value back into the likelihood function itself because we are eventually going to choose community members to maximize this likelihood value.

Notice that in this objective, the three graduating institutions represented along the top of the table are all assumed to be members of the same community.  The hiring institutions represented by the rows are all in the same community as well, so if we called these communities $c$  and $c^\prime$  then we could equally write this part of the likelihood function as
$$
\prod_{i\in c} \prod_{j\in c^\prime }\Biggl\{ \frac{e^{-\lambda_{cc^\prime}}\lambda_{cc^\prime}^{k_{ij}}}{k_{ij}!}\Biggr\}
$$
where $\lambda_{cc^\prime}$ is the common placement rate  between the two communities.

To get the probability of the entire sample we should multiply these products together for all different $(c,c^\prime$ pairs in our community structure.  If $C$  is the collection of graduating institutions and $C^\prime$ is the set of hiring institutions, then the expression would look like
$$
\prod_{c\in C}\prod_{c^\prime \in C^\prime}\Biggl(\prod_{i\in c} \prod_{j\in c^\prime }\Biggl\{ \frac{e^{-\lambda_{cc^\prime}}\lambda_{cc^\prime}^{k_{ij}}}{k_{ij}!}\Biggr\} \Biggr)
$$

The thing to notice about this is that each $k_{ij}$ is just an entry in some cell of the adjacency matrix.  As we do this product across all the communities, the product in the denominator will just be a product of the factorial of each of the cells in the adjacency matrix.  The point is that this is true no matter what community structure we choose.  So when we compare likelihoods for different community structures, we can do so using only the numerators of the likelihoods - the denominators will 
be the same.

Writing a community struture in a different way.  Let $I$ be the number of hiring institutions, while $A$ is the number of academic institutions. We'll let $\mathcal{A}$ be the adjacency matrix. $k$ is the number of academic tiers while $K$ is the number of hiring tiers.  We'll create a matrix $C$ of dimension $(K,I)$ which has entries which are either  $0$ or $1$.  An entry in a cell $(i,j)$ of this matrix is $1$ if and only if institution $j$ is a member of hiring community $i$.  The row sums indicate the size of a hiring community.  The matrix $C$ describes a community structure for hiring institutions.

The  matrix product $C\mathcal{A}$ is a $(K,A)$ matrix whose element $(i,j)$ represents the number of graduates univerity $j$ placed in hiring tier $i$.

We can do the same thing and represent the academic community structure with an $(A,k)$ matrix $T$ whose $i,j$th element is $1$ if and only if university $i$ is in academic tier $j$.  the matrix product $C\mathcal{A}T$ is the tier to tier adjacency matrix whose $i,j$th element is the total number of placements if graduates from tier $j$ who got jobs in tier $i$.  Since the column sums of $T$, say $t$, give the total number of institutions in each graduating tier, while the row sums of $C$, say $c$, give the total number of hiring institutions in each hiring teir, we can reduce the likelihood calculation to a iteration across a $K$ by $k$ matrix, where $K$ is the total number of hiring tiers, while $k$ is the number of graduating tiers.

Referring to our simple example above we can do the likelihood calculation for placements by graduating tier $j$ in hiring tier $i$  as
$$
CA\mathcal{T}[i,j](\log(\frac{CA\mathcal{T}[i,j]}{c[i]t[j]})-1)
$$

In [13]:
#create design matrices
C = zeros(Int32, 12, length(institutions));
T = zeros(Int32, length(academic_list), 5);
#get an allocation
est_alloc = load("est_alloc.jld")["est_alloc"];

In [14]:
for i in 1:length(institutions)
    C[est_alloc[i], i] = 1
end

In [15]:
sum(C)

991

In [16]:
for i in 1:length(academic_list)
    T[i, est_alloc[i]] = 1
end

In [18]:
compressed_adjacency = C*adjacency*T

12×5 Matrix{Int32}:
 79  20   0  22   30
  0  42   4   0   63
  0   0   0   0    0
 18   8   8  96   50
 10  28   0  27  145
  2  24  11  40   76
  0  30   6  23   87
 17   0  11  43   10
  0   4   3   1    4
 26  54  76  47   63
  4   8   0  16   24
 17  63  28  41   87

In [19]:
sum(compressed_adjacency)

1596

In [23]:
t = sum(T,dims = 1)

1×5 Matrix{Int64}:
 119  63  128  64  38

In [27]:
c = sum(C, dims=2)

12×1 Matrix{Int64}:
 119
  63
 128
  64
  38
  66
  69
 188
  50
   1
  13
 192

In [30]:
rates = zeros(12,5)
for i in 1:12, j in 1:5
    rates[i,j] = compressed_adjacency[i,j]/(c[i]*t[j])
end
rates

12×5 Matrix{Float64}:
 0.0055787    0.00266773  0.0          0.00288866  0.00663423
 0.0          0.010582    0.000496032  0.0         0.0263158
 0.0          0.0         0.0          0.0         0.0
 0.00236345   0.00198413  0.000976562  0.0234375   0.0205592
 0.00221141   0.0116959   0.0          0.011102    0.100416
 0.000254647  0.00577201  0.00130208   0.0094697   0.030303
 0.0          0.00690131  0.000679348  0.00520833  0.0331808
 0.000759878  0.0         0.000457114  0.0035738   0.00139978
 0.0          0.00126984  0.00046875   0.0003125   0.00210526
 0.218487     0.857143    0.59375      0.734375    1.65789
 0.00258565   0.00976801  0.0          0.0192308   0.048583
 0.000744048  0.00520833  0.00113932   0.00333659  0.0119243

In [33]:
# likelihood
l = 0.0
for i in 1:12,j in 1:5
    l +=  -compressed_adjacency[i,j]*(log(max(rates[i,j],.001)) - 1)
end
l

7635.759517257309