# Stochastic Classification of Economics Departments by Placement Rates

James Yu, 16 November 2021

In [1]:
using Distributions, JSON
# may need to run import Pkg; Pkg.add("name_of_each_package") if using for the first time

This notebook contains an algorithm for classifying Economics departments into a series of types based on placement outcome data.

The placement data used is grouped as a dictionary of applicants sorted by year of starting application. In total, the data spans from 2003 to 2021. The procedure for compiling the data is provided in a separate Python file.

In [2]:
year_by_year = JSON.parsefile("to_from_by_year.json")
sort(collect(keys(year_by_year)))

19-element Vector{String}:
 "2003"
 "2004"
 "2005"
 "2006"
 "2007"
 "2008"
 "2009"
 "2010"
 "2011"
 "2012"
 "2013"
 "2014"
 "2015"
 "2016"
 "2017"
 "2018"
 "2019"
 "2020"
 "2021"

For our estimation, we will only consider placements from 2015 to 2018 as the number of reported graduates on EconJobMarket was the most stable and even over this period. We put all placement outcomes for Assistant Professor positions in one set, and all other placement outcomes in another set. The other placement outcomes are those where applicants were hired at departments that do not graduate Ph.D. students; this includes the public sector, private sector, government, non-academic positions at academic departments and teaching universities, among others.

In [3]:
i = 0

academic_builder = Set{}()
sink_builder = Set{}()

for key in 2015:2020
    data = year_by_year[string(key)]
    println(key, " has ", length(keys(data)), " entries")
    for aid in keys(data)
        outcome = data[aid]
        if outcome["position_name"] == "Assistant Professor"
            push!(academic_builder, outcome)
        else
            push!(sink_builder, outcome)
        end
    end
end
println(length(academic_builder), " total assistant professor outcomes")
println(length(sink_builder), " other outcomes")

2015 has 512 entries
2016 has 1145 entries
2017 has 837 entries
2018 has 912 entries
2019 has 1002 entries
2020 has 847 entries
3608 total assistant professor outcomes
1647 other outcomes


This piece of code deals with teaching universities by checking if they ever graduated Ph.D. students:

In [4]:
academic = Set{}()
academic_to = Set{}()
for outcome in academic_builder
    push!(academic, outcome["from_institution_name"])
    push!(academic_to, outcome["to_name"])
end

tch_sink = Set{}() # sink of teaching universities that do not graduate PhDs
for key in academic_to
    if !(key in academic)
        push!(tch_sink, key)
    end
end

println(length(academic))
println(length(academic_to))
println(length(tch_sink))

361
801
494


The next piece of code sorts all the sink departments (except teaching universities, which are dealt with above) by category:

In [5]:
filter_sink_builder = Set{}()
for outcome in sink_builder
    if outcome["from_institution_name"] in academic
        push!(filter_sink_builder, outcome)
    end
end

acd_sink = Set{}()
gov_sink = Set{}()
pri_sink = Set{}()

for outcome in filter_sink_builder
    # CODE global academic, other_placements, pri_sink, gov_sink, acd_sink
    if outcome["recruiter_type"] in [6, 7]
        # private sector: for and not for profit
        push!(pri_sink, string(outcome["to_name"], " (private sector)"))
    elseif outcome["recruiter_type"] == 5
        # government institution
        push!(gov_sink, string(outcome["to_name"], " (public sector)"))
    else
        # everything else including terminal academic positions
        push!(acd_sink, string(outcome["to_name"], " (academic sink)"))
    end
end

println(length(acd_sink))
println(length(gov_sink))
println(length(pri_sink))

205
91
102


Now that we have five sets for every category of department, we can construct a matrix representing the placements between these departments:

In [6]:
institutions = vcat(collect(academic), collect(acd_sink), collect(gov_sink), collect(pri_sink), collect(tch_sink))

out = zeros(UInt8, length(institutions), length(collect(academic)))
i = 0
for outcome in academic_builder
    i += 1
    out[findfirst(isequal(outcome["to_name"]), institutions), findfirst(isequal(outcome["from_institution_name"]), institutions)] += 1
end
for outcome in filter_sink_builder
    i += 1
    keycheck = ""
    if outcome["recruiter_type"] in [6, 7]
        keycheck = string(outcome["to_name"], " (private sector)")
    elseif outcome["recruiter_type"] == 5
        keycheck = string(outcome["to_name"], " (public sector)")
    else
        keycheck = string(outcome["to_name"], " (academic sink)")
    end
    out[findfirst(isequal(keycheck), institutions), findfirst(isequal(outcome["from_institution_name"]), institutions)] += 1
end
println("Total number of outcomes: ", i)

Total number of outcomes: 5156


Finally, we get to the estimator. For this estimate, we assume that each observed set of placement outcomes between any two pairs of departments is drawn from a distribution common to the "type" of the hiring department and the "type" of the graduating department. Here this distribution is assumed to be Poisson, in line with classical stochastic block models used for similar estimations in Karrer and Newman (2011) and Peixoto (2014).

Given a particular assignment of departments to types, and given the placement outcomes, a single round of estimation computes the mean number of applicants from any single type $t$ department that would be hired at a single type $t^\prime$ department and measures the probability that each independent observation was drawn from its corresponding mean. When summed together, the logarithms of the probabilities form a log-likelihood which can be used for maximum likelihood estimation.

In [7]:
function bucket_estimate(assign::Array{UInt8}, A::Matrix{UInt8}, num::UInt8)
    b = zeros(UInt8, size(A)[1], size(A)[2])
    T = zeros(Int64, num + 1, num)
    count = zeros(num + 1, num)
    for i in 1:size(A)[1], j in 1:size(A)[2]
         @inbounds val = (num + 1) * (assign[j] - 1) + assign[i]
         @inbounds b[i, j] = val
         @inbounds T[val] += A[i, j]
         @inbounds count[val] += 1
    end
    L = 0.0
    for i in eachindex(A)
        @inbounds L += logpdf(Poisson(T[b[i]]), A[i])
    end
    return -L, T
end

bucket_estimate (generic function with 1 method)

Finally, we compute the maximum-likelihood estimated Poisson means by stochastically re-allocating departments to types and saving likelihood-improving re-allocations until no further re-allocations are found.

In [8]:
function doit(sample, academic_institutions, asink, gsink, psink, tsink, all_institutions, num)
    # some initial states
    current_allocation = Array{UInt8}(undef, length(all_institutions))
    cur_objective = Inf
    best_mat = nothing
    cursor = 1
    for inst in academic_institutions
        current_allocation[cursor] = rand(1:num)
        cursor += 1
    end
    # except that the sinks must stay in types
    # this was built to support more sinks, but here we only use one
    for key in asink
        current_allocation[cursor] = num + 1
        cursor += 1
    end
    for key in gsink
        current_allocation[cursor] = num + 1
        cursor += 1
    end
    for key in psink
        current_allocation[cursor] = num + 1
        cursor += 1
    end
    for key in tsink
        current_allocation[cursor] = num + 1
        cursor += 1
    end
    blankcount = 0

    # BEGIN MONTE CARLO REALLOCATION ROUTINE
    while true
        # attempt to reallocate between 1 and 3 academic institutions to a random spot
        temp_allocation = copy(current_allocation)
        @simd for k in rand(1:length(academic_institutions), rand(1:3))
            @inbounds temp_allocation[k] = rand(delete!(Set(1:num), temp_allocation[k]))
        end
        # check if the new assignment is better
        test_objective, estimated_means = bucket_estimate(temp_allocation, sample, num)
        if test_objective < cur_objective
            print(test_objective, " ")
            blankcount = 0
            cur_objective = test_objective
            best_mat = estimated_means
            current_allocation = temp_allocation
        else
            blankcount += 1
            if blankcount % 1000 == 0
                print(blankcount, " ")
            end
        end
        if blankcount == 30000
            return cur_objective, best_mat, current_allocation
        end
    end
end

SELECT_COUNT = 4 # four type allocation used here
est_obj, est_mat, est_alloc = doit(out, collect(academic), collect(acd_sink), collect(gov_sink), collect(pri_sink), collect(tch_sink), institutions, UInt8(SELECT_COUNT))

2.4118297217959332e8 2.40543574561001e8 2.4047169235279968e8 2.401042366743375e8 2.3951076378860226e8 2.3926799517481914e8 2.3896726379694253e8 2.3886911578678766e8 2.3801897645823237e8 2.3777572127469677e8 2.3768105528893906e8 2.3766602930881116e8 2.3757798206901884e8 2.3754131873905963e8 2.3749386379074585e8 2.373921223705648e8 2.3735171566464347e8 2.3733277741148934e8 2.3716486983169702e8 2.3700171837897235e8 2.3692693492904973e8 2.367264995513212e8 2.3652730430570006e8 2.3634492713820848e8 2.3626205650970742e8 2.360923320415815e8 2.3586368019704893e8 2.3583650213257694e8 2.3540410763038445e8 2.353513254878959e8 2.352766382105916e8 2.3481703504745877e8 2.347590390792922e8 2.3457610298275772e8 2.3425462019436213e8 2.3421274132697544e8 2.342064510039025e8 2.3405058046109077e8 2.3401366485793886e8 2.3338939102030227e8 2.3335670276644993e8 2.3302621904883555e8 2.3300075917361316e8 2.314682954006086e8 2.3138809904369104e8 2.3087683738034913e8 2.3081443451544783e8 2.3071048766202402e8 2.3

(1.5290250195126066e8, [118 163 41 363; 66 158 20 327; … ; 31 74 10 460; 433 757 165 1369], UInt8[0x04, 0x03, 0x01, 0x02, 0x03, 0x02, 0x03, 0x02, 0x01, 0x03  …  0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05])

The estimated means are:

In [9]:
est_mat

5×4 Matrix{Int64}:
 118  163   41   363
  66  158   20   327
  84  168   94   255
  31   74   10   460
 433  757  165  1369

The type allocation associated with these means is:

In [10]:
for j in 1:SELECT_COUNT
    println("TYPE ", j)
    println()
    i = 1
    for entry in est_alloc
        if entry == j
            println(institutions[i])
        end
        i += 1
    end
    println()
    println()
end

TYPE 1

University of Massachusetts, Amherst
Peking University
University of Lausanne (Université de Lausanne)
Georgetown University
Hong Kong Baptist University
Simon Fraser University
Colorado State University
Université catholique de Louvain
Stellenbosch University
London Business School
University of Amsterdam (Universiteit van Amsterdam)
George Mason University
Concordia University
University of Kentucky
Auburn University
ETH Zurich (Swiss Federal Institute of Technology; Eidgenössische Technische Hochschule)
Trinity College Dublin, University of Dublin
University of Edinburgh
University of Bonn (Rheinische Friedrich-Wilhelms-Universität Bonn)
University of California, San Diego
Western University
Freie Universität Berlin
Australian National University
Ecole Polytechnique (IP Paris)
University of Copenhagen (Københavns Universitet)
Carleton University
Université Laval
University of Nebraska, Lincoln
Katholieke Universiteit Leuven (KU Leuven)
University of North Carolina, Chapel Hi

The numerical form of the type allocation, to be used for estimating the deep parameters, is:

In [11]:
print(est_alloc)

UInt8[0x04, 0x03, 0x01, 0x02, 0x03, 0x02, 0x03, 0x02, 0x01, 0x03, 0x01, 0x01, 0x03, 0x03, 0x03, 0x02, 0x03, 0x03, 0x03, 0x03, 0x02, 0x01, 0x01, 0x01, 0x01, 0x01, 0x03, 0x03, 0x01, 0x01, 0x03, 0x03, 0x04, 0x01, 0x03, 0x03, 0x03, 0x01, 0x01, 0x03, 0x03, 0x01, 0x03, 0x03, 0x03, 0x01, 0x01, 0x01, 0x03, 0x03, 0x03, 0x03, 0x03, 0x01, 0x04, 0x03, 0x02, 0x03, 0x02, 0x04, 0x01, 0x01, 0x04, 0x03, 0x03, 0x03, 0x04, 0x01, 0x02, 0x02, 0x01, 0x03, 0x01, 0x01, 0x01, 0x03, 0x03, 0x01, 0x03, 0x03, 0x04, 0x02, 0x04, 0x04, 0x03, 0x03, 0x03, 0x02, 0x03, 0x04, 0x01, 0x01, 0x01, 0x01, 0x03, 0x03, 0x01, 0x03, 0x02, 0x01, 0x03, 0x04, 0x03, 0x01, 0x03, 0x03, 0x03, 0x03, 0x03, 0x01, 0x03, 0x03, 0x04, 0x03, 0x03, 0x01, 0x01, 0x03, 0x03, 0x01, 0x03, 0x03, 0x03, 0x03, 0x03, 0x01, 0x01, 0x03, 0x03, 0x03, 0x03, 0x03, 0x02, 0x04, 0x03, 0x02, 0x02, 0x04, 0x03, 0x03, 0x03, 0x04, 0x03, 0x02, 0x04, 0x02, 0x01, 0x03, 0x03, 0x03, 0x02, 0x03, 0x02, 0x03, 0x03, 0x03, 0x03, 0x02, 0x03, 0x03, 0x01, 0x04, 0x01, 0x01, 0x03, 0x03

## References

Karrer, B., and M. E. J. Newman (2011): "Stochastic Blockmodels and community structure in networks," Physical Review, 83(1).

Peixoto, T. (2014): "Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models," Physical Review, 89(1).