In [1]:
using HTTP, JSON, PrettyTables, JLD, DotEnv, Distributions, LinearAlgebra, Dates, MySQL

In [2]:
cfg = DotEnv.config("../.env")
files_path = cfg["files_path"]
adjacency = load(files_path*"placement_rates.jld")["placement_rates"]
classification_properties = load(files_path*"classification_properties.jld")["classification_properties"]
refresh_mysql_db = true
classification_properties

Dict{String, Any} with 6 entries:
  "data_loaded"              => DateTime("2024-06-18T18:21:53.103")
  "unmatched_index"          => 10
  "number_of_academic_types" => 5
  "institution_counts"       => [20, 58, 180, 334, 522, 152, 227, 598, 413, 1, …
  "algorithm_run_id"         => 5
  "num_years"                => 24

# Likelihood Ratios

The objective is to calculate the probability of each individual university's profiles of hires and placements given the placement rates estimated for each of the tiers in the classification.  The probabilities are then compared across tiers by calculating the ratio of each of these probabilities to the probability they have given the tier in which the university was placed by the classification algorithm.  If a university is classified as tier 1, then the probabililty of its hires and placements should he bigher using the rates estimated for tier 1 than they are using the rates estimated for tier 2.

We'll actually do it for hires and placements separately as well to check how restrictive is the assumption that universities who are the best at placing students are also best at hiring them.


In [3]:
# tier hiring rates
# this is just for illustration - the hiring rate should be per year per university
# adjacency is the matrix returned by the classification algorithm 
phr = zeros(size(adjacency)[1], size(adjacency)[2]);
for i in 1:size(adjacency)[1], j in 1:size(adjacency)[2]
    phr[i,j] = 
    adjacency[i,j]/(classification_properties["num_years"]*classification_properties["institution_counts"][i])
end
phr

12×5 Matrix{Float64}:
  2.4875       0.704167     0.358333     0.0895833    0.00833333
  0.682471     0.612787     0.225575     0.0653736    0.00718391
  0.24838      0.319676     0.228241     0.0337963    0.00787037
  0.0336826    0.0631238    0.0562625    0.051023     0.00411677
  0.0          0.00335249   0.00423052   0.00223499   0.0130109
  0.155702     0.149397     0.078125     0.015625     0.00630482
  0.142621     0.0809471    0.0422173    0.0100954    0.00477239
  0.0059922    0.00905797   0.00940635   0.0039019    0.00188127
  0.00343019   0.00595238   0.00696126   0.0041364    0.00484262
 15.2917      24.9583      31.3333      17.2083      14.2083
  0.230263     0.180921     0.0910088    0.0350877    0.00986842
  0.0155763    0.0244678    0.0272586    0.0103193    0.00447819

The third to last row above aggragates all the placements that ended up in intitution 574 (ocean and crow). 
Each of these placements was either a applicant who never left a trail to show where they were hired, or an applicant
who got a job in an institution that was not included in the econjobmarket list of institutions.

This matrix says that each tier 1 university hired almost 2 and a half graduates every year from other tier 1 universities

These tended to be institutions who never placed an ad on econjobmarket, and who were never mentioned as a graduating institution
by any applicant who registered on econjobmarket.  Basically these were people who failed to get jobs on the international job market.
This is not completely accurate, but no definition of the scope of a market is ever going to be good.

The next computation just does the same thing for placements.  The third to last row

In [4]:
# tier placement rates 
ppr = zeros(size(adjacency)[1], size(adjacency)[2]);
for i in 1:size(adjacency)[1], j in 1:size(adjacency)[2]
    ppr[i,j] = 
    adjacency[i,j]/(classification_properties["num_years"]*classification_properties["institution_counts"][j])
end
ppr

12×5 Matrix{Float64}:
 2.4875     0.242816   0.0398148  0.00536427  0.000319285
 1.97917    0.612787   0.0726852  0.0113523   0.000798212
 2.23542    0.992098   0.228241   0.0182136   0.00271392
 0.5625     0.363506   0.104398   0.051023    0.0026341
 0.0        0.0301724  0.0122685  0.00349301  0.0130109
 1.18333    0.391523   0.0659722  0.00711078  0.00183589
 1.61875    0.31681    0.0532407  0.00686128  0.00207535
 0.179167   0.0933908  0.03125    0.00698603  0.00215517
 0.0708333  0.0423851  0.0159722  0.00511477  0.00383142
 0.764583   0.430316   0.174074   0.051522    0.027219
 0.4375     0.118534   0.019213   0.00399202  0.000718391
 0.5        0.270833   0.0972222  0.0198353   0.00550766

When doing the probability calculations, the number of years over which data is collected is constant for all 
observations.  So whenever we do ratios it will just cancel out.  By recomputing rates to exclude it we can use the poisson
distribution directly for university level probability calculations.

In [5]:
hiring_rates = zeros(size(adjacency)[1], size(adjacency)[2]);
for i in 1:size(adjacency)[1], j in 1:size(adjacency)[2]
    hiring_rates[i,j] = 
    adjacency[i,j]/classification_properties["institution_counts"][i]
end
placement_rates = zeros(size(adjacency)[1], size(adjacency)[2]);
for i in 1:size(adjacency)[1], j in 1:size(adjacency)[2]
    placement_rates[i,j] = 
    adjacency[i,j]/classification_properties["institution_counts"][j]
end

The next task is to load the academic placements and the tier information

In [6]:
#academic_builder = load(files_path*"academic_builder.jld")["academic_builder"];
filtered_data = load(files_path*"filtered_data.jld")["filtered_data"];
sinks = load(files_path*"sinks.jld")["sinks"];
id_to_type_api = JSON.parsefile(files_path*"id_to_type_api.json");


In [7]:
function hiring_outcomes(filtered_data, id_to_type_api, num_types, ejm_id)
    cnt = zeros(Int64,num_types)
    #println("ejm_is is ",typeof(ejm_id), "\n")
    for placement in filtered_data
        if placement["to_institution_id"] ==  ejm_id
            #println(placement["from_institution_id"], " has type ", typeof(placement["from_institution_id"]))
            t = type_lookup(id_to_type_api,placement["from_institution_id"])
            #hack because from_institution_id = 1021 return nulls (Tech University of Brauschweig)
            # and this is the only missing id 1021 is type 5
            if t == 0 
                t = 5
            end
            #println(placement["from_institution_id"], " returns ", t, " which has type ",typeof(t))
            cnt[t] += 1
        end
    end
    return cnt
end

function placement_outcomes(filtered_data, sinks, id_to_type_api, num_types, ejm_id)
    cnt = zeros(Int64, num_types)
    for placement in filtered_data
        if placement["from_institution_id"] ==  string(ejm_id) || placement["from_institution_id"] ==  ejm_id
            #println(placement["to_name"])
            k = placement["to_institution_id"]
            t = type_lookup(id_to_type_api,string(k))
            #println(placement["from_institution_id"], " returns ", t, " which has type ",typeof(t))
            if t > 0
                cnt[t] += 1
            else
                p = placement["to_name"]
                #println("checking ", p)
                n = 1
                for sink in sinks
                    if p in sink
                        cnt[5+n] += 1 
                        break 
                    else
                        n += 1
                    end
                end
            end
        end
    end
    return cnt
end

function type_lookup(id_to_type_api, ejm_id)
    for x in id_to_type_api
        if x["institution_id"] == ejm_id
            return x["type"]
        end
    end
    return 0
end

function to_type_lookup(id_to_type_api, ejm_id)
    for x in id_to_type_api
        if x["institution_id"] == string(ejm_id)
            return x["type"]
        end
    end
    return 0
end

function name_lookup(id_to_type_api, ejm_id)
    for x in id_to_type_api
        if x["institution_id"] == string(ejm_id)
            return x["name"]
        end
    end
end

function institution_lookup(id_to_type_api, ejm_id)
    for x in id_to_type_api
        if x["institution_id"] == string(ejm_id)
            return x
        end
    end
end        

institution_lookup (generic function with 1 method)

In [8]:
all_data = load(files_path*"current_estimates.jld")["all_data"]
expected_offers = load(files_path*"expected_offers.jld")["expected_offers"]
offer_values = []
for j in 1:length(expected_offers)
    if j < classification_properties["unmatched_index"]-1
        push!(offer_values, expected_offers[j])
    elseif j == classification_properties["unmatched_index"]-1
        push!(offer_values, expected_offers[j])
        push!(offer_values, 0.0)
    else
        push!(offer_values,expected_offers[j-1])
    end
end
tier_values = []
for j in 1:5
    push!(tier_values, all_data[j])
end

In [14]:
function likelihood_ratios(filtered_data, sinks, id_to_type_api, hiring_rates, 
        placement_rates, offer_values, tier_values, institution_id)
    """
        compute likelihood ratios by calculating the mutinomial probability of hires and placements using
        the ml poission rates for each of the tiers.  Normally, the tier to which they are assigned should
        have ratio 1, but with small numbers of placements or hires this breaks down
    
        In cases with no recorded hires, rates are set arbirarily to 1 for the tier, zero for everything else
        Not sure yet what happens when placements or hires are small, eg 1 or 2
    """
    
    institution = institution_lookup(id_to_type_api, institution_id)
    r = zeros(size(hiring_rates)[2])
    p = zeros(size(hiring_rates)[2])
    q = zeros(size(hiring_rates)[2])
    euclid = zeros(size(hiring_rates)[2])
    hiring = zeros(size(hiring_rates)[2])
    placing = zeros(size(hiring_rates)[2])
    t = institution["type"]
    name = institution["name"]
    #println(t, " ", name)
    
    # start with hiring
    a = hiring_outcomes(filtered_data, id_to_type_api, 
        size(hiring_rates)[2], parse(Int64, institution["institution_id"]))
    placements = placement_outcomes(filtered_data, sinks, id_to_type_api, 
        size(hiring_rates)[1], institution["institution_id"])
    if iszero(a)
        for k in 1:size(hiring_rates)[2]
            if k == t
                r[k] = 1
                break
            end
        end
        return Dict("name" => name, "id" => institution["institution_id"],
            "tier" => t, "ratios" => r , "hires" => a, "placements" => placements,
            "hiring_value" => 0, "placement_value" => transpose(offer_values)*placements)
    end 
    for i in 1:size(hiring_rates)[2]
        prod = 1
        for j in 1:length(a)
            prod = pdf(Poisson(hiring_rates[i,j]), a[j])*prod
        end
        p[i] = prod
    end
    for j in 1:size(hiring_rates)[2]
        hiring[j]=p[j]/p[t]
    end
    
    for i in 1:size(hiring_rates)[2]
        prod = 1
        for j in 1:length(placements)
            prod = pdf(Poisson(placement_rates[j,i]), placements[j])*prod
        end
        q[i] = prod
    end
    for j in 1:size(hiring_rates)[2]
        placing[j]=q[j]/q[t]
    end
    for j in 1:size(hiring_rates)[2]
        r[j]=(p[j]*q[j])/(p[t]*q[t])
    end
    for i in 1:size(hiring_rates)[2]
        euclid[i] = norm(vcat(a,placements)-vcat(hiring_rates[i,:],placement_rates[:,i]))
    end
    value_of_hires = transpose(tier_values)*a
    value_of_placements = transpose(offer_values)*placements

    return  Dict("name" => name, "id" => institution["institution_id"],
        "tier" => t, "ratios" => r, "hiring_ratios" => hiring,
        "placement_ratios" => placing, "hires" => a, "placements" => placements, "euclidian" => euclid,
        "hiring_value" => value_of_hires, "placement_value" => value_of_placements)
end

likelihood_ratios (generic function with 1 method)

In [10]:
# run various tests
a = likelihood_ratios(filtered_data, sinks, id_to_type_api, hiring_rates, placement_rates,
    offer_values, tier_values, 361
)
#a = hiring_outcomes(filtered_data, id_to_type_api, size(placement_rates)[2], 1765)
#a = placement_outcomes(filtered_data, sinks, id_to_type_api, size(placement_rates)[1], 32)
#id_to_type_api[9]["institution_id"]
#type_lookup(id_to_type_api,"350")

Dict{String, Any} with 11 entries:
  "tier"             => 1
  "placement_ratios" => [1.0, 2.26108, 6.35263e-89, 3.67024e-228, 0.0]
  "hires"            => [37, 23, 10, 2, 0]
  "placement_value"  => 66.5736
  "name"             => "Cornell University"
  "euclidian"        => [59.616, 39.4036, 67.0445, 76.6616, 78.3494]
  "id"               => "361"
  "hiring_ratios"    => [1.0, 0.000912685, 3.71989e-19, 1.26297e-66, 0.0]
  "ratios"           => [1.0, 0.00206365, 2.36311e-107, 0.0, 0.0]
  "placements"       => [17, 26, 40, 9, 0, 13, 24, 4, 2, 22, 6, 16]
  "hiring_value"     => 54.4528

In [15]:


#save the calculations for all institutions
rep = []
for institution in id_to_type_api
    try
        s = likelihood_ratios(filtered_data, sinks, id_to_type_api, hiring_rates,
            placement_rates, offer_values, tier_values, institution["institution_id"])
        push!(rep, s)
    catch e
        rethrow(e)
    end
    
end
  

In [16]:
open(files_path*"likelihood_ratios.json", "w") do f
    write(f, JSON.json(rep))
end;

In [13]:
n = 0
m = 0
# set this to the tier you want to view
tier = 2
# set the next variable to zero to see all universities in the tier
# set it to 1 to see only those whose likelihood ratios don't support their tier assignment
anomalous = 0
for r in rep
    if r["tier"] == tier
        m += 1
        if maximum(r["ratios"]) > r["ratios"][r["tier"]]
            n +=1
            println(r["name"], " ", r["id"], " ", r["tier"])
            println("Anomalous ratio")
            for k in keys(r)
                println(k, " => ", r[k])
            end
            println("\n\n")
        else
            if anomalous == 1
                continue
            end
            println(r["name"], " ", r["id"], " ", r["tier"])
            for k in keys(r)
                println(k, " => ", r[k])
            end
            println("\n\n")
        end
    end
end
println("There were ", n, " anomalies")
println("The proportion ", n/m, " were anomalous") 

Arizona State University 370 2
tier => 2
placement_ratios => [7.597365052077695e-57, 1.0, 1.6841684826066406e-15, 1.0107217114045204e-67, 1.330269284837765e-121]
hires => [17, 24, 6, 2, 0]
placement_value => 25.073260748050433
name => Arizona State University
euclidian => [98.22662571828475, 15.998457862661423, 30.772205672900807, 41.15349946825895, 43.13013226827297]
id => 370
hiring_ratios => [1.1547374874898252e-9, 1.0, 1.2205804747109686e-7, 5.009636910941316e-36, 0.0]
ratios => [8.772962231779202e-66, 1.0, 2.0556631659932653e-22, 5.063348792141861e-103, 0.0]
placements => [2, 10, 22, 5, 0, 2, 5, 1, 0, 13, 2, 13]
hiring_value => 33.425535954436725



Bocconi University 1034 2
tier => 2
placement_ratios => [0.0, 1.0, 7.916917886748075e-50, 1.6457763790526557e-134, 1.5656397299841477e-236]
hires => [12, 28, 11, 7, 1]
placement_value => 40.41215478402265
name => Bocconi University
euclidian => [91.35463863428063, 22.441931939023586, 43.46359623329464, 52.7729832773643, 54.966808820694