## Mapinator Estimation Routines

The five notebooks in this directory can be used to create a classification, then estimate the value of graduates from each tier.
Each of them uses data saved to file from previous operations.  So you can create a classification without assembling the data (just read it from file)
and you can estimate values without doing a classification.

In logical order start with 

1. [mike_assemble_data](./mike_assemble_data.ipynb) - this cleans and assembles data using a routine written by James Yu and Silas Kwok
1. [mike_create_classification](./mike_create_classification.ipynb) - creates a classification using code written by James Yu
1. [mike_adjust_adjacency](./mike_adjust_adjacency.ipynb) - adjusts the adjacency matrix to account for bias in sampling (needs mysql and access to a server).
1. [mike_hacked_estimator](./mike_hacked_estimator.ipynb) - estimates graduate values using the adjusted adjacency matrix  (based on code from James Yu
1. [mike_fine_grained](./mike_fine_grained.ipynb) - an experimental effort to assign types to individual institutions based on the value of their hires (this is unfinished)

## Contents of Estimates and Files

The files in this folder consist of a combination of json files, julia encoded files that create fully formed julia datatypes when they are read, and an odd collection of `.tex` and `.png` files that can used when writing papers.

### Data assembly

The [mike_assemble_data](./mike_assemble_data.ipynb) creates a clean dataset by calling the api [support.econjobmarket.org](https://support.econjobmarket.org).  (The link doesn't load the data.  The appropriate endpoint is in the notebook file).

This cleaned dataset is saved as a json file in the file [to_from_by_year.json](./current_estimates_and_files/to_from_by_year.json).  Subsequent workbooks process theh data by loading it from this file.

### The Classification alorithm

This notebook [mike_create_classification](./mike_create_classification.ipynb) run the classification algorithm. 

It begins by setting many of the parameters that are used in all the notebooks.  For example, the number of academic types is set here.  By appearance they are

1. `show_tier_members` is set to 1 if you want the classification to list all the members of each tier at the end of the notebook output.  You might want to do this when you are experimenting, but generally there are better ways to do that descibed later. 

2. `NUMBER_OF_TYPES` sets the number of academic types to use in the classification.  Sinks are set separately.

3. `sinks_to_include` manually sets the number of sinks to be included in the adjacency matrices.  Each element of the vector is the variable name of one of the sets that were declared in the previous lines.  You can comment out the set declarations in the previous lines if you don't include their name in this parameter vector.  But you don't have to do that.  If the sinks aren't named in the `sinks_to_include` parameter vector their data is just left out.  No sink name can be added to this vector unless a corresponding set has been initalized with that name.  Consider the sinks as created in the previous lines to be fixed.  Declaring a new set name then adding its name to the sinks to include parameter should trigger an exception.

4. `algorithm_run_id` The only other parameter that can be set if the `algorithm_run_id`.  This number is set in the mysql database and is intended to give more details about the algorithm.  Code on the webserver can request information like the list of universities in the top tier, by specifying which run of the algorithm they want. 

Finally, a comment on the `unmatched` sink. The `unmatched` sink includes all the applicants who ended up getting jobs at places that didn't have an id number at econjobmarket, plus all the people who disappeard from the internet in the sense that no evidence of any job they had accepted could be found.  

Roughly this list is probably the best guess about market failures.  An institution which is not listed on econjobmarket is one that has never advertised at econjobmarket, and which has never been listed by any applicant as their graduating institution.

The best way to describe them is that they are institutions that do not participate in the international job market.  As such, graduates who take jobs with them are likely not to have received job offers on the international market.  These are currently the best guess of market failures.

`Teaching Institutions` is a sink that is created automatically. It consists of all clearly academic institutions (Vassar is an example) who do not produce graduates (which means there is no record of a placement by any graduate from that institution).

The algorithm reads the file `to_from_by_year.json` created by the data assembly worksheet.  It does not fetch new data from the api.  This is primarily to keep the data used by all the processing files consistent.

#### Files Created by the Classification algorithm

One of the other notebooks disaggregates the placements to compare individual institutions.  For this reason the algorithm creates two filtered copies of the data.  The algorithm saves a file called `filtered_data.jld`.  When this file is read back in to a julia program it comes our directly as a valid julia object.  In this case. loading the file will produce a julia Vector consisting of elements of type Any.  The actual elements are placements recorded in json.  They are all placements that have been filtered according to the settings described above.  Using this method the `filtered_data` file produces a julia type that is the same across all notebooks that use it.

The program creates another file of type `.jld`  called `sinks.jld`.  This loads as a julia set consisting of a list of institution id numbers.  It is used in the `mike_likelihood_ratios.ipynb`.

The next file saved is called `configuration_options.jld`.  It loads as a julia Dict.  This is an example:
```
Dict{String, Any} with 4 entries:
  "data_loaded"        => DateTime("2024-06-18T18:21:53.103")
  "institution_counts" => [20, 58, 180, 334, 522, 152, 227, 598, 413, 1, 38, 64…
  "algorithm_run_id"   => 5
  "num_years"          => 24
```
The `data_loaded` entry is created by looking at the date on the file itself.  The `num_years` entry is the number of years for which data are included in the `filtered_data` file.

The program crates a `.jld` file called `placement_rates.jld` which loads as a julia Matrix of type Int32 and provides the adjacecny matrix associated with the best estimate of the classification.  The command `size(placement_rates)` produces a tuple.  The first element is the number of academic tiers plus the number of configured sinks.  The second element of the tuple is the number of academic tiers.  These parameters are set at the beginning of the create classification file. It is also how these size configuration numbers are passed on to other notebooks that use the data.

A file called `id_to_type_api.json` is create.  Reading it in as a json file creates an array of dicts

Finally, the program writes a latex file called `nice_adjacency_table.tex` which is suitable for reading in a lyx file or a latex file - in other words, when writing up stuff.

### Summarizing Data

The notebook [mike_likelihood_ratios](./mike_likelihood_ratios.ipynb) is an illustration of how the saved data from the classification can be referenced and used.  The only thing it does at the moment is to create a file called `likelihood_ratios.json` that creates a summary of properties gleaned from the classification.

Here is an example of an entry created by the likelihood_ratios program:
```
Duke University 17 1
tier => 1
placement_ratios => [1.0, 9.845451827189389e-51, 3.0202207413977538e-202, 0.0, 0.0]
hires => [30, 7, 8, 1, 0]
name => Duke University
euclidian => [41.077609472801605, 65.81126854708437, 90.85235750489655, 97.73231057567419, 98.88364168558734]
id => 17
hiring_ratios => [1.0, 249.31085584965248, 7.5449214669479e-6, 6.197428718406091e-35, 0.0]
offer_value => 106.27431420583007
ratios => [1.0, 2.454578021263111e-48, 2.2787328306693214e-207, 0.0, 0.0]
placements => [37, 49, 43, 12, 0, 32, 37, 4, 1, 18, 5, 17]
hiring_value => 37.37897492023241
```
This same entry appears in the `likelihood_ratios.json` except encoded as json.

The core bits of information included in this report are the `hires` and `placements` vectors.  The `hires` vector lists, ordered by tier, the number of graduates the university has hired from each tier.  For example, Duke has hired 30 tier 1 graduates according to the mapinator data.

The `placements` vector shows the number of its graduates that have been hired by each tier.  The first 5 entries are the numbers hired by tiers 1 through 5.  The last seven entries are graduates that were hired by the following collection of institutions (listed in the same order as they appear in the vector):
```
Public Sector
 Private Sector
 Postdocs
 Lecturers
 Unmatched
 Other Groups
 Teaching Universities
```
The `unmatched` tier records graduates who got job at institutions that didn't have id numbers in the econjobmmarket database, or who could not be traced on the internet.  The logic of this is that since the graduates registered on econjobmarket, they were trying for jobs on the international job market, but did not get offers.

The `ratios` property computes the likelihood of the insitutions hires and placements evaluated using the estimated rates of each different tier, each divided by the likelihood measured by the estimamted placement rates for the tier to which they are assigned.  The `hiring_ratios` and `placement_ratios` do the same thing for their hires and placements independently.

The `euclidean` measure is the euclidean distance between the university's hires and placements, and the average hires and placements of each of the tiers.  Generally that distance should be lowest for the tier into which they were assigned by the classification algorithm. 

Finally the `offer_value` and `hiring_value` properties are estimated using the logic in the appears in the paper [The Mapinator Classification](https://montoya.econ.ubc.ca/papers/markets/markets.pdf).  We estimated the values of the graduates of each tier using a directed search model.  The `hiring_value` parameter is just the total value of all the graduates the university hired.  The `offer_value` is similar.  It estimates the mean value of the offers made by each of 11 hiring tiers times the universities own placement vector.  This is a measure of the total value of the graduates that the university produced.

### Update and Save the Data

The final utility [mike_fine_grained](mike_fine_grained.ipynb) calculate quantiles for the hiring and placement values in the likelihood ratios file.  That has to be done separately because all the values have to be calculated before the quantiles are computed. 

The `likelihood_ratios.json` file is rewritten to add these quantile values.  The new file overwrites the old on.  It also saves the the results to a mysql database.

The notebook also produces figures that show the distributions of hiring and placement values for each tier.