Sherlock

Sherlock detects potential cheating on multiple-choice exams using statistical analysis.

Key Features

Minimal data requirements — no answer key, candidate metadata, or seating charts needed
Built-in evaluation — measure detection accuracy against your own exam data before trusting the results

How Well Does It Work?

Performance depends heavily on the structure of the exam and the extent of copying.

Cheating is easier to detect when:

Cheaters copy a larger share of answers
There are few cheaters relative to the exam population. (Large groups of copiers can resemble a legitimate cluster of high-ability candidates.)

For reference, here is the detection accuracy in a specific realistic scenario:

Two cheaters copying 150 of 200 answers, among ~6000 candidates

See the Evaluate Accuracy section to learn how to evaluate the accuracy of the model in your specific setting.

How It Works

Sherlock looks for unusually high numbers of matching answers between test-takers and flags suspicious cases for investigation. While matching answers don't prove cheating occurred, they provide a focused starting point for further review.

Determining what counts as "unusually high" depends on the exam and test-takers. For instance, easy questions that everyone answers correctly create legitimate matches. The key question is: What's the likelihood that two candidates of similar ability would naturally answer the same questions identically?

To answer this, Sherlock builds a statistical model from the answer patterns, then simulates how often candidates would have similar answers based purely on their ability and question difficulty. Using these simulations, it calculates the probability of observing the actual degree of similarity and flags cases where this probability falls below a set threshold.

The analysis does not require you to pass in the correct answers, or any metadata about candidates. This is helpful when that information isn't available (e.g. you didn't conduct the exam), or when the exam authority wants to limit data access during exam processing.

However, if you do have access to the correct answers, then it may help improve the quality of the underlying statistical model. This may in turn help you detect cheaters with greater accuracy.

Current Status

The project is still in its early stages.

Run model diagnostics

Verify that the model can recover latent parameters and predict empirical distributions.

uv --directory backend run python scripts/refresh_synthetic_data --preset [preset-name]
uv --directory backend run python scripts/run_model_diagnostics [preset-name]

Define sampling distributions at backend/src/analysis_service/synthetic_data/params/. See baseline.yaml for an example.

Analyze an exam

Check whether an exam has any likely cheaters.

docker build -t sherlock-backend backend/
docker run -p 8000:8000 sherlock-backend
uv --directory backend run python scripts/run_analysis [preset-name] [additional-args]

Evaluate accuracy

Measure detection accuracy on synthetic data with known cheaters injected.

From a hypothetical distribution:

uv --directory backend run python scripts/evaluate_accuracy.py --preset [preset-name] [additional-args]

From an existing dataset (matches empirical answer distributions):

uv --directory backend run python scripts/evaluate_accuracy.py --data [csv-path] [additional-args]

Roadmap

Calculate precision/recall under a wide range of simulated distributions.
Develop a Desktop application (written in Typescript in frontend/) for non-technical users.

Getting Started

See CONTRIBUTING.md for a guide to getting your local development environment set up.

Running the Server

From the backend/ directory, install dependencies and start the API server:

uv sync
uv run python -m analysis_service.api

The server starts at http://127.0.0.1:8000 by default.

Docker

Build and run from the backend/ directory:

docker build -t sherlock-backend backend/
docker run -p 8000:8000 sherlock-backend

Configure via environment variables prefixed with SHERLOCK_:

SHERLOCK_HOST=0.0.0.0 SHERLOCK_PORT=9000 uv run python -m analysis_service.api

Variable	Default	Description
`SHERLOCK_HOST`	`127.0.0.1`	Server bind address
`SHERLOCK_PORT`	`8000`	Server port
`SHERLOCK_MAX_CANDIDATES`	`10000`	Max candidates per request
`SHERLOCK_MAX_ITEMS`	`500`	Max exam items per request
`SHERLOCK_MAX_CONCURRENT_JOBS`	`4`	Max parallel analysis jobs
`SHERLOCK_JOB_TTL_SECONDS`	`3600`	How long completed jobs are retained

API Overview

All endpoints live under /api/v1.

Method	Endpoint	Description
`GET`	`/api/v1/health`	Health check
`POST`	`/api/v1/analysis`	Submit an analysis job
`GET`	`/api/v1/analysis/{job_id}`	Poll job status and results
`DELETE`	`/api/v1/analysis/{job_id}`	Cancel a job

Submitting an Analysis

Analysis is asynchronous: you submit a job, then poll for results.

Request — POST /api/v1/analysis

{
  "exam_dataset": {
    "candidate_ids": ["C001", "C002", "C003"],
    "response_matrix": [
      ["A", "B", "C", "D", "B"],
      ["A", "B", "C", "D", "B"],
      ["D", "C", "B", "A", "*"]
    ],
    "n_categories": 4,
    "correct_answers": ["A", "B", "C", "D", "C"],
    "missing_values": ["*"]
  },
  "significance_level": 0.05,
  "num_monte_carlo": 100,
  "random_seed": 42
}

response_matrix: each row is a candidate's answers as strings. Use a missing value string (default "*") for unanswered items.
n_categories: number of distinct non-missing answer choices (minimum 2).
correct_answers (optional): the answer key as strings, used to improve model estimation.
missing_values (optional): set of strings treated as missing (default ["*"]).
significance_level: p-value cutoff for flagging suspects (default 0.05).
num_monte_carlo: number of Monte Carlo simulations (default 100).
threshold (optional): if provided, uses a fast fixed-threshold pipeline instead of the full IRT-based analysis.

The response returns a job ID:

{ "job_id": "a1b2c3d4e5f6" }

Reading Results

Poll GET /api/v1/analysis/{job_id} until status is "completed" or "failed".

{
  "job_id": "a1b2c3d4e5f6",
  "status": "completed",
  "progress": null,
  "result": {
    "suspects": [
      {
        "candidate_id": "C002",
        "detection_threshold": 4.0,
        "observed_similarity": 5,
        "p_value": 0.03
      }
    ],
    "model_version": null,
    "pipeline_type": "automatic"
  },
  "error": null,
  "created_at": "2026-01-28T12:00:00+00:00",
  "completed_at": "2026-01-28T12:00:07+00:00"
}

status: one of "pending", "running", "completed", "failed".
suspects: candidates flagged for unusually high answer similarity.
- observed_similarity: the candidate's maximum matching-answer count against any other candidate.
- detection_threshold: the similarity count above which a candidate is flagged.
- p_value: probability of observing this similarity by chance (null when using the threshold pipeline).
pipeline_type: "automatic" (full IRT-based) or "threshold" (fixed cutoff).
While a job is running, the progress field reports the current phase ("similarity", "irt_fitting", "threshold_calibration", or "monte_carlo") and step counts.

CLI Script

The run_analysis.py script provides a convenient way to submit a CSV file and view results from the command line. Start the server first, then from the backend/ directory:

uv run python scripts/run_analysis.py [path-to-csv]

The CSV must have two columns: candidate_id and answer_string, where each character in answer_string is one question's response. Missing responses are represented by * (configurable via --missing-value).

candidate_id,answer_string
C0001,ABCD*BCA
C0002,ABC*ABCA

Option	Default	Description
`--url`	`http://127.0.0.1:8000`	Server base URL
`--threshold`	None	Use fixed-threshold pipeline
`--significance-level`	`0.05`	p-value cutoff
`--num-monte-carlo`	`100`	Monte Carlo simulations
`--num-threshold-samples`	`100`	Threshold calibration samples
`--random-seed`	None	Random seed for reproducibility
`--missing-value`	`*`	String representing missing responses
`--output-dir`	`reports/detection/`	Directory for JSON report output
`--poll-interval`	`10.0`	Seconds between status polls

The script displays a progress spinner while the job runs and prints a table of flagged suspects on completion. Full results are saved as JSON to the output directory.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
backend		backend
frontend		frontend
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sherlock

Key Features

How Well Does It Work?

How It Works

Current Status

Run model diagnostics

Analyze an exam

Evaluate accuracy

Roadmap

Getting Started

Running the Server

Docker

API Overview

Submitting an Analysis

Reading Results

CLI Script

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sherlock

Key Features

How Well Does It Work?

How It Works

Current Status

Run model diagnostics

Analyze an exam

Evaluate accuracy

Roadmap

Getting Started

Running the Server

Docker

API Overview

Submitting an Analysis

Reading Results

CLI Script

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages