Sherlock detects potential cheating on multiple-choice exams using statistical analysis.
- Minimal data requirements — no answer key, candidate metadata, or seating charts needed
- Built-in evaluation — measure detection accuracy against your own exam data before trusting the results
Performance depends heavily on the structure of the exam and the extent of copying.
Cheating is easier to detect when:
- Cheaters copy a larger share of answers
- There are few cheaters relative to the exam population. (Large groups of copiers can resemble a legitimate cluster of high-ability candidates.)
For reference, here is the detection accuracy in a specific realistic scenario:
Two cheaters copying 150 of 200 answers, among ~6000 candidates
See the Evaluate Accuracy section to learn how to evaluate the accuracy of the model in your specific setting.
Sherlock looks for unusually high numbers of matching answers between test-takers and flags suspicious cases for investigation. While matching answers don't prove cheating occurred, they provide a focused starting point for further review.
Determining what counts as "unusually high" depends on the exam and test-takers. For instance, easy questions that everyone answers correctly create legitimate matches. The key question is: What's the likelihood that two candidates of similar ability would naturally answer the same questions identically?
To answer this, Sherlock builds a statistical model from the answer patterns, then simulates how often candidates would have similar answers based purely on their ability and question difficulty. Using these simulations, it calculates the probability of observing the actual degree of similarity and flags cases where this probability falls below a set threshold.
The analysis does not require you to pass in the correct answers, or any metadata about candidates. This is helpful when that information isn't available (e.g. you didn't conduct the exam), or when the exam authority wants to limit data access during exam processing.
However, if you do have access to the correct answers, then it may help improve the quality of the underlying statistical model. This may in turn help you detect cheaters with greater accuracy.
The project is still in its early stages.
Verify that the model can recover latent parameters and predict empirical distributions.
uv --directory backend run python scripts/refresh_synthetic_data --preset [preset-name]
uv --directory backend run python scripts/run_model_diagnostics [preset-name]Define sampling distributions at backend/src/analysis_service/synthetic_data/params/. See baseline.yaml for an example.
Check whether an exam has any likely cheaters.
docker build -t sherlock-backend backend/
docker run -p 8000:8000 sherlock-backend
uv --directory backend run python scripts/run_analysis [preset-name] [additional-args]Measure detection accuracy on synthetic data with known cheaters injected.
From a hypothetical distribution:
uv --directory backend run python scripts/evaluate_accuracy.py --preset [preset-name] [additional-args]From an existing dataset (matches empirical answer distributions):
uv --directory backend run python scripts/evaluate_accuracy.py --data [csv-path] [additional-args]- Calculate precision/recall under a wide range of simulated distributions.
- Develop a Desktop application (written in Typescript in
frontend/) for non-technical users.
See CONTRIBUTING.md for a guide to getting your local development environment set up.
From the backend/ directory, install dependencies and start the API server:
uv sync
uv run python -m analysis_service.apiThe server starts at http://127.0.0.1:8000 by default.
Build and run from the backend/ directory:
docker build -t sherlock-backend backend/
docker run -p 8000:8000 sherlock-backendConfigure via environment variables prefixed with SHERLOCK_:
SHERLOCK_HOST=0.0.0.0 SHERLOCK_PORT=9000 uv run python -m analysis_service.api| Variable | Default | Description |
|---|---|---|
SHERLOCK_HOST |
127.0.0.1 |
Server bind address |
SHERLOCK_PORT |
8000 |
Server port |
SHERLOCK_MAX_CANDIDATES |
10000 |
Max candidates per request |
SHERLOCK_MAX_ITEMS |
500 |
Max exam items per request |
SHERLOCK_MAX_CONCURRENT_JOBS |
4 |
Max parallel analysis jobs |
SHERLOCK_JOB_TTL_SECONDS |
3600 |
How long completed jobs are retained |
All endpoints live under /api/v1.
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/v1/health |
Health check |
POST |
/api/v1/analysis |
Submit an analysis job |
GET |
/api/v1/analysis/{job_id} |
Poll job status and results |
DELETE |
/api/v1/analysis/{job_id} |
Cancel a job |
Analysis is asynchronous: you submit a job, then poll for results.
Request — POST /api/v1/analysis
{
"exam_dataset": {
"candidate_ids": ["C001", "C002", "C003"],
"response_matrix": [
["A", "B", "C", "D", "B"],
["A", "B", "C", "D", "B"],
["D", "C", "B", "A", "*"]
],
"n_categories": 4,
"correct_answers": ["A", "B", "C", "D", "C"],
"missing_values": ["*"]
},
"significance_level": 0.05,
"num_monte_carlo": 100,
"random_seed": 42
}response_matrix: each row is a candidate's answers as strings. Use a missing value string (default"*") for unanswered items.n_categories: number of distinct non-missing answer choices (minimum 2).correct_answers(optional): the answer key as strings, used to improve model estimation.missing_values(optional): set of strings treated as missing (default["*"]).significance_level: p-value cutoff for flagging suspects (default0.05).num_monte_carlo: number of Monte Carlo simulations (default100).threshold(optional): if provided, uses a fast fixed-threshold pipeline instead of the full IRT-based analysis.
The response returns a job ID:
{ "job_id": "a1b2c3d4e5f6" }Poll GET /api/v1/analysis/{job_id} until status is "completed" or "failed".
{
"job_id": "a1b2c3d4e5f6",
"status": "completed",
"progress": null,
"result": {
"suspects": [
{
"candidate_id": "C002",
"detection_threshold": 4.0,
"observed_similarity": 5,
"p_value": 0.03
}
],
"model_version": null,
"pipeline_type": "automatic"
},
"error": null,
"created_at": "2026-01-28T12:00:00+00:00",
"completed_at": "2026-01-28T12:00:07+00:00"
}status: one of"pending","running","completed","failed".suspects: candidates flagged for unusually high answer similarity.observed_similarity: the candidate's maximum matching-answer count against any other candidate.detection_threshold: the similarity count above which a candidate is flagged.p_value: probability of observing this similarity by chance (null when using the threshold pipeline).
pipeline_type:"automatic"(full IRT-based) or"threshold"(fixed cutoff).- While a job is running, the
progressfield reports the current phase ("similarity","irt_fitting","threshold_calibration", or"monte_carlo") and step counts.
The run_analysis.py script provides a convenient way to submit a CSV file and view results from the command line. Start the server first, then from the backend/ directory:
uv run python scripts/run_analysis.py [path-to-csv]The CSV must have two columns: candidate_id and answer_string, where each character in answer_string is one question's response. Missing responses are represented by * (configurable via --missing-value).
candidate_id,answer_string
C0001,ABCD*BCA
C0002,ABC*ABCA| Option | Default | Description |
|---|---|---|
--url |
http://127.0.0.1:8000 |
Server base URL |
--threshold |
None | Use fixed-threshold pipeline |
--significance-level |
0.05 |
p-value cutoff |
--num-monte-carlo |
100 |
Monte Carlo simulations |
--num-threshold-samples |
100 |
Threshold calibration samples |
--random-seed |
None | Random seed for reproducibility |
--missing-value |
* |
String representing missing responses |
--output-dir |
reports/detection/ |
Directory for JSON report output |
--poll-interval |
10.0 |
Seconds between status polls |
The script displays a progress spinner while the job runs and prints a table of flagged suspects on completion. Full results are saved as JSON to the output directory.
