Skip to content

kmangal/sherlock

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sherlock

Sherlock detects potential cheating on multiple-choice exams using statistical analysis.

Key Features

  • Minimal data requirements — no answer key, candidate metadata, or seating charts needed
  • Built-in evaluation — measure detection accuracy against your own exam data before trusting the results

How Well Does It Work?

Performance depends heavily on the structure of the exam and the extent of copying.

Cheating is easier to detect when:

  • Cheaters copy a larger share of answers
  • There are few cheaters relative to the exam population. (Large groups of copiers can resemble a legitimate cluster of high-ability candidates.)

For reference, here is the detection accuracy in a specific realistic scenario:

Confusion Matrix

Two cheaters copying 150 of 200 answers, among ~6000 candidates

See the Evaluate Accuracy section to learn how to evaluate the accuracy of the model in your specific setting.

How It Works

Sherlock looks for unusually high numbers of matching answers between test-takers and flags suspicious cases for investigation. While matching answers don't prove cheating occurred, they provide a focused starting point for further review.

Determining what counts as "unusually high" depends on the exam and test-takers. For instance, easy questions that everyone answers correctly create legitimate matches. The key question is: What's the likelihood that two candidates of similar ability would naturally answer the same questions identically?

To answer this, Sherlock builds a statistical model from the answer patterns, then simulates how often candidates would have similar answers based purely on their ability and question difficulty. Using these simulations, it calculates the probability of observing the actual degree of similarity and flags cases where this probability falls below a set threshold.

The analysis does not require you to pass in the correct answers, or any metadata about candidates. This is helpful when that information isn't available (e.g. you didn't conduct the exam), or when the exam authority wants to limit data access during exam processing.

However, if you do have access to the correct answers, then it may help improve the quality of the underlying statistical model. This may in turn help you detect cheaters with greater accuracy.

Current Status

The project is still in its early stages.

Run model diagnostics

Verify that the model can recover latent parameters and predict empirical distributions.

uv --directory backend run python scripts/refresh_synthetic_data --preset [preset-name]
uv --directory backend run python scripts/run_model_diagnostics [preset-name]

Define sampling distributions at backend/src/analysis_service/synthetic_data/params/. See baseline.yaml for an example.

Analyze an exam

Check whether an exam has any likely cheaters.

docker build -t sherlock-backend backend/
docker run -p 8000:8000 sherlock-backend
uv --directory backend run python scripts/run_analysis [preset-name] [additional-args]

Evaluate accuracy

Measure detection accuracy on synthetic data with known cheaters injected.

From a hypothetical distribution:

uv --directory backend run python scripts/evaluate_accuracy.py --preset [preset-name] [additional-args]

From an existing dataset (matches empirical answer distributions):

uv --directory backend run python scripts/evaluate_accuracy.py --data [csv-path] [additional-args]

Roadmap

  1. Calculate precision/recall under a wide range of simulated distributions.
  2. Develop a Desktop application (written in Typescript in frontend/) for non-technical users.

Getting Started

See CONTRIBUTING.md for a guide to getting your local development environment set up.

Running the Server

From the backend/ directory, install dependencies and start the API server:

uv sync
uv run python -m analysis_service.api

The server starts at http://127.0.0.1:8000 by default.

Docker

Build and run from the backend/ directory:

docker build -t sherlock-backend backend/
docker run -p 8000:8000 sherlock-backend

Configure via environment variables prefixed with SHERLOCK_:

SHERLOCK_HOST=0.0.0.0 SHERLOCK_PORT=9000 uv run python -m analysis_service.api
Variable Default Description
SHERLOCK_HOST 127.0.0.1 Server bind address
SHERLOCK_PORT 8000 Server port
SHERLOCK_MAX_CANDIDATES 10000 Max candidates per request
SHERLOCK_MAX_ITEMS 500 Max exam items per request
SHERLOCK_MAX_CONCURRENT_JOBS 4 Max parallel analysis jobs
SHERLOCK_JOB_TTL_SECONDS 3600 How long completed jobs are retained

API Overview

All endpoints live under /api/v1.

Method Endpoint Description
GET /api/v1/health Health check
POST /api/v1/analysis Submit an analysis job
GET /api/v1/analysis/{job_id} Poll job status and results
DELETE /api/v1/analysis/{job_id} Cancel a job

Submitting an Analysis

Analysis is asynchronous: you submit a job, then poll for results.

Request — POST /api/v1/analysis

{
  "exam_dataset": {
    "candidate_ids": ["C001", "C002", "C003"],
    "response_matrix": [
      ["A", "B", "C", "D", "B"],
      ["A", "B", "C", "D", "B"],
      ["D", "C", "B", "A", "*"]
    ],
    "n_categories": 4,
    "correct_answers": ["A", "B", "C", "D", "C"],
    "missing_values": ["*"]
  },
  "significance_level": 0.05,
  "num_monte_carlo": 100,
  "random_seed": 42
}
  • response_matrix: each row is a candidate's answers as strings. Use a missing value string (default "*") for unanswered items.
  • n_categories: number of distinct non-missing answer choices (minimum 2).
  • correct_answers (optional): the answer key as strings, used to improve model estimation.
  • missing_values (optional): set of strings treated as missing (default ["*"]).
  • significance_level: p-value cutoff for flagging suspects (default 0.05).
  • num_monte_carlo: number of Monte Carlo simulations (default 100).
  • threshold (optional): if provided, uses a fast fixed-threshold pipeline instead of the full IRT-based analysis.

The response returns a job ID:

{ "job_id": "a1b2c3d4e5f6" }

Reading Results

Poll GET /api/v1/analysis/{job_id} until status is "completed" or "failed".

{
  "job_id": "a1b2c3d4e5f6",
  "status": "completed",
  "progress": null,
  "result": {
    "suspects": [
      {
        "candidate_id": "C002",
        "detection_threshold": 4.0,
        "observed_similarity": 5,
        "p_value": 0.03
      }
    ],
    "model_version": null,
    "pipeline_type": "automatic"
  },
  "error": null,
  "created_at": "2026-01-28T12:00:00+00:00",
  "completed_at": "2026-01-28T12:00:07+00:00"
}
  • status: one of "pending", "running", "completed", "failed".
  • suspects: candidates flagged for unusually high answer similarity.
    • observed_similarity: the candidate's maximum matching-answer count against any other candidate.
    • detection_threshold: the similarity count above which a candidate is flagged.
    • p_value: probability of observing this similarity by chance (null when using the threshold pipeline).
  • pipeline_type: "automatic" (full IRT-based) or "threshold" (fixed cutoff).
  • While a job is running, the progress field reports the current phase ("similarity", "irt_fitting", "threshold_calibration", or "monte_carlo") and step counts.

CLI Script

The run_analysis.py script provides a convenient way to submit a CSV file and view results from the command line. Start the server first, then from the backend/ directory:

uv run python scripts/run_analysis.py [path-to-csv]

The CSV must have two columns: candidate_id and answer_string, where each character in answer_string is one question's response. Missing responses are represented by * (configurable via --missing-value).

candidate_id,answer_string
C0001,ABCD*BCA
C0002,ABC*ABCA
Option Default Description
--url http://127.0.0.1:8000 Server base URL
--threshold None Use fixed-threshold pipeline
--significance-level 0.05 p-value cutoff
--num-monte-carlo 100 Monte Carlo simulations
--num-threshold-samples 100 Threshold calibration samples
--random-seed None Random seed for reproducibility
--missing-value * String representing missing responses
--output-dir reports/detection/ Directory for JSON report output
--poll-interval 10.0 Seconds between status polls

The script displays a progress spinner while the job runs and prints a table of flagged suspects on completion. Full results are saved as JSON to the output directory.

About

Detect cheating on multiple choice exams

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors