Skip to content

kenrinzero/ai-code-audit-taxonomy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Code Audit Taxonomy

Browse the taxonomy site

A curated taxonomy of code defect patterns that AI assistants produce with distinctive frequency, form, or mechanism in Python. Each entry documents the defective shape, explains why language models produce it, links to real-world incidents, and provides mechanical detection cues. The current release covers 24 patterns — the set where evidence met the inclusion rule, not a target count.

The patterns are AI-amplified, not AI-exclusive. Most are ordinary defects — swallowed exceptions, off-by-one errors, missing timeouts — that human programmers also write. What makes them worth cataloging separately is that AI-generated code produces them at characteristic densities, in characteristic forms, and through mechanisms tied to how language models generate code. The honest claim is AI-shaped, not AI-only.

Why this exists

AI coding assistants produce correct code most of the time. The difficult part is staying alert across hundreds of correct outputs and catching the broken one. Knowing the characteristic shapes to watch for makes the difference.

This taxonomy is a reference for anyone who reads AI-generated code: reviewers, auditors, developers using AI assistants, or teams building review workflows. It is organized around two questions:

  1. What does the defect look like? Detection cues you can grep for or spot on visual scan.
  2. Why does the model produce it? Mechanism explanations that connect the defect to how language models work — not just what to look for, but why it keeps appearing.

Each entry also documents false-positive shapes — what looks like the pattern but isn't — so you can triage confidently rather than over-flag.

How to use this

If you review AI-generated code: scan the Detection cues sections. Most are grep-able (except Exception: pass, requests.get( without timeout=, logger.info(f"). Start with the entries rated difficulty: low — they have the highest signal-to-effort ratio.

If you build or configure AI coding tools: the Mechanism sections explain why each pattern recurs. The cross-cutting note codified-guidance-is-insufficient documents why CLAUDE.md / AGENTS.md conventions alone don't prevent these — enforcement via lint rules and CI is the cure.

If you maintain ruff, bandit, or similar linters: many entries map to existing rules (ruff BLE001, B006, G004, SIM115, PLC0415, RUF029; bandit B101, B113, B202, B602, B608). The taxonomy documents the AI-amplified densities at which these rules fire and the false-positive shapes that help triage.

How language models generate code — a brief primer

The Mechanism sections use a small vocabulary from how language models work. Three ideas cover most of it.

Token-level prediction. A language model writes code one small piece at a time — roughly word-sized chunks called tokens — picking the most likely next token given what it has just produced. It commits as it goes; it does not draft a whole function and then refine.

The training corpus. The body of text the model learned from: Stack Overflow answers, tutorials, GitHub repositories, documentation, blog posts. The shapes that occur most frequently per token in the training corpus are the ones the model produces most fluently. Under-represented shapes are under-produced even when they are the right answer for the situation.

Local attention. The model decides each next token mostly from the surrounding code — recent lines, the current function signature, nearby imports — not by re-reading the whole file or project. Conventions documented elsewhere (style guides, lint rules, project docs) sit outside this local window unless they are explicitly pulled in.

Most defects in the taxonomy come from these three forces together: the corpus-fluent shape wins the per-token decision, and consistency that would require looking outside the local window is not enforced at generation time. No further ML background is needed.

The entries

By surface (category)

Category Entries
error-handling swallowed-exceptions, inconsistent-error-handling, brittle-error-detection
structure near-identical-siblings, unjustified-lazy-import
control-flow off-by-one, swapped-args
security string-built-sql, shell-true-subprocess-injection, tarfile-extractall-without-filter
observability print-instead-of-logging, f-string-in-logger-call
reliability missing-network-timeout, resource-leak-no-context-manager
async async-await-mismatch, sleep-based-synchronization
language-pitfall mutable-default-arguments, assert-for-runtime-validation
configuration hardcoded-config-values
consistency convention-drift
documentation narrating-comments
testing weak-test-assertion
defensive-programming unreachable-defensive-guard
library-usage wrong-tool-for-job

By mechanism (cross-cutting notes)

Each entry has a category (the surface where the defect appears) and may participate in one or more cross-cutting notes (the mechanism connecting it to other entries). The two axes are independent.

Note What it observes Entries
ai-pedagogical-bias Model treats production code as tutorial code 6 entries
same-project-knows-right-pattern Same codebase uses the right pattern at one site, wrong at another 10 entries
codified-guidance-is-insufficient Documented conventions don't prevent the violations; enforcement is the cure 16+ entries
surface-failure-modes-explicitly Typed-exception meta-family: surface failure modes through the type system 4 entries
defensive-choice-with-justifying-comment Defensive choices paired with comments justifying constraints that don't survive verification 9+ entries
partial-fix-propagation A prior fix addressed some sites; sibling sites retain the wrong pattern 3 entries

Entry format

Each taxonomy entry follows the same structure:

  • Code example — minimal defective code; the bug should be visible to someone who knows Python.
  • Mechanism — why a language model produces this shape. Connects to the primer above.
  • Evidence / incident — real-world GitHub issues, PRs, or CVEs where the pattern was found in AI-generated code. Every entry requires concrete evidence of an AI-vs-human frequency or form differential.
  • Detection cues — what to grep for or spot on visual scan. Mechanical enough to use as a checklist.
  • Notes — false-positive shapes, connections to cross-cutting notes, difficulty rating.

All current entries are classified generation: evergreen — patterns stable across model generations. The entry format supports a current-model generation for patterns tied to specific model families, but none have met the inclusion rule yet. Longitudinal tracking of which generations introduce or reduce specific patterns is a future goal.

What this is (and isn't)

This is a reference taxonomy — a structured catalog of patterns with evidence and mechanism. It is not:

  • An exhaustive list. 24 entries is a starting point, not a ceiling.
  • A claim that AI is bad at coding. The stance is neutral and practical: assistants are useful tools with characteristic distributional properties.
  • A frozen document. The patterns AI produces will shift as models change; the taxonomy is meant to be updated alongside them.

Inclusion rule

A pattern enters the taxonomy only if there is concrete evidence of a frequency or form differential between AI-generated and human-written code — not just a plausible mechanism story. Each entry carries an evidence grade: reproduced (independently verified), observed (captured from real projects), reported (documented by others), or analogical (structurally predicted from a confirmed mechanism).

Evidence methodology

The taxonomy's evidence base draws from three streams:

  1. GitHub issues and PRs from AI-coded open-source projects — identified by CLAUDE.md/AGENTS.md presence, AI-attributed commit trailers, or bot-authored audit frameworks. 75 specimens drawn from 65+ distinct repositories across the entries. A small number of repositories contribute specimens to multiple entries; individual entries may have narrower provenance — the per-entry Evidence sections are transparent about sourcing.
  2. Community lint rules (ruff, bandit, pylint, SonarCloud) that independently flag the same patterns — evidence that the broader Python community recognizes these as defect classes regardless of authorship.
  3. Academic cross-validation — Zhu, Tsantalis & Rigby (2026), "AI-Generated Smells" (arXiv:2605.02741), provides statistical evidence on structural code smells in AI-generated Python code, cross-validating the near-identical-siblings entry and the broader claim that AI-generated code has measurable distributional properties.

Evidence specimens referenced in entries link to the original GitHub issues. Local specimen files (detailed research notes) are not included in this repository.

Sources and background

License

MIT — see LICENSE.

About

Taxonomy of 24 AI-typical code defect patterns in Python — mechanism, evidence, detection cues

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages