feat(engine): near-duplicate-blocks engine and layered architecture#28
Conversation
Duplicated from chore/samples PR for independence. After chore/samples merges, this commit can be dropped on rebase (the files will be identical).
…x.exs Project-level tooling: dialyzer suppression list, codeqa analysis config, pre-commit hooks, devenv environment, and mix.exs with new deps (gen_stage, yamerl, etc.) required by the engine layer.
…rvers - Engine layer: Analyzer, Collector, Pipeline, Registry, Parallel, FileContext - AST subsystem: lexing (tokens + normalizer), signal-based parsing, classification (node protocol, type detector), enrichment (compound nodes) - 40+ language definitions organized by category (native, scripting, vm, web, markup, config, data) - Per-run OTP supervisors: BehaviorConfigServer, FileContextServer, FileMetricsServer, RunSupervisor - Near-duplicate-block detection with bigram-based similarity scoring - Block impact analysis with codebase/file impact and refactoring potentials - Combined metrics system with scalar applier and YAML behavior configs - Health report subsystem with formatters, grader, delta, top blocks
Tests cover: AST lexing/parsing/classification/enrichment, signal registry, language fixtures for all 40+ languages, block impact analysis, combined metrics, engine pipeline, health report subsystems, near-duplicate-block detection (file and codebase level), OTP analysis servers, and CLI.
…rage - sync-behavior-coverage: new workflow to keep behavior YAML coverage in sync - compare, dialyzer, health-report, test: updated for new engine structure - action.yml: updated action definition - scripts/run.sh: updated run helper
Documents new mix tasks (health, diagnose, sample-report, signal-debug), metric categories, and updated architecture overview.
Removes modules that were replaced by the layered engine architecture:
- lib/codeqa/analyzer.ex → lib/codeqa/engine/analyzer.ex
- lib/codeqa/collector.ex → lib/codeqa/engine/collector.ex
- lib/codeqa/pipeline.ex → lib/codeqa/engine/pipeline.ex
- lib/codeqa/comparator.ex, formatter.ex, summarizer.ex, telemetry.ex,
stopwords.ex → functionality moved into engine subsystems
- lib/codeqa/metrics/{codebase_metric,file_metric,token_normalizer}.ex
→ moved to lib/codeqa/metrics/{codebase,file}/
- lib/codeqa/cli/{compare,stopwords}.ex → commands removed from CLI
- test/codeqa/{cli_compare,formatter}_test.exs → tests removed with commands
…gine/ subdirs
Removes old flat module structure that was reorganized:
- lib/codeqa/metrics/*.ex → moved to lib/codeqa/metrics/file/
- lib/codeqa/parallel.ex → moved to lib/codeqa/engine/parallel.ex
- lib/codeqa/registry.ex → moved to lib/codeqa/engine/registry.ex
- test/codeqa/metrics/{branching,function_metrics}_test.exs
→ moved to test/codeqa/metrics/file/
🟠 Code Health: C (61/100)
%%{init: {'theme': 'neutral'}}%%
xychart-beta
title "Code Health Scores"
x-axis ["Readability", "Complexity", "Structure", "Duplication", "Naming", "Magic Numbers", "Combined Metrics"]
y-axis "Score" 0 --> 100
bar [95, 31, 86, 48, 96, 100, 63]
|
🔍 Top Likely Issues (cosine similarity)
🟢 Readability — A (95/100)Codebase averages: flesch_adapted=97.99, fog_adapted=4.72, avg_tokens_per_line=9.31, avg_line_length=35.15
🔴 Complexity — D- (31/100)Codebase averages: difficulty=41.09, effort=233830.77, volume=4030.66, estimated_bugs=1.34
🟢 Structure — A- (86/100)Codebase averages: branching_density=0.14, mean_depth=3.92, avg_function_lines=8.68, max_depth=10.06, max_function_lines=20.73, variance=7.34, avg_param_count=1.13, max_param_count=1.98
🟠 Duplication — C- (48/100)Codebase averages: redundancy=0.59, bigram_repetition_rate=0.55, trigram_repetition_rate=0.37
🟢 Naming — A (96/100)Codebase averages: entropy=0.89, mean=6.66, variance=18.87, avg_sub_words_per_id=1.17
🟢 Magic Numbers — A (100/100)Codebase averages: density=0.00
🔴 Combined Metrics — D (63/100)
🔴 Code Smells — D (40/100)
🟠 Consistency — C- (50/100)
🔴 Dependencies — D- (27/100)
🟡 Documentation — B+ (83/100)
🟢 Error Handling — A- (92/100)
🔴 File Structure — D+ (42/100)
🟡 Function Design — B+ (81/100)
🟡 Naming Conventions — B+ (80/100)
🔴 Scope And Assignment — D- (28/100)
🟡 Testing — B+ (83/100)
🟢 Type And Value — A- (89/100)
🟡 Variable Naming — B (74/100)
|
🔴 lib/codeqa/metrics/file/inflector.ex:32-45 — No Implicit Null InitialIssues:
32 │ """
33 │ @spec detect_casing(String.t()) ::
34 │ :pascal_case | :camel_case | :snake_case | :macro_case | :kebab_case | :other
35 │ def detect_casing(identifier) do
36 │ cond do
37 │ identifier =~ ~r/^[A-Z][a-zA-Z0-9]*$/ -> :pascal_case
38 │ identifier =~ ~r/^[a-z][a-z0-9]*(?:[A-Z][a-zA-Z0-9]*)+$/ -> :camel_case
39 │ identifier =~ ~r/^[a-z]+(_[a-z0-9]+)*$/ -> :snake_case
40 │ identifier =~ ~r/^[A-Z]+(_[A-Z0-9]+)*$/ -> :macro_case
41 │ identifier =~ ~r/^[a-z]+(-[a-z0-9]+)*$/ -> :kebab_case
42 │ true -> :other
43 │ end
44 │ end
45 │ end
🟠 lib/codeqa/health_report/formatter/github.ex:54-73 — Single ResponsibilityIssues:
54 │ """
55 │ @spec render_part_2(map(), keyword()) :: String.t()
56 │ def render_part_2(report, opts \\ []) do
57 │ detail = Keyword.get(opts, :detail, :default)
58 │ display_categories = merge_cosine_categories(report.categories)
59 │ worst_blocks = Map.get(report, :worst_blocks_by_category, %{})
60 │
61 │ [
62 │ top_issues_section(Map.get(report, :top_issues, []), detail),
63 │ category_sections(display_categories, detail, worst_blocks),
64 │ sentinel(2)
65 │ ]
66 │ |> List.flatten()
67 │ |> Enum.join("\n")
68 │ end
69 │
70 │ @doc """
71 │ Renders Part 3: blocks section (top 10 blocks with code).
72 │ Returns a list with a single part since blocks are now limited to top 10.
73 │ """
🟠 lib/codeqa/health_report/formatter/github.ex:14-25 — Single ResponsibilityIssues:
14 │ [
15 │ pr_summary_section(Map.get(report, :pr_summary)),
16 │ header(report),
17 │ cosine_legend(),
18 │ delta_section(Map.get(report, :codebase_delta)),
19 │ if(chart?, do: mermaid_chart(display_categories), else: []),
20 │ progress_bars(display_categories),
21 │ top_issues_section(Map.get(report, :top_issues, []), detail),
22 │ blocks_section(Map.get(report, :top_blocks, [])),
23 │ category_sections(display_categories, detail, worst_blocks),
24 │ footer()
25 │ ]🟠 lib/codeqa/health_report/formatter/github.ex:39-47 — Single ResponsibilityIssues:
39 │ [
40 │ pr_summary_section(Map.get(report, :pr_summary)),
41 │ header(report),
42 │ cosine_legend(),
43 │ delta_section(Map.get(report, :codebase_delta)),
44 │ if(chart?, do: mermaid_chart(display_categories), else: []),
45 │ progress_bars(display_categories),
46 │ sentinel(1)
47 │ ]🟠 lib/codeqa/health_report/formatter/github.ex:61-65 — Single ResponsibilityIssues:
61 │ [
62 │ top_issues_section(Map.get(report, :top_issues, []), detail),
63 │ category_sections(display_categories, detail, worst_blocks),
64 │ sentinel(2)
65 │ ]🟠 lib/codeqa/health_report/formatter/github.ex:98-105 — Single ResponsibilityIssues:
98 │ combined = %{
99 │ type: :cosine_group,
100 │ key: "combined_metrics",
101 │ name: "Combined Metrics",
102 │ score: combined_score,
103 │ grade: grade_letter_from_score(combined_score),
104 │ categories: cosine
105 │ }🟠 lib/codeqa/health_report/formatter/github.ex:128-133 — Single ResponsibilityIssues:
128 │ [
129 │ "## #{emoji} Code Health: #{report.overall_grade} (#{report.overall_score}/100)",
130 │ "",
131 │ "> #{report.metadata.total_files} files · #{extract_project_name(report.metadata.path)} · #{format_date(report.metadata.timestamp)}",
132 │ ""
133 │ ]🟠 lib/codeqa/health_report/formatter/github.ex:137-140 — Single ResponsibilityIssues:
137 │ [
138 │ "> *Combined metric scores use cosine similarity: +1 = metric profile perfectly matches healthy pattern for this behavior, 0 = no signal, −1 = anti-pattern detected. Mapped to 0–100 using breakpoints (approx: ≥0.5→A, ≥0.2→B, ≥0.0→C, ≥−0.3→D, <−0.3→F); actual letter grades use the full 15-step scale.*",
139 │ ""
140 │ ]🟠 lib/codeqa/health_report/formatter/github.ex:147-157 — Single ResponsibilityIssues:
147 │ [
148 │ "```mermaid",
149 │ "%%{init: {'theme': 'neutral'}}%%",
150 │ "xychart-beta",
151 │ " title \"Code Health Scores\"",
152 │ " x-axis [#{names}]",
153 │ " y-axis \"Score\" 0 --> 100",
154 │ " bar [#{scores}]",
155 │ "```",
156 │ ""
157 │ ]🟠 lib/codeqa/health_report/formatter/github.ex:162-164 — Single ResponsibilityIssues:
162 │ Enum.reduce(categories, 0, fn cat, acc ->
163 │ max(acc, String.length(cat.name))
164 │ end) |
Summary
lib/codeqa/metrics/*.ex→lib/codeqa/metrics/file/andlib/codeqa/metrics/codebase/Dependencies
chore/samples(shared sample fixtures duplicated here for independence)tools/scalar_tunerUI is split into a separate PRMerge notes
This branch includes the 791 sample fixtures from
chore/samplesfor independence. Afterchore/samplesmerges:chore(samples): include sample fixtures for independent PRcommitExtra work done during split
Two extra cleanup commits were needed because the
pr-b-files.txtonly listed files on the original branch — files deleted frommainby the original branch were not in the list and had to be removed separately:refactor(engine): remove superseded modules from pre-engine architecture(15 files)refactor(engine): remove flat metrics modules moved into file/ and engine/ subdirs(23 files)Test plan
mix testpasses (884 tests, 0 failures — verified locally)mix dialyzerclean (ignored warnings listed in .dialyzer_ignore.exs)mix codeqa.healthruns on this repo