Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PoC of gritql integration #12

Merged
merged 15 commits into from
Apr 16, 2024
Merged

Add PoC of gritql integration #12

merged 15 commits into from
Apr 16, 2024

Conversation

transitive-bullshit
Copy link
Collaborator

This PR adds an optional dependency on the Rust-based grit binary via the @getgrit/launcher npm package.

Goals of adding gritql:

  • enable file-scoped rules to optionally specify gritql patterns to extract more focused source code context for the LLM-based linter to work with
  • increase the accuracy of the linter by focusing the source context
  • increase the speed of the linter by passing less context
  • reduce the cost of the linter by passing less context
  • gritql wraps tree-sitter and a bunch of tree-sitter wasm grammers, which are generally a pain to deal with (see my exploration here for more details and alternatives), so one potential upside of using gritql is that it will make cross-language and multi-file-context support a lot easier vs using tree-sitter directly
  • gritql also natively supports specifying before & after code transforms in their pattern syntax via the => operator. when we do end up tackling autofix, this could be a useful, more deterministic tool to augment LLM-based code editing for many built-in rules

Most of these goals still need to be quantified and tested in real-world repos as well as our evals before we consider merging. I'm not sure if we'll include this in the MVP with our goal of launching next week, but given how large of an impact this could potentially have, I was excited to explore this direction to understand viability and quantify the potential impact.

See also getgrit/gritql#214. If we end up adopting this approach, we'll likely want to use a more generally compatible solution as opposed to running grit as a subprocess, but this would be good enough for the purposes of this PoC for now.

@transitive-bullshit transitive-bullshit added the enhancement New feature or request label Apr 14, 2024
Copy link

vercel bot commented Apr 14, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
gptlint ✅ Ready (Inspect) Visit Preview Apr 16, 2024 0:56am

@transitive-bullshit
Copy link
Collaborator Author

transitive-bullshit commented Apr 15, 2024

OpenAI before & after tests

All results without grit were run with the --no-grit flag. I tried running the same tests with Anthropic Claude 3 Haiku & Opus, but ran into a lot of random Cloudflare 524 errors which corrupted the results.

OpenAI gptlint results with grit

Linter tasks { numTasks: 328, numTasksFilteredGrit: 147, numTasksDisabled: 1 }
Linter stats; total cost $2.52 {
  model: 'gpt-4-turbo-preview',
  weakModel: 'gpt-3.5-turbo',
  numModelCalls: 397,
  numModelCallsCached: 0,
  numPromptTokens: 628157,
  numCompletionTokens: 60081,
  numTotalTokens: 688238,
  lintDuration: '1m 5.2s'
}

lint errors: [
  {
    "model": "gpt-4-turbo-preview",
    "level": "error",
    "filePath": "bin/grit.ts",
    "language": "typescript",
    "message": "Use semantic variable names",
    "ruleName": "semantic-variable-names",
    "codeSnippet": "const partialFilesC = [...partialFilesMap.values()].filter(",
    "confidence": "high",
    "reasoning": "The variable name 'partialFilesC' is not descriptive and does not clearly convey the purpose or content of the variable. The use of 'C' is unclear and does not follow best practices for semantic variable names."
  }
]

OpenAI gptlint results without grit

Linter tasks { numTasks: 475, numTasksDisabled: 1 }
Linter stats; total cost $3.16 {
  model: 'gpt-4-turbo-preview',
  weakModel: 'gpt-3.5-turbo',
  numModelCalls: 521,
  numModelCallsCached: 0,
  numPromptTokens: 1197978,
  numCompletionTokens: 61159,
  numTotalTokens: 1259137,
  lintDuration: '54.6s'
}

lint errors: [
  {
    "model": "gpt-4-turbo-preview",
    "level": "error",
    "filePath": "bin/generate-evals.ts",
    "language": "typescript",
    "message": "Prefer types that always represent valid states",
    "ruleName": "prefer-types-always-valid-states",
    "codeSnippet": "state.isLoading = true\nstate.isLoading = false\nstate.error = '' + e",
    "confidence": "high",
    "reasoning": "The provided code snippet directly manipulates state properties in a way that could lead to invalid states, such as setting isLoading to true then to false, and setting an error without a clear transition state. This approach does not ensure that the state always represents valid and consistent states, violating the rule's intent to prefer types that always represent valid states."
  }
]

OpenAI eval results with grit

Linter stats; total cost $3.48 {
  model: 'gpt-4-turbo-preview',
  weakModel: 'gpt-3.5-turbo',
  numModelCalls: 428,
  numModelCallsCached: 0,
  numPromptTokens: 599655,
  numCompletionTokens: 57969,
  numTotalTokens: 657624
}

Eval results {
  numFiles: 292,
  numRules: 13,
  numUnexpectedErrors: 0,
  numFalseNegatives: 1,
  numFalsePositives: 0,
  numTrueNegatives: 145,
  numTruePositives: 146,
  precision: 1,
  recall: 0.9931972789115646,
  f1Score: 0.9965870307167235
}

OpenAI eval results without grit

Linter stats; total cost $3.59 {
  model: 'gpt-4-turbo-preview',
  weakModel: 'gpt-3.5-turbo',
  numModelCalls: 451,
  numModelCallsCached: 0,
  numPromptTokens: 635057,
  numCompletionTokens: 60404,
  numTotalTokens: 695461
}

Eval results {
  numFiles: 292,
  numRules: 13,
  numUnexpectedErrors: 0,
  numFalseNegatives: 1,
  numFalsePositives: 0,
  numTrueNegatives: 145,
  numTruePositives: 146,
  precision: 1,
  recall: 0.9931972789115646,
  f1Score: 0.9965870307167235
}

Analysis

The eval results are roughly the same, which makes sense because all of the generated evals are purposefully minimal, and gritql is currently only being used to filter source files.

The more interesting comparison is the results of running the linter on the gptlint codebase itself. Here, using grit resulted in a 20% cost reduction ($3.16 to $2.52) but is 15% slower (65s vs 54s). The grit version correctly found 1 true positive error, whereas the --no-grit version incorrectly found 1 false positive and has one false negative. So it's clear that the grit version is more accurate for the current rules, but it's difficult to quantify how much more accurate without more extensive testing.

Overall, these results aren't as significant as I was expecting, but filtering out source files with no matches does provide a significant cost reduction, and I expect this difference to be proportional to the size of the codebase.

Also, there are a lot of the built-in rules which are still sending too much context to the LLM, and several of the rules don't use gritql filters at all at the moment, so there is definitely still work to be done which I believe will increase these gains.

I was a bit surprised that the grit version is 15% slower than the non-grit version since the grit version made only 397 LLM calls whereas the --no-grit version made 521 LLM calls. This discrepancy is likely due to invoking grit multiple times (once per rule with a gritql pattern) and scanning the codebase for each of these grit invocations. This can definitely be optimized, but I'd rather focus on ensuring that this is the right direction first before spending too much time optimizing the gritql integration. It was a bit surprising, though, since I would've expected the LLM inference time for 124 extra LLM calls to dominate the runtime compared to our local grit invocations.

@transitive-bullshit
Copy link
Collaborator Author

Also interesting to note that the grit version used ~2x less prompt tokens than the --no-grit version (628,157 vs 1,197,978) but they both used approximately the same number of response completion tokens (60,081 vs 61,159) which are more expensive than prompt tokens.

@transitive-bullshit transitive-bullshit merged commit 583073a into main Apr 16, 2024
5 checks passed
@transitive-bullshit transitive-bullshit deleted the feature/gritql branch April 16, 2024 00:58
@transitive-bullshit transitive-bullshit changed the title WIP Add PoC of gritql integration Add PoC of gritql integration Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant