Tools to analyze and capture metadata from an existing application's codebase into a dataBASE and then derive higher-level insight about the application. The tools let you generate a report describing the application's purpose and composition. A tool is also provide to allow you to ask ad hoc questions of the codebase. Where applicable, the tools use one of a choice of several AI LLMs to help accelerate insight generation. The tools rely on the use a of MongoDB Atlas database (with its Vector Search capability) to store and retrieve insights.
The following screenshot shows part of an example report generated by these tools.
You can also view an example of a full report generated by the tools.
-
Ensure you have the following software installed on your workstation:
-
Install the project's required Node/NPM packages.
npm install
-
Ensure you have the codebase for a sample application ready to access on your local filesystem. You can optionally download a zip of some example projects for testing (for subset of people who have link access). Note the current version of these tools work better with Java-based codebases, but over time, many more programming languages will be supported equally.
-
Ensure you have can leverage LLMs from OpenAI/Azure GPT, GCP Vertex AI or AWS Bedrock API, with the following three models types available to use, along with appropriate LLM provider API keys / credentials:
- Embeddings model for generating vector embeddings
- Text completions 'primaary' model, typically with a small token limit for generating text and JSON content to process text inputs
- Text completions 'secondary' model, typically with a large token limit for generating text and JSON content to process text inputs (acting as a backup in case the primary model errors for a particular piece of cotent)
For suggestions on what LLM to run, see section Demonstrated LLM Capabilities
-
In a shell, from the root folder of this project, run the following command to copy an example environment configuration file to a new file in the same root folder called
.env
:cp 'EXAMPLE.env' '.env'
Edit the
.env
file to:- Set your
LLM
provider (e.g., "OpenAI", "VertexAIGemini", etc.) - Set your MongoDB URL and codebase directory path
- Set the specific environment variables required for your chosen LLM provider
The system uses a manifest-driven approach - you only need to configure environment variables for your selected LLM provider. The application will automatically validate only the variables required for your chosen provider and provide clear error messages if any are missing. See the section LLM Authentication And URN Notes for help on determing the URNs for you to specify in the
.env
file. - Set your
-
Ensure you have a running MongoDB Atlas version 7.0 or greater dedicated cluster of any size/tier. You can even use an 'M0' free-tier version, although for some uses cases, the free-tier storage limit of 512MB may be insufficient. Ensure the approprate network and database access rights are configured in Atlas.
There are various tools you need to run in a specific order (shown the next section). For each tool, you can easily debug using VS Code, by following these steps:
- Open the project in VS Code
- In the Explorer select the "src/cli/.ts" or "src/cli/optional/.ts" file you want to run
- From the Activity Bar (left panel), select the Run and Debug view
- Execute the pre-configured task Run and Debug TypeScript
- this will run the TypeScript compiler first, and then, if successful, it will run the program in debug mode, showing its output in the Debug Console of the Status Bar (bottom panel).
Alternatively, you also run the compiled JavaScript files (first compiled from TypeScript using the npm build
command) from the terminal using the node
command. Main tools are located in ./dist/src/cli/
and optional tools are in ./dist/src/cli/optional/
. The command to run each tool is shown the next section.
(for a summary of the purpose of these tools, see the section Analysis And Summary Capture Process)
-
BUILD THE PROJECT SOURCE CODE: First build the project (this compiles TypeScript and moves some HTML template files to project's executable path) by executing the the following command.
npm run build
-
CAPTURE SOURCES: To capture LLM-generated metadata about every source file into the database, execute the the following commands from a terminal (or select the corresponding "src/cli/capture-codebase.ts" file in your VS Code IDE and choose to "Run and Debug).
node ./dist/src/cli/capture-codebase.js
Note 1. If you receive LLM provider authentication/authorisaton errors when you execute the task, see the section LLM Authentication And URN Notes for help on configuring LLM provider credentials correctly.
Note 2. The task may take a few muinutes to tens of minutes to execute, depending on the complexity of the source project and the LLM used. This tool employs asynchronous IO and concurrency, but inevitably, the LLM will take time to respond to requests and may often apply throttling, which will be the main causes of slowdown. If you see messages with the character
?
in the tool's output, this indicates that the LLM is returning an "overloaded" response, and hence, the tool will transparently pause each affected LLM request job and then retry after a short wait. -
GENERATE INSIGHTS: Run the following command to generate insights (e.g. identifid technology stack, business processes, DDD aggregates, potential microserices, etc.) lveraging the previously database captured sources files metadata.
node ./dist/src/cli/generate-insights-from-db.js
-
CREATE REPORT: Run the following command to generate a static HTML-based report summarising the application from the previously captured source metadata and aggregated insights.
node ./dist/src/cli/produce-report.js
-
QUERY CODEBASE: To adhoc query the codebase (i.e., to "talk to your code", which uses MongoDB's Vector Search capability to search the database-captured metadata), place your questions in the file
input/questions.prompts
and then run the following command to execute the queries.node ./dist/src/cli/optional/query-codebase.js
-
OTHER TOOLS: Other optional tools are provided in the "src/cli/optional" subfolder which you can explore and run. These are currently undocumented and may disappear in the future. They are primarily intended for use by this project's developers or for specialized use cases.
Execute the following command from the project's root folder:
npm run validate
During project setup, you will need to ensure you have enabled the required LLM for the requried cloud region using the cloud provider's LLM configuration tool.
Specify your API key for your own OpenAI or Azure GPT service in .env
.
Execute the following from shell, assuming you have already installed and configured the Google Cloud CLI (gcloud).
gcloud auth login
gcloud auth application-default login
In the AWS Console, select the "Bedrock Configuration | Model Access" option and enable access for the models you requrie.
Execute the following from shell, assuming you have already installed and configured the AWS CLI, aws (for a subset of people you can use MDB's SSO to access AWS account to obtain the SSO start URL):
aws configure sso
Then edit the file ~/.aws/config
and rename the line [profile ...]
for the newly generated profile section to [default]
instead, then run:
aws sso login
aws sts get-caller-identity # to test the CLI works
For some Bedrock hosted models, AWS forces you to use the ARN of an AWS "inference profile" for the particlar region for the model id. To view the region ARNs for models in Bedrock, run:
aws bedrock list-inference-profiles
From this output, configure the URL defined for the inferenceProfileArn
parameter in your .env
file.
The capture-codebase
tool captures metadata about all the files in the codebase, including LLM-generated summary data, in a series of collections in the codebase-analyzed
database. The main steps this tool conducts are:
-
Initialise database collections and indexes (including Atlas Vector Search indexes).
-
Loop through every nested file in the referenced codebase, capturing the following into the database collection
sources
:- the file's content (e.g., code)
- the file's metadata (e.g., file type, filepath)
- LLM-generated summary data for the specific file type (e.g., if the file is a Java source, captures information about the the class, including its public methods with signatures, other classes referenced by this class, etc.)
- two separate vector embeddings, indexed via Atlas Vector Search (one for the file's content and one for the file's LLM-generated summary).
Below is an example of a record stored in the database representing an introspected Java file:
Using captured metadata about the source files stored in the sources
collection, the subsequent generate-insights-from-db
tool uses an LLM to generate insights of various aspects of the application which it then persists into the database collection appsummaries
. These captured insights include an outline of the application's purpose and a list its technologies, business processes, DDD bounded contexts, DDD aggregates, DDD entities,, DDD repositories, and potential microservices.
The following table shows the results of using this project with various LLMs against the codebase of the legacy Sun/Oracle Java Petstore J2EE application (tested on 17-July-2024):
LLM Hosting/API Provider | LLMs | Insight Quality 1 | Speed 2 | Average Error Rate 3 |
---|---|---|---|---|
Azure OpenAI | GPT4o | 4 | 1:47 mins | 0.4 % |
GCP VertexAI | Gemini 2.5 Pro + Flash | 5 | 13:44 mins | 0.6 % |
AWS Bedrock | Claude Sonnet 4.0 & 3.7 | 4 | 23:47 mins | 1.1 % |
AWS Bedrock | Amazon Nova Pro 1.0 Pro + Lite | 3.5 | 9:30 mins | 0.8 % |
AWS Bedrock | Meta Llama Instruct 3-3-70b + 2-90b | 2 | 39:53 mins | 43 % |
AWS Bedrock | Mistral Large 2402 + 2407 | 2 | 15:12 mins | 24 % |
AWS Bedrock | Deepseek R1 | 3.5 | 10:12 mins | 1 % |
-
Insight Quality: 1 - 5, 1 = Low, 5 = High
-
Speed: Time taken to extract insights from all source files incluing for waits/restries
-
Error Rate: Proportion of requests that the LLM could not process even after retries or where the prompt had to be truncated to fit the LLM's context window
The LLMRouter
class (see src/llm/core/llm-router.ts
) enables the LLMs you use to be pluggable, Ut performs various tasks to optimize the use of the configured LLMs and their APIs, It abstracts away many of the undesirable non-functional behaviours which occur when using LLMs. These non-functional features include:
- Pluggable LLM. Routes to flavour of LLM (which may have a speciifc token size limit).
- Automatic LLM Timeouts. Proactively times out LLM invocations if the LLM is taking too long to respond.
- Automatic LLM Retries. Detects throttling and timeout events from the LLM provider and automatically retries, several times, with pauses between each attempt.
- Enforce JSON Response From LLM. Enforces a JSON format response, validated against a prescribed JSON schema, and automatically retries if a response is invalid JSON.
- Automatically Fails Over To A Backup LLM. If the primary LLM's API response is consistently struggling to correctly process a particular prompt (e.g. repeated timeouts, repeated token limits exceeded, repeated JSON formatting issues), it automatically sends the request to a secondary LLM from the same family (if one has been specified).
- Reactively Crop Excessively Large Prompt. If the LLM response error indicates the prompt is too large, it shortens the prompt to the size instructed by the LLM's API response, if specified. It otherwise, estimates the appropriate shorter size and applies that. Note, it only crops a prompt of it has been unable to fulfill the full prompt with the backup secondary LLM first, in case that has a larger accomodating limit.
- Provide LLM Invocation Statistics. Gathers LLM invocation statistics (successes, failures, retries, truncations, etc.), logging each event and then providing a final summary table of LLM statistics at the end of the tool's run.