Bug fixes by dxa204 · Pull Request #402 · ianarawjo/ChainForge

dxa204 · 2025-11-20T05:32:34Z

Fixed the chunkMethod is "(unspecified)" error for vector store methods in the Retriever by adding chunkMethod to the normal vars in the embeddings loops.
Fixed the OpenAI Tiktoken modal settings bug (see ChunkMethodSchema.tsx), and the 'model' field also supports tokenizers.
Fixed the "ENTITY TOO LARGE" error by sending the data as a file instead of plain JSON (mirrors Media Node).
Fixed the embedding method error bug — a meaningful message is now returned to the frontend instead of silently failing.
Fixed the Vis Node y-axis alignment bug.
Fixed the threshold bug for the semantic chunker — the default value (0.8) is now used instead of "auto".
Fixed the deleted file lingering bug in the backend by using MediaLookup.remove() (mirrors Media Node).
Fixed the Late Chunker bug by stripping add_special_tokens, as newer versions of sentence-transformers raise an error when it’s unsupported, and Chonkie’s late chunker sets it by default.
Fixed the cf-docs example flow to use {#query}, and updated the book Q&A example to use a supported chunking method, since Spacy is no longer built-in.
Fixed the Chonkie sentence splitter method. It splits by sentence now.
Set the default item of the retriever's inspector to 'retrievalMethod'
Regrouped the chunk method under new headings
Added tooltips to the chunk and retrieval methods
Deleted SDPM Chunker as it depreciated. Instead, it's integrated into Semantic Chunker. skip_window = 0 runs the normal semantic chunker, skip_window = 1 runs SDPM.
Modify pop-up that appears after the retrieval node is run to include a warning about the load size and the possible long processing time when an vector store/embedding method is selected, and the # chunks are above a certain threshold.
Show the runtime of a retrieval method.
Add a progress bar in the retrieval node.
Add a status (green tick, loading spinner etc. ) to the retrieval node.
Add github repo and documentation link in the settings under 'advanced'.
Fix purple button in tabular data node to collapse when clicked out of focus.
CSS tweak in the prompt node uploaded documents list. The scroll bar was overlapping with the trash icon.
Remove the loading spinner screen that freezes the retrieval node when it's running, and add a stop button.

ianarawjo · 2025-11-20T14:24:25Z

chainforge/react-server/src/ChunkMethodSchemas.tsx

  schema: {
    type: "object",
-    required: ["max_tokens", "overlap_tokens"],
+    required: ["model", "chunk_size", "chunk_overlap"],


I'm confused, why is gpt-3.5-turbo listed under tiktoken? Tiktoken models aren't ChatGPT, or am I wrong somehow?

It’s an OpenAI model name that tiktoken understands and maps to the correct tokenizer. See encoding_for_model() in chunkers.py

ianarawjo · 2025-11-20T14:26:49Z

chainforge/react-server/src/ChunkNode.tsx

+            const fullText = StringLookup.get(fileInfo.text) ?? "";
+            const textBlob = new Blob([fullText], { type: "text/plain" });
+
+            formData.append("document", textBlob);


Has this been tested? I wonder how the upload node is passing files---is it doing a form submission, and the issue was isolated to the ChunkNode as here?

UploadNode uses MediaLookup.upload(file) and MediaLookup.getAsText(uid) to convert the uploaded file into plain text, which it stores in field.text. It doesn’t submit a form or send a "document" field.

The updated code wraps the string from StringLookup in a Blob and appends it as "document".

The earlier ENTITY TOO LARGE error was coming from the fetch request in ChunkNode failing, not from the upload node.

ianarawjo · 2025-11-20T14:27:07Z

chainforge/react-server/src/UploadNode.tsx

        <IconUpload size={40} color="#888" />
        <Text size="sm" color="dimmed">
-          Drag & drop files here or click to upload
+          Drag & drop files here or click to upload (.pdf, .docx, .txt, .md)


Does .txt work?

Yes, heart_of_darkness is txt

ianarawjo · 2025-11-20T14:27:48Z

chainforge/react-server/src/VisNode.tsx

  const past_shortnames_counts: Dict<number> = {};
  const shortnames: Dict<string> = {};
-  const max_lines = 8;
+  const max_lines = 4;


This is a really big change... please revert. This number was carefully chosen.

ianarawjo · 2025-11-20T14:30:06Z

chainforge/react-server/src/VisNode.tsx

+  const truncated = truncStr(label, maxChars) ?? "";
+  return addLineBreaks(truncated, maxCharsPerLine);
+};
+


So the y-axis labels should already be wrapping, due to max_chars_per_line. I'm not sure why this is required.

The issue that appeared wasn't about the wrapping, as I understood it, but about the gutter/y-margin not being sized correctly. It didn't have its width set for some reason.

ianarawjo · 2025-11-20T14:30:28Z

chainforge/react-server/src/VisNode.tsx

          const shortnames = genUniqueShortnames(names);
+
+          const yLabelNames = new Set(responses.map(resp_to_x));
+          const yLabelShortnames = genUniqueShortnames(yLabelNames);


The shortnames are already y-axis names... I'm confused what this adds.

ianarawjo · 2025-11-20T14:30:50Z

chainforge/react-server/src/VisNode.tsx

+          const yLabelShortnames = genUniqueShortnames(yLabelNames);
+
          for (const name of names) {
+            const shortname = wrapYAxisLabel(shortnames[name]);


Shortnames were already calculated and wrapped by genUnique.

ianarawjo · 2025-11-20T14:31:37Z

chainforge/react-server/src/VisNode.tsx

+            tickvals: marginLabels,
+            ticktext: wrappedMarginLabels,
+          };
+          layout.margin.l = calcLeftPaddingForYLabels(wrappedMarginLabels);


The left padding is indeed what needs to be considered, but it shouldn't be necessary to add a new custom function for it. genUnique already wraps the text.

ianarawjo

See in-line comments.

ianarawjo · 2025-11-20T14:33:14Z

chainforge/flask_app.py

+                settings[key] = float(value)
+            elif key in known_bool_params:
+                # Handle boolean conversion robustly
+                settings[key] = value.lower() in ['true', '1', 't', 'y', 'yes']


'y', '1', and 't' should be removed here; these aren't valid identifiers for boolean expressions.

dxa204 · 2025-11-20T18:26:02Z

chainforge/react-server/src/VisNode.tsx

          const shortnames = genUniqueShortnames(names);
+          const yLabelShortnames = genUniqueShortnames(
+            new Set(responses.map(resp_to_x)),
+          );


The issue here wasn’t padding — the y-axis labels weren’t wrapping and were getting clipped (falling out of the node) in categorical/open-ended plots.

In this code, names contains the categorical values, so genUniqueShortnames(names) only wrapped those category labels. The actual y-axis text comes from resp_to_x(r) (the full query strings), which were never passed through genUniqueShortnames.

yLabelShortnames applies the same wrapping logic to the y-labels themselves, so they wrap instead of being clipped.

Copilot

Pull request overview

This PR implements a comprehensive set of bug fixes and UX improvements for the ChainForge RAG (Retrieval Augmented Generation) system, focusing on chunking, embedding, and retrieval operations. The changes span both the React frontend and Flask backend, addressing issues ranging from minor UI tweaks to significant architectural improvements in how data is processed and displayed.

Key changes:

Consolidated the deprecated SDPM chunker into the Semantic chunker with a configurable skip_window parameter
Added real-time progress tracking and a stop button for retrieval operations with visual progress bars
Fixed the "ENTITY TOO LARGE" error by sending large text documents as file uploads instead of form fields
Improved error handling and user feedback throughout the retrieval pipeline

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 22 comments.

Show a summary per file

File	Description
tests/test_chunking.py	Removed SDPM chunker tests following consolidation into semantic chunker
chainforge/react-server/src/VisNode.tsx	Fixed y-axis label alignment by applying shortnames to y-axis labels
chainforge/react-server/src/UploadNode.tsx	Added MediaLookup.remove() calls to properly clean up deleted files, fixed scroll overlap with trash icon
chainforge/react-server/src/RetrievalNode.tsx	Major refactor: added progress tracking, status indicators, stop button, and improved confirmation messages
chainforge/react-server/src/RetrievalMethodSchemas.tsx	Added descriptive tooltips for all retrieval methods
chainforge/react-server/src/RetrievalMethodListComponent.tsx	Added latency display badges for retrieval methods
chainforge/react-server/src/LLMResponseInspectorModal.tsx	Added defaultTableColVar prop passthrough
chainforge/react-server/src/LLMResponseInspector.tsx	Implemented defaultTableColVar logic for retrieval node inspector
chainforge/react-server/src/GlobalSettingsModal.tsx	Added GitHub repository and documentation links in advanced settings
chainforge/react-server/src/ChunkNode.tsx	Changed to send text as file blob to avoid entity size limits
chainforge/react-server/src/ChunkMethodSchemas.tsx	Consolidated SDPM into semantic chunker, reorganized into semantic groups, updated defaults and descriptions
chainforge/react-server/src/ChunkMethodListComponent.tsx	Added tooltip support for chunk methods
chainforge/react-server/src/AiPopover.tsx	Fixed purple button to collapse when clicked out of focus
chainforge/rag/embeddings.py	Fixed OpenAI client initialization to properly pass API key
chainforge/rag/chunkers.py	Consolidated SDPM into semantic chunker, fixed tiktoken tokenizer fallback, patched sentence-transformers compatibility, updated sentence chunker defaults
chainforge/flask_app.py	Changed chunk endpoint to accept file uploads, added progress tracking endpoint, improved error messages, added latency tracking

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-04T04:01:30Z

chainforge/react-server/src/UploadNode.tsx

    setDataPropsForNode(id, { fields: [], output: [] });
    setStatus(Status.READY);
-  }, [id, setDataPropsForNode]);
+  }, [fields, id, setDataPropsForNode]);


The fields dependency was added to the handleClearUploads callback (line 171), which means this callback will be recreated every time the fields array changes. Since this callback is likely passed to child components or used in event handlers, this could cause unnecessary re-renders. Consider using a ref to access the current fields value instead, or move the UID collection logic outside the callback.

Copilot · 2025-12-04T04:01:31Z

chainforge/react-server/src/RetrievalMethodListComponent.tsx

+              const latency =
+                props.methodResults?.[item.key]?.metavars?.latency;


The methodResults prop is now being used to display latency for individual retrieval methods (lines 891-892), but the retrieval method results structure uses latency_ms as a string (e.g., "123.45ms") in the metavars. The code correctly extracts this, but the naming is inconsistent - it's called latency_ms but already includes the "ms" suffix in the value. Consider either:

Storing the numeric value and formatting it for display separately

Renaming the field to just latency since the unit is already included

Copilot · 2025-12-04T04:01:31Z

chainforge/react-server/src/RetrievalNode.tsx

+    pollIntervalRef.current = window.setInterval(async () => {
+      try {
+        const resp = await fetch(`${FLASK_BASE_URL}getRetrieveProgress`);
+        if (currentRunId !== runIdRef.current) return;
+        const data = await resp.json();
+
+        let currentProgress = 0;
+        if (typeof data === "number") {
+          currentProgress = data;
+        } else if (data && typeof data === "object") {
+          // Sum all values in the object (assuming they are numbers representing % completion)
+          const values = Object.values(data) as number[];
+          currentProgress = values.reduce(
+            (acc, val) => acc + (typeof val === "number" ? val : 0),
+            0,
+          );
+        }
+
+        // Clamp between 5 and 95 so it doesn't look finished until it actually is
+        setProgress(Math.min(95, Math.max(5, currentProgress)));
+      } catch (e) {
+        console.warn("Could not fetch progress", e);
+      }
+    }, 500);


There's no cleanup effect to clear the polling interval when the component unmounts. If the RetrievalNode is unmounted while a retrieval is running (e.g., user navigates away or deletes the node), the interval created on line 184 will continue running indefinitely, causing a memory leak. Add a useEffect cleanup function to clear pollIntervalRef.current on unmount.

Copilot · 2025-12-04T04:01:31Z

chainforge/react-server/src/ChunkMethodSchemas.tsx

@@ -190,8 +190,10 @@ export const ChonkieSentenceSchema: ModelSettingsDict = {
      },


The default value for chunk_size has been changed from 512 to 1. While the description explains this is "to keep each chunk to a single sentence," this is a significant behavioral change that could affect existing users' workflows. Consider whether this breaking change is intentional and if it should be noted in migration documentation.

Suggested change

},

},

// BREAKING CHANGE: The default value for `chunk_size` has been changed from 512 to 1 to keep each chunk to a single sentence.

// This is a significant behavioral change and should be noted in migration documentation or the changelog.

Copilot · 2025-12-04T04:01:31Z

chainforge/react-server/src/RetrievalNode.tsx

+      console.error(err);
+      setRunTooltip("Error checking inputs.");
+    }
+  }, [pullInputData, id, status, methodItems]);


The function handleRunHover is defined with useCallback but doesn't include methodItems.length in the dependency array (line 434). Since the function references methodItems.length on line 419, it should be included in the dependencies to ensure the callback is updated when the number of methods changes.

Suggested change

}, [pullInputData, id, status, methodItems]);

}, [pullInputData, id, status, methodItems, methodItems.length]);

Copilot · 2025-12-04T04:01:35Z

chainforge/rag/embeddings.py

-        client = OpenAI()
-
-        print(f"Using OpenAI model: {model_name} for {len(texts)} texts")
+        import os


The OpenAI client is now constructed with the API key directly (line 118), which is correct. However, the code still imports os (line 110) which is used to get the environment variable, but the import statement appears after the try block starts. The import should be at the module level or before the try block for clarity and proper scoping.

Suggested change

import os

Copilot · 2025-12-04T04:01:35Z

chainforge/react-server/src/ChunkMethodSchemas.tsx

+        default: 1,
        title: "Max tokens per chunk",
+        description:
+          "Default 1 keeps each chunk to a single sentence. Increase to group multiple sentences up to the given token count.",


The chunk_size default has changed from 512 to 1 for sentence chunking. This means by default, each "chunk" will be 1 token, not 1 sentence as the description suggests (line 196). The description says "Default 1 keeps each chunk to a single sentence" but the parameter is "Max tokens per chunk", which would split sentences into individual tokens.

Looking at the backend code (chunkers.py line 210), this parameter is used in the SentenceChunker constructor. You should verify whether Chonkie's SentenceChunker interprets chunk_size=1 as "1 sentence" or "1 token". If it means "1 token", the description is misleading.

Suggested change

"Default 1 keeps each chunk to a single sentence. Increase to group multiple sentences up to the given token count.",

"Default 1 splits text into single-token chunks. Increase to group more tokens (and potentially multiple sentences) per chunk.",

Copilot · 2025-12-04T04:01:35Z

chainforge/flask_app.py

+                settings[key] = float(value)
+            elif key in known_bool_params:
+                # Handle boolean conversion robustly
+                settings[key] = value.lower() in ['true', 'yes']


The boolean conversion logic has been simplified (line 1425 and 1934) to only check for 'true' and 'yes', removing the checks for '1', 't', and 'y'. This is a breaking change if any existing code or saved configurations use these shortened forms. Consider maintaining backward compatibility by keeping all the original checks.

Suggested change

settings[key] = value.lower() in ['true', 'yes']

settings[key] = value.lower() in ['true', 'yes', '1', 't', 'y']

Copilot · 2025-12-04T04:01:36Z

chainforge/react-server/src/LLMResponseInspector.tsx

+      let effectiveTableColVar = tableColVar;
+
+      if (
+        ignoreAndHideLLMField &&
+        !userSelectedTableCol &&
+        tableColVar === "$LLM"
+      ) {
+        effectiveTableColVar = "retrievalMethod";
+        setTableColVar("retrievalMethod");
+      }
+


The logic for setting effectiveTableColVar (lines 545-554) updates the state variable tableColVar within the effect, which could trigger additional re-renders. Since this is conditional logic that runs in an effect, consider whether this should be computed as a derived value instead of setting state, or ensure the effect dependencies are correct to avoid infinite loops.

Suggested change

let effectiveTableColVar = tableColVar;

if (

ignoreAndHideLLMField &&

!userSelectedTableCol &&

tableColVar === "$LLM"

) {

effectiveTableColVar = "retrievalMethod";

setTableColVar("retrievalMethod");

}

// Compute effectiveTableColVar as a derived value to avoid unnecessary state updates

const effectiveTableColVar = useMemo(() => {

if (

ignoreAndHideLLMField &&

!userSelectedTableCol &&

tableColVar === "$LLM"

) {

return "retrievalMethod";

}

return tableColVar;

}, [ignoreAndHideLLMField, userSelectedTableCol, tableColVar]);

Copilot · 2025-12-04T04:01:36Z

chainforge/flask_app.py

+        def find_query_metadata(query_text, queries):
+            for q in queries:
+                if isinstance(q, dict) and q.get("text") == query_text:
+                    return q
+            return {}


Variable find_query_metadata is not used.

Suggested change

def find_query_metadata(query_text, queries):

for q in queries:

if isinstance(q, dict) and q.get("text") == query_text:

return q

return {}

This reverts commit 6d14908.

* Bug fixes * Small fix * Reverted VisNode and added minimal tweak to fix y-axis alignment bug * Bug fixes * Chunk methods menu reordering and deleted SDPM Chunker as it depreciated * Implementation of study feedback * Retrieval pop up message fix * Small fix * message fix * implement copilot suggestions * Revert "implement copilot suggestions" This reverts commit 6d14908.

dxa204 added 2 commits November 19, 2025 23:47

Bug fixes

97c0c50

Small fix

7e889c1

ianarawjo reviewed Nov 20, 2025

View reviewed changes

ianarawjo requested changes Nov 20, 2025

View reviewed changes

ianarawjo reviewed Nov 20, 2025

View reviewed changes

Reverted VisNode and added minimal tweak to fix y-axis alignment bug

db60d9b

dxa204 commented Nov 20, 2025

View reviewed changes

dxa204 added 6 commits November 25, 2025 02:06

Bug fixes

ac67562

Chunk methods menu reordering and deleted SDPM Chunker as it depreciated

3bdf390

Implementation of study feedback

2305a37

Retrieval pop up message fix

5394e8d

Small fix

629fde6

message fix

34be91c

Copilot AI review requested due to automatic review settings December 4, 2025 03:53

Copilot started reviewing on behalf of dxa204 December 4, 2025 03:54 View session

Copilot finished reviewing on behalf of dxa204 December 4, 2025 03:58

Copilot AI reviewed Dec 4, 2025

View reviewed changes

dxa204 added 2 commits December 4, 2025 11:14

implement copilot suggestions

6d14908

Revert "implement copilot suggestions"

e77e5fa

This reverts commit 6d14908.

ianarawjo merged commit e6cfb7a into ianarawjo:ragforge Dec 11, 2025

		const latency =
		props.methodResults?.[item.key]?.metavars?.latency;

		@@ -190,8 +190,10 @@ export const ChonkieSentenceSchema: ModelSettingsDict = {
		},

-      },
+      },
+      // BREAKING CHANGE: The default value for `chunk_size` has been changed from 512 to 1 to keep each chunk to a single sentence.
+      // This is a significant behavioral change and should be noted in migration documentation or the changelog.

	}, [pullInputData, id, status, methodItems]);
	}, [pullInputData, id, status, methodItems, methodItems.length]);

	"Default 1 keeps each chunk to a single sentence. Increase to group multiple sentences up to the given token count.",
	"Default 1 splits text into single-token chunks. Increase to group more tokens (and potentially multiple sentences) per chunk.",

	settings[key] = value.lower() in ['true', 'yes']
	settings[key] = value.lower() in ['true', 'yes', '1', 't', 'y']

Conversation

dxa204 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ianarawjo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dxa204 commented Nov 20, 2025 •

edited

Loading