fix: prevent temp file leak in document extractor on API failure#33246
fix: prevent temp file leak in document extractor on API failure#33246alvinttang wants to merge 1 commit intolanggenius:mainfrom
Conversation
…api fails The _extract_text_from_doc, _extract_text_from_ppt, _extract_text_from_pptx, and _extract_text_from_epub functions create temporary files with delete=False but only call os.unlink() on the success path. If partition_via_api() or the file open raises an exception, the temporary file is never cleaned up, causing a resource leak on disk. Move os.unlink() into a finally block so the temp file is always removed regardless of whether the API call succeeds or fails. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical issue in the document extractor node where temporary files were not properly cleaned up when the Unstructured API encountered an error. By refactoring the file deletion logic into Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a critical temporary file leak issue that occurs when API calls fail during document extraction. This fix is crucial as the temporary file leak could have led to a Denial of Service. My review suggests simplifying the temporary file handling by leveraging Python's tempfile context manager more idiomatically. This approach would make the code cleaner, more robust, and consistently apply the improved pattern across all affected document types (DOC, PPT, PPTX, and EPUB).
| with tempfile.NamedTemporaryFile(suffix=".doc", delete=False) as temp_file: | ||
| temp_file.write(file_content) | ||
| temp_file.flush() | ||
| try: | ||
| with open(temp_file.name, "rb") as file: | ||
| elements = partition_via_api( | ||
| file=file, | ||
| metadata_filename=temp_file.name, | ||
| api_url=unstructured_api_config.api_url, | ||
| api_key=api_key, | ||
| ) | ||
| finally: | ||
| os.unlink(temp_file.name) | ||
| return "\n".join([getattr(element, "text", "") for element in elements]) |
There was a problem hiding this comment.
While the try...finally block correctly fixes the temporary file leak, this can be simplified by using the tempfile.NamedTemporaryFile context manager with its default behavior (delete=True). This avoids manual file deletion and makes the code more concise and idiomatic.
By using the with statement for the temporary file to encompass the API call, the file will be automatically closed and deleted upon exiting the block, whether it's successful or an exception occurs.
with tempfile.NamedTemporaryFile(suffix=".doc") as temp_file:
temp_file.write(file_content)
temp_file.flush()
temp_file.seek(0)
elements = partition_via_api(
file=temp_file,
metadata_filename=temp_file.name,
api_url=unstructured_api_config.api_url,
api_key=api_key,
)
return "\n".join([getattr(element, "text", "") for element in elements])| with tempfile.NamedTemporaryFile(suffix=".ppt", delete=False) as temp_file: | ||
| temp_file.write(file_content) | ||
| temp_file.flush() | ||
| try: | ||
| with open(temp_file.name, "rb") as file: | ||
| elements = partition_via_api( | ||
| file=file, | ||
| metadata_filename=temp_file.name, | ||
| api_url=unstructured_api_config.api_url, | ||
| api_key=api_key, | ||
| ) | ||
| finally: | ||
| os.unlink(temp_file.name) |
There was a problem hiding this comment.
While the try...finally block correctly fixes the temporary file leak, this can be simplified by using the tempfile.NamedTemporaryFile context manager with its default behavior (delete=True). This avoids manual file deletion and makes the code more concise and idiomatic.
By using the with statement for the temporary file to encompass the API call, the file will be automatically closed and deleted upon exiting the block, whether it's successful or an exception occurs.
with tempfile.NamedTemporaryFile(suffix=".ppt") as temp_file:
temp_file.write(file_content)
temp_file.flush()
temp_file.seek(0)
elements = partition_via_api(
file=temp_file,
metadata_filename=temp_file.name,
api_url=unstructured_api_config.api_url,
api_key=api_key,
)| with tempfile.NamedTemporaryFile(suffix=".pptx", delete=False) as temp_file: | ||
| temp_file.write(file_content) | ||
| temp_file.flush() | ||
| try: | ||
| with open(temp_file.name, "rb") as file: | ||
| elements = partition_via_api( | ||
| file=file, | ||
| metadata_filename=temp_file.name, | ||
| api_url=unstructured_api_config.api_url, | ||
| api_key=api_key, | ||
| ) | ||
| finally: | ||
| os.unlink(temp_file.name) |
There was a problem hiding this comment.
While the try...finally block correctly fixes the temporary file leak, this can be simplified by using the tempfile.NamedTemporaryFile context manager with its default behavior (delete=True). This avoids manual file deletion and makes the code more concise and idiomatic.
By using the with statement for the temporary file to encompass the API call, the file will be automatically closed and deleted upon exiting the block, whether it's successful or an exception occurs.
with tempfile.NamedTemporaryFile(suffix=".pptx") as temp_file:
temp_file.write(file_content)
temp_file.flush()
temp_file.seek(0)
elements = partition_via_api(
file=temp_file,
metadata_filename=temp_file.name,
api_url=unstructured_api_config.api_url,
api_key=api_key,
)| with tempfile.NamedTemporaryFile(suffix=".epub", delete=False) as temp_file: | ||
| temp_file.write(file_content) | ||
| temp_file.flush() | ||
| try: | ||
| with open(temp_file.name, "rb") as file: | ||
| elements = partition_via_api( | ||
| file=file, | ||
| metadata_filename=temp_file.name, | ||
| api_url=unstructured_api_config.api_url, | ||
| api_key=api_key, | ||
| ) | ||
| finally: | ||
| os.unlink(temp_file.name) |
There was a problem hiding this comment.
While the try...finally block correctly fixes the temporary file leak, this can be simplified by using the tempfile.NamedTemporaryFile context manager with its default behavior (delete=True). This avoids manual file deletion and makes the code more concise and idiomatic.
By using the with statement for the temporary file to encompass the API call, the file will be automatically closed and deleted upon exiting the block, whether it's successful or an exception occurs.
with tempfile.NamedTemporaryFile(suffix=".epub") as temp_file:
temp_file.write(file_content)
temp_file.flush()
temp_file.seek(0)
elements = partition_via_api(
file=temp_file,
metadata_filename=temp_file.name,
api_url=unstructured_api_config.api_url,
api_key=api_key,
)
Summary
partition_via_api()raises an exceptionRoot Cause
The
_extract_text_from_doc,_extract_text_from_ppt,_extract_text_from_pptx, and_extract_text_from_epubfunctions inapi/dify_graph/nodes/document_extractor/node.pycreate temporary files withdelete=Falsebut only callos.unlink()on the success path.If
partition_via_api()raises an exception (e.g., network timeout, API error, malformed response), execution jumps to the outerexceptblock and theos.unlink()call is skipped. The temporary file is never cleaned up, leaking disk space.In long-running Dify deployments that process many documents, this can accumulate significant orphaned temp files in the system's temp directory.
Fix
Move
os.unlink()into afinallyblock so the temporary file is always removed regardless of whether the API call succeeds or fails.Before:
After:
Test plan
/tmpafter processing documents with API failures🤖 Generated with Claude Code