-
Notifications
You must be signed in to change notification settings - Fork 175
Description
I use this API to create a batch embedding task:
https://api.openai.com/v1/batches
After the task is completed, download the embedding result file based on the output_file_id using the following API:
https://api.openai.com/v1/files/{file_id}/content
Recently, the project ran into an issue. After investigation, it was found that the problem was caused by incomplete downloaded files resulting in missing data.
The program was set to request 5,000 items per batch, but the result file downloaded locally contained only a few hundred items and showed the following message:
Premature end of Content-Length delimited message body (expected: 391505535; received: 113446592
I also tried to retrieve the result file in chunks, but it still failed and I could only get partial data.
The following is the Java code I originally used to download the file. Before this, this method was able to download the result file completely.
public static int downloadFile(String outputFilePath, String url, Map<String, String> headers) throws IOException {
HttpGet request = new HttpGet(url);
if (headers != null) {
headers.forEach(request::addHeader);
}
CloseableHttpResponse execute = httpClient.execute(request);
int statusCode = execute.getStatusLine().getStatusCode();
HttpEntity entity = execute.getEntity();
InputStream content = entity.getContent();
try (FileOutputStream outputStream = new FileOutputStream(outputFilePath)) {
byte[] buffer = new byte[8192];
int bytesRead;
while ((bytesRead = content.read(buffer)) != -1) {
outputStream.write(buffer, 0, bytesRead);
}
System.out.println("File downloaded to: " + outputFilePath);
} catch (IOException e) {
System.err.println("Error writing to file: " + e.getMessage());
} finally {
try {
content.close();
} catch (IOException e) {
System.err.println("Error closing input stream: " + e.getMessage());
}
}
return statusCode;
}
I tried testing the download on the server using wget. After downloading part of the file, it would close and then retry, repeating in a loop.
[root@scripts]# wget --header="Authorization: Bearer token" \
> -O file.jsonl \
> https://api.openai.com/v1/files/{fileId}/content
--2025-11-13 02:59:44-- https://api.openai.com/v1/files/{fileId}/content
Resolving api.openai.com (api.openai.com)... 162.159.140.245, 172.66.0.243
Connecting to api.openai.com (api.openai.com)|162.159.140.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 730814739 (697M) [application/octet-stream]
Saving to: ‘file.jsonl’
file.jsonl 10%[===============> ] 73.76M 25.1MB/s in 2.9s
2025-11-13 03:00:33 (25.1 MB/s) - Connection closed at byte 77340672. Retrying.
--2025-11-13 03:00:36-- (try: 2) https://api.openai.com/v1/files/{fileId}/content
Connecting to api.openai.com (api.openai.com)|162.159.140.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 730814739 (697M) [application/octet-stream]
Saving to: ‘file.jsonl’
file.jsonl 3%[====> ] 23.03M 24.7MB/s in 0.9s
2025-11-13 03:00:51 (24.7 MB/s) - Connection closed at byte 77340672. Retrying.
--2025-11-13 03:00:55-- (try: 3) https://api.openai.com/v1/files/{fileId}/content
Connecting to api.openai.com (api.openai.com)|162.159.140.245|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 730814739 (697M) [application/octet-stream]
Saving to: ‘file.jsonl’
file.jsonl 6%[========> ] 42.78M 19.3MB/s in 2.2s
2025-11-13 03:01:10 (19.3 MB/s) - Connection closed at byte 77340672. Retrying.
HTTP request sent, awaiting response... ^C
[root@scripts]# wc -l file.jsonl
851 file.jsonl