Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Reduce disk fragmentation in direct write mode on Windows #195
Note: this problem affects only Windows.
An NZBGet user has reported that downloaded files have high fragmentation when option DirectWrite is active. After an investigation it turned out the sparse files are tend to be highly fragmented, even if they are written into sequentially.
In classic direct write mode NZBGet uses sparse files to avoid allocating of files with zeroes. Having files allocated with zeroes means that the same file segments are written twice into disk: first time when allocating disk space and second time when actually writing downloaded data. Writing two times if of course a big performance disadvantages. That's why NZBGet uses sparse files in direct write mode.
However, when using direct write mode with active article cache the files are typically written sequentially, at once, when all segments are downloaded (and stored in memory cache). In that case we don't really need output files to be sparse files.
After investigating that a bit I’ve found out that Windows (NTFS) stores two sizes for a file: normal file size and valid data size.
If an application tries to read beyond valid data size pointer it will become nulls as if they were written to disk.
When an applications writes to file starting from the last valid data size position Windows writes to the preallocated disk sectors and moves valid data size pointer. This saves unnecessary zeroing.
If however an application tries to write somewhere beyond the valid data size pointer Windows writes zeroes from the last valid data pointer to current write position (and moves valid data size pointer to current position).
That means that preallocating and writing from the beginning completely eliminates zeroing stage and still provides the advantage of unfragmented disk space preallocation.
The scenario "preallocate space, then write at random file positions” is not good performance wise as Windows will need to zero some data on disk. DirectWrite mode in NZBGet falls into this scenario. However if article cache is active there are no random writes and all segments are written after the file is completed, sequentially from the file beginning. This means we can use preallocating (without sparse) in DirectWrite mode when article cache is active.
In the ideal case the file is written sequentially from the first segment to the last. No zeroes are written in that case.
If during flushing of cache there are not completed segments the system will write zeroes to disk. Later when the segments are downloaded they will be written to the file. In such unfortunate case the same disk sectors will be written twice, once with zeroes, then with actual data.
The question is what is worse: having fragmented sparse file or unfragmented file but with occasional unnecessary double writings. I hope the latter is a better strategy as a whole.
For example we are writing 90 segments of a 100 segment file (10 segments are stuck). We set file pointer to the position of the first segment and writing its data, then doing that for the second segment, etc. At some point we need to skip a stuck segment, so we setting the pointer to the next segment and writing its data. The system sees that we skipped a segment and it writes zeroes to the disk for that segment, then it writes our data segment. Since the whole file is unfragmented the writing of zeroes and of a real (next) segment will be done by the disk probably as one disk operation. In other words the writing of zeroes comes at no cost. Hopefully. Although we needed to write only 90 segments we have written 100 segments (10 of them with zeroes). But since we did this for unfragmented file the total time needed for this operation was likely less that if we were writing 90 segments of fragmented sparse file.
That’s all a guess-work. A real test would be to download a big nzb (several gigabytes) with different settings multiple times and compare results (download time, unpack time).
In this issue