Sorting big files that do not fit in memory drives us to use exeternal sorting methods that some common and widely used utilities - such as Unix sort - have.
" External sorting is a term for a class of sorting algorithms that can handle massive amounts of data." (Source: Wikipedia)
So, basically, we are using a "divide and conquer" strategy, sorting small chunks of the input and storing them into the disk. After that, we can read the first N bytes of each chunk and merge them into the output file. A high level diagram of this strategy is available here. Example:
chunk-1: [a, b, c, k, z]
chunk-2: [d, e, f]
Having these two ordered chunks, we should now select, comparing them, what is the order to write them into the output file. In this case, viewing the algorithm in high-level, we should have something like this:
1. a < d? Write `a` and advance in `chunk-1`
2. b < d? Write `b` and advance in `chunk-1`
3. c < d? Write `c` and advance in `chunk-1`
4. k < d? Write `d` and advance in `chunk-2`
5. k < e? Write `e` and advance in `chunk-2`
6. k < f? Write `f` and advance in `chunk-2`
7. Write `k`and advance in `chunk-1`
8. Write `z`and advance in `chunk-1``
9. Close the output file
The solution to this problem is written in Java8 and does not use any external library. The program has a Main
class that takes two arguments: the first one is the input file and the second one is the file where the output should be written. The architeture of the solution is the following:
ExternalSort
=> just a wrapper for the two steps (divide and conquer) of the algorithm;FileSplitter
=> splits a given input file, sorts its contents in N chunks and writes them into disk;SortedFileBuffer
=> utility data structure that has the list of chunks that are opened, a relationship between these buffers and the last line that was read from them and a set of buffers that were completely read;FileMerger
=> Using the above class, selects the entry to write to the output buffer, from the list of open and sorted buffers, based on alphabetic comparation.
There is a file test.txt
in the root directory of the project. It's a blacklist from somewhere on the Internet. It has some lines that are not sorted, for test and demonstration purposes.
NOTE: this implementation should not be used in production. There are, certainly, a some optimizations and corrections to do. This is just a proof-of-concept.