Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement memory-efficient in-file binary search for CDXJ indexes #604

Closed
ibnesayeed opened this issue Feb 1, 2019 · 28 comments · Fixed by #631
Closed

Implement memory-efficient in-file binary search for CDXJ indexes #604

ibnesayeed opened this issue Feb 1, 2019 · 28 comments · Fixed by #631

Comments

@ibnesayeed
Copy link
Member

Currently, the complete index is loaded in memory for faster lookup, but not scalable for large index files as the memory usage grows linearly with the size of index. We can instead implement an in-file binary search to make the memory usage constant for any arbitrary index file size while still being efficient in lookup due to the nature of binary search. A similar implementation is present in the newly introduced MementoMap repo.

@machawk1
Copy link
Member

machawk1 commented Feb 1, 2019

Comparing the implementations here would be useful:

Current ipwb replay binsearch

with open(fullFilePath, 'r') as cdxjFile:
    lines = cdxjFile.read().split('\n')

    lineFound = binary_search(lines, surtURI, retIndex, onlyURI)

def binary_search(haystack, needle, returnIndex=False, onlyURI=False):
    lBound = 0
    uBound = None

    surtURIsAndDatetimes = []

    cdxjObj = objectifyCDXJData(haystack, onlyURI)
    surtURIsAndDatetimes = cdxjObj['data']

    metaLineCount = len(cdxjObj['metadata'])

    uBound = len(surtURIsAndDatetimes)

    pos = bisect_left(surtURIsAndDatetimes, needle, lBound, uBound)

    if pos != uBound and surtURIsAndDatetimes[pos] == needle:
        if returnIndex:  # Index useful for adjacent line searching
            return pos + metaLineCount
        return haystack[pos + metaLineCount]
    else:
        return None

Proposed binsearch from MementoMap

with open(infile) as f:
    surtk, freq, *_ = f.readline().split(maxsplit=2)
    if self.debug:
        print(f"Matching FIRST> {key} {surtk} {freq}", file=sys.stderr)
    if key == surtk:
        return [surtk, freq]
    left = 0
    f.seek(0, 2)
    right = f.tell()
    while (left < right):
        mid = (right + left) // 2
        f.seek(mid)
        f.readline()
        surtk, freq, *_ = f.readline().split(maxsplit=2)
        if self.debug:
            print(f"Matching {left}:{mid}:{right}> {key} {surtk} {freq}", file=sys.stderr)
        if key == surtk:
            return [surtk, freq]
        elif key > surtk:
            left = mid + 1
        else:
            right = mid - 1

As verbally discussed, the latter is more of a byte-wise instead of a line-wise (former) approach. The former may not perform an even line-wise bisection but that is not important if the ultimate implementation is more efficient in memory. Speed should also be evaluated, which could be well evaluated with both small and very large CDXJ files.

An outstanding TODO as verbally described by @ibnesayeed is the implementation of multiple adjacent entries that map the criteria. The current implementation relies on the lines being in memory, which will not be available in the latter.

@ibnesayeed
Copy link
Member Author

The in-file binary search I implemented for MementoMap is showing pretty good speed of about one millisecond per lookup on an average. This measurement was performed on an index with almost 30 million records by successively looking for over 3 million different URIs. The process was performed on a remotely mounted NFS drive and matching records were written in a file that added some additional I/O overhead, otherwise, the lookup time could have been even smaller.

@machawk1
Copy link
Member

machawk1 commented Feb 6, 2019

@ibnesayeed Can you provide a concrete way (e.g., code snippet, timing commands, sample data) to test your approach to the current one?

@ibnesayeed
Copy link
Member Author

The data I worked on is not CDXJ data that might be the expectation of IPWB's current code. While I have large CDXJ files (not the ones with locator key though) which can be used to test it, but I cannot share those indexes publicly. UK Web Archive however has made a handful of their CDX indexes available online publicly (but those are not CDXJ files). My understanding is that even attempting to load a large CDXJ file in the current IPWB replay would take up all the memory and perhaps crash.

@ibnesayeed
Copy link
Member Author

@machawk1
Copy link
Member

machawk1 commented Feb 6, 2019

The memory issue might be something good to also test.

To get realistic benchmarks, how about we generate some IPWB-style CDXJ TimeMaps?

We generate some WARCs in testing, so should be able to also generate CDXJ files in the style of data that the code currently expects.

@ibnesayeed
Copy link
Member Author

ibnesayeed commented Feb 6, 2019

To get realistic benchmarks, how about we generate some IPWB-style CDXJ TimeMaps?

We don't necessarily need to generate dummy data to test how it would behave of large indexes. We can run it against any large text file as the index, even if it is an invalid index. Though, we can internally use some real CDXJ datasets we have for testing. My guess is, the following line would be the one that will choke the memory.

lines = cdxjFile.read().split('\n')

@ibnesayeed
Copy link
Member Author

ibnesayeed commented Feb 6, 2019

I just realized that a single binary lookup takes a fraction of a millisecond. In the case of MementoMap supplied URI is looked up many times iteratively with shortder and shorter sub-URI keys until one is found or the keys exhaust. This means those 3 million lookups would turn into 8-9 times as many (it's a rough guess, actual number could be different).

@machawk1
Copy link
Member

machawk1 commented Feb 6, 2019

Ok. Per above, some code snippets, timing commands, and sample data would make this more empirically convincing as a better solution than what is currently implemented in ipwb.

@machawk1
Copy link
Member

machawk1 commented Feb 7, 2019

@machawk1
Copy link
Member

I created a CDXJ file using the code in the aforementioned repo with 500,000 entries and tried replaying it in ipwb via ipwb replay sample_sorted.cdxj. Loading up localhost:5000 in the browser requires a bit of time before the browser responds.

...and that is not even search for a URI.

@ibnesayeed
Copy link
Member Author

Also, you should see a significant rise in memory consumption.

@machawk1
Copy link
Member

@ibnesayeed 300+MB (with a spike) difference handling the request.

For the sake of Counting Mementos™, there might be a more memory efficient, quicker (<O(n)) but naive way for display on the main replay interface, e.g., CRLF count in the file.

The additional computation that has ipwb traversing the file for the extended display will eventually be moved to the admin/ interface per #178.

The above is a sort-of related aside from the primary task of this ticket: efficient search, but I figured I'd run a quick empirical test and report it here rather than create a ticket (for now).

@ibnesayeed
Copy link
Member Author

ibnesayeed commented Feb 21, 2019

In a generic file, you can't count number of lines without reading it all the way. However, if you are looking for a rough estimate, you can calculate average number of bytes per line heuristically, then use that to divide the file size with. This operation will be O(1). That said, counting lines can be done accurately by processing the I/O stream, instead of loading the whole file in memory. Also, opening file in binary mode for the purpose should be faster than text mode as it does not go through the decoding process for the content.

If you want to discount the data transfer and rendering time, then you can make a HEAD request in a terminal using cURL and prefix it with time command.

@machawk1 machawk1 pinned this issue May 17, 2019
@machawk1
Copy link
Member

machawk1 commented Jun 22, 2019

On a 500,000 entry CDXJ file, time curl http://localhost:5000/ > /dev/null requires 1m35s to resolve for me using latest master 63433f3 and ipwb replay sample.cdxj.

Related aside, getURIsAndDatetimesInCDXJ() is called twice when accessing the replay homepage. The algorithm there requires all lines to be traversed, given it is looking to list all URI-Ms/Rs, datetimes, titles, etc. This effect causes the load time for the full page to take a very long time while the main part of the page (the UI inclusive of logo, search box, etc) are shown.

This information will eventually be moved to /admin/ per above but the same problem will still be exhibited. Maybe something AJAX-y or streaming can be used to enrich the relevant memento listings as they're processed.

EDIT (so as to not derail): The iteration via enumeration in getURIsAndDatetimesInCDXJ() seems fit for using either a generator or asynchronous (using async/await?) processing.

EDIT: Created a separate issue in #612 to handle these efforts, which are independent of the binary search optimization that this issue (#604) should address.

@ikreymer
Copy link

If you're looking for a way to do this in the browser, I would recommend looking at the IndexedDB APIs for storing CDXJ in the browser which provides a way to do range queries and I think eliminates the need for implementing this manually, eg. https://developer.mozilla.org/en-US/docs/Web/API/IDBKeyRange

I started experimenting with this (but not actually using yet): https://github.com/webrecorder/wabac.js/blob/master/src/collIndex.js#L50

@ibnesayeed
Copy link
Member Author

Thanks @ikreymer for the pointer to IndexedDB. I am sure we will find a good use-case for that. However, this ticket is about implementing binary search in CDXJ files on the server.

@machawk1
Copy link
Member

machawk1 commented Feb 20, 2020

I extracted the relevant parts from the code to which @ibnesayeed linked from MementoMap and appended them below to be integrated into ipwb. This executes and returns the value in 0.03 seconds on a 500k line CDXJ.

import re, sys

def lookup_keys(surt):
    keyre = re.compile(r"(.+)([,/]).+")
    key = surt.split("?")[0].strip("/")
    keys = [key, f"{key}/*"]
    while "," in key:
        try:
            m = keyre.match(key)
            keys.append(f"{m[1]}{m[2]}*")
            key = m[1]
        except Exception as e:
            key = key.strip("/,")
    return keys

def bin_search(iter, key):
    # Read line encompassing current position
    surtk, rest = iter.readline().split(maxsplit=1)
    if key == surtk:
        return [surtk, freq]

    # If further searching required...
    left = 0
    # Go to end of seek stream
    iter.seek(0, 2)
    right = iter.tell()  # Current position
    while (right - left > 1):
        mid = (right + left) // 2
        iter.seek(mid)
        iter.readline()  # Q: Why are there two readlines?
        line = iter.readline()
        surtk, rest = line.split(maxsplit=1)

        if key == surtk[0:-1]:
            return line
        elif key > surtk:
            left = mid
        else:
            right = mid

def run_batchlookup(filename, surt):
    mobj = open(filename, "rb")
    iter = mobj

    fobj = open(filename, "rb")
    for line in fobj:
        res = lookup(iter, surt)
        if res:
            return res
    fobj.close()
    mobj.close()

def lookup(iter, surt):
    for idx, key in enumerate(lookup_keys(surt)):
        res = bin_search(iter, key.encode())
        if res:
            return res


filename = 'test_sorted.cdxj'
surtKey = 'intuit,sr7ro5w7d6w3e2mmx4w/'

res = run_batchlookup(filename, surtKey)
print(res)

EDIT: A needed extension of this will be to test and return the relevant adjacent lines that match the key.

@machawk1
Copy link
Member

machawk1 commented Feb 21, 2020

issue-604 branch uses the aforementioned binary search algorithm and is effective at displaying the initial UI with very large CDXJ files in a reasonable amount of time. As above, the logic to return the other adjacent relevant lines needs to be implemented and some code cleanup is needed entailing removing redundant functions and semantic revisions of new function names.

EDIT: The lack of the multiple URIs also breaks some tests, which is good (their purpose).

@anatoly-scherbakov
Copy link
Contributor

Please pardon my intrusion gentlemen. This is a very interesting project, a lot of work has been done, and I only started digging into the code today. So I might be very wrong. But here it goes.

This issue apparently relates to supporting very large CDXJ files. Loading such a file into RAM (which happens currently, and I even proposed an in-memory cache for that) can be unaffordable for a small server where an ipwb node is running.

As far as I can see, CDXJ files are essentially key → value maps. The goal is to find value by strictly given key, and binary search in such a file is only possible because the file is ordered by key lexicographically.

However, key-value data model is not unique; it is well supported by multiple database engines out there, including open source software in the first ranks. Redis, MongoDB, CouchDB, Cassandra, ElasticSearch, PostgreSQL, MySQL - all of them either support key-value collections directly or permit to model them.

Imagine ipwb has support for plugins which are essentially connectors to various database backends. For speed or temporary operations you can choose Redis and offload your CDXJ files there. For permanent storage, you are welcome to use MongoDB. And if you are operating on really huge indices you can offload them to Amazon S3 (which is essentially a key-value storage too, just the values can be very large).

Thus, every ipwb node could choose what part of the global Trusted Web to host, and to choose whichever backend is more suitable to host its indices.

This is more flexible than to have one hard-coded file based implementation. Thoughts?

@ibnesayeed
Copy link
Member Author

Thanks @anatoly-scherbakov for your interest in the project, contributions, and thoughts. We really appreciate it.

Yes, you are right about the index being a key/value store problem which has been solved million times by many systems already. We can use any K/V system that allows two levels of indexing (i.e., primary index on URIs and secondary on the datetime of the capture). While it is sensible to support multiple index backends going forward if the project matures and attracts more users, it was initially wise to hard code a file-based index for the following reasons:

  • the project started in a one and a half day hackathon where we did not have too much time in hand to architect it with a flexible index backend design, but we wanted something working and we borrowed some code initially from the PyWB project
  • CDX files, initially introduced by the Internet Archive, (and later enhanced CDXJ files) were de-facto go to index format in the web archiving ecosystems and there were tools that would index WARC files and create CDX/CDXJ files, which we could easily change to fit our needs
  • Text-based indexes are very good for data wrangling using Unix text processing tools for research and analysis purposes, which is very important for us as we are web science and web archiving researchers

In fact other archival replay systems like OpenWayback and PyWB already support a handful of index backends such as BerkeleyDB, CDX/CDXJ, and OutbackCDX etc. So, this is not unlikely to eventually allow index drivers for different K/V systems in IPWB as well, if there is enough demand to justify the investment of cycles. Also, we are currently experimenting with many attributes to support per record encryption and other features, which would become messy if were to support multiple backends before settling down on the schema, while hard coded text based index allows us that flexibility to experiment more easily.

Binary search in CDX/CDXJ files is already a solved problem, but we are redoing it due to licensing issues. This project is released under MIT which is a rather permissive licence and is not compatible with strong copyleft licenses like GPL that were chosen by the tools where binary CDX index search is implemented. We are already working on an independent binsearch implementation that can be used here when ready, but there are only limited number of cycles available to spare on that, hence the progress is slow.

@anatoly-scherbakov
Copy link
Contributor

I see, thanks for the detailed explanation. What is the typical size of a CDXJ file in terms of line count and byte size? Would it be possible to just load it into RAM as Python dict and plainly use key lookups?

@machawk1
Copy link
Member

@anatoly-scherbakov Reiterating @ibnesayeed, thanks for your interest. While I am open to using other data sources for K/V lookup, one goal is for the CDXJ indexes to be shareable with the sole dependency being replicate of the lookup to be in IPFS instead of a centralized remote system. We also want the querying procedure to scale, hence the need for binsearch, which has mostly been implemented in #631, just not yet merged in.

As @ibnesayeed also said, ipwb was initially (literally) hacked together and progressively improved. Its design is what functioned at the time and so, we are usually pretty open to improving it after-the-fact.

I hacked (see a theme?) together a CDXJ generation script some time back in an attempt to generate large amount of fabricated data solely for evaluating the ipwb lookup procedure. The improvements in #631 show a very promising performance increase with 5 million records (arbitrarily chosen). We are very open to improving this evaluation procedure for making the tool more usable and systematic.

@ibnesayeed
Copy link
Member Author

What is the typical size of a CDXJ file in terms of line count and byte size? Would it be possible to just load it into RAM as Python dict and plainly use key lookups?

Each line of the CDXJ file corresponds to one archived resource, with a few header lines that hold some metadata about the archive and the file itself. Each line contains a SURT (a sort friendly and canonicalized transformation of URI that is no longer than the original URI), a 14 digit datetime string, original URI, mime-type, status code, two IPFS hashes (corresponding to header and payload of the resource), and some additional attributes to support per-record encryption when needed. CDXJ is flexible about adding more attributes as needed.

If you want to store this information in a Python dictionary, you will essentially be creating a nested dict where the primary keys will be SURTs that will store a list or dict of index record objects identified/keyed by their 14-digit datetime (there may be collision too, so be careful about them). Again if something like this is to be implemented, this will only work for smaller archives, for large archives vanilla Python dictionaries will be a nightmare, even if you have sufficient memory available. As I mentioned earlier, it will require support for multiple index back ends that implement a uniform API, where this dictionary one could be just one of many possible back ends. The API will need to return:

  1. All mementos of a given URI
  2. First memento of a given URI
  3. Last memento of a given URI
  4. Closest memento of a given URI to a given datetime
  5. Immediate previous memento to the closest memento of a given URI to a given datetime
  6. Immediate next memento to the closest memento of a given URI to a given datetime

@anatoly-scherbakov
Copy link
Contributor

This is a very detailed explanation, thank you; I agree that a dict itself would not exactly work. What do you think of a backend interface like this?

class MementoSet:
    def by_uri(self, original_uri: str) -> MementoSet:
        ...

    def first(self) -> Optional[Memento]:
        ...

    def last(self) -> Optional[Memento]:
        ...

    def closest_to_datetime(self, datetime: datetime.datetime) -> MementoSet:
        # This returns a MementoSet, not a Memento. Reason: there might be multiple mementos with the same datetime, as you mentioned above.
        ...

    def after(self, memento: Memento) -> Optional[Memento]:
        ...

    def before(self, memento: Memento) -> Optional[Memento]:
        ...

If mementos is an instance of MementoSet, the usecases you provided will map to:

  1. mementos.by_uri(uri)
  2. mementos.by_uri(uri).first()
  3. mementos.by_uri(uri).last()
  4. mementos.by_uri(uri).closest_to_datetime(datetime).first()
memento = mementos.by_uri(uri).closest_to_datetime(datetime).first()
mementos.by_uri(uri).before(memento)
memento = mementos.by_uri(uri).closest_to_datetime(datetime).first()
mementos.by_uri(uri).after(memento)

I don't like the cases 5 and 6 - they are too verbose, but at this point I am not sure how to organize them better.

@ibnesayeed
Copy link
Member Author

While I listed all the atomic requirements earlier, actual API, based on the usage requirement, can be much more simpler and efficient with only two public functions:

get_all_mementos(urir) -> Iterator[chronologically_sorted_index_records_of_matching_urir]

get_nav_mementos(urir, datetime) -> {first, prev, closest, next, last}

Note: The value corresponding to each key in the returned dictionary of the second function can be a list of record objects, if we plan to support collisions.

In addition to these, the index API may also provide an overall iterator with a filter argument to iterate over all the matching records for the sake of record pinning or MementoMap generation.

iterate_all(filter) -> Iterator[filtered_sorted_index_records]

@anatoly-scherbakov
Copy link
Contributor

I've created a separate issue to continue this discussion about an index backend interface which I find quite interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants