Skip to content

kenhktsui/open-information-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open Information Retrieval

Build an open information retrieval system, which can be used for different downstream tasks:

  • open domain question answering
  • retrieval augmented large language model
  • factchecking

VectorDB uses Qdrant, which is written in Rust. It is a lightning fast and production ready vector DB.

Implementations so far include:

  • Retriever of REALM (13353718 data records)

More retrievers will be added in the future.

Installation

Qdrant Vector DB Initialisation

# init vector DB
docker pull qdrant/qdrant:v1.1.2
docker run -p 6333:6333 \
    -v $(pwd)/qdrant_storage:/qdrant/storage \
    qdrant/qdrant:v1.1.2

Data can be downloaded from link and unzipped into qdrant_storage folder. The data has been produced by running upload_vector.py, which loads wikipedia data and REALM vectors into Qdrant Vector DB. So you do not need to run it again.

Once you had unzipped into qdrant_storage folder. You can verify the data by running:

python util/verify_vector.py
### Output
collections=[CollectionDescription(name='wiki_collection')]
status=<CollectionStatus.GREEN: 'green'> optimizer_status=<OptimizersStatusOneOf.OK: 'ok'> vectors_count=13353718 indexed_vectors_count=13353718 points_count=13353718 segments_count=9 config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=128, distance=<Distance.DOT: 'Dot'>), shard_number=1, on_disk_payload=True), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=1), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0)) payload_schema={}

You will see there are 13353718 indexed vectors in the collection.

Client Installation

pip install -r requirements.txt

Usage

from oir.retriever import RealmRetriever


retriever = RealmRetriever()
result = retriever.search("Who is the first president of United States?", top_k=3, score_threshold=20.0)
print(result[0])
### Output
{
    'data': "Presidency of George Washington\n\nThe presidency of George Washington began on April 30, 1789, when Washington was inaugurated as the first President of the United States, and ended on March 4, 1797. Washington took office after the 1788–89 presidential election, the nation's first quadrennial presidential election, in which he was elected unanimously. Washington was re-elected unanimously in the 1792 presidential election, and chose to retire after two terms. He was succeeded by his vice president, John Adams of the Federalist Party. Washington had established his preeminence among the new nation's Founding Fathers through his service as Commander-in-Chief of the Continental Army during the American Revolutionary War and as President of the 1787 Constitutional Convention. Once the Constitution was approved, it was widely expected that Washington would become the first President of the United States, despite his own desire to retire from public life. In his first inaugural address, Washington expressed both his reluctance to accept the presidency and his inexperience with the duties of civil administration, but he proved an able leader. Washington presided over the establishment of the new federal governmentappointing all of the high-ranking officials in the executive and judicial branches, shaping numerous political practices, and establishing the site of the permanent capital of the United States.",
    'score': 25.925602
}

result = retriever.search("What is the tallest building in the world?", top_k=3, score_threshold=20.0)
print(result[0])
### Output
{
    'data': "Vanity height\n\nVanity height is defined by the Council on Tall Buildings and Urban Habitat (CTBUH) as the height difference between a skyscraper's pinnacle and the highest usable floor (usually observatory, office, restaurant, retail or hotel/residential). Because the CTBUH ranks the world's tallest buildings by height to pinnacle, a number of buildings appear higher in the rankings than they otherwise would due to extremely long spires. The controversy began when the Petronas Towers were named as the world's tallest buildings in 1998, despite having a roof 63.4\xa0m (208\xa0ft) lower than that of the Willis Tower. The current world's tallest building, Burj Khalifa, is officially 828 meters tall, but its highest usable floor is 585m above ground. Therefore, its vanity height is defined as 244 meters, or 29% of the building's total height. The likely next tallest building, Jeddah Tower (designed by the same architect), will be over 1,000 meters tall but its highest floor is 630m above ground. The top 370m (equivalent to an 85-story building) or 37% of the building's total height is unusable. When vanity height is excluded, the height progression of the world's tallest buildings looks much more modest in comparison.",
    'score': 26.24711
}

result = retriever.search("who creates Python?", top_k=3, score_threshold=20.0)
print(result[0])
### output
{
    'data': "Python (programming language)\n\nPython is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales. In July 2018, Van Rossum stepped down as the leader in the language community. Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library. Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of Python's other implementations. Python and CPython are managed by the non-profit Python Software Foundation. Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired by SETL), capable of exception handling and interfacing with the Amoeba operating system. Its implementation began in December 1989.",
    'score': 22.62619
}

About

Implementation of Production Ready Information Retrieval System

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages