# What this notebook is for
- Information extraction: transforming unstructured documents into structured tables (slow, comprehensive)
- Retrieval-augmented generations: answering questions over a corpus of unstructured documents (fast, less comprehensive)

# Load data

Data must have 3 columns:
- `document_id`: identifier for a document. Has to be unique.
- `document_name`: make sure it's unique
- `document_text`: contains the actual text of the document

In [1]:
import os
import pandas as pd
from predibase import PredibaseClient
from time import perf_counter

df = pd.read_csv("s3://predibase-public-us-west-2/datasets/formatted_hotel_reviews.csv")
df

Unnamed: 0.1,Unnamed: 0,document_text,document_id,document_name
0,0,Title: Best Western Plus Hotel\nReview text: T...,AWE2FvX5RxPSIh2RscTK,Best Western Plus Hotel AWE2FvX5RxPSIh2RscTK
1,1,Title: Clean rooms at solid rates in the heart...,AVwcj_OhkufWRAb5wi9T,Clean rooms at solid rates in the heart of Car...
2,2,Title: Business\nReview text: Parking was horr...,AVwcj_OhkufWRAb5wi9T,Business AVwcj_OhkufWRAb5wi9T
3,3,Title: Very good\nReview text: Not cheap but e...,AVwcj_OhkufWRAb5wi9T,Very good AVwcj_OhkufWRAb5wi9T
4,4,Title: Low chance to come back here\nReview te...,AVwcj_OhkufWRAb5wi9T,Low chance to come back here AVwcj_OhkufWRAb5wi9T
...,...,...,...,...
9995,9995,Title: Very accommodating and friendly staff!\...,AVwdatg0ByjofQCxo5S5,Very accommodating and friendly staff! AVwdatg...
9996,9996,"Title: comfortable, friendly, clean, professio...",AVwdatg0ByjofQCxo5S5,"comfortable, friendly, clean, professional AVw..."
9997,9997,Title: Great location\nReview text: This Hampt...,AVwdatg0ByjofQCxo5S5,Great location AVwdatg0ByjofQCxo5S5
9998,9998,Title: Great Atmosphere!\nReview text: Awesome...,AV1thTgM3-Khe5l_OvT5,Great Atmosphere! AV1thTgM3-Khe5l_OvT5


In [2]:
for _, row in df.iterrows():
    print()
    print(row["document_text"])
    break


Title: Best Western Plus Hotel
Review text: This hotel was nice and quiet. Did not know, there was train track near by. But it was only few train passed during our stay. Best Western changed hotel classification. The Plus category are not the same as before.
Address: 5620 Calle Real
Country: US
City: Goleta
Date: 2018-01-01T00:00:00.000Z


# Information Extraction and Retrieval-Augmented Generation API

In [3]:
from info_extract import Corpus
from info_extract.endpoints import get_llm_endpoint
from info_extract.retrieval import get_retriever


# number of documents to work with in the corpus
num_documents = 10

# chunks size in characters
chunk_size = 2048

# name of the corpus
corpus_name = "demo-corpus"

# instantiate the Predibase client
pc = PredibaseClient(token="<YOUR PREDIBASE API TOKEN>")

# Using a Predibase LLM (e.g. llama-2-13b)
llm_endpoint = get_llm_endpoint(model_provider="predibase", model_name="llama-2-13b", predibase_client=pc)

# Use Predibase infrastructure for indexing and retrieval
retriever = get_retriever(retrieval_provider="predibase", index_name=f"{corpus_name}-{chunk_size}", predibase_client=pc, model_name="llama-2-13b")

# Create the corpus of documents and pass in the necessary resources (LLM and retriever)
corpus = Corpus(df.head(num_documents), name=corpus_name, llm_endpoint=llm_endpoint, retriever=retriever)

  from .autonotebook import tqdm as notebook_tqdm


[2023-08-29 16:52:03,446] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
NOTE: Redirects are currently not supported in Windows or MacOs.
Extension horovod.torch has not been built: /usr/local/Caskroom/miniforge/base/envs/predibase38/lib/python3.8/site-packages/horovod/torch/mpi_lib_v2.cpython-38-darwin.so not found
If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.


### Chunk the data
The first step is to turn documents into smaller chunks of text.

In [4]:
chunks = corpus.chunk(chunk_size)

In [5]:
chunks.df

Unnamed: 0,chunk_id,chunk_text,document_id
0,0,Title: Best Western Plus Hotel Review text: Th...,AWE2FvX5RxPSIh2RscTK
0,0,Title: Clean rooms at solid rates in the heart...,AVwcj_OhkufWRAb5wi9T
0,0,Title: Business Review text: Parking was horri...,AVwcj_OhkufWRAb5wi9T
0,0,Title: Very good Review text: Not cheap but ex...,AVwcj_OhkufWRAb5wi9T
0,0,Title: Low chance to come back here Review tex...,AVwcj_OhkufWRAb5wi9T
0,0,Title: Loved staying here Review text: This is...,AVweLARAByjofQCxv5vX
0,0,Title: Does not live up to its reputation Revi...,AVweLARAByjofQCxv5vX
0,0,Title: worst customer service ever Review text...,AV1thAoL3-Khe5l_Ott5
0,0,Title: Location Location Location Review text:...,AVz6h4Sb3D1zeR_xDHsu
0,0,Title: The worst place i've booked Review text...,AVwdo6WHByjofQCxrGaj


## 1. Extract information from all documents
- Define a `list` of questions/queries that you'd like to extract.
- Note that this will run on all documents and will be slow. If you know that the information exists in a subset of the documents, either use RAG (next section) or create a new `Corpus` with a subset of the documents.

In [7]:
start_t = perf_counter()
extraction_result = corpus.extract(queries=["what is the address of the hotel?"])
print(f"took {perf_counter() - start_t}")

took 29.20573787500001


### Examine the results of the extraction

In [8]:
extraction_result.extractions

Unnamed: 0,document_id,query,answer
0,AWE2FvX5RxPSIh2RscTK,what is the address of the hotel?,The address of the hotel is:\n\n5620 Calle Real
1,AVz6h4Sb3D1zeR_xDHsu,what is the address of the hotel?,The address of the hotel is:\n\n2240 Buena Vis...
2,AVwdo6WHByjofQCxrGaj,what is the address of the hotel?,The address of the hotel is:\n\n1107 N Main St
3,AVweLARAByjofQCxv5vX,what is the address of the hotel?,The address of the hotel is:\n\n167 W Main St.
4,AV1thAoL3-Khe5l_Ott5,what is the address of the hotel?,The address of the hotel is:\n\n115 W Steve Wa...
5,AVwcj_OhkufWRAb5wi9T,what is the address of the hotel?,The address of the hotel is:\n\n5th And San Ca...


In [9]:
for _, row in extraction_result.extractions.iterrows():
    print(10 * "-")
    print(row["answer"])
    print()

----------
The address of the hotel is:

5620 Calle Real

----------
The address of the hotel is:

2240 Buena Vista Rd.

----------
The address of the hotel is:

1107 N Main St

----------
The address of the hotel is:

167 W Main St.

----------
The address of the hotel is:

115 W Steve Wariner Dr.

----------
The address of the hotel is:

5th And San Carlos, PO Box 3574, Carmel by the Sea, US.



### See which chunks the answer is coming from
Specify which (`query`, `document_id`) pair to look at their attributions.

In [10]:
query = "what is the address of the hotel?"
document_id = "AWE2FvX5RxPSIh2RscTK"

relevant_chunks = extraction_result.get_attribution(query=query, document_id=document_id)

In [11]:
print(relevant_chunks)

[<info_extract.info_extract.Chunk object at 0x7ff6e8036e20>]


In [12]:
for chunk in relevant_chunks:
    print("chunk.document_id", chunk.document_id)
    print("chunk.chunk_id", chunk.chunk_id)
    print()
    print("chunk.chunk_text:\n", chunk.chunk_text)
    print("\n\n")

chunk.document_id AWE2FvX5RxPSIh2RscTK
chunk.chunk_id 0

chunk.chunk_text:
 Title: Best Western Plus Hotel Review text: This hotel was nice and quiet. Did not know, there was train track near by. But it was only few train passed during our stay. Best Western changed hotel classification. The Plus category are not the same as before. Address: 5620 Calle Real Country: US City: Goleta Date: 2018-01-01T00:00:00.000Z





## 2. Retrieval-Augmented Generation
If the answer you're looking for is in one or a couple of documents, RAG is a more suitable (and faster) approach than extraction. Here's what's happening under the hood:
1. Create an index over the chunked documents.
2. Pass in a query to the index. This will trigger:
    - Retrieval of the `K` most relevant chunks.
    - Combine these chunks to get a final answer.

### If an index hasn't been created, create one

In [13]:
corpus.index()



### Query the corpus

In [14]:
question = "which hotel has train track noise?"

start_t = perf_counter()
rag_response = corpus.query(question)
duration = perf_counter() - start_t



### Print the answer

In [15]:
print(f"RAG answer\n{rag_response.answer}\n\nDuration: {duration} seconds")

RAG answer
Hello! Based on the information provided, the answer to your question is:

The Best Western Plus hotel has train track noise.

This information is provided in answer A1: MoreMore Address.

Duration: 36.13179716600001 seconds


### Print the relevant chunks

In [16]:
len(rag_response.chunk_answers)

2

In [17]:
for chunk in rag_response.chunk_answers:
    print(chunk.chunk_text)
    print("answer:", chunk.answer)
    print("\n\n------")

Title: Location Location Location Review text: MoreMore Address: 2240 Buena Vista Rd Country: US City: Lexington Date: 2017-06-15T00:00:00.000Z
answer: A1: MoreMore Address.


------
Title: Best Western Plus Hotel Review text: This hotel was nice and quiet. Did not know, there was train track near by. But it was only few train passed during our stay. Best Western changed hotel classification. The Plus category are not the same as before. Address: 5620 Calle Real Country: US City: Goleta Date: 2018-01-01T00:00:00.000Z
answer: A1: The hotel with train track noise is the Best Western Plus hotel.


------


In [18]:
# as an example
print(rag_response.chunk_answers[0])

ChunkExtractionResult(document_id='AVz6h4Sb3D1zeR_xDHsu', chunk_id=0, chunk_text='Title: Location Location Location Review text: MoreMore Address: 2240 Buena Vista Rd Country: US City: Lexington Date: 2017-06-15T00:00:00.000Z', query='which hotel has train track noise?', answer='A1: MoreMore Address.', is_correct=True)
