# Using FAISS and Lamini for RAG Pipeline

## Step 1: Import Libraries
Import the necessary libraries, including FAISS, Lamini, and jsonlines to handle embeddings and data processing.

## Step 2: Set Parameters
Define parameters like the number of nearest neighbors to be returned.

In [2]:
# Number of nearest chunks to return
k = 2

## Step 3: Initialize Variables
Set up placeholders for the index and corresponding plain text splits. Instantiate Lamini's embedding client.

In [4]:
# Set up for the index, which holds the embeddings, and the splits, which holds the corresponding plain text
index = None
splits = []

# Instantiate Lamini's embedding client
embedding_client = lamini.Embedding()

NameError: name 'lamini' is not defined

## Step 4: Create Embeddings for Each Transcript
Load each 'transcript' item from a JSONL file, create an embedding for it, and add it to the index.

In [3]:
# Step 6: Extract Text from PDF
# Function to extract text from a PDF
from pdf2image import convert_from_path
import pytesseract

def read_file(pdf_path: str):
    """
    Reads a PDF file using pytesseract and writes the extracted text to a .txt file.
    Args:
        pdf_path (str): Path to the PDF file.
    """
    try:
        # Convert PDF to images
        pages = convert_from_path(pdf_path, 500)        
        # Extract text from each page and write to a .txt file
        all_text = ""
        for page_num, img_blob in enumerate(pages):
            text = pytesseract.image_to_string(img_blob, lang='eng')
            all_text += text + "\n"
    except Exception as e:
        print(f"Error reading file {pdf_path}: {e}")
    return all_text
# Execute PDF extraction
pdf_path = "documents/sample.pdf"
raw_text = read_file(pdf_path)
raw_text[:500]

'wan N Un\n\nCase 1:20-cv-02167-TJK Document 195 Filed 04/08/24 Page 1 of 9\nCESS AL FILED\nORIGINAL = 4PREEBour\n\nUNITED STATES DISTRICT COURT — APR _ 16 2094\nFOR THE DISTRICT OF COLUMBIA\n\nJOHN D. HADD\n, E\n| CLERK\n\n#122108\n\nCivil Action No. 20-2167 (TJK)\n\nTHE CHEROKEE NATION et al.,\n\nPlaintiffs,\nV.\n\nUNITED STATES DEPARTMENT OF THE\nINTERIOR et al.,\n\naanwwew\n\n-e2e-eee\n\nDefendants.\n\nCart rested\n\nORDER CERTIFYING QUESTION OF LAW\nTO THE SUPREME COURT OF OKLAHOMA\n\nJO THE SUPREME CUURLDLRy eee\n\nThe United S'

In [None]:
# Load each 'transcript' item and create embeddings
with jsonlines.open("data.jsonl", "r") as file:
    for item in file:
        transcript_embedding = embedding_client.generate(item['transcript'])
        if not index:
            index = faiss.IndexFlatL2(transcript_embedding.size) # Set the size of the index based on model embedding size
        index.add(transcript_embedding)
        splits.append(item['transcript'])

## Step 5: Create Embedding for the Question
Create an embedding for the user's question to find the most relevant text from the index.

In [6]:
# Define the question and create its embedding
question = "What is TSMC's 2019 revenue in USD?"
question_embedding = embedding_client.generate(question)

## Step 6: Find Nearest Neighbors
Use FAISS to find the top ( k ) nearest neighbors for the question embedding.

In [7]:
# Find the k nearest neighbors and retrieve relevant data
distances, indices = index.search(question_embedding, k)
relevant_data = [splits[i] for i in indices[0] if i >= 0]

## Step 7: Instantiate LLM Client
Initialize Lamini's LLM client for generating the final response based on the relevant data.

In [8]:
# Instantiate Lamini's LLM client
llm = lamini.Lamini(model_name="meta-llama/Meta-Llama-3.1-8B-Instruct")

## Step 8: Prepare the Prompt
Construct a prompt for Lamini's LLM using the retrieved relevant data and the question.

In [9]:
# Form the prompt using the retrieved data and the question
prompt = f"""
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{relevant_data}
{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

## Step 9: Generate the Response
Use the constructed prompt to generate the answer to the user's question with Lamini's LLM.

In [14]:
# Generate the answer using Lamini's LLM
response = llm.generate(prompt)
print(response)



The text does not mention TSMC (Taiwan Semiconductor Manufacturing Company) at all. It appears to be a transcript of a conference call from a company that is not specified. However, based on the content, it seems to be a company in the semiconductor industry, possibly a supplier to TSMC.

The text mentions a revenue range of $550 million to $565 million for 2019, but it does not specify the company's name.
