1. Hugging Face Transformers
The Transformers library by Hugging Face is a must-have for implementing state-of-the-art language models for text generation and embeddings.

In [2]:
from transformers import AutoTokenizer, AutoModel

# Load a pre-trained transformer model
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

# Generate embeddings for a document
text = "This is an example document."
inputs = tokenizer(text, return_tensors="pt")
embeddings = model(**inputs).last_hidden_state.mean(dim=1)
print("Embeddings shape:", embeddings.shape)

Embeddings shape: torch.Size([1, 384])


2. FAISS
Facebook AI Similarity Search (FAISS) is a library designed for efficient similarity searches and clustering on large datasets. It is essential for building the retrieval layer in a RAG system.

In [3]:
import faiss
import numpy as np

# Create a sample dataset of embeddings
dimension = 128
num_vectors = 1000
data = np.random.random((num_vectors, dimension)).astype('float32')

# Build an index
index = faiss.IndexFlatL2(dimension)  # L2 similarity
index.add(data)  # Add data to the index

# Perform a search
query = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(query, k=5)
print("Nearest neighbors:", indices)

Nearest neighbors: [[179 620 419 628 405]]


3. LangChain

LangChain is a high-level framework for building pipelines in RAG systems, particularly for combining retrieval and generation seamlessly.

In [19]:
from openai import OpenAI

client = OpenAI(
  organization='org-cuMDpoth1WIg3Bl6wbCknOSk',
  project='$PROJECT_ID',
)

OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

In [18]:
from openai import OpenAI
client = OpenAI()
completion = client.chat.completions.create(
    model="gpt-4o",
    store=True,
    messages=[
        {"role": "user", "content": "write a haiku about ai"}
    ]
)

OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

In [17]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

# Set up FAISS vector store
docs = ["Document 1", "Document 2", "Document 3"]
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(docs, embeddings)

# Create a retrieval-augmented QA chain
retriever = vectorstore.as_retriever()
qa_chain = RetrievalQA.from_chain_type(llm=OpenAI(model="text-davinci-003"), retriever=retriever)

# Ask a question
response = qa_chain.run("What is Document 1 about?")
print("Response:", response)

ValidationError: 1 validation error for OpenAIEmbeddings
  Value error, Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter. [type=value_error, input_value={'model_kwargs': {}, 'cli...20, 'http_client': None}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error

4. Pinecone
Pinecone is a managed vector database that excels at providing real-time vector search and similarity matching.

In [15]:
import pinecone

# Initialize Pinecone
pinecone.init(api_key="your_api_key", environment="us-west1-gcp")

# Create an index
index_name = "example-index"
pinecone.create_index(index_name, dimension=128)

# Upsert and query vectors
index = pinecone.Index(index_name)
vectors = [{"id": "doc1", "values": [0.1, 0.2, 0.3, ...]}]
index.upsert(vectors)
query_result = index.query([0.1, 0.2, 0.3, ...], top_k=5)
print(query_result)

AttributeError: init is no longer a top-level attribute of the pinecone package.

Please create an instance of the Pinecone class instead.

Example:

    import os
    from pinecone import Pinecone, ServerlessSpec

    pc = Pinecone(
        api_key=os.environ.get("PINECONE_API_KEY")
    )

    # Now do stuff
    if 'my_index' not in pc.list_indexes().names():
        pc.create_index(
            name='my_index', 
            dimension=1536, 
            metric='euclidean',
            spec=ServerlessSpec(
                cloud='aws',
                region='us-west-2'
            )
        )



In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample documents
docs = ["Document one", "Document two", "Document three"]

# TF-IDF Vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(docs)

# Compute cosine similarity
similarity_matrix = cosine_similarity(tfidf_matrix)
print("Similarity Matrix:\n", similarity_matrix)

Similarity Matrix:
 [[1.         0.25861529 0.25861529]
 [0.25861529 1.         0.25861529]
 [0.25861529 0.25861529 1.        ]]


In [14]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings for sentences
sentences = ["This is a test.", "This is another test."]
embeddings = model.encode(sentences)
print("Embeddings shape:", embeddings.shape)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embeddings shape: (2, 384)
