#### Task 1: Simple vector embedding generation

**Objective:**
Generate vector embeddings from text data.

**Task Description:**

- load huggingface embedding model (`model_name="sentence-transformers/all-mpnet-base-v2"`)
- embed simple text queries

How to select the right embedding model: [MTEB - Massive Text Embedding Benchmark](https://huggingface.co/blog/mteb)

**Useful links:**

- [Langchain Chroma](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/)


In [1]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.documents import Document

# ADD HERE YOUR CODE
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  from tqdm.autonotebook import tqdm, trange


In [2]:
text = "This is a test document."

# ADD HERE YOUR CODE
# Perform vector search
query_vector = embedding_model.embed_query(text=text)

print(f"Embedding vector length: {len(query_vector)}")
print(query_vector[:10])

Embedding vector length: 768
[-0.048951808363199234, -0.039862047880887985, -0.021562788635492325, 0.009908556006848812, -0.03810393065214157, 0.012684349901974201, 0.04349448159337044, 0.07183390110731125, 0.009748606011271477, -0.006987082306295633]


#### Task 2: Generate embedding vectors with custom dataset

**Objective:**
Load custom dataset, preprocess it and generate vector embeddings.

**Task Description:**

- load pdf document "AI_Book.pdf" via langchain document loader: `PyPDFLoader`
- use RecursiveCharacterTextSplitter to split documents into chunks
- generate embeddings for single documents

**RecursiveCharacterTextSplitter:**

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is `["\n\n", "\n", " ", ""]`. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

**Useful links:**

- [Langchain PyPDFLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html)
- [Langchain RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)


In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import re

pdf_doc = "./AI_Book.pdf"

# Create pdf data loader
# ADD HERE YOUR CODE
loader = PyPDFLoader(file_path=pdf_doc)

# Load and split documents in chunks
# ADD HERE YOUR CODE
pages = loader.load()
pages_chunked = RecursiveCharacterTextSplitter().split_documents(pages)

# Function to clean text by removing invalid unicode characters, including surrogate pairs
def clean_document_text(chunk):
    # Remove surrogate pairs
    text = chunk.page_content
    text = re.sub(r'[\ud800-\udfff]', '', text)
    # Optionally remove non-ASCII characters (depends on your use case)
    text = re.sub(r'[^\x00-\x7F]+', '', text)
    return Document(page_content=text, metadata=chunk.metadata)

pages_chunked_cleaned = [clean_document_text(chunk) for chunk in pages_chunked]

print(pages_chunked_cleaned[0])

page_content='Aurélien GéronHands-on Machine Learning with
Scikit-Learn, Keras, and
TensorFlow
Concepts, Tools, and Techniques to
Build Intelligent SystemsSECOND EDITION
Boston Farnham Sebastopol Tokyo Beijing Boston Farnham Sebastopol Tokyo Beijing' metadata={'source': './AI_Book.pdf', 'page': 2}


In [None]:
print(pages_chunked_cleaned[1])

In [None]:
print(f"Number of text chunks: {len(pages_chunked_cleaned)}")

#### Task 3: Store vector embeddings from pdf document to ChromaDB vector database.

**Objective:**

Store vector embeddings into ChromaDB to store knowledge.

**Task Description:**

- create chromadb client
- create chromadb collection
- create langchain chroma db client
- store text document chunks and vector embeddings to vector databases

**Useful links:**

- [Langchain How To](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/#initialization-from-client)


In [19]:
from langchain_chroma import Chroma
import chromadb
import chromadb
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings

client = chromadb.HttpClient(
    host="localhost",
    port=8000,
    ssl=False,
    headers=None,
    settings=Settings(allow_reset=True, anonymized_telemetry=False),
    tenant=DEFAULT_TENANT,
    database=DEFAULT_DATABASE,
)

# ADD HERE YOUR CODE
# Create a collection

collection_name = "AI_Book"

collection = client.get_or_create_collection(collection_name)

# ADD HERE YOUR CODE
# Create chromadb
vector_db_from_client = Chroma(
    client=client,
    collection_name=collection_name,
    embedding_function=embedding_model
)

In [20]:
from uuid import uuid4

uuids = [str(uuid4()) for _ in range(len(pages_chunked_cleaned[:50]))]

# ADD HERE YOUR CODE
vector_db_from_client.add_documents(documents=pages_chunked_cleaned[:50], id=uuids)

['47ebb199-ac7a-4fe3-b3ae-a094756c8ea6',
 '61786197-c7d3-466e-8e40-9aa3a1134f1d',
 '0e550d96-f29a-4fcd-bd60-067d4c9e29a1',
 '386c718e-56a0-4860-85d7-f5d28a5c289f',
 '6c57dd8e-da9f-4f99-827f-a96cbca375d9',
 'ecae7bb7-ebd0-4b94-9d44-7e5c12ef3bd6',
 '0573a319-8eb7-4345-8a12-a19e90a5cf9b',
 '38d3ee78-9f9b-4510-8f44-f16e7602e62d',
 '950a321b-c4e7-4e4b-9b2a-5667765322c8',
 'edfb6e6e-9b90-4fb8-b13f-c361f15ae864',
 'f107498a-68db-4509-8ed0-5a97b36e9646',
 'b0835cde-6de9-42cf-850f-9231c87880bc',
 '199eda86-353c-444e-ab93-d0e685005433',
 '4ef73735-4dc7-4c63-b992-2e6dbcd3415d',
 '8b5e0347-1ac5-46b6-8eb7-0a9737027367',
 '87aca02e-d65b-4c4f-9fe1-ebc67489ec22',
 'fce61e03-40bf-4782-aa5d-a821b6dc0bf9',
 '0e05208a-d17f-4aa9-939b-1b254b0e1693',
 'e7e2a7b9-ee5c-4573-81fa-a5651e7d955d',
 'ccac0e9c-7903-4ae4-80fb-b26793a35e00',
 'e05fd145-7fa9-4ab3-9c51-24d1f9691bcf',
 'a8e16440-1152-4981-a1c7-45752bbe4f59',
 '821e0f88-a7d3-4418-a451-473ac03e31d6',
 '8695ac46-7ab2-4bf7-b7e0-cb2751f4f3d3',
 '626fae68-db48-

In [21]:
client.count_collections()

1

In [8]:
# client.delete_collection("ai_model_book")

#### Task 4: Access ChromaDB and perform vector search

**Objective:**

Use query to perform vector search against ChromaDB vector database

**Task Description:**

- define query
- run vector search
- print k=3 most similar documents

**Useful links:**

- [Langchain Query ChromaDB](https://python.langchain.com/v0.2/docs/integrations/vectorstores/chroma/#query-directly)


In [24]:
search_query = "Types of Machine Learning Systems"

results = vector_db_from_client.similarity_search(
    search_query,
    k=3
)

for res in results:
    print(res.page_content)
    print(res.metadata)
    print("\n----------------\n")

Types of Machine Learning Systems
There are so many different types of Machine Learning systems that it is useful to
classify them in broad categories based on:
Whether or not they are trained with human supervision (supervised, unsuper
vised, semisupervised, and Reinforcement Learning)
Whether or not they can learn incrementally on the fly (online versus batch
learning)
Whether they work by simply comparing new data points to known data points,
or instead detect patterns in the training data and build a predictive model, much
like scientists do (instance-based versus model-based learning)
These criteria are not exclusive; you can combine them in any way you like. For
example, a state-of-the-art spam filter may learn on the fly using a deep neural net
work model trained using examples of spam and ham; this makes it an online, model-
based, supervised learning system.
Lets look at each of these criteria a bit more closely.
Supervised/Unsupervised Learning
Machine Learning systems can be