# **VectorDB, Hugging Face & Ollama |**
# **Assignment**

## (1) What is a Vector Database (VectorDB) and how is it different from traditional databases?

Ans:

A Vector Database (VectorDB) is a specialized type of database used to store and manage vector embeddings, which are numerical representations of data such as text, images, audio, or videos generated by machine learning models. These vectors capture the semantic meaning of the data, allowing the database to perform similarity-based searches rather than exact matches. Vector databases use mathematical distance measures like cosine similarity or Euclidean distance to find data points that are closest in meaning to a given query.

In contrast, traditional databases (such as relational or NoSQL databases) are designed to store structured data in the form of rows and columns and are optimized for exact-match queries using SQL or key-based lookups. They rely on indexing methods like B-trees or hash indexes and work well for transactional operations but are not efficient for searching high-dimensional vector data. Traditional databases retrieve results only when values match exactly, whereas vector databases retrieve semantically similar results, making VectorDBs especially suitable for modern AI applications such as semantic search, recommendation systems, chatbots, and retrieval-augmented generation (RAG).

## (2) Explain the various types of VectorDBs available and describe their suitability for different use cases.

Ans :

Vector databases can be classified into different types based on how they store, index, and manage vector data, and each type is suitable for specific use cases. In-memory vector databases store vectors directly in RAM, which makes them extremely fast and ideal for small to medium-sized datasets or applications that require real-time responses, such as chatbots and interactive recommendation systems; however, they are limited by memory size and are not cost-effective for very large datasets. Disk-based vector databases store vectors on persistent storage and use optimized indexing techniques to balance speed and scalability, making them suitable for large-scale applications like enterprise search systems and document retrieval platforms where data durability is important.

Another category is standalone vector databases, which are built exclusively for vector similarity search and provide advanced indexing algorithms such as HNSW or IVF; these are well suited for AI-driven applications like semantic search, image similarity, and RAG pipelines because they offer high accuracy and performance. In contrast, hybrid vector databases combine traditional database capabilities with vector search, allowing both structured queries and semantic similarity searches in a single system; this makes them ideal for applications that need metadata filtering along with vector similarity, such as e-commerce product search or personalized content delivery. Finally, cloud-managed vector databases offer scalability, automatic indexing, and easy integration with AI services, making them suitable for startups and production-grade AI systems that require minimal infrastructure management and high availability.

## (3) Why is Chroma DB considered important in the context of AI/ML projects?   Describe its key features.

Ans :

Chroma DB is considered important in AI/ML projects because it is a lightweight, developer-friendly vector database specifically designed to support LLM-based applications, semantic search, and Retrieval-Augmented Generation (RAG) workflows. It makes it easy to store, manage, and retrieve vector embeddings generated by machine learning models, enabling AI systems to access relevant context efficiently. Due to its simplicity and tight integration with Python and popular ML libraries, Chroma DB is widely used in prototyping, research, and small-to-medium production AI applications.

The key features of Chroma DB include efficient vector similarity search using distance metrics such as cosine similarity and Euclidean distance, support for metadata storage and filtering alongside vectors, and seamless integration with embedding models and frameworks like LangChain and LlamaIndex. Chroma DB can run locally or in-memory, making it ideal for fast experimentation without complex infrastructure, while also supporting persistent storage for longer-term projects. Its simple API, open-source nature, and focus on AI-native workflows make Chroma DB a practical and important tool for modern AI/ML projects, especially those involving chatbots, document Q&A systems, and semantic retrieval.

## (4) What are the benefits of using Hugging Face Hub for generative AI tasks?

Ans:

Hugging Face Hub provides several benefits for generative AI tasks by acting as a centralized platform for models, datasets, and tools used in modern AI development. It gives developers easy access to thousands of pre-trained generative models for tasks such as text generation, translation, summarization, image generation, and speech processing, which significantly reduces development time and computational cost. By reusing well-tested models, researchers and engineers can focus on fine-tuning and application building instead of training models from scratch.

Another major benefit of Hugging Face Hub is its strong ecosystem and interoperability. It integrates seamlessly with popular frameworks like PyTorch, TensorFlow, and JAX, and provides high-level libraries such as Transformers, Diffusers, and Datasets, making experimentation and deployment easier. The Hub also supports version control, model sharing, and collaboration, enabling reproducibility and community-driven improvements. Additionally, features like model cards, evaluation benchmarks, and inference APIs help ensure transparency, responsible AI usage, and scalable deployment, making Hugging Face Hub highly valuable for both research and production-level generative AI projects.

## (5) Describe the process and advantages of navigating and using pre-trained models from the Hugging Face Hub.

Ans :

The process of navigating and using pre-trained models from the Hugging Face Hub begins with browsing or searching the Hub based on task type such as text generation, translation, summarization, image generation, or speech processing. Users can filter models by framework, language, license, and popularity, and then review the model card, which provides essential details such as the model’s purpose, training data, architecture, evaluation metrics, and usage examples. Once a suitable model is identified, it can be easily loaded using Hugging Face libraries like Transformers or Diffusers with just a few lines of code, allowing developers to quickly integrate the model into their AI or ML applications.

The advantages of using pre-trained models from the Hugging Face Hub include significant time and cost savings, as there is no need to train large models from scratch. These models are state-of-the-art and community-validated, offering strong performance out of the box. The Hub also supports fine-tuning, enabling models to be adapted to domain-specific tasks with smaller datasets. Additional benefits include version control, reproducibility, and easy sharing, as well as seamless integration with popular ML frameworks and deployment tools. Overall, the Hugging Face Hub accelerates experimentation, encourages collaboration, and makes generative AI development more accessible and scalable.

In [2]:
# 6

!pip install chromadb sentence-transformers

Collecting chromadb
  Downloading chromadb-1.4.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.4.0-py3-none-any.whl.metadata (5.8 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.3-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.39.1-py3-none-any.whl.metadata (2.5 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading pypika-0.50.0-py2.py3-none-any.whl.metadata (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import chromadb
from sentence_transformers import SentenceTransformer

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Create Chroma client
client = chromadb.Client()

# Create a collection
collection = client.create_collection(name="semantic_search_demo")

# Sample documents
documents = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks",
    "Python is widely used in data science",
    "Vector databases store embeddings for similarity search"
]

# Generate embeddings
embeddings = model.encode(documents).tolist()

# Add documents to Chroma DB
collection.add(
    documents=documents,
    embeddings=embeddings,
    ids=[f"doc_{i}" for i in range(len(documents))]
)

print("Documents inserted successfully!")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Documents inserted successfully!


In [4]:
# Search query
query = "AI and neural networks"

# Convert query into vector
query_embedding = model.encode([query]).tolist()

# Perform similarity search
results = collection.query(
    query_embeddings=query_embedding,
    n_results=2
)

print("Search Results:")
for doc in results["documents"][0]:
    print("-", doc)


Search Results:
- Deep learning uses neural networks
- Machine learning is a subset of artificial intelligence


In [11]:
# 7

# Install required libraries (run once)
%pip install transformers datasets torch

# Import libraries
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import Dataset

# Load pre-trained GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# GPT-2 does not have a pad token, so assign EOS token as pad token
tokenizer.pad_token = tokenizer.eos_token
model.resize_token_embeddings(len(tokenizer))

# Create a small custom dataset
texts = [
    "Machine learning is transforming the world.",
    "Artificial intelligence helps machines think.",
    "Deep learning uses neural networks.",
    "Python is popular for AI development."
]

dataset = Dataset.from_dict({"text": texts})

# Tokenization function (IMPORTANT: labels are required)
def tokenize_function(example):
    tokens = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=64
    )
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

# Tokenize dataset
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(["text"])
tokenized_dataset.set_format("torch")

# Training configuration
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    learning_rate=5e-5,
    logging_steps=1,
    save_steps=10,
    report_to="none"
)

# Trainer setup
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset
)

# Fine-tune the model
trainer.train()

# Text generation using fine-tuned model
input_text = "Artificial intelligence"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

output = model.generate(
    input_ids,
    max_length=30,
    num_return_sequences=1
)

print("Generated Text:")
print(tokenizer.decode(output[0], skip_special_tokens=True))




Map:   0%|          | 0/4 [00:00<?, ? examples/s]

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
1,7.7288
2,5.3376
3,3.1617
4,1.8039
5,0.9819
6,1.6638


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Text:
Artificial intelligence is a new way of thinking.


In [15]:
# 8

!ollama --version
!ollama pull llama2

/bin/bash: line 1: ollama: command not found
/bin/bash: line 1: ollama: command not found
