In [25]:
! pip install sentence-transformers langchain faiss-cpu openai transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




### 1. Load a PDF file or Webpage

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("Interview Questions.pdf")
pages = loader.load()

[Document(page_content='Q u a l c o m m  G e n A I  P r e p a r a t i o n\nM o d e l  O p t i m i z a t i o n  &  D e p l o y m e n t\n1.  What ar e quantization and pruning? Ho w do t he y help optimiz e AI models?\nThese ar e model compr ession t echniques  used to reduce the siz e, latency, and \npower consumption of deep le arning models — especiall y important for deplo ying \nmodels on edge de vices lik e smartphones and Io T.\n🔹  Quantization\nQuantization r educes the pr ecision of model p ar amet ers (weights and \nactivations), t ypically from 32-bit flo ating-point (flo at32\ue082 to 8-bit int egers (int8\ue082,  \nwithout signific antly impacting accurac y.\n✅  Benef it s:\n•Smaller model siz e\n•Faster inference (especiall y on mobile/edge de vices)\n•Reduced memor y usage\n🛠  T ypes:\n• P ost -tr aining quantization:  Apply after training ( easy, minimal e xtra steps)\n• Quantization-aw ar e tr aining \ue081Q A T\ue082\ue092 Simulat es quantization dur ing training  \n(mo

### 2. Split long texts into chunks

In [9]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 50)
document = splitter.split_documents(pages)

### 3. Convert to embeddings

In [27]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings(model_name = 'sentence-transformers/all-MiniLM-L6-v2')
vector_base = FAISS.from_documents(document, embedding_model)

### 4. Ask a question

In [22]:
query = "What is AWS?"
docs = vector_base.similarity_search(query, k=3)

### 5. Generate an answer using OpenAI or LLM

In [28]:
# 5. Use a Free LLM (FLAN-T5) for QA
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
from langchain.chains.question_answering import load_qa_chain

# Load FLAN-T5 for generative Q&A
generator = pipeline("text2text-generation", model="google/flan-t5-base", max_new_tokens=256)
llm = HuggingFacePipeline(pipeline=generator)

# QA Chain (no sources version, suitable for FLAN-T5)
chain = load_qa_chain(llm, chain_type="stuff")
response = chain.run(input_documents=docs, question=query)

# 6. Output Answer
print("Answer:", response)

Device set to use mps:0
Token indices sequence length is longer than the specified maximum sequence length for this model (614 > 512). Running this sequence through the model will result in indexing errors
  test_elements = torch.tensor(test_elements)


Answer: Cloud Plat f or msAWS  fle xibility, GCP  AI-nativ e, Azure = enterprise MS int egration Sc aling AI Use data/model p arallelism, aut o-scaling, ser verless inference • Dat a Management  DVC, Delta Lake • Exper iment T r acking MLflow, Weights & Biases • Pipelines  Kubeflow, Airflow, TFX • Deplo yment Seldon, Bent oML, SageMak er • Monit or ing Prometheus, Graf ana, Fiddler  ML Ops ensur es r epr oducibilit y , r eliabilit y , and sc alabilit y of machine le arning systems in pr oduction en vironments.
