In [0]:
%pip install --upgrade PyMuPDF sentence-transformers faiss-cpu transformers accelerate sentencepiece 
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
# STEP 1: Load 
import fitz  

def load_pdf(path):
    doc = fitz.open(path)
    return "\n".join(page.get_text() for page in doc)

cv_text = load_pdf("Data/CV_GustavoCaldas.pdf")
print(f"Loaded CV with {len(cv_text.split())} words.")


Loaded CV with 980 words.


In [0]:
# STEP 2: Split into chunks for embedding (150 words each)
def split_into_chunks(text, max_words=150):
    words = text.split()
    chunks = []
    for i in range(0, len(words), max_words):
        chunk = " ".join(words[i:i+max_words])
        chunks.append(chunk)
    return chunks

chunks = split_into_chunks(cv_text)
print(f"Split into {len(chunks)} chunks.")


Split into 7 chunks.


In [0]:
# STEP 3: Generate embeddings with sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np

embed_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embed_model.encode(chunks)
print("Embeddings created.")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Embeddings created.


In [0]:
# STEP 4: Build FAISS index for similarity search
import faiss

dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(np.array(embeddings).astype('float32'))
print(f"FAISS index created with {index.ntotal} vectors.")


✅ FAISS index created with 7 vectors.


In [0]:
# STEP 5: Load lightweight LLM (flan-t5-small)
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "google/flan-t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
model.eval()
print(f"Loaded model: {model_name}")


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

✅ Loaded model: google/flan-t5-small


In [0]:
# STEP 6: Define question-answer function
def ask_question(question, top_k=3, max_new_tokens=150):
    # Embed question
    q_embed = embed_model.encode([question])
    
    
    D, I = index.search(np.array(q_embed).astype('float32'), top_k)
    context = "\n".join([chunks[i] for i in I[0]])
    
   
    prompt = f"Context: {context}\nQuestion: {question}\nAnswer:"
    
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer


In [0]:
response = ask_question("What experience does Gustavo have with Databricks?")
print("Answer:\n", response)


Token indices sequence length is longer than the specified maximum sequence length for this model (1121 > 512). Running this sequence through the model will result in indexing errors


Answer:
 DATA ENGINEER NATIONALE-NEDERLANDEN


In [0]:
response = ask_question("Which are Gustavo's strengths?")
print("Answer:\n", response)

Answer:
 he desarrollado un fuerte interés por el mundo de los datos, orientando mi aprendizaje y experiencia hacia la Ingeniera de Datos, el Machine Learning y la Visualización. I apasiona construir bases de datos sólidas, disear modelos de predicción y extraer information valiosa que aporte un impacto significativo a las empresas. Mis principales intereses giran en torno a transform


In [0]:
response = ask_question("EXPERIENCIA LABORAL")
print("Answer:\n", response)

Answer:
 educaciain y formacin 19/06/2023 – 11/12/2023 Madrid, Espaa BOOTCAMP EN MARKETING DATA ANALYTICS Universidad Internacional de La Rioja Web www.unir.net Nivel en el MEC Nivel 5 EQF-MEC Lengua(s) materna(s): PORTUGUÉS Otro(s) idioma(s): COMPRENSIN EXPRESIN ORAL EXPRESIN ESCRITA Comprensión auditiva Comp


In [0]:
response = ask_question(" Universidade do Minho")
print("Answer:\n", response)

Answer:
 Universidad Internacional de La Rioja Web www.unir.net Nivel en el MEC Nivel 5 EQF-MEC Lengua(s) materna(s): PORTUGUÉS Otro(s) idioma(s): COMPRENSIN EXPRESSIN ORAL EXPRESSIN ESCRITA Comprensión auditiva Comprensión lectora Producción oral


# Disclaimer

Due to computational resource constraints in my cloud environment (Databricks Free), this assistant uses a lightweight language model (`flan-t5-small`) that balances speed and accessibility but sacrifices some accuracy and detail in responses. For best performance, a larger, more powerful model would be required, which typically needs GPU resources and paid API access.
