In [None]:
%pip install sentence-transformers transformers bitsandbytes accelerate datasets

# Simple Retrieval-Augmented Generation (RAG) Pipeline
Here we implement a simple RAG pipeline without relying on vector databases. We will do everything in memory. We *strongly* suggest running this code using a cuda-compatible GPU.

In [31]:
import pandas as pd
import torch
import datasets
from sentence_transformers import SentenceTransformer
from transformers import pipeline
from sklearn.metrics.pairwise import cosine_similarity

device = "cuda" if torch.cuda.is_available() else "cpu"

<hr>

## Data
First, we need to load data containing textual information that we want to encode. If you are running this in Google Colab, upload the "isco-08-en.csv" dataset to the root of the project.

In [4]:
df = pd.read_csv("isco-08-en.csv", sep=",")

df.head(5)

Unnamed: 0,Level,ISCO 08 Code,Title EN,Definition,Tasks include,Included occupations
0,1,1,Managers,"Managers plan, direct, coordinate and evaluate...",Tasks performed by managers usually include: f...,Occupations in this major group are classified...
1,2,11,"Chief Executives, Senior Officials and Legisla...","Chief executives, senior officials and legisla...",Tasks performed by workers in this sub-major g...,Occupations in this sub-major group are classi...
2,2,12,Administrative and Commercial Managers,"Administrative and commercial managers plan, o...",Tasks performed by workers in this sub-major g...,Occupations in this sub-major group are classi...
3,2,13,Production and Specialized Services Managers,Production and specialized services managers p...,Tasks performed by workers in this sub-major g...,Occupations in this sub-major group are classi...
4,2,14,"Hospitality, Retail and Other Services Managers","Hospitality, shop and related services manager...",Tasks performed by workers in this sub-major g...,Occupations in this sub-major group are classi...


As we can see, there are different hierarchy levels and different textual columns in this dataset. How do we handle this? Usually, the more context the better, so we will use all the textual columns to create one descriptor for each job classification.

In [5]:
descriptor_template = """# Title: {title}
# Definition: {definition}
# Included tasks and occupations: {tasks} {occupations}"""

for i, row in df.iterrows():
    title = row["Title EN"]
    definition = row["Definition"]
    tasks = row["Tasks include"]
    occupations = row["Included occupations"]

    descriptor = descriptor_template.format(
        title=title, definition=definition, tasks=tasks, occupations=occupations
    )
    df.at[i, "Descriptor"] = descriptor

## Sentence Embedding
Now, we will load a pre-trained sentence transformer to encode the descriptors into unique dense embedding vectors.

In [None]:
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device=device)

embeddings = model.encode(df["Descriptor"].tolist(), show_progress_bar=True)

Now we have a dense vector associated to each job classification, representing our knowledge base. We now need to setup a retrieval system, including a similarity measure. We will use cosine similarity between the user query (in this case, an enterprise's text from balance sheets) and the vectors in the knowledge base.

In [7]:
def get_most_similar(query, embeddings, model, k=5):
    encoded_query = model.encode(query)
    cos_sims = cosine_similarity([encoded_query], embeddings)[0]
    most_similar_ids = cos_sims.argsort()[-k:][::-1]
    return most_similar_ids

Now we can write a simple query and test the retrieval mechanism. The **get_most_similar** function will retrieve the IDs of the job classifications.

In [8]:
query = "Our activity produced three thousands tons of metal in the year 2025."

most_similar_ids = get_most_similar(query, embeddings, model, 5)

for idx, identifier in enumerate(most_similar_ids):
    row = df.iloc[identifier]
    print(f'({idx+1}) {row["Title EN"]} - Level {row["ISCO 08 Code"]}')

(1) Metal, Machinery and Related Trades Workers - Level 72
(2) Craft and Related Trades Workers - Level 7
(3) Labourers in Mining, Construction, Manufacturing and Transport - Level 93
(4) Science and Engineering Professionals - Level 21
(5) Building and Related Trades Workers (excluding Electricians) - Level 71


## LLM Implementation
In RAG pipelines, the generative LLM is usually the last element in the chain, as it will receive the user query along the retrieved relevant information. First of all, we need to load the llm model. We will use Google's [Gemma 3 (1B) instruction tuned model](https://huggingface.co/unsloth/gemma-3-1b-it). We load it from [Unsloth](https://unsloth.ai/?ref=producthunt) since they don't require user authentication, but it's the same model.

In [None]:
llm = pipeline("text-generation", model="unsloth/gemma-3-1b-it", device=device)

Let's define a prompt structure. This needs to be tweaked to suit your needs: for example, we can ask the model to output *only* the code, or the title, or format its response in a specific way.

In [10]:
prompt_template = """Given this text: "{query}"

Classify it into one of the following job classifications: {most_similar_jobs}"""

To help the model understand the task, we can also include a system prompt.

In [17]:
system_prompt = """You classify text from enterprises' balance sheets into one of the given job classifications.
You **ONLY** output the _exact_ name of the class, without any additional text."""

Now let's implement the full pipeline.

In [25]:
query = "Our activity produced three thousands tons of metal in the year 2025."

most_similar_ids = get_most_similar(query, embeddings, model, 5)

most_similar_jobs = "\n".join(
    [f"({idx+1}) " + df.iloc[identifier]["Title EN"] for idx, identifier in enumerate(most_similar_ids)]
)

messages = [
    {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
    {"role": "user", "content": [{"type": "text", "text": prompt_template.format(query=query, most_similar_jobs=most_similar_jobs)}]},
]

response = llm(messages, max_new_tokens=20)[0]['generated_text'][2]['content']

print(response)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Metal, Machinery and Related Trades Workers



### Classification of multiple texts
Say we have multiple texts we want to classify. For efficiency, we can conveniently use Hugging Face's `datasets` library, which is very easy to set up and makes things much faster. This is particularly useful when we need to classify a large number of texts.

In [47]:
queries = [
    "Our activity produced three thousands tons of metal in the year 2025.",
    "We invested ten million dollars to research new vaccines.",
    "During 2025, we introduced several new products including new pastas and salads options."
]

queries_dataset = datasets.Dataset.from_dict({"query": queries})

The pipeline object automatically handles parallel processing, so this is as easy as passing the list of messages to the llm.

In [48]:
messages = []

for query in queries_dataset["query"]:
    most_similar_ids = get_most_similar(query, embeddings, model, 5)
    most_similar_jobs = "\n".join(
        [f"({idx+1}) " + df.iloc[identifier]["Title EN"] for idx, identifier in enumerate(most_similar_ids)]
    )
    message = [
        {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
        {"role": "user", "content": [{"type": "text", "text": prompt_template.format(query=query, most_similar_jobs=most_similar_jobs)}]},
    ]
    messages.append(message)

output = llm(messages, max_new_tokens=20)
responses = [output[i][0]['generated_text'][2]['content'] for i in range(len(output))]

Finally, we can inspect the results.

In [49]:
for query, response in zip(queries, responses):
    print(f'Query: "{query}"\nResponse: "{response}"\n')

Query: "Our activity produced three thousands tons of metal in the year 2025."
Response: "Metal, Machinery and Related Trades Workers
"

Query: "We invested ten million dollars to research new vaccines."
Response: "Health Professionals
"

Query: "During 2025, we introduced several new products including new pastas and salads options."
Response: "Food Preparation Assistants
"

