# qwen_agent retrieves structured data from undatasio PDF.

![](example_content/undatasio_example.png)

_By stay, Tech Enthusiast @Undatasio_
- - -

#### 🚀 Let's begin this example. 
    😃 😎 😝

📣 This is a notebook example demonstrating the retrieval of formatted data from a markdown file converted from a PDF parsed by the undatasio platform using the qwen_agent framework.

#### 📚 Below are the steps I took for this example:
- 📄 Upload the PDF file to be parsed to the undatasio platform.
  - _Download the undatasio Python library._
  - _Import environment variables._
  - _Use the undatasio library to obtain the URL of the markdown file parsed from the corresponding PDF file._
- 📝 Use the qwen-agent RAG module to call the qwen2.5-72b-instruct model for formatted data extraction.
  - _First, you need to install the corresponding qwen-agent library using pip._
  - _Next, configure the parameters for the qwen2.5-72b-instruct model._
  - _Finally, describe the function for this capability._
  - _Call the function, and you will get the data you want._


🎃 This is the entire process for this example. I hope you can gain some experience from it.

#### 📤 Upload PDF, undatasio parsing.

_Install undatasio using pip._

In [1]:
# install undatasio
!pip install -U -q undatasio

In [2]:
!conda install -c conda-forge python-dotenv -y -q

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



_Use the load_dotenv function to load local environment variables._

In [3]:
import os
from dotenv import load_dotenv

load_dotenv('.env')

True

In [4]:
UNDATASIO_API_KEY=os.getenv("UNDATASIO_API_KEY")

_Create an instance of the UnDatasIO class._

In [5]:
from undatasio.undatasio import UnDatasIO

undatasio_obj = UnDatasIO(UNDATASIO_API_KEY, task_name='demo')

_Get the URL of the markdown file for the uploaded file._

In [6]:
document = undatasio_obj.get_result_to_files(
    file_name_list='NVIDIAAn.pdf',
    version='v1'
)
document.data[0]

'https://backend.undatas.io/static/pdfParser/a4292de6adc9424f85cd057f282dbf99/v1/e04d3c05b4de47a5b93c0237ed3fafd6/pdf.md'

#### 🚩 qwen-agent to retrieve structured data.

_Install qwen-agent using pip._

In [7]:
!pip install -q -U qwen-agent['rag']

zsh:1: no matches found: qwen-agent[rag]


_Configure the parameters of the LLM._

In [8]:
LLM_CONFIG = {
    "model": 'qwen2.5-72b-instruct',
    "model_server": os.getenv('DASHSCOPE_URL'),
    "api_key": os.getenv('DASHSCOPE_API_KEY')
}

_Describe the function for retrieving formatted data._

In [9]:
from qwen_agent.agents import Assistant
from typing import List, Dict
import json


def get_values(keys: List, file: str) -> Dict | None:
    key_str = "{{"
    for key in keys:
        key_str += f'"{key}": "Extract the value of the key({key}) from the article. Return an empty string if the key is not found."'
    key_str += "}}"

    print(key_str)
    prompt = f"""Given a text passage, please extract specific information and return it in JSON format:

{key_str}

Requirements:
1. Return results in valid JSON format
2. Use empty string ("") if information is not found
3. Only extract relevant information - do not include unrelated context
4. Keep original phrasing when possible
5. Remove redundant text that doesn't add meaning
6. Return only the result data in JSON format.

Example input:
"NVIDIA reported strong growth in Q3, with revenue reaching $18.1 billion. Jensen Huang, who founded NVIDIA in 1993 and serves as CEO, commented that AI demand remains robust. The company expects Q4 revenue to hit $20 billion."

Example output:
{{
"third_quarter": "revenue reaching $18.1 billion",
  "founder_ceo": "Jensen Huang",
  "revenue_forecast": "expects Q4 revenue to hit $20 billion"
}}

Please provide the text passage to analyze."""
    bot = Assistant(llm=LLM_CONFIG)
    messages = [{'role': 'user', 'content': [{'text': prompt}, {'file': f'{file}'}]}]
    rsp = bot.run_nonstream(messages)[0]['content']
    json_data = json.loads(rsp)
    return json_data

prompt

```
Given a text passage, please extract specific information and return it in JSON format:

{{
"the third quarter": "Extract the value of the key(the third quarter) from the article. Return an empty string if the key is not found.","CEO of NVIDIA": "Extract the value of the key(CEO of NVIDIA) from the article. Return an empty string if the key is not found.",
}}

Requirements:

1. Return results in valid JSON format
2. Use empty string ("") if information is not found
3. Only extract relevant information - do not include unrelated context
4. Keep original phrasing when possible
5. Remove redundant text that doesn't add meaning

Example input:
"NVIDIA reported strong growth in Q3, with revenue reaching $18.1 billion. Jensen Huang, who founded NVIDIA in 1993 and serves as CEO, commented that AI demand remains robust. The company expects Q4 revenue to hit $20 billion."

Example output:
{
"third_quarter": "revenue reaching $18.1 billion",
  "founder_ceo": "Jensen Huang",
  "revenue_forecast": "expects Q4 revenue to hit $20 billion"
}

Please provide the text passage to analyze.
```

_Call the function._

In [None]:
values = get_values(['the third quarter', 'CEO of NVIDIA'], document.data[0])

In [11]:
values

{'the third quarter': 'revenue of $5.93 billion, down 17% from a year ago and down 12% from the previous quarter',
 'CEO of NVIDIA': 'Jensen Huang'}