![](example_content/undatasio_example.png)

_By **stay**, Tech Enthusiast @Undatasio_
- - -

#### 🚀 Let's begin this example. 
    😃 😎 😝

📣 This is a notebook example demonstrating the retrieval of formatted data from a markdown file converted from a PDF parsed by the undatasio platform using the qwen_agent framework.

#### 📚 Below are the steps I took for this example:
- 📄 Upload the PDF file to be parsed to the undatasio platform.
  - _Download the undatasio Python library._
  - _Import environment variables._
  - _Use the undatasio library to obtain the URL of the markdown file parsed from the corresponding PDF file._
- 📝 Use the qwen-agent RAG module to call the qwen2.5-72b-instruct model for formatted data extraction.
  - _First, you need to install the corresponding qwen-agent library using pip._
  - _Next, configure the parameters for the qwen2.5-72b-instruct model._
  - _Finally, describe the function for this capability._
  - _Call the function, and you will get the data you want._


🎃 This is the entire process for this example. I hope you can gain some experience from it.

#### 📤 Upload PDF, undatasio parsing.

_Install undatasio using pip._

In [66]:
# install undatasio
!pip install -U -q undatasio

In [32]:
!conda install -c conda-forge python-dotenv -y -q

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [33]:
import os
from dotenv import load_dotenv

load_dotenv('.env')

True

In [34]:
UNDATASIO_API_KEY=os.getenv("UNDATASIO_API_KEY")

_Create an instance of the UnDatasIO class._

In [35]:
from undatasio.undatasio import UnDatasIO

undatasio_obj = UnDatasIO(UNDATASIO_API_KEY, task_name='demo')

_Please upload the PDF file you want to parse._

In [67]:
upload_res = undatasio_obj.upload(file_dir_path='./demo_files/')

In [68]:
upload_res.code

200

In [63]:
show_files = undatasio_obj.show_upload()
show_files.data

_Publish the parsing task. A return code of 200 indicates the task was published successfully._

In [73]:
parser_res = undatasio_obj.parser(file_name_list=['resume-example.pdf'], parameter='fast', lang='en')
parser_res.code

200

_Check the task version information to see if the upload file task has been completed._
_"over" indicates the task is completed. "parsering" indicates the task is currently being processed._

In [75]:
version_info = undatasio_obj.show_version()
version_info.data

Unnamed: 0,title,version,count,file_name,status
0,1 files,v3,1,[resume-example.pdf],over
1,1 files,v2,1,[resume-example.pdf],over
2,1 files,v1,1,[resume-example.pdf],over


_Get the URL of the markdown file for the uploaded file._

In [76]:
document = undatasio_obj.get_result_to_files(
    file_name_list='resume-example.pdf',
    version='v1'
)
document.data[0]

'https://backend.undatas.io/static/pdfParser/d539f9ec49964970a58d0a801ce071ca/v1/996c87bb9c3e427b80a0d2a9c31eeba0/pdf.md'

#### 🚩 qwen-agent to retrieve structured data.

_Install qwen-agent using pip._

In [29]:
!pip install -q -U "qwen-agent[rag]"

_Configure the parameters of the LLM._

In [15]:
LLM_CONFIG = {
    "model": 'qwen2.5-72b-instruct',
    "model_server": os.getenv('DASHSCOPE_URL'),
    "api_key": os.getenv('DASHSCOPE_API_KEY')
}

_Describe the function for retrieving formatted data._

In [16]:
from qwen_agent.agents import Assistant
from typing import List, Dict
import json


def get_values(keys: List, file: str) -> Dict | None:
    key_str = "{{"
    for key in keys:
        key_str += f'"{key}": "Extract the value of the key({key}) from the article. Return an empty string if the key is not found."'
    key_str += "}}"

    print(key_str)
    prompt = f"""Given a text passage, please extract specific information and return it in JSON format:

{key_str}

Requirements:
1. Return results in valid JSON format
2. Use empty string ("") if information is not found
3. Only extract relevant information - do not include unrelated context
4. Keep original phrasing when possible
5. Remove redundant text that doesn't add meaning
6. Return only the result data in JSON format.

Example input:
"NVIDIA reported strong growth in Q3, with revenue reaching $18.1 billion. Jensen Huang, who founded NVIDIA in 1993 and serves as CEO, commented that AI demand remains robust. The company expects Q4 revenue to hit $20 billion."

Example output:
{{
"third_quarter": "revenue reaching $18.1 billion",
  "founder_ceo": "Jensen Huang",
  "revenue_forecast": "expects Q4 revenue to hit $20 billion"
}}

Please provide the text passage to analyze."""
    bot = Assistant(llm=LLM_CONFIG)
    messages = [{'role': 'user', 'content': [{'text': prompt}, {'file': f'{file}'}]}]
    rsp = bot.run_nonstream(messages)[0]['content']
    json_data = json.loads(rsp)
    return json_dataa

In [19]:
values = get_values(['name', 'age', 'technology stack', 'graduate institutions', 'Graduate student or not', 'Have you studied abroad'], document.data[0])

{{"name": "Extract the value of the key(name) from the article. Return an empty string if the key is not found.""age": "Extract the value of the key(age) from the article. Return an empty string if the key is not found.""technology stack": "Extract the value of the key(technology stack) from the article. Return an empty string if the key is not found.""graduate institutions": "Extract the value of the key(graduate institutions) from the article. Return an empty string if the key is not found.""Graduate student or not": "Extract the value of the key(Graduate student or not) from the article. Return an empty string if the key is not found.""Have you studied abroad": "Extract the value of the key(Have you studied abroad) from the article. Return an empty string if the key is not found."}}


2025-01-21 10:52:33,561 - split_query.py - 82 - INFO - Extracted info from query: {"information": ["text passage to analyze
2025-01-21 10:52:35,553 - memory.py - 113 - INFO - {"keywords_zh": ["文本段落", "分析"], "keywords_en": ["text passage", "analyze"], "text": "text passage to analyze"}
2025-01-21 10:52:35,611 - simple_doc_parser.py - 383 - INFO - Read parsed https://backend.undatas.io/static/pdfParser/d539f9ec49964970a58d0a801ce071ca/v1/3ce28e4466cd48dda5093511a7cf4e6f/pdf.md from cache.
2025-01-21 10:52:35,612 - doc_parser.py - 114 - INFO - Start chunking https://backend.undatas.io/static/pdfParser/d539f9ec49964970a58d0a801ce071ca/v1/3ce28e4466cd48dda5093511a7cf4e6f/pdf.md ()...
2025-01-21 10:52:35,612 - doc_parser.py - 132 - INFO - Finished chunking https://backend.undatas.io/static/pdfParser/d539f9ec49964970a58d0a801ce071ca/v1/3ce28e4466cd48dda5093511a7cf4e6f/pdf.md (). Time spent: 5.602836608886719e-05 seconds.
2025-01-21 10:52:35,613 - base_search.py - 56 - INFO - all tokens: 473
2

In [20]:
values

{'name': 'KATHY JAMES',
 'age': '',
 'technology stack': 'JavaScript, HTML/CSS, Django, SQL, REST APIs, Angular.js, React.js, Jest, Eclipse, Java',
 'graduate institutions': 'University of Alabama',
 'Graduate student or not': '',
 'Have you studied abroad': ''}