# No.1 accuracy in multiform table extraction 
- Convert documents to maximize RAG performance 
- LangChain provides powerful tools for text splitting and vectorization


![Layout Analyzer](./figures/la.png)

In [1]:
! pip3 install -qU  markdownify  langchain-upstage  requests

In [2]:

%load_ext dotenv
%dotenv
# UPSTAGE_API_KEY

In [3]:
import warnings

warnings.filterwarnings("ignore")

![Layout Analyzer](./figures/solar_sample.png)

In [4]:
import requests
import os

url = "https://api.upstage.ai/v1/document-ai/layout-analysis"
headers = {"Authorization": f"Bearer {os.getenv('UPSTAGE_API_KEY')}"}
files = {"document": open("pdfs/solar_sample.pdf", "rb")}
response = requests.post(url, headers=headers, files=files)
response_json = response.json()
response_json["html"]

"<table id='0' style='font-size:14px'><tr><td>Model</td><td>Size</td><td>Type</td><td>H6 (Avg.)</td><td>ARC</td><td>HellaSwag</td><td>MMLU</td><td>TruthfulQA</td><td>Winogrande</td><td>GSM8K</td></tr><tr><td>SOLAR 10.7B-Instruct</td><td>~ 11B</td><td>Alignment-tuned</td><td>74.20</td><td>71.08</td><td>88.16</td><td>66.21</td><td>71.43</td><td>83.58</td><td>64.75</td></tr><tr><td>Qwen 72B</td><td>~ 72B</td><td>Pretrained</td><td>73.60</td><td>65.19</td><td>85.94</td><td>77.37</td><td>60.19</td><td>82.48</td><td>70.43</td></tr><tr><td>Mixtral 8x7B-Instruct-v0.1</td><td>~ 47B</td><td>Instruction-tuned</td><td>72.62</td><td>70.22</td><td>87.63</td><td>71.16</td><td>64.58</td><td>81.37</td><td>60.73</td></tr><tr><td>Yi 34B-200K</td><td>~ 34B</td><td>Pretrained</td><td>70.81</td><td>65.36</td><td>85.58</td><td>76.06</td><td>53.64</td><td>82.56</td><td>61.64</td></tr><tr><td>Yi 34B</td><td>~ 34B</td><td>Pretrained</td><td>69.42</td><td>64.59</td><td>85.69</td><td>76.35</td><td>56.23</td><td>8

![](figures/docai.png)

In [5]:
response = requests.post(
    url, headers=headers, files={"document": open("pdfs/docai.pdf", "rb")}
)
response_json = response.json()
for element in response_json["elements"]:
    if element["category"] == "figure":
        print(element["text"])  # or, element["html"]
        break

Performance - Medical expense statement Customer testimony
Percentage "Achieved a score exceeding 95% on the 5
types of documents, exceeding the
96
previous human works" - Hanwha Life
92~95
75~87 "Tested 7 difficult documents with highly
unstructured data, achieving over 95%
solid result" - Samsung Life
"We consistently use the Upstage universal
OCR model, which has close to 98%
accuracy" - KB Kookmin Bank
"We are introducing OCR tasks to innovate
franchise business opening and ID
verification processes" - Samsung
Upstage Human Competitors Securities / Samsung Card
Hanwha Life HYUNDAI SERVICE
GLOBAL
SAMSUNG posco
SAMSUNG
LIFE INSURANCE
Amass
. 퍼플스
Kb KB Financial Group
pay 손해보험
Shinhan Bank
삼성카드 s AMSUNG
SAMSUNG
삼성증권


In [6]:
from IPython.display import display, HTML

files = {"document": open("figures/docai.png", "rb")}
response = requests.post(url, headers=headers, files=files)
response_json = response.json()

display(HTML(response_json["html"]))


In [7]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("pdfs/solar_sample.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

In [8]:
from IPython.display import display, HTML

display(HTML(docs[0].page_content[:5000]))

0,1,2,3,4,5,6,7,8,9
Model,Size,Type,H6 (Avg.),ARC,HellaSwag,MMLU,TruthfulQA,Winogrande,GSM8K
SOLAR 10.7B-Instruct,⇠ 11B,Alignment-tuned,74.20,71.08,88.16,66.21,71.43,83.58,64.75
Qwen 72B,⇠ 72B,Pretrained,73.60,65.19,85.94,77.37,60.19,82.48,70.43
Mixtral 8x7B-Instruct-v0.1,⇠ 47B,Instruction-tuned,72.62,70.22,87.63,71.16,64.58,81.37,60.73
Yi 34B-200K,⇠ 34B,Pretrained,70.81,65.36,85.58,76.06,53.64,82.56,61.64
Yi 34B,⇠ 34B,Pretrained,69.42,64.59,85.69,76.35,56.23,83.03,50.64
Mixtral 8x7B-v0.1,⇠ 47B,Pretrained,68.42,66.04,86.49,71.82,46.78,81.93,57.47
Llama 2 70B,⇠ 70B,Pretrained,67.87,67.32,87.33,69.83,44.92,83.74,54.06
Falcon 180B,⇠ 180B,Pretrained,67.85,69.45,88.86,70.50,45.47,86.90,45.94
SOLAR 10.7B,⇠ 11B,Pretrained,66.04,61.95,84.60,65.48,45.04,83.66,55.50


In [9]:
from markdownify import markdownify as md
from IPython.display import display, Markdown

md_text = md(docs[0].page_content)
display(Markdown(md_text[:5000]))



| Model | Size | Type | H6 (Avg.) | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| SOLAR 10.7B-Instruct | ⇠ 11B | Alignment-tuned | 74.20 | 71.08 | 88.16 | 66.21 | 71.43 | 83.58 | 64.75 |
| Qwen 72B | ⇠ 72B | Pretrained | 73.60 | 65.19 | 85.94 | 77.37 | 60.19 | 82.48 | 70.43 |
| Mixtral 8x7B-Instruct-v0.1 | ⇠ 47B | Instruction-tuned | 72.62 | 70.22 | 87.63 | 71.16 | 64.58 | 81.37 | 60.73 |
| Yi 34B-200K | ⇠ 34B | Pretrained | 70.81 | 65.36 | 85.58 | 76.06 | 53.64 | 82.56 | 61.64 |
| Yi 34B | ⇠ 34B | Pretrained | 69.42 | 64.59 | 85.69 | 76.35 | 56.23 | 83.03 | 50.64 |
| Mixtral 8x7B-v0.1 | ⇠ 47B | Pretrained | 68.42 | 66.04 | 86.49 | 71.82 | 46.78 | 81.93 | 57.47 |
| Llama 2 70B | ⇠ 70B | Pretrained | 67.87 | 67.32 | 87.33 | 69.83 | 44.92 | 83.74 | 54.06 |
| Falcon 180B | ⇠ 180B | Pretrained | 67.85 | 69.45 | 88.86 | 70.50 | 45.47 | 86.90 | 45.94 |
| SOLAR 10.7B | ⇠ 11B | Pretrained | 66.04 | 61.95 | 84.60 | 65.48 | 45.04 | 83.66 | 55.50 |
| Qwen 14B | ⇠ 14B | Pretrained | 65.86 | 58.28 | 83.99 | 67.70 | 49.43 | 76.80 | 58.98 |
| Mistral 7B-Instruct-v0.2 | ⇠ 7B | Instruction-tuned | 65.71 | 63.14 | 84.88 | 60.78 | 68.26 | 77.19 | 40.03 |
| Yi 34B-Chat | ⇠ 34B | Instruction-tuned | 65.32 | 65.44 | 84.16 | 74.90 | 55.37 | 80.11 | 31.92 |
| Mistral 7B | ⇠ 7B | Pretrained | 60.97 | 59.98 | 83.31 | 64.16 | 42.15 | 78.37 | 37.83 |

  
Table 2: Evaluation results in the Open LLM Leaderboard for SOLAR 10.7B and SOLAR 10.7B-Instruct along with  
other top-performing models. We report the scores for the six tasks mentioned in Sec. 4.1 along with the H6 score  
(average of six tasks). We also report the size of the models in units of billions of parameters. The type indicates the  
training stage of the model and is chosen from {Pretrained, Instruction-tuned, Alignment-tuned}. Models based on  
SOLAR 10.7B are colored purple. The best scores for H6 and the individual tasks are shown in bold.

MetaMathQA ( Yu et al. , 2023 ) dataset.

  
We reformatted the instruction datasets with an  
Alpaca-styled chat template. For datasets such as  
OpenOrca, which are derived from FLAN ( Long-  
pre et al. , 2023 ), we ﬁlter data that overlaps with  
the benchmark datasets (see Tab. 8 in Appendix. C  
for more information). The alignment datasets  
are in the {prompt, chosen, rejected} triplet for-  
mat. We preprocess the alignment datasets follow-  
ing Zephyr ( Tunstall et al. , 2023 ). We use Data-  
verse ( Park et al. , 2024 ) for data preprocessing.

  
Evaluation. In the HuggingFace Open LLM  
Leaderboard ( Beeching et al. , 2023 ), six types of  
evaluation methods are presented: ARC ( Clark  
et al. , 2018 ), HellaSWAG ( Zellers et al. , 2019 ),  
MMLU ( Hendrycks et al. , 2020 ), TruthfulQA ( Lin  
et al. , 2022 ), Winogrande ( Sakaguchi et al. , 2021 ),  
and GSM8K ( Cobbe et al. , 2021 ). We utilize these  
datasets as benchmarks for evaluation and also re-  
port the average scores for the six tasks, e.g., H6.  
We either submit directly to the Open LLM Leader-  
board or utilize Evalverse ( Kim et al. , 2024b ) for  
running evaluations locally.

  
Model merging. Model merging methods such  
as Yadav et al. ( 2023 ) can boost model perfor-  
mance without further training. We merge some  
of the models that we trained in both the instruc-  
tion and alignment tuning stages. We implement  
our own merging methods although popular open  
source also exist such as MergeKit 3 .

  
4.2 Main Results

  
We present evaluation results for our SOLAR  
10.7B and SOLAR 10.7B-Instruct models along

  
3 https://github.com/cg123/mergekit

  
with other top-performing models in Tab. 2 . SO-  
LAR 10.7B outperforms other pretrained models  
of similar sizes, such as Qwen 14B and Mistral  
7B, which shows that DUS is an effective method  
to up-scale base LLMs. Furthermore, despite the  
smaller size, SOLAR 10.7B-Instruct scores the  
highest in terms of H6, even surpassing the recent  
top-performing open-source LLM Mixtral 8x7B-  
Instruct-v0.1 or Qwen 72B. The above results indi-  
cate DUS can up-scale models that are capable of  
achieving state-of-the-art performance when ﬁne-  
tuned. We also report data contamination results  
for SOLAR 10.7B-Instruct in Appendix C .

  
4.3 Ablation Studies

  
We present ablation studies for both the instruction  
and alignment tuning stages. Note that the evalua-  
tion results for the following studies are ran locally  
and may vary from results obtained by submitting  
to the Open LLM Leaderboard.

  
4.3.1 Instruction Tuning

  
Ablation on the training datasets. We present  
ablation studies using different training datasets  
for the instruction tuning in Tab. 3 . The ablated  
models are preﬁxed with SFT for supervised ﬁne-  
tuning. ‘SFT v1’ only uses the Alpaca-GPT4  
dataset, whereas ‘SFT v2’ also uses the OpenOrca  
dataset. ‘SFT v3’ uses the Synth. Math-Instruct  
dataset along with the datasets us

In [10]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_upstage import ChatUpstage


llm = ChatUpstage()

prompt_template = PromptTemplate.from_template(
    """
    Please provide most correct answer from the following context. 
    If the answer is not present in the context, please write "The information is not present in the context."
    ---
    Question: {question}
    ---
    Context: {Context}
    """
)
chain = prompt_template | llm | StrOutputParser()

In [11]:
chain.invoke({"question": "Explain Table 2?", "Context": docs})

'Table 2 presents the evaluation results for the Open LLM Leaderboard, which includes SOLAR 10.7B and SOLAR 10.7B-Instruct along with other top-performing models. The table reports the scores for six tasks mentioned in Sec. 4.1, including H6 (average of six tasks), as well as the size of the models in units of billions of parameters. The type indicates the training stage of the model and is chosen from {Pretrained, Instruction-tuned, Alignment-tuned}. Models based on SOLAR 10.7B are colored purple, and the best scores for H6 and the individual tasks are shown in bold.'

In [12]:
chain.invoke({"question": "What is MMLU scores of SOLAR 10.7B?", "Context": docs})

'The MMLU scores of SOLAR 10.7B is 65.48.'

In [17]:
chain.invoke({"question": "What is ARC of Falcon 180B?", "Context": docs})

'The ARC of Falcon 180B is 69.45.'

In [14]:
chain.invoke({"question": "What is MMLU scores of Mistral?", "Context": md_text})

'Mistral 7B-Instruct-v0.2 has an MMLU score of 60.78.'

# Excercise 
Sometimes, even if we provide a table in Markdown or HTML format, the Large Language Model (LLM) may not extract the information correctly. How can you fix this issue?

Hint: Consider using CoT, a few-shot learning approach or a divide and conquer strategy. 
