# Large Langugage Model Inference
- Once we have the text extracted from the table we infer the table using Large Language Model

## Mistral Inference
- Efficient for simple prompt
- Limited capacity of handling prompt size

In [None]:
!pip install ollama



In [None]:
ollama.pull("mistral")

In [1]:
# Run in terminal "ollama serve"

In [None]:
import ollama
import json

#    4. If there are positional markers or other structural metadata (like "start_row", "end_row", "start_col", "end_col"), explain how they contribute to understanding the table’s layout.
    # 2. Identify the primary elements that represent rows, columns, and cell content. Explain what each element represents and how the table data is organized.

def infer_table_with_ollama(table_json):
    prompt_content = f"""
                You are given a table represented in JSON format. Your task is to interpret the table, which may have varied structures and schemas.

                Guidelines:
                1. Analyze the table structure: Identify the columns, data types (numeric, categorical, text), and overall organization.
                2. Summarize the key information: Describe the main content, highlight important columns, and identify any notable patterns or trends in the data.
                3. Infer relationships: Explain how columns relate to each other, and if possible, identify any correlations or hierarchical information present.
                4. Handle metadata and nested structures: If there is additional metadata or nested elements, infer their purpose and how they contribute to the table's context.

                Here is the JSON table:

                {json.dumps(table_json, indent=4)}

                Provide a concise interpretation of the table, including its structure, content summary, and any relevant insights. The response should be formatted in JSON.
            """
    

    messages = [
        {"role": "user", "content": prompt_content}
    ]

    response = ollama.chat(model="mistral", messages=messages)

    return response['message']['content']

file_path = '/Users/johri/Projects/JnJ capstone/table detection/SciTSR/train/structure/0705.0450v1.4.json'
with open(file_path, 'r') as file:
    table_json = json.load(file)
    table_json = {
  "table": {
    "title": "Quarterly Sales Report",
    "columns": ["Product ID", "Product Name", "Region", "Sales Data"],
    "data": [
      {
        "Product ID": "101",
        "Product Name": "Laptop Pro 15",
        "Region": "North America",
        "Sales Data": {
          "Quarter": ["Q1", "Q2", "Q3", "Q4"],
          "Units Sold": [150, 200, 180, 220],
          "Revenue ($)": [300000, 400000, 360000, 440000]
        }
      },
      {
        "Product ID": "102",
        "Product Name": "Smartphone X",
        "Region": "Europe",
        "Sales Data": {
          "Quarter": ["Q1", "Q2", "Q3", "Q4"],
          "Units Sold": [500, 600, 550, 650],
          "Revenue ($)": [250000, 300000, 275000, 325000]
        }
      },
      {
        "Product ID": "103",
        "Product Name": "Tablet S",
        "Region": "Asia",
        "Sales Data": {
          "Quarter": ["Q1", "Q2", "Q3", "Q4"],
          "Units Sold": [300, 350, 330, 370],
          "Revenue ($)": [150000, 175000, 165000, 185000]
        }
      }
    ]
  }
}

table_inference = infer_table_with_ollama(table_json)
print("Table Inference:\n", table_inference)

Table Inference:
  {
       "Table Interpretation": {
           "Title": "Quarterly Sales Report",
           "Structure": "A table with four columns: 'Product ID', 'Product Name', 'Region', and 'Sales Data'. The 'Sales Data' column is further nested, containing sub-columns 'Quarter', 'Units Sold', and 'Revenue ($)'.",
           "Content Summary": {
               "Data Points": "Three data points, each representing a product (Laptop Pro 15, Smartphone X, Tablet S), their region of sale, and quarterly sales data for each product.",
               "Notable Columns": ["Product ID", "Product Name", "Region", "Quarter", "Units Sold", "Revenue ($)"],
               "Trends": "Increasing trend in the number of units sold and revenue over the quarters for all products. The 'Laptop Pro 15' has the highest sales in terms of both units and revenue."
           },
           "Relationships": {
               "Columns Relation": "'Product ID' and 'Product Name' are related to the nested 'Sales D

## LLaMa3.2 Inference
- Inferences from LLaMa was found to be more affective in identifying context
- It handled the instructions of output format strictly

In [1]:
import json
import pandas as pd
import requests
import torch
import transformers

In [2]:
def query_ollama(prompt, model="llama3.2", host="http://localhost:11434"):
    
    url = f"{host}/api/chat"
    
    headers = {"Content-Type": "application/json"}
    
    data = json.dumps({
        "model": model,
        "messages": [
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        "stream": False,
        "format": "json",
        "options": {
            "seed": 42
        }
    })
    
    response = requests.post(url, headers=headers, data=data)

    # print(response)
    
    if response.status_code == 200:
        return response.json() 
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return {}

In [3]:
table = [
    ['Dat', 'ASCl', '', '', '', ''],
    ['Method', 'LLM', 'SNLI', 'ANLI', 'CQA', 'SVAMP'],
    ['STANDARD FINETUNING', 'NYA', '88.38', '43.58', '62.19', '62.63'],
    ['DISTILLING STEP-BY-STEP', '20B', '89.12', '48.15', '63.25', '63.00'],
    ['DISTILLING STEP-BY-STEP', '54OB', '89.51', '49.58', '63.29', '65.50']
]

In [4]:
text = """
Extracted Text:
Table 1: Distilling step-by-step works with different
sizes of LLMs. When rationales are extracted from a
20B GPT-NeoX model, Distilling step-by-step is still
able to provide performance lift compared to standard
finetuning on 220M TS models.

Table 2: Our proposed multi-task training framework
consistently leads to better performances than treating
rationale and label predictions as a single task. Single-
task training can at times lead to worse performance
than standard finetuning.

 

Dataset
ANLI CQA  SVAMP

STANDARD FINETUNING NIA 88.38 43.58 62.19 62.63
DISTILLING STEP-BY-STEP 20B 89.12 48.15 63.25 63.00
DISTILLING STEP-BY-STEP 540B 89.51 49.58 63.29 65.50

Method LLM_se-SNLI

 

 

using the full dataset and thus requires larger model
to close the performance gap.

4.4 Further ablation studies

So far, we have focused on showing the effective-
ness of Distilling step-by-step on reducing the train-
ing data required for finetuning or distilling smaller
task-specific models. In this section, we perform
further studies to understand the influence of dif-
ferent components in the Distilling step-by-step
framework. Specifically, we study (1) how differ-
ent LLMs, from which the rationales are extracted,
affect the effectiveness of Distilling step-by-step,
and (2) how the multi-task training approach com-
pares to other potential design choices in training
small task-specific models with LLM rationales.
Here, we fix the small task-specific models to be
220M T5 models, and utilize 100% of the data on
all datasets.

Distilling step-by-step works with different sizes
of decently trained LLMs. In addition to using
540B PaLM as the LLM, here we consider a rela-
tively smaller LLM, 20B GPT-NeoX model (Black
et al., 2022), from which we extract rationales for
Distilling step-by-step. In Table 1, we see that
when coupled with LLMs of different sizes, Distill-
ing step-by-step can still provide performance im-
provements compared to standard finetuning. How-
ever, the performance lift is smaller when rationales
are extracted from the 20B GPT-NeoX model in-
stead of from the 540B PaLM. This can be due
to the fact that the larger PaLM model provides
higher-quality rationales that are more beneficial
for learning the task.

Multi-task training is much more effective than
single-task rationale and label joint prediction.
There are different possible ways to train task-
specific models with LLM-rationales as output su-
pervisions. One straightforward approach is to con-
catenate the rationale 7; and label ; into a single

 

Dataset
Method e-SNLI ANLI CQA SVAMP
STANDARD FINETUNING = 88.38 = 43.58 62.19 62.63
SINGLE-TASK TRAINING 88.88 = 43.50 61.37. 63.00
MULTI-TASK TRAINING 89.51 49.58 63.29 65.50

 

sequence [?;,9,] and treat the entire sequence as
the target output in training small models, as con-
sidered in (Magister et al., 2022; Ho et al., 2022):

N
Lengle = 57 LMF (esis). G)

i=l

In Table 2, we compare this single-task training
approach to our proposed multi-task training ap-
proach for utilizing LLM-rationales. We see that
not only multi-task training consistently leads to
better performance, single-task training with LLM-
rationales can at times leads to worse performance
than standard finetuning, e.g., on ANLI and CQA.
In fact, similar results have also been observed
in (Wiegreffe et al., 2021; Magister et al., 2022;
Ho et al., 2022) that simply treating rationale and
label predictions as a single joint task may harm the
model’s performance on label prediction. This val-
idates our use of the multi-task training approach,
and highlights the need to treat the rationales care-
fully so as to unleash their actual benefits.

5 Discussion

We propose Distilling step-by-step to extract ra-
tionales from LLMs as informative supervision in
training small task-specific models. We show that
Distilling step-by-step reduces the training dataset
required to curate task-specific smaller models; it
also reduces the model size required to achieve,
and even surpass, the original LLM’s performance.
Distilling step-by-step proposes a resource-efficient
training-to-deployment paradigm compared to ex-
isting methods. Further studies demonstrate the
generalizability and the design choices made in
Distilling step-by-step. Finally, we discuss the lim-
itations, future directions and ethics statement of
our work below.
"""

In [5]:
system_prompt = """
You are given a table extracted from a document, along with associated text from the same source. \
If the table is meaningful then summarize its content else mention 'no meaningful content'. \
Response should be in JSON format with schema {'summary': '...'}.
"""

In [6]:
example_table1 = [
    ['Independent\nVariable', 'Dependent\nVariable', 'Dependent\nVariableType', 'Name'],
    ['Many', 'Single', '2Categories', 'Binary classifi-\ncation'],
    ['Many', 'Single', '>2Categories', 'Multi-class\nclassification'],
    ['Many', 'Many', '2 Categories\nper dependent\nvariable', 'Multi-label\nclassification'],
    ['Many', 'Many', '> 2 Categories\nper dependent\nvariable', 'Multi-output\nclassification'],
    ['Single', 'Single', 'Numeric', 'Simple Regres-\nsion'],
    ['Many', 'Single', 'Numeric', 'Multiple\nRegression'],
    ['Many', 'Many', 'Numeric', 'Multivariate\nMultiple\nRegression']
]

In [7]:
example_text1 = """
Reverse Principal Component Analysis for\nMulti-Output Regression\nAkshit Bhalla\nDepartment of Industrial \
Engineering and Management\nRV College of Engineering\nBengaluru, Karnataka, \
India\nakshitbhalla.im16@rvce.edu.in\nAbstract—The problem of multi-output regression deals with \
the methodology proposed in this paper significantly surpasses\npredicting more than one value given \
an observation.This paper the performance of the most commonly adopted techniques. A\nproposes a novel \
method to accomplish this task by using a simple 3 step procedure of transforming, predicting, and \
trans-\npopular technique named Principal Component Analysis (PCA).\nforming back proves to be excellent \
for making predictions,\nThe approach is to reduce the dimensions \
of the target data\nespecially when time is a concern.\nand make predictions on it, following which the predictions \
are\ntransformed to the higher dimension. This approach was com-\nII. \
LITERATUREREVIEW\npared against several existing approaches using publicly available\ndatasets. It was found to largely \
outperform other approaches. The problem of multi-output regression is not new. It is\nApplication areas include \
(not limited to) climatology, genetics, known by several names in literature including multi-target\nimage processing \
and computer vision, and medicine.\nregression, multi-response regression, and multivariate regres-\nIndex \
Terms—Principal Component Analysis, Multi-output\nsion. The following table summarizes the many prediction\nregression, \
Multivariate Regression, Machine Learning, Dimen-\nsionality Reduction, Multivariate Statistics, Data Science tasks.\nI. \
INTRODUCTION TABLEI\nThe objective of multi-output regression is to predict more\nCLASSIFICATION &REGRESSION PROBLEMS\nthan \
one unknown dependent numeric variable when a few\nindependent variables are known.There exist numerous appli- Independent \
Dependent Dependent Name\nVariable Variable VariableType\ncations for modeling and predicting several unknown variables\nMany \
Single 2Categories Binary classifi-\nsimultaneously. For example, predicting the coordinates and cation\nvelocities of \
celestial objects, multi-trait prediction in genet- Many Single >2Categories Multi-class\nclassification\nics [1], \
forecasting temperature and pressure conditions in\nMany Many 2 Categories Multi-label\nclimatology, and prediction for \
gas tank levels [2]. For the per dependent classification\nrecent COVID19 pandemic, John Hopkins University Center \
variable\nMany Many > 2 Categories Multi-output\nfor Systems Science and Engineering has been making data\nper dependent \
classification\navailable to the public [3]. Kaggle has been using this data variable\nto host weekly research code \
competitions to predict the Single Single Numeric Simple Regres-\ninfections and deaths for the following week [4]–[8]. \
This is sion\nMany Single Numeric Multiple\nalso a multi-output regression problem.\nRegression\nThe task of predicting \
multiple variables is interesting Many Many Numeric Multivariate\nbecause more often than not,even the variables to predict share \
Multiple\nRegression\nsome relationship with each other. Ever since its introduction\nby Karl Pearson in 1901, principal \
component analysis has\nbecome the technique of choice for reducing dimensions.From In general, classification is the case when \
the dependent\nbeing used in finance to reduce portfolio risk [9] to being variable is categorical and regression is the \
case when the\nused in image processing and computer vision to empower dependent variable is numeric. Just for the sake \
of clarity,\nautonomous cars [10], PCA has unlimited applications. The simple regression is the case when there is a single \
inde-\nbeauty of this technique over other techniques for reducing pendent variable and a single numeric dependent \
variable\ndimensions lies in the fact that it captures the essence of the to predict. Multiple regression is the case when \
there are\ndata and allows one to decide what dimension to reduce to many independent variables and a single numeric \
dependent\nand at how much loss of data. It might be counterintuitive to variabletopredict.Multivariate multiple \
regression is the case\nsome, that a technique for reducing dimensions can be used when there are many independent variables \
and many numeric\nfor modeling the multi-output regression problem. However, dependent variables to predict. For many \
decades several\n978-1-7281-6453-3/20/$31.00©2020IEEE\nAuthorized licensed use limited to: Columbia University Libraries. \
Downloaded on October 25,2024 at 03:56:04 UTC from IEEE Xplore. Restrictions apply.\n49
"""

In [8]:
example_summary1 = """
The table categorizes data types for classification and regression problems in machine learning. The categories are:

* Binary classification (single dependent variable)
* Multi-class classification (>2 categories per dependent variable)
* Multi-label classification (2 categories per dependent variable)
* Multi-output classification (> 2 categories per dependent variable)
* Simple regression (1 independent variable, 1 numeric dependent variable)
* Multiple regression (many independent variables, 1 numeric dependent variable)
* Multivariate multiple regression (many independent variables, many numeric dependent variables)

The text provides context on the problem of multi-output regression, which involves predicting more than one unknown \
dependent numeric variable when a few independent variables are known. It discusses how Principal Component Analysis \
(PCA) can be used to reduce dimensions and improve prediction accuracy. The paper proposes a novel method for multi-output \
regression using PCA and compares its performance against existing approaches.

The text also mentions the various application areas of multi-output regression, including climatology, genetics, image \
processing, and computer vision, as well as medicine. It highlights the importance of dimensionality reduction in machine \
learning problems and provides an overview of the key concepts and techniques used in classification and regression problems.
"""

In [9]:
example_table2 = [
    ['Observed\nPrediction'], 
    ['gpt-4']
]

In [10]:
example_text2 = """
OpenAI code base next word prediction \nBits per word\n6.0\n Observed\n Prediction\n 5.0 gpt-4\n4.0\n3.0\n2.0\n1.0\n100p \
10n 1µ 100µ 0.01 1\nCompute\nFigure1. Performance of GPT-4 and smaller models. The metric is final loss on a dataset \
derived\nfrom our internal code base.This is a convenient, large dataset of code tokens which is not contained in\n the \
training set. We chose to look at loss because it tends to be less noisy than other measures across\ndifferent amounts of \
training compute. A power law fit to the smaller models(excludingGPT-4) is\nshown as the dotted line; this fit accurately \
predicts GPT-4’s final loss.The x-axis is training compute\n normalized so that GPT-4 is 1.\nCapability prediction on 23 \
coding problems\n–Mean Log Pass Rate\n5\n Observed\nPrediction\n4 gpt-4\n3\n2\n1\n0\n1µ 10µ 100µ 0.001 0.01 0.1 \
1\nCompute\nFigure2.Performance of GPT-4 and smaller models.The metric is mean log pass rate on a subsetof\n the Human Eval \
dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts \
GPT-4’s performance.The x-axis is training compute normalized so that\nGPT-4 is 1.\n3
"""

In [11]:
example_summary2 = "no meaningful content"

In [16]:
# query = f"""
# Table: 
# ```
# {example_table1}
# ```

# Text:
# ```
# {example_text1}
# ```

# Summary:
# {{
#     'summary': '{example_summary1}'
# }}

# -----------------------------------------------

# Table: 
# ```
# {example_table2}
# ```

# Text:
# ```
# {example_text2}
# ```

# Summary:
# {{
#     'summary': '{example_summary2}'
# }}

# -----------------------------------------------

# Table:
# ```
# {table}
# ```

# Text:
# ```
# {text}
# ```

# Summary:
# """

In [19]:
query = f"""
Table:
```
{table}
```

Text:
```
{text}
```

Summary:
"""

In [20]:
res = query_ollama(query)

In [21]:
print(res["message"]["content"])

{"summary": "The table shows that Distilling step-by-step improves performance on various tasks when using different sizes of LLMs as input for extracting rationales. It outperforms standard finetuning in some cases, especially with smaller LLMs like the 20B GPT-NeoX model. Multi-task training is also more effective than single-task rationale and label joint prediction."}
