# GPT4tools: Teaching LLM to Use Tools via Self-instruction

Utilize GPT-3.5 to generate tools-related instruction-following data, which is then used to tune the language model.
This process offers language models the ability to access the multi-modal information by invoking visual models.

## Dataset Construction
1. Data Generation

$ Y \sim M_T(P_t|X_C) $

$Y$: A large number of instruction-following data

$X_C$: image content

$M_T$: GPT-3.5

$P_t$: A tool-related prompt

The Pt comprises the system message, the definition of tools (`tool name : usage scenario, arguments`), and the suffix prompt which encourages $M_T$ to generate visual instructions and desired outputs. $Y$ , the outcome of $M_T$, consists of N instruction-output pairs ${y^1, y^2, ..., y^N }$, where $y^i$ has the format of "`instruction, tool name, arguments`", and N is the number of defined tools

Without image context priors, GPT-3.5 tends to generate objects of visual instructions towards a small subset, which is reflected in t-SNE as sparser clusters.

The language model tuned by image-conditioned data is more robust than models without the image content.

![jupyter](./figures/GPT4Tools_figure1.png)

2. Data Formation
   
   1. Remove duplicate instructions/incorrectly formatted instructions/calls with incorrect tool names/ calls with incorrect tool-arguments formats from the raw datasets (70K items -> 41K)

   2. Transform the retained data into an instruction-response format utilizing a standardized template as shown in bottom-left corner of Fig.1. This procedure produces a new dataset, $Y^+_S$.

$Y^+_S$ includes:

`prefix prompt`: encompass system messages and tool definitions

`image content`: the image content

`user input`: replaced with the generated visual instruction

Response:

`Thought`: the model's cognition when to use tools

`Action`: signifying which tools the model will use or action the model will take

`Action input`: representing arguments of the selected tool

`Observation`: reflecting outcomes of the used tools

![jupyter](./figures/GPT4Tools_figure2.png)

3. Data Augmentation

   Challenge: this simplistic format lacks complexity and depth in both instructions and responses.

   Negative samples. The generated instructions primarily focus on tool usage, i.e., the decision after the Thought is always "Yes". Consequently, there is a potential risk that the fine-tuned model overfits such a decision. When the user instruction is not associated with the tool usage, the fine-tuned model may erroneously execute irrelevant actions by invoking unnecessary tools. To mitigate this issue, we synthesize negative samples YS− by selecting conversation data from the existing dataset [https://arxiv.org/abs/2304.03277] and converting them into the required template, as illustrated in Figure 3 (b). By tuning with $Y_S^+ ∪ Y_S^−$, the model can accurately decide when to use tools.

    Context samples. The generated instructions adopt a standard and fixed `single-tune` format, which lacks a contextual structure. Thus, as shown in Figure 3 (c), we augment the dataset by cutting off the chain of action. We also randomly select multiple instructions from $Y_S^+ ∪ Y_S^−$ and reformat them into multi-turn conversation data. In this way, we synthesize the contextual instruction-following data $Y_S^c$, enabling the tuned model to call tools within the given context.


Total dataset:$$Y_S = Y_S^+ ∪ Y_S^− ∪ Y_S^c$$

## Instruction Tuning

tune the off-the-self language model using its original auto-regressive training objective

leverage LoRA optimization, which freezes the language model and only optimizes rank decomposition components of the Transformer layers

For a sequence with $L$ tokens, compute the probability of the target response $X_r$ by:
$$
p(X_r \mid X_C, X_{\text{inst}}) = \prod_{i=1}^{L} p_\theta(x_i \mid X_C, X_{\text{inst}}, x_{1:i-1}),
$$

$X_{\text{inst}}$: instruction tokens

$\theta$: trainable parameters

## Evaluation Approach
construct an evaluation dataset following the same procedures detailed in § 3.1 and manually verify the accuracy of each item

Evaluation dateset:
1. validation set: has the same ingredients as the training set, encompassing 23 tools (defined in Visual ChatGPT)
   
   validate whether the model can adhere to user commands correctly after tuning with the training set.
3. test set: comprises 8 novel tools absent from the training set
   
   verify whether the model can generalize to new tools after tuning



Based on the human-annotated evaluation dataset with N instructions, a successful rate to measure the model’s performance from three aspects is designed:


- **Successful Rate of Thought**($SR_t$) measures whether the predicted decision matches the ground-truth decision. It is calculated as $$SR_t = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(\tau_i),$$ where $\tau_i$ signifies a singular process. If the thought is correct, $\mathbb{I}(\tau_i)$ is equal to 1, and 0 otherwise.

- **Successful Rate of Action** ($SR_{act}$) measures whether the predicted tool name is in agreement with the name of the ground truth tool. It is calculated as $$SR_{act} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(\alpha_i),$$ where $\alpha_i$ denotes the matching process for the tool names. In cases where the predicted tool name matches the pre-defined name, $\mathbb{I}(\alpha_i)$ is equal to 1, and 0 otherwise.

- **Successful Rate of Arguments** ($SR_{args}$) evaluates whether the predicted arguments match the ground-truth arguments. It can be calculated using the following equation:
$$SR_{\text{args}} = \frac{1}{N} \sum_{i=1}^N \eta_i, \quad \text{where } \eta_i = \frac{1}{K} \sum_{j} \eta_{i,j}.$$

$\eta_i$ denotes a sequence of arguments encompassing both the image path and the input text. 

$K$ represents the quantity of arguments in $\eta_i$. When the argument belongs to the image path, $\eta_{i,j}$ equals 1 if the predicted and ground-truth image paths share the same suffix, and 0 otherwise. When the argument is the input text, $\eta_{i,j}$ is equal to the BLEU score between the predicted and the ground truth text.
- **Successful Rate** ($SR$) measures whether a chain of actions are executed successfully, which requires the correctness of thought, tool name, and tool arguments at the same time:
$$SR = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(\tau_i) \cdot \mathbb{I}(\alpha_i) \cdot \mathbb{I}(\eta_i>0.5),$$
Additionally, when a procedure comprises two consecutive actions, the SR equals 100% only if
both actions are executed correctly.
