# Finetuning to follow instructions

<img src="https://camo.githubusercontent.com/6736ab7968f8da6bd6fc747de22ef9afa9d840373749005ce3e96fc6ead7ed8c/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830375f636f6d707265737365642f636861707465722d6f766572766965772d312e776562703f31" width=700>

## Stage 1: Preparing the dataset

### 1. Dataset download and preparation

In [1]:
import json
import os
import urllib

In [2]:
def download_and_load_file(file_path, url):

    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)

    # The book originally contained this unnecessary "else" clause:
    #else:
    #    with open(file_path, "r", encoding="utf-8") as file:
    #        text_data = file.read()

    with open(file_path, "r", encoding="utf-8") as file:
        data = json.load(file)

    return data


file_path = "instruction-data.json"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch"
    "/main/ch07/01_main-chapter-code/instruction-data.json"
)

data = download_and_load_file(file_path, url)
print("Number of entries:", len(data))

Number of entries: 1100


In [4]:
data[50:53]

[{'instruction': 'Identify the correct spelling of the following word.',
  'input': 'Ocassion',
  'output': "The correct spelling is 'Occasion.'"},
 {'instruction': "What type of figurative language is used in 'She is the apple of my eye'?",
  'input': '',
  'output': 'The figurative language used is a metaphor.'},
 {'instruction': 'Correct the spelling error in the sentence.',
  'input': 'I will atend the meeting tomorrow.',
  'output': "The correct spelling is 'attend', not 'atend'."}]

---

<img src="https://camo.githubusercontent.com/56327c274257475f53fbb0a25fac50e703cbd67af30ff930eb3611c3356f6da6/68747470733a2f2f73656261737469616e72617363686b612e636f6d2f696d616765732f4c4c4d732d66726f6d2d736372617463682d696d616765732f636830375f636f6d707265737365642f70726f6d70742d7374796c652e776562703f31" width=700>

Left is ***Alpaca*** prompt style and right is ***Phi-3*** prompt style developped by Microsoft.

We will use Alpaca prompt style.

In [13]:
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""

    return instruction_text + input_text

In [15]:
print(format_input(data[50]))
print(f"\n### Response:\n{data[50]['output']}")

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'


In [16]:
print(format_input(data[51]))
print(f"\n### Response:\n{data[51]['output']}")

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What type of figurative language is used in 'She is the apple of my eye'?

### Response:
The figurative language used is a metaphor.
