# Data Extraction

This script organizes the data into two columns: questions, answers

In [4]:
# Files to clean
file_paths = [{'input_file': 'alpaca_data.json', 'output_file': 'alpaca_data_trained.json'}]

output_dir = './output'

In [5]:
!gdown 1dSqzV4HbdX4RzO7862sUFzJ_YzMwBuvR

Downloading...
From: https://drive.google.com/uc?id=1dSqzV4HbdX4RzO7862sUFzJ_YzMwBuvR
To: /content/alpaca_data.json
100% 22.8M/22.8M [00:00<00:00, 71.0MB/s]


Stanford Alpaca: An Instruction-following LLaMA Model
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto

Dataset: https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json

Year of Publication: 2023

Overview
The current Alpaca model is fine-tuned from a 7B LLaMA model on 52K instruction-following data generated by the techniques in the Self-Instruct paper, with some modifications that we discuss in the next section. In a preliminary human evaluation, we found that the Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite.

Alpaca is still under development, and there are many limitations that have to be addressed. Importantly, we have not yet fine-tuned the Alpaca model to be safe and harmless. We thus encourage users to be cautious when interacting with Alpaca, and to report any concerning behavior to help improve the safety and ethical considerations of the model.

Our initial release contains the data generation procedure, dataset, and training recipe. We intend to release the model weights if we are given permission to do so by the creators of LLaMA. For now, we have chosen to host a live demo to help readers better understand the capabilities and limits of Alpaca, as well as a way to help us better evaluate Alpaca's performance on a broader audience.

In [6]:
import json

def data_extraction(file_path: str) -> list:
    """
    Extracts relevant data from the JSON file and
    organized them into two columns: questions, answers.

    Parameters:
        file_path (str): file path name to the JSON file

    Returns:
        list_questions_answers (list): list of dictionaries
        with questions and answers
    """
    with open(file_path, 'r') as file:
        data = json.load(file)

    list_questions_answers = []

    for item in data:
        question = item['instruction']
        answers = item['output']
        if answers:
            question_answer = {
                'question': question,
                'answer': answers
            }
            list_questions_answers.append(question_answer)

    print(f'Total number of questions: {len(list_questions_answers)}')

    return list_questions_answers


In [7]:
import os

file_path = file_paths[0]
input_file = file_path['input_file']
output_file = file_path['output_file']
output_dir = 'output'

print(f"Processing {input_file}...\n")
list_questions_answers = data_extraction(input_file)
print(f"Number of questions and answers in {input_file}: {len(list_questions_answers)}\n")
print(f"First 5 entries:\n")
for i in range(5):
    print(f"Question {i+1}: {list_questions_answers[i]['question']}")
    print(f"Answer {i+1}: {list_questions_answers[i]['answer']}\n")

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

output_file_path = os.path.join(output_dir, output_file)
with open(output_file_path, 'w') as file:
    json.dump(list_questions_answers, file)

print(f"Data saved in {output_file_path}\n")
print("-" * 20)

Processing alpaca_data.json...

Total number of questions: 51974
Number of questions and answers in alpaca_data.json: 51974

First 5 entries:

Question 1: Give three tips for staying healthy.
Answer 1: 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 
2. Exercise regularly to keep your body active and strong. 
3. Get enough sleep and maintain a consistent sleep schedule.

Question 2: What are the three primary colors?
Answer 2: The three primary colors are red, blue, and yellow.

Question 3: Describe the structure of an atom.
Answer 3: An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom.

Question 4: How can we reduce air pollution?
Answer 4: There are a number of way

In [8]:
# zip the files and download
import shutil
from google.colab import files
from datetime import datetime

file_name = 'dataset' + '_' + datetime.now().strftime("%Y%m%d-%H%M%S")
shutil.make_archive(file_name, 'zip', 'output')
files.download(file_name + '.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>