Skip to content

langbridgeai/Open-Instructions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open-Instructions

A Pavilion of recent Open Source Generative Pre-trained Transformer (GPT) Projects for Decentralized AI.

Code License Data License Python 3.9+ Code style: yapf

Overview

Ailurus

The recent surge in more efficient & open-source LLMs projects has been nothing short of fervent, yet the various instruction-finetuned LLaMAs have left those genuinely interested in customized GPT or even decentralized AI feeling puzzled. Consequently, we try to build this project that consolidates existing resources on either LLaMAs or any GPT variant. As 🤗huggingface and peft has extensively integrated mature pre-training and fine-tuning pipelines, the majority of these open-source models or chatbots are compatible with unified Trainer of transformers. The sole distinction lies in the datasets for finetuning. Therefore, our open-source initiative's paramount contribution lies in assisting the community with consolidating existing open-source datasets of instructions, which is amalgamated and named as Open-Instructions (GoogleDrive). We evaluate their strengths and weaknesses, and would also release an open-source efficient model trained on all of these data, referred to as Ailurus which is my favorite animal.

Notes: In this project, we inherit most of the codes from Vicuna for data cleaning, finetuning and serving, a big shout out to the team for their fantastic work!

Contents

Open-Instructions Summary

Here is a short summary of different open-source datasets for instruction finetuning.

Dataset Num of Samples (Lang) Engine Cost
Alpaca 52K En text-davinci-003 <$500
InstructionWild (Coati) 52K En & 52K Zh text-davinci-003 $880
ShareGPT-90K (Vicuna) ~100K => 48K Multi-lingual gpt-3.5-turbo Scrapped(free?)
GPT4ALL ~806K => 437K Multi-lingual gpt-3.5-turbo $500
GPT4LLM 52K En & 52K Zh gpt-4 Est >$880
Dolly ~15K En databricks employees n/a

Ailurus Checkpoints

We aim to open source the entire training logs to receive feedback for improvements but it's still ongoing. We follow Vicuna's data cleaning, training and inference pipeline to train Ailurus with LoRA. As of now, we only release one checkpoint at Google Drive as we observe the training kinda saturates at ~10k steps. But more is coming...Again, we are not distributing LLaMA weights.

Checkpoint

Current Development Plan

Pipelines Status
Instruction Data Analysis 🔧 Developing
Instruction Tuning ✅ Supported
Parameter-Efficient Tuning ✅ Supported
Large Model Inference 🔧 Developing
UI Serving 🔧 Developing
Alignment Tuning 🔧 Developing

Alpaca

Stanford Alpaca is definitely the one representative open source lightGPT project that lits the fire. The dataset Alpaca was trained on contains 52K instruction-following data generated by Self-Instruct GPT. The JSON file alpaca_data.json in their repo is a list of dictionaries, each dictionary contains the following fields:

  • instruction: str, describes the task the model should perform. Each of the 52K instructions is unique.
  • input: str, optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.
  • output: str, the answer to the instruction as generated by text-davinci-003.

ColossalAI Chatbot

InstructionWild is the dataset ColossalaiChatbot Coati was trained on. It inherits the same format as Alpaca for fast and easy usage. Differently, their instructions have no input field. The developers scrap over 700 noisy instructions from Twitter and filter out noisy instructions, subsequently pick 429 clean insturctions to ensure the high quality. They use a similar method as Alpaca to collect instructions. However, the generation does not need outputs for instructions thus avoid human involvement. The prompts generated are more diverse and covers more topics compared to the Alpaca's. They provide 5 prompts as examples for generating new instructions from OpenAI API. After collecting prompts, they collect responses of these instructions from OpenAI API. The English and Chinese datasets are generated seperately. In total, 880$ are spent to collect the dataset. There are 52K instructions for English (around 24M tokens) and 52K instructions for Chinese.

GPT4LLM

GPT-4-LLM follows exactly how Alpaca uses GPT to generate the instructions and further utilizes the latest gpt-4 engine instead of text-davinci-003 to generate 52K instructions using the same prompts in Alpaca. Additionally, the developers also release 52K instructions in Chinese.

Vicuna

Vicuna (not sure why everyone is so much into the camelid family..) is created by fine-tuning a LLaMA based model using approximately 70K user-shared conversations gathered from ShareGPT.com with public APIs. To ensure data quality, they convert the HTML back to markdown and filter out some inappropriate or low-quality samples. Additionally, they divide lengthy conversations into smaller segments that fit the model's maximum context length. For detailed instructions to clean the ShareGPT data, check out here. Unfortunately, the developers of Vicuna choose not to release the data at the moment. But our uncapped heros stand out and release the pre-cleaned version of ShareGPT-90K (see discussions here), which one may use Vicuna's cleaning script to clean it manually.

GPT4All

GPT4ALL collected roughly one million promptresponse pairs using the GPT-3.5-Turbo OpenAI API between March 20, 2023 and March 26th, 2023. To do this, they first gathered a diverse sample of questions/prompts by leveraging three publicly available datasets:

  • The unified chip2 subset of LAION OIG
  • Coding questions with a random sub-sample of Stackoverflow Questions
  • Instruction-tuning with a sub-sample of Bigscience/P3 with dedicated substantial attention to data preparation and curation based on commentary in the Alpaca project.

Upon collection of the initial dataset of promptgeneration pairs, they loaded data into Atlas for data curation and cleaning. With Atlas, they further removed all examples where GPT-3.5-Turbo failed to respond to prompts and produced malformed output. This reduced our total number of examples to 806,199 high-quality prompt-generation pairs. Interestingly, the developers decided to remove the entire Bigscience/P3 subset from the final training dataset due to its very low output diversity; P3 contains many homogeneous prompts which produce short and homogeneous responses from GPT-3.5-Turbo. This exclusion produces a final subset containing 437,605 prompt-generation pairs. The datasets are available at:

LMFlow

LMFlow is a recent open source tookit for finetuning and inference of LLMs. They release a natural instructions dataset but unfortunately without detailed descriptions of how they collected. Additionally, they also reformat three public Medical-QA related datasets known as PubMedQA(ID), MedMCQA(ID), MedQA-USMLE (OOD) for domain-specific finetuned LLaMA, which also achieves a feasible performance compared to chatGPT.

Dolly

Databricks’ Dolly is an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. Based on pythia-12b, Dolly is trained on ~15K instruction/response fine tuning records generated by Databricks employees in capability domains from the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA and summarization. dolly-v2-12b is not a state-of-the-art model, but does exhibit surprisingly high quality instruction following behavior not characteristic of the foundation model on which it is based.

The model is available on Hugging Face as databricks/dolly-v2-12b.

Pros

  • Our new dataset improves the model's ability in Generation, Open QA, and Mind Storm instructions. This corresponds to our data collection process. Our data is collected from Twitter, where users tend to share their interesting prompts of mostly generation, open QA, and mind-storm types.

Limitations for LLaMA-finetuned models

  • Both Alpaca and ColossalChat are based on LLaMA. It is hard to compensate for the missing knowledge in the pre-training stage.
  • Lack of counting ability: Cannot count the number of items in a list.
  • Lack of Logics (reasoning and calculation)
  • Tend to repeat the last sentence (fail to produce the end token).
  • Poor multilingual results: LLaMA is mainly trained on English datasets (Generation performs better than QA).

Limitations of dataset

  • Lack of summarization ability: No such instructions in finetune datasets.
  • Lack of multi-turn chat and role-playing: No such instructions in finetune datasets
  • Lack of self-recognition: No such instructions in finetune datasets
  • Lack of Safety:
    • When the input contains fake facts, the model makes up false facts and explanations.
    • Cannot abide by OpenAI's policy: When generating prompts from OpenAI API, it always abides by its policy. So no violation case is in the datasets.

A Preliminary Mindmap

mindmap
  root((mindmap))
    CHATGPT GENERATED INSTRUCTIONS
        SEED TASKS
            SELF INSTRUCT
            ALPACA
            WILD INSTRUCT
        MORE TASK TYPES
            GPT4ALL
                LAION OIG
                STACKOVERFLOW
                BIGSCIENCE
    RWD
        HUMAN INTERACTIONS
            VICUNA
            SHARE GPT 90K
        ACROSS DOMAINS
            LMFLOW
                NATURAL INSTRUCTIONS
                MEDICAL QA
            OTHERS

About

Open-Instructions: A Pavilion of recent Open Source GPT Projects for decentralized AI.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published