Skip to content

Repositorio para albergar el código de limpieza semi-automática de datasets de instrucciones para Alpaca ES

Notifications You must be signed in to change notification settings

maxserras/burra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Burra - Somos NLP Hackaton 2023

This repo contains the code for the participation of @zolastro, @luisetex, @coronadoh and @maxserras for the Somos NLP Hackaton 2023.

The main objective of this project was to clean and improve the Somos alpaca es dataset to train Llama in spanish with instructions.

Contexts

One of the goals of the Somos NLP Hackaton 2023 was to train an Alpaca model in Spanish. To that end, the organizers translated the original alpaca dataset that can be found here.

This corpus had multiple issues such as:

  • Hallucinations due to using unprocessable inputs (images, urls, etc.)
  • Inconsistencies such as empty outputs, etc.
  • Wrong answers to math problems (mostly)

Also, almost no evaluation has been done regarding bias/hate speech, etc. over this corpus.

In addition, when translating the corpus to Spanish we found other issues such as:

  • Inconsistencies in the translation
  • Sentences that were not translated
  • Inability of propagating labels from the original EN corpus due to the lack of alignement.
  • Mixed sentences

The goal of this project was to clean the dataset and improve it to train a Llama model in Spanish.

Outcomes

The principal outcomes provided by the Burra team are:

  • SetFit model to detect unprocessable samples (images, urls, etc.)
    • Same model using the paraphrase-multilingual-mpnet-v2 model, see Model Card
  • LangID filtering algorithm to detect samples that were not correctly translated to Spanish
  • Alpaca EN dataset & Alpaca ES dataset alignment algorithm
  • Evaluation of Bias Detection algorithm and Hate Speech Detection algorithm over the Alpaca EN Dataset. Propagation of this information to the Alpaca ES dataset
  • Dataset Available in HuggingFace space and HF Datasets
  • This repo containing the experiments and code used.

Other experiments & Future work

In addition, we've given a try to some other lines of work and research that we would like to share:

  1. Using ChatGPT3.5 to identify inconsistent samples:
  • We tried some prompt engineering that can be found at the src/promptingfolder.
  • Some results are available at HF Argilla Space under the name of "somos-alpaca-es-analysis-chatgpt3_5_turbo"
  1. First trials of fine-tuning the Bertin LLama: `src/llama/fine_tune_llama.ipynb'
    • Disclaimer: we dedicated almost zero time to this, so we don't guarantee that neither the script nor the results are correct.

Other task remain as future work:

  • Generate more training samples using other open-licensed LLMs
  • Improve the prompting
  • Train self-correction / reflection over the Alpaca ES
  • Train other SetFit model for math problem detection

How to use this repo

Before running this, export the PYTHONPATH

export PYTHONPATH=$PYTHONPATH:$(pwd)/src

Installation can be done as always:

pip install -r requirements.txt

Then, you need to create the config/envs.json with different environment variables, such as:

- HF_TOKEN: token to access HuggingFace databases / push models / ...
- HUB_DATASET_NAME: name of the HF database to sync the changes
- OPENAI_TOKEN: to use the OpenAI Client

Then, all the entrypoints follow the same logic:

$ python main.py "command"

Where the commands can be:

- "train-unprocessable-setfit": train the SetFit model to detect unprocessable samples
- "predict-unprocessable-setfit": predict the unprocessable samples using the SetFit model
- "train-predict-setfit": train and predict the unprocessable samples using the SetFit model
- "save": save the progress from the Argilla Space to the HF Dataset
- "align": align the Alpaca EN dataset with the Alpaca ES dataset
- "enrich": enrich the Alpaca ES dataset with the results of the Bias Detection and Hate Speech Detection models

About

Repositorio para albergar el código de limpieza semi-automática de datasets de instrucciones para Alpaca ES

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages