GitHub - nicolay-r/arekit-ss: Low Resource Context Relation Sampler for contexts with relations for fact-checking and fine-tuning your LLM models, powered by AREkit

arekit-ss 0.24.0

arekit-ss [AREkit double "s"] -- is an object-pair context sampler for datasources, powered by AREkit

NOTE: For custom text sampling, please follow the ARElight project.

Installation

Install dependencies:

pip install git+https://github.com/nicolay-r/arekit-ss.git@0.24.0

Download AREkit related data, from which sources are required:

python -m arekit.download_data

Usage

Example of composing prompts:

python -m arekit_ss.sample --writer csv --source rusentrel --sampler prompt \
  --prompt "For text: '{text}', the attitude between '{s_val}' and '{t_val}' is: '{label_val}'" \
  --dest_lang en --docs_limit 1

Mind the case (issue #18): switching to another language may affect on amount of extracted data because of terms_per_context parameter that crops context by fixed and predefined amount of words.

Parameters

source -- source name from the list of the supported sources.
- terms_per_context -- amount of words (terms) in between SOURCE and TARGET objects.
- object-source-types -- filter specific source object types
- object-target-types -- filter specific target object types
- relation_types -- list of types, in which items separated with | char; all by default
- splits -- Manual selection of the data-types related splits that should be chosen for the sampling process; types should be separated by ':' sign; for example: 'train:test'
sampler -- List of the supported samplers:
- nn -- CNN/LSTM architecture related, including frames annotation from RuSentiFrames.
  - no-vectorize -- flag is applicable only for nn, and denotes no need to generate embeddings for features
- bert -- BERT-based, single-input sequence.
- prompt -- prompt-based sampler for LLM systems [prompt engeneering guide]
  - prompt -- text of the prompt which includes the following parameters:
    - {text} is an original text of the sample
    - {s_val} and {t_val} values of the source and target of the pairs respectively
    - {label_val} value of the label
writer -- the output format of samples:
- csv -- for AREnets framework;
- jsonl -- for OpenNRE framework.
- sqlite -- SQLite-3.0 database.
mask_entities -- mask entity mode.
Text translation parameters:
- src_lang -- original language of the text.
- dest_lang -- target language of the text.
output_dir -- target directory for samples storing
Limiting the amount of documents from source:
- docs_limit -- amount of documents to be considered for sampling from the whole source.
- doc_ids -- list of the document IDs.

Powered by

AREkit framework

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
arekit_ss		arekit_ss
test		test
LICENSE		LICENSE
README.md		README.md
arekit_ss.ipynb		arekit_ss.ipynb
dependencies.txt		dependencies.txt
logo.png		logo.png
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arekit_ss

arekit_ss

test

test

LICENSE

LICENSE

README.md

README.md

arekit_ss.ipynb

arekit_ss.ipynb

dependencies.txt

dependencies.txt

logo.png

logo.png

setup.py

setup.py

Repository files navigation

arekit-ss 0.24.0

Installation

Usage

Parameters

Powered by

About

Releases 2

Packages

Languages

License

nicolay-r/arekit-ss

Folders and files

Latest commit

History

Repository files navigation

arekit-ss 0.24.0

Installation

Usage

Parameters

Powered by

About

Topics

Resources

License

Stars

Watchers

Forks

Languages