Skip to content

Commit

Permalink
Merge branch 'master' into iter_archive
Browse files Browse the repository at this point in the history
  • Loading branch information
severo committed Oct 14, 2021
2 parents 5528318 + f839338 commit d860b49
Show file tree
Hide file tree
Showing 59 changed files with 5,720 additions and 232 deletions.
1 change: 1 addition & 0 deletions .circleci/deploy.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ deploy_doc "master" master

# Example of how to deploy a doc on a certain commit (the commit doesn't have to be on the master branch).
# The following commit would live on huggingface.co/docs/datasets/v1.0.0
deploy_doc "38ec259" v1.13.0
deploy_doc "2c1fc9c" v1.12.1
deploy_doc "c65dccc" v1.12.0
deploy_doc "ea7f0b8" v1.11.0
Expand Down
31 changes: 31 additions & 0 deletions .github/workflows/test-audio.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Test audio

on:
push:
branches:
- master
pull_request:
branches:
- master

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: "3.6"
- name: Install OS dependencies
run: |
sudo apt-get update
sudo apt-get install libsndfile1 sox
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install .[tests,audio]
pip install pyarrow --upgrade
- name: Test audio with pytest
run: |
HF_SCRIPTS_VERSION=master python -m pytest -n 2 -sv ./tests/features/test_audio.py
8 changes: 4 additions & 4 deletions datasets/biosses/biosses.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,8 @@ class Biosses(datasets.GeneratorBasedBuilder):
def _info(self):
features = datasets.Features(
{
"sentence 1": datasets.Value("string"),
"sentence 2": datasets.Value("string"),
"sentence1": datasets.Value("string"),
"sentence2": datasets.Value("string"),
"score": datasets.Value("float32"),
}
)
Expand All @@ -93,7 +93,7 @@ def _generate_examples(self, filepath):
df = pd.read_csv(filepath, sep="\t", encoding="utf-8")
for idx, row in df.iterrows():
yield idx, {
"sentence 1": row["sentence1"],
"sentence 2": row["sentence2"],
"sentence1": row["sentence1"],
"sentence2": row["sentence2"],
"score": row["score"],
}
2 changes: 1 addition & 1 deletion datasets/biosses/dataset_infos.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"default": {"description": "BIOSSES is a benchmark dataset for biomedical sentence similarity estimation. The dataset comprises 100 sentence pairs, in which each sentence was selected from the TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset containing articles from the biomedical domain. The sentence pairs were evaluated by five different human experts that judged their similarity and gave scores ranging from 0 (no relation) to 4 (equivalent).\n", "citation": "@article{souganciouglu2017biosses,\n title={BIOSSES: a semantic sentence similarity estimation system for the biomedical domain},\n author={So{\\u{g}}anc{\\i}o{\\u{g}}lu, Gizem and {\\\"O}zt{\\\"u}rk, Hakime and {\\\"O}zg{\\\"u}r, Arzucan},\n journal={Bioinformatics},\n volume={33},\n number={14},\n pages={i49--i58},\n year={2017},\n publisher={Oxford University Press}\n}\n", "homepage": "https://tabilab.cmpe.boun.edu.tr/BIOSSES/DataSet.html", "license": "BIOSSES is made available under the terms of The GNU Common Public License v.3.0.\n", "features": {"sentence 1": {"dtype": "string", "id": null, "_type": "Value"}, "sentence 2": {"dtype": "string", "id": null, "_type": "Value"}, "score": {"dtype": "float32", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "biosses", "config_name": "default", "version": {"version_str": "0.0.0", "description": null, "major": 0, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 32783, "num_examples": 100, "dataset_name": "biosses"}}, "download_checksums": {"https://raw.githubusercontent.com/Markus-Zlabinger/ssts/fce78a649ab90269950aaf44ce20a36e94409392/data/biosses/all.tsv": {"num_bytes": 36324, "checksum": "e0f7b235e4bc9a76ad4bd170bf0da2f449ec6ea677a9a4b5dcb7be6687775906"}}, "download_size": 36324, "post_processing_size": null, "dataset_size": 32783, "size_in_bytes": 69107}}
{"default": {"description": "BIOSSES is a benchmark dataset for biomedical sentence similarity estimation. The dataset comprises 100 sentence pairs, in which each sentence was selected from the TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset containing articles from the biomedical domain. The sentence pairs were evaluated by five different human experts that judged their similarity and gave scores ranging from 0 (no relation) to 4 (equivalent).\n", "citation": "@article{souganciouglu2017biosses,\n title={BIOSSES: a semantic sentence similarity estimation system for the biomedical domain},\n author={So{\\u{g}}anc{\\i}o{\\u{g}}lu, Gizem and {\\\"O}zt{\\\"u}rk, Hakime and {\\\"O}zg{\\\"u}r, Arzucan},\n journal={Bioinformatics},\n volume={33},\n number={14},\n pages={i49--i58},\n year={2017},\n publisher={Oxford University Press}\n}\n", "homepage": "https://tabilab.cmpe.boun.edu.tr/BIOSSES/DataSet.html", "license": "BIOSSES is made available under the terms of The GNU Common Public License v.3.0.\n", "features": {"sentence1": {"dtype": "string", "id": null, "_type": "Value"}, "sentence2": {"dtype": "string", "id": null, "_type": "Value"}, "score": {"dtype": "float32", "id": null, "_type": "Value"}}, "post_processed": null, "supervised_keys": null, "task_templates": null, "builder_name": "biosses", "config_name": "default", "version": {"version_str": "0.0.0", "description": null, "major": 0, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 32783, "num_examples": 100, "dataset_name": "biosses"}}, "download_checksums": {"https://raw.githubusercontent.com/Markus-Zlabinger/ssts/fce78a649ab90269950aaf44ce20a36e94409392/data/biosses/all.tsv": {"num_bytes": 36324, "checksum": "e0f7b235e4bc9a76ad4bd170bf0da2f449ec6ea677a9a4b5dcb7be6687775906"}}, "download_size": 36324, "post_processing_size": null, "dataset_size": 32783, "size_in_bytes": 69107}}
Binary file modified datasets/biosses/dummy/0.0.0/dummy_data.zip
Binary file not shown.
240 changes: 240 additions & 0 deletions datasets/greek_legal_code/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
---
pretty_name: Greek Legal Code
annotations_creators:
- found
language_creators:
- found
languages:
- el
licenses:
- cc-by-4.0
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- multi-class-classification
- topic-classification
---

# Dataset Card for Greek Legal Code

## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)

## Dataset Description

- **Homepage:** https://doi.org/10.5281/zenodo.5528002
- **Repository:** https://github.com/christospi/glc-nllp-21
- **Paper:** TBA
- **Leaderboard:** N/A
- **Point of Contact:** [Christos Papaloukas](mailto:christospap@di.uoa.gr)

### Dataset Summary

Greek_Legal_Code (GLC) is a dataset consisting of approx. 47k legal resources from Greek legislation. The origin of GLC is “Permanent Greek Legislation Code - Raptarchis”, a collection of Greek legislative documents classified into multi-level (from broader to more specialized) categories.

**Topics**

GLC consists of 47 legislative volumes and each volume corresponds to a main thematic topic. Each volume is divided into thematic sub categories which are called chapters and subsequently, each chapter breaks down to subjects which contain the legal resources. The total number of chapters is 389 while the total number of subjects is 2285, creating an interlinked thematic hierarchy. So, for the upper thematic level (volume) GLC has 47 classes. For the next thematic level (chapter) GLC offers 389 classes and for the inner and last thematic level (subject), GLC has 2285 classes.

GLC classes are divided into three categories for each thematic level: frequent classes, which occur in more than 10 training documents and can be found in all three subsets (training, development and test); few-shot classes which appear in 1 to 10 training documents and also appear in the documents of the development and test sets, and zero-shot classes which appear in the development and/or test, but not in the training documents.


### Supported Tasks and Leaderboards

The dataset supports:

**Multi-class Text Classification:** Given the text of a document, a model predicts the corresponding class.

**Few-shot and Zero-shot learning:** As already noted, the classes can be divided into three groups: frequent, few-shot, and zero- shot, depending on whether they were assigned to more than 10, fewer than 10 but at least one, or no training documents, respectively.

| Level | Total | Frequent | Few-Shot (<10) | Zero-Shot |
|---|---|---|---|---|
|Volume|47|47|0|0|
|Chapter|389|333|53|3|
|Subject|2285|712|1431|142|

### Languages

All documents are written in Greek.

## Dataset Structure

### Data Instances


```json
{
"text": "179. ΑΠΟΦΑΣΗ ΥΠΟΥΡΓΟΥ ΜΕΤΑΦΟΡΩΝ ΚΑΙ ΕΠΙΚΟΙΝΩΝΙΩΝ Αριθ. Β-οικ. 68425/4765 της 2/17 Νοεμ. 2000 (ΦΕΚ Β΄ 1404) Τροποποίηση της 42000/2030/81 κοιν. απόφασης του Υπουργού Συγκοινωνιών «Κωδικοποίηση και συμπλήρωση καν. Αποφάσεων» που εκδόθηκαν κατ’ εξουσιοδότηση του Ν.Δ. 102/73 «περί οργανώσεως των δια λεωφορείων αυτοκινήτων εκτελουμένων επιβατικών συγκοινωνιών». ",
"volume": 24, # "ΣΥΓΚΟΙΝΩΝΙΕΣ"
}
```

### Data Fields

The following data fields are provided for documents (`train`, `dev`, `test`):

`text`: (**str**) The full content of each document, which is represented by its `header` and `articles` (i.e., the `main_body`).\
`label`: (**class label**): Depending on the configurarion, the volume/chapter/subject of the document. For volume-level class it belongs to specifically: ["ΚΟΙΝΩΝΙΚΗ ΠΡΟΝΟΙΑ",
"ΓΕΩΡΓΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΡΑΔΙΟΦΩΝΙΑ ΚΑΙ ΤΥΠΟΣ",
"ΒΙΟΜΗΧΑΝΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΥΓΕΙΟΝΟΜΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΠΟΛΕΜΙΚΟ ΝΑΥΤΙΚΟ",
"ΤΑΧΥΔΡΟΜΕΙΑ - ΤΗΛΕΠΙΚΟΙΝΩΝΙΕΣ",
"ΔΑΣΗ ΚΑΙ ΚΤΗΝΟΤΡΟΦΙΑ",
"ΕΛΕΓΚΤΙΚΟ ΣΥΝΕΔΡΙΟ ΚΑΙ ΣΥΝΤΑΞΕΙΣ",
"ΠΟΛΕΜΙΚΗ ΑΕΡΟΠΟΡΙΑ",
"ΝΟΜΙΚΑ ΠΡΟΣΩΠΑ ΔΗΜΟΣΙΟΥ ΔΙΚΑΙΟΥ",
"ΝΟΜΟΘΕΣΙΑ ΑΝΩΝΥΜΩΝ ΕΤΑΙΡΕΙΩΝ ΤΡΑΠΕΖΩΝ ΚΑΙ ΧΡΗΜΑΤΙΣΤΗΡΙΩΝ",
"ΠΟΛΙΤΙΚΗ ΑΕΡΟΠΟΡΙΑ",
"ΕΜΜΕΣΗ ΦΟΡΟΛΟΓΙΑ",
"ΚΟΙΝΩΝΙΚΕΣ ΑΣΦΑΛΙΣΕΙΣ",
"ΝΟΜΟΘΕΣΙΑ ΔΗΜΩΝ ΚΑΙ ΚΟΙΝΟΤΗΤΩΝ",
"ΝΟΜΟΘΕΣΙΑ ΕΠΙΜΕΛΗΤΗΡΙΩΝ ΣΥΝΕΤΑΙΡΙΣΜΩΝ ΚΑΙ ΣΩΜΑΤΕΙΩΝ",
"ΔΗΜΟΣΙΑ ΕΡΓΑ",
"ΔΙΟΙΚΗΣΗ ΔΙΚΑΙΟΣΥΝΗΣ",
"ΑΣΦΑΛΙΣΤΙΚΑ ΤΑΜΕΙΑ",
"ΕΚΚΛΗΣΙΑΣΤΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΕΚΠΑΙΔΕΥΤΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΔΗΜΟΣΙΟ ΛΟΓΙΣΤΙΚΟ",
"ΤΕΛΩΝΕΙΑΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΣΥΓΚΟΙΝΩΝΙΕΣ",
"ΕΘΝΙΚΗ ΑΜΥΝΑ",
"ΣΤΡΑΤΟΣ ΞΗΡΑΣ",
"ΑΓΟΡΑΝΟΜΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΔΗΜΟΣΙΟΙ ΥΠΑΛΛΗΛΟΙ",
"ΠΕΡΙΟΥΣΙΑ ΔΗΜΟΣΙΟΥ ΚΑΙ ΝΟΜΙΣΜΑ",
"ΟΙΚΟΝΟΜΙΚΗ ΔΙΟΙΚΗΣΗ",
"ΛΙΜΕΝΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΑΣΤΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΠΟΛΙΤΙΚΗ ΔΙΚΟΝΟΜΙΑ",
"ΔΙΠΛΩΜΑΤΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΔΙΟΙΚΗΤΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΑΜΕΣΗ ΦΟΡΟΛΟΓΙΑ",
"ΤΥΠΟΣ ΚΑΙ ΤΟΥΡΙΣΜΟΣ",
"ΕΘΝΙΚΗ ΟΙΚΟΝΟΜΙΑ",
"ΑΣΤΥΝΟΜΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΑΓΡΟΤΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΕΡΓΑΤΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΠΟΙΝΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΕΜΠΟΡΙΚΗ ΝΟΜΟΘΕΣΙΑ",
"ΕΠΙΣΤΗΜΕΣ ΚΑΙ ΤΕΧΝΕΣ",
"ΕΜΠΟΡΙΚΗ ΝΑΥΤΙΛΙΑ",
"ΣΥΝΤΑΓΜΑΤΙΚΗ ΝΟΜΟΘΕΣΙΑ"
] \

The labels can also be a the chapter-level or subject-level class it belongs to. Some chapter labels are omitted due to size (389 classes). Some subject labels are also omitted due to size (2285 classes).

### Data Splits

| Split | No of Documents | Avg. words |
| ------------------- | ------------------------------------ | --- |
| Train | 28,536 | 600 |
|Development | 9,511 | 574 |
|Test | 9,516 | 595 |

## Dataset Creation

### Curation Rationale

The dataset was curated by Papaloukas et al. (2021) with the hope to support and encourage further research in NLP for the Greek language.

### Source Data

#### Initial Data Collection and Normalization

The ``Permanent Greek Legislation Code - Raptarchis`` is a thorough catalogue of Greek legislation since the creation of the Greek state in 1834 until 2015. It includes Laws, Royal and Presidential Decrees, Regulations and Decisions, retrieved from the Official Government Gazette, where Greek legislation is published. This collection is one of the official, publicly available sources of classified Greek legislation suitable for classification tasks.

Currently, the original catalogue is publicly offered in MS Word (.doc) format through the portal e-Themis, the legal database and management service of it, under the administration of the Ministry of the Interior (Affairs). E-Themis is primarily focused on providing legislation on a multitude of predefined thematic categories, as described in the catalogue. The main goal is to help users find legislation of interest using the thematic index.

#### Who are the source language producers?

[More Information Needed]

### Annotations

#### Annotation process

[More Information Needed]

#### Who are the annotators?

[More Information Needed]

### Personal and Sensitive Information

The dataset does not include personal or sensitive information.

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]

## Additional Information

### Dataset Curators

Papaloukas et al. (2021)

### Licensing Information

[More Information Needed]

### Citation Information

*Christos Papaloukas, Ilias Chalkidis, Konstantinos Athinaios, Despina-Athanasia Pantazi and Manolis Koubarakis.*
*Multi-granular Legal Topic Classification on Greek Legislation.*
*Proceedings of the 3rd Natural Legal Language Processing (NLLP) Workshop, Punta Cana, Dominican Republic, 2021*
```
@inproceedings{papaloukas-etal-2021-glc,
title = "Multi-granular Legal Topic Classification on Greek Legislation",
author = "Papaloukas, Christos and Chalkidis, Ilias and Athinaios, Konstantinos and Pantazi, Despina-Athanasia and Koubarakis, Manolis",
booktitle = "Proceedings of the 3rd Natural Legal Language Processing (NLLP) Workshop",
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "",
url = "https://arxiv.org/abs/2109.15298",
doi = "",
pages = ""
}
```

### Contributions

Thanks to [@christospi](https://github.com/christospi) for adding this dataset.
1 change: 1 addition & 0 deletions datasets/greek_legal_code/dataset_infos.json

Large diffs are not rendered by default.

Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading

1 comment on commit d860b49

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==3.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008419 / 0.011353 (-0.002934) 0.003706 / 0.011008 (-0.007302) 0.029323 / 0.038508 (-0.009185) 0.031520 / 0.023109 (0.008411) 0.279482 / 0.275898 (0.003584) 0.316590 / 0.323480 (-0.006890) 0.007391 / 0.007986 (-0.000595) 0.004644 / 0.004328 (0.000316) 0.008523 / 0.004250 (0.004273) 0.045000 / 0.037052 (0.007948) 0.286068 / 0.258489 (0.027579) 0.324423 / 0.293841 (0.030582) 0.021659 / 0.128546 (-0.106887) 0.007838 / 0.075646 (-0.067808) 0.234480 / 0.419271 (-0.184792) 0.043084 / 0.043533 (-0.000449) 0.275165 / 0.255139 (0.020026) 0.318836 / 0.283200 (0.035636) 0.074954 / 0.141683 (-0.066729) 1.495317 / 1.452155 (0.043163) 1.616519 / 1.492716 (0.123803)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.270083 / 0.018006 (0.252077) 0.546353 / 0.000490 (0.545863) 0.004564 / 0.000200 (0.004364) 0.000103 / 0.000054 (0.000049)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.032718 / 0.037411 (-0.004693) 0.020614 / 0.014526 (0.006088) 0.025222 / 0.176557 (-0.151335) 0.112214 / 0.737135 (-0.624921) 0.025595 / 0.296338 (-0.270743)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.370861 / 0.215209 (0.155652) 3.721305 / 2.077655 (1.643650) 1.627367 / 1.504120 (0.123247) 1.440308 / 1.541195 (-0.100887) 1.509516 / 1.468490 (0.041026) 0.332795 / 4.584777 (-4.251982) 4.412442 / 3.745712 (0.666730) 0.890163 / 5.269862 (-4.379699) 0.826536 / 4.565676 (-3.739141) 0.041796 / 0.424275 (-0.382479) 0.004837 / 0.007607 (-0.002770) 0.521316 / 0.226044 (0.295271) 5.210336 / 2.268929 (2.941407) 2.257490 / 55.444624 (-53.187134) 1.917940 / 6.876477 (-4.958536) 1.974619 / 2.142072 (-0.167453) 0.479769 / 4.805227 (-4.325458) 0.101900 / 6.500664 (-6.398764) 0.051618 / 0.075469 (-0.023851)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.357908 / 1.841788 (-0.483879) 12.169950 / 8.074308 (4.095642) 23.820521 / 10.191392 (13.629129) 0.682934 / 0.680424 (0.002510) 0.464809 / 0.534201 (-0.069392) 0.202114 / 0.579283 (-0.377169) 0.447935 / 0.434364 (0.013571) 0.160318 / 0.540337 (-0.380020) 0.169695 / 1.386936 (-1.217241)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.008628 / 0.011353 (-0.002725) 0.003664 / 0.011008 (-0.007344) 0.028422 / 0.038508 (-0.010086) 0.032455 / 0.023109 (0.009346) 0.254660 / 0.275898 (-0.021238) 0.294258 / 0.323480 (-0.029222) 0.007508 / 0.007986 (-0.000477) 0.004790 / 0.004328 (0.000461) 0.008507 / 0.004250 (0.004257) 0.044949 / 0.037052 (0.007897) 0.251771 / 0.258489 (-0.006718) 0.294005 / 0.293841 (0.000164) 0.021779 / 0.128546 (-0.106767) 0.007627 / 0.075646 (-0.068020) 0.225486 / 0.419271 (-0.193785) 0.042216 / 0.043533 (-0.001317) 0.254729 / 0.255139 (-0.000410) 0.273239 / 0.283200 (-0.009961) 0.082798 / 0.141683 (-0.058884) 1.503944 / 1.452155 (0.051789) 1.640053 / 1.492716 (0.147337)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.436013 / 0.018006 (0.418007) 0.533820 / 0.000490 (0.533330) 0.093687 / 0.000200 (0.093487) 0.001505 / 0.000054 (0.001450)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.031461 / 0.037411 (-0.005951) 0.019058 / 0.014526 (0.004532) 0.027123 / 0.176557 (-0.149434) 0.111707 / 0.737135 (-0.625429) 0.026508 / 0.296338 (-0.269830)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.360658 / 0.215209 (0.145449) 3.604223 / 2.077655 (1.526568) 1.557324 / 1.504120 (0.053204) 1.370400 / 1.541195 (-0.170795) 1.399101 / 1.468490 (-0.069389) 0.330974 / 4.584777 (-4.253803) 4.661074 / 3.745712 (0.915362) 0.905855 / 5.269862 (-4.364007) 0.833279 / 4.565676 (-3.732398) 0.041748 / 0.424275 (-0.382527) 0.004735 / 0.007607 (-0.002872) 0.518413 / 0.226044 (0.292369) 5.181240 / 2.268929 (2.912311) 2.200605 / 55.444624 (-53.244020) 1.844412 / 6.876477 (-5.032065) 1.872178 / 2.142072 (-0.269894) 0.482797 / 4.805227 (-4.322430) 0.104077 / 6.500664 (-6.396587) 0.052933 / 0.075469 (-0.022536)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.362033 / 1.841788 (-0.479755) 12.564077 / 8.074308 (4.489769) 27.160574 / 10.191392 (16.969182) 0.686919 / 0.680424 (0.006495) 0.511293 / 0.534201 (-0.022908) 0.229093 / 0.579283 (-0.350190) 0.513949 / 0.434364 (0.079585) 0.186765 / 0.540337 (-0.353573) 0.208861 / 1.386936 (-1.178075)

CML watermark

Please sign in to comment.