-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added tool to convert tars to pimarcs
Ran the tool on all test corpora in the test datasets dir. Test pipelines still work, confirming that pimarc corpus reading is working nicely and producing the same results as the old tar reading.
- Loading branch information
Showing
73 changed files
with
86,650 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
YZRNSOCIMU 0 23 | ||
HBTLGGCVKT 104 127 | ||
ZIOWPURJRB 257 280 | ||
ODFTGFICEY 337 360 | ||
LZRREJCHPP 522 545 |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
KPAHKKKWVT 0 23 | ||
WZEIFIPOKK 104 127 | ||
BGCHCWJAFD 224 247 | ||
YMJPGIOAHG 328 351 | ||
ZMMBCFVOYV 408 431 |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
VTEKNLGSQR 0 23 | ||
JOTLWDENTY 96 119 | ||
MJBBQMBDVP 240 263 | ||
JWZQEOQYKZ 280 303 | ||
LMWJZWJDWM 320 343 |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
GTOIMDPEKW 0 23 | ||
HQUJFEGDHR 1849 1872 | ||
FLUSDLUITG 2800 2823 | ||
UTADTLWSBD 3789 3812 | ||
HMPSAIEWBR 4254 4277 |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
NILGVJJJYV 0 23 | ||
CCNBQAYSMF 124 147 | ||
HOFMAPRCJY 1117 1140 | ||
CABYXXCFJM 2684 2707 | ||
MSNNCDATKD 2794 2817 |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
IQJDSJREGL 0 23 | ||
TWFQDTZBHD 545 568 | ||
AISGQHSPQL 1354 1377 | ||
NXXEEHWAWV 2711 2734 | ||
VXRJUKPJEG 3828 3851 |
Binary file not shown.
Binary file renamed
BIN
+2.8 MB
...a/datasets/corpora/ids/data/archive-0.tar → ...a/datasets/corpora/ids/data/archive-0.prc
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 212355 212383 | ||
ep-01-01-17.txt 1017518 1017546 | ||
ep-01-01-18.txt 1884851 1884879 | ||
ep-01-01-31.txt 2314450 2314478 |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
KUZYBPNLVV 0 23 | ||
PDBOVSCHVV 104 127 | ||
BWMWMWYWZP 273 296 | ||
XVEUCVZTNR 345 368 | ||
TVGUNJTIWM 441 464 |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ZMJXPZABOV 0 23 | ||
NLSXQUOACY 153 176 | ||
ZMROJKPGRV 209 232 | ||
ILJMMUKTZQ 257 280 | ||
YPJUZWVHAI 418 441 |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
WURADDXADK 0 23 | ||
FMVFPWDSRH 72 95 | ||
OQPJRNARGB 184 207 | ||
CNSUUPNNNB 248 271 | ||
MXYHJSBUNI 409 432 |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
VLUGLWYLBD 0 23 | ||
ZJXTADMING 419 442 | ||
DXYKJJGBSS 1324 1347 | ||
OLAIELPEKR 2483 2506 | ||
KDHQMVPYEP 3782 3805 |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
DAXDUNAKXW 0 23 | ||
PXZBEJGTDN 1291 1314 | ||
JHCRGTFWBH 1932 1955 | ||
PVASPTSVBT 2167 2190 | ||
LZLNXOVKJQ 3686 3709 |
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
NJHILGGEUQ 0 23 | ||
FYERTBVJIY 1089 1112 | ||
VIPOZPPPGM 1254 1277 | ||
JVAJXJEEHJ 3017 3040 | ||
PVUBUIEPWH 4386 4409 |
Binary file not shown.
Binary file renamed
BIN
+1.91 MB
...rpora/tokenized_longer/data/archive-2.tar → ...sets/corpora/tokenized/data/archive-0.prc
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 145056 145084 | ||
ep-01-01-17.txt 695851 695879 | ||
ep-01-01-18.txt 1287532 1287560 | ||
ep-01-01-31.txt 1579162 1579190 |
Binary file renamed
BIN
+1.91 MB
...sets/corpora/tokenized/data/archive-0.tar → ...rpora/tokenized_longer/data/archive-0.prc
Binary file not shown.
5 changes: 5 additions & 0 deletions
5
test/data/datasets/corpora/tokenized_longer/data/archive-0.prci
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 145056 145084 | ||
ep-01-01-17.txt 695851 695879 | ||
ep-01-01-18.txt 1287532 1287560 | ||
ep-01-01-31.txt 1579162 1579190 |
Binary file renamed
BIN
+1.91 MB
...rpora/tokenized_longer/data/archive-0.tar → ...rpora/tokenized_longer/data/archive-1.prc
Binary file not shown.
5 changes: 5 additions & 0 deletions
5
test/data/datasets/corpora/tokenized_longer/data/archive-1.prci
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 145056 145084 | ||
ep-01-01-17.txt 695851 695879 | ||
ep-01-01-18.txt 1287532 1287560 | ||
ep-01-01-31.txt 1579162 1579190 |
Binary file renamed
BIN
+1.91 MB
...rpora/tokenized_longer/data/archive-1.tar → ...rpora/tokenized_longer/data/archive-2.prc
Binary file not shown.
5 changes: 5 additions & 0 deletions
5
test/data/datasets/corpora/tokenized_longer/data/archive-2.prci
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 145056 145084 | ||
ep-01-01-17.txt 695851 695879 | ||
ep-01-01-18.txt 1287532 1287560 | ||
ep-01-01-31.txt 1579162 1579190 |
12,359 changes: 12,359 additions & 0 deletions
12,359
test/data/datasets/corpora/tokenized_longer/data/archive-3.prc
Large diffs are not rendered by default.
Oops, something went wrong.
5 changes: 5 additions & 0 deletions
5
test/data/datasets/corpora/tokenized_longer/data/archive-3.prci
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 145056 145084 | ||
ep-01-01-17.txt 695851 695879 | ||
ep-01-01-18.txt 1287532 1287560 | ||
ep-01-01-31.txt 1579162 1579190 |
Binary file not shown.
12,359 changes: 12,359 additions & 0 deletions
12,359
test/data/datasets/corpora/tokenized_longer/data/archive-4.prc
Large diffs are not rendered by default.
Oops, something went wrong.
5 changes: 5 additions & 0 deletions
5
test/data/datasets/corpora/tokenized_longer/data/archive-4.prci
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 145056 145084 | ||
ep-01-01-17.txt 695851 695879 | ||
ep-01-01-18.txt 1287532 1287560 | ||
ep-01-01-31.txt 1579162 1579190 |
Binary file not shown.
12,359 changes: 12,359 additions & 0 deletions
12,359
test/data/datasets/corpora/tokenized_longer/data/archive-5.prc
Large diffs are not rendered by default.
Oops, something went wrong.
5 changes: 5 additions & 0 deletions
5
test/data/datasets/corpora/tokenized_longer/data/archive-5.prci
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 145056 145084 | ||
ep-01-01-17.txt 695851 695879 | ||
ep-01-01-18.txt 1287532 1287560 | ||
ep-01-01-31.txt 1579162 1579190 |
Binary file not shown.
12,359 changes: 12,359 additions & 0 deletions
12,359
test/data/datasets/corpora/tokenized_longer/data/archive-6.prc
Large diffs are not rendered by default.
Oops, something went wrong.
5 changes: 5 additions & 0 deletions
5
test/data/datasets/corpora/tokenized_longer/data/archive-6.prci
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 145056 145084 | ||
ep-01-01-17.txt 695851 695879 | ||
ep-01-01-18.txt 1287532 1287560 | ||
ep-01-01-31.txt 1579162 1579190 |
Binary file not shown.
12,359 changes: 12,359 additions & 0 deletions
12,359
test/data/datasets/corpora/tokenized_longer/data/archive-7.prc
Large diffs are not rendered by default.
Oops, something went wrong.
5 changes: 5 additions & 0 deletions
5
test/data/datasets/corpora/tokenized_longer/data/archive-7.prci
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 145056 145084 | ||
ep-01-01-17.txt 695851 695879 | ||
ep-01-01-18.txt 1287532 1287560 | ||
ep-01-01-31.txt 1579162 1579190 |
Binary file not shown.
12,359 changes: 12,359 additions & 0 deletions
12,359
test/data/datasets/corpora/tokenized_longer/data/archive-8.prc
Large diffs are not rendered by default.
Oops, something went wrong.
5 changes: 5 additions & 0 deletions
5
test/data/datasets/corpora/tokenized_longer/data/archive-8.prci
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 145056 145084 | ||
ep-01-01-17.txt 695851 695879 | ||
ep-01-01-18.txt 1287532 1287560 | ||
ep-01-01-31.txt 1579162 1579190 |
Binary file not shown.
12,359 changes: 12,359 additions & 0 deletions
12,359
test/data/datasets/corpora/tokenized_longer/data/archive-9.prc
Large diffs are not rendered by default.
Oops, something went wrong.
5 changes: 5 additions & 0 deletions
5
test/data/datasets/corpora/tokenized_longer/data/archive-9.prci
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 145056 145084 | ||
ep-01-01-17.txt 695851 695879 | ||
ep-01-01-18.txt 1287532 1287560 | ||
ep-01-01-31.txt 1579162 1579190 |
Binary file not shown.
Binary file renamed
BIN
+1.88 MB
.../text_corpora/europarl/data/archive-0.tar → .../text_corpora/europarl/data/archive-0.prc
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-01-01-15.txt 0 28 | ||
ep-01-01-16.txt 142457 142485 | ||
ep-01-01-17.txt 683778 683806 | ||
ep-01-01-18.txt 1265116 1265144 | ||
ep-01-01-31.txt 1551495 1551523 |
Binary file renamed
BIN
+1000 KB
...text_corpora/europarl2/data/archive-0.tar → ...text_corpora/europarl2/data/archive-0.prc
Binary file not shown.
5 changes: 5 additions & 0 deletions
5
test/data/datasets/text_corpora/europarl2/data/archive-0.prci
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
ep-02-01-14.txt 0 28 | ||
ep-02-01-15.txt 2813 2841 | ||
ep-02-01-16.txt 74199 74227 | ||
ep-02-01-17.txt 608743 608771 | ||
ep-02-02-04.txt 831331 831359 |