Skip to content

Commit

Permalink
Added tool to convert tars to pimarcs
Browse files Browse the repository at this point in the history
Ran the tool on all test corpora in the test datasets dir. Test pipelines still work, confirming that pimarc corpus reading is working nicely and producing the same results as the old tar reading.
  • Loading branch information
markgw committed Mar 31, 2020
1 parent d3a6aa6 commit 838d623
Show file tree
Hide file tree
Showing 73 changed files with 86,650 additions and 0 deletions.
2 changes: 2 additions & 0 deletions src/python/pimlico/utils/pimarc/reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ class PimarcReader(object):
"""
def __init__(self, archive_filename):
self.archive_filename = archive_filename
if not archive_filename.endswith(".prc"):
raise IOError("pimarc files should have the extension '.prc'")
self.index_filename = "{}i".format(archive_filename)

self.archive_file = open(self.archive_filename, mode="rb")
Expand Down
5 changes: 5 additions & 0 deletions src/python/pimlico/utils/pimarc/tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,10 @@ def from_tar(opts):
print(" Writing {}".format(name))
arc.write_file(data, name)

if opts.delete:
print("Deleting tar: {}".format(tar_path))
os.remove(tar_path)


def no_subcommand(opts):
print("Specify a subcommand: list, ...")
Expand All @@ -85,6 +89,7 @@ def run():
subparser.set_defaults(func=from_tar)
subparser.add_argument("tars", nargs="+", help="Path to the tar archive(s)")
subparser.add_argument("--out-path", "-o", help="Directory to output files to. Defaults to same as input")
subparser.add_argument("--delete", "-d", action="store_true", help="Delete the tar files after creating pimarcs")

opts = parser.parse_args()
opts.func(opts)
Expand Down
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/float_list/data/archive_00.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
YZRNSOCIMU 0 23
HBTLGGCVKT 104 127
ZIOWPURJRB 257 280
ODFTGFICEY 337 360
LZRREJCHPP 522 545
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/float_list/data/archive_01.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
KPAHKKKWVT 0 23
WZEIFIPOKK 104 127
BGCHCWJAFD 224 247
YMJPGIOAHG 328 351
ZMMBCFVOYV 408 431
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/float_list/data/archive_02.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
VTEKNLGSQR 0 23
JOTLWDENTY 96 119
MJBBQMBDVP 240 263
JWZQEOQYKZ 280 303
LMWJZWJDWM 320 343
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/float_lists/data/archive_00.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
GTOIMDPEKW 0 23
HQUJFEGDHR 1849 1872
FLUSDLUITG 2800 2823
UTADTLWSBD 3789 3812
HMPSAIEWBR 4254 4277
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/float_lists/data/archive_01.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
NILGVJJJYV 0 23
CCNBQAYSMF 124 147
HOFMAPRCJY 1117 1140
CABYXXCFJM 2684 2707
MSNNCDATKD 2794 2817
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/float_lists/data/archive_02.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
IQJDSJREGL 0 23
TWFQDTZBHD 545 568
AISGQHSPQL 1354 1377
NXXEEHWAWV 2711 2734
VXRJUKPJEG 3828 3851
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/ids/data/archive-0.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 212355 212383
ep-01-01-17.txt 1017518 1017546
ep-01-01-18.txt 1884851 1884879
ep-01-01-31.txt 2314450 2314478
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/int_list/data/archive_00.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
KUZYBPNLVV 0 23
PDBOVSCHVV 104 127
BWMWMWYWZP 273 296
XVEUCVZTNR 345 368
TVGUNJTIWM 441 464
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/int_list/data/archive_01.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ZMJXPZABOV 0 23
NLSXQUOACY 153 176
ZMROJKPGRV 209 232
ILJMMUKTZQ 257 280
YPJUZWVHAI 418 441
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/int_list/data/archive_02.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
WURADDXADK 0 23
FMVFPWDSRH 72 95
OQPJRNARGB 184 207
CNSUUPNNNB 248 271
MXYHJSBUNI 409 432
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/int_lists/data/archive_00.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
VLUGLWYLBD 0 23
ZJXTADMING 419 442
DXYKJJGBSS 1324 1347
OLAIELPEKR 2483 2506
KDHQMVPYEP 3782 3805
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/int_lists/data/archive_01.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
DAXDUNAKXW 0 23
PXZBEJGTDN 1291 1314
JHCRGTFWBH 1932 1955
PVASPTSVBT 2167 2190
LZLNXOVKJQ 3686 3709
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/int_lists/data/archive_02.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
NJHILGGEUQ 0 23
FYERTBVJIY 1089 1112
VIPOZPPPGM 1254 1277
JVAJXJEEHJ 3017 3040
PVUBUIEPWH 4386 4409
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/corpora/tokenized/data/archive-0.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 145056 145084
ep-01-01-17.txt 695851 695879
ep-01-01-18.txt 1287532 1287560
ep-01-01-31.txt 1579162 1579190
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 145056 145084
ep-01-01-17.txt 695851 695879
ep-01-01-18.txt 1287532 1287560
ep-01-01-31.txt 1579162 1579190
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 145056 145084
ep-01-01-17.txt 695851 695879
ep-01-01-18.txt 1287532 1287560
ep-01-01-31.txt 1579162 1579190
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 145056 145084
ep-01-01-17.txt 695851 695879
ep-01-01-18.txt 1287532 1287560
ep-01-01-31.txt 1579162 1579190
12,359 changes: 12,359 additions & 0 deletions test/data/datasets/corpora/tokenized_longer/data/archive-3.prc

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 145056 145084
ep-01-01-17.txt 695851 695879
ep-01-01-18.txt 1287532 1287560
ep-01-01-31.txt 1579162 1579190
Binary file not shown.
12,359 changes: 12,359 additions & 0 deletions test/data/datasets/corpora/tokenized_longer/data/archive-4.prc

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 145056 145084
ep-01-01-17.txt 695851 695879
ep-01-01-18.txt 1287532 1287560
ep-01-01-31.txt 1579162 1579190
Binary file not shown.
12,359 changes: 12,359 additions & 0 deletions test/data/datasets/corpora/tokenized_longer/data/archive-5.prc

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 145056 145084
ep-01-01-17.txt 695851 695879
ep-01-01-18.txt 1287532 1287560
ep-01-01-31.txt 1579162 1579190
Binary file not shown.
12,359 changes: 12,359 additions & 0 deletions test/data/datasets/corpora/tokenized_longer/data/archive-6.prc

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 145056 145084
ep-01-01-17.txt 695851 695879
ep-01-01-18.txt 1287532 1287560
ep-01-01-31.txt 1579162 1579190
Binary file not shown.
12,359 changes: 12,359 additions & 0 deletions test/data/datasets/corpora/tokenized_longer/data/archive-7.prc

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 145056 145084
ep-01-01-17.txt 695851 695879
ep-01-01-18.txt 1287532 1287560
ep-01-01-31.txt 1579162 1579190
Binary file not shown.
12,359 changes: 12,359 additions & 0 deletions test/data/datasets/corpora/tokenized_longer/data/archive-8.prc

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 145056 145084
ep-01-01-17.txt 695851 695879
ep-01-01-18.txt 1287532 1287560
ep-01-01-31.txt 1579162 1579190
Binary file not shown.
12,359 changes: 12,359 additions & 0 deletions test/data/datasets/corpora/tokenized_longer/data/archive-9.prc

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 145056 145084
ep-01-01-17.txt 695851 695879
ep-01-01-18.txt 1287532 1287560
ep-01-01-31.txt 1579162 1579190
Binary file not shown.
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/text_corpora/europarl/data/archive-0.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-01-01-15.txt 0 28
ep-01-01-16.txt 142457 142485
ep-01-01-17.txt 683778 683806
ep-01-01-18.txt 1265116 1265144
ep-01-01-31.txt 1551495 1551523
Binary file not shown.
5 changes: 5 additions & 0 deletions test/data/datasets/text_corpora/europarl2/data/archive-0.prci
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
ep-02-01-14.txt 0 28
ep-02-01-15.txt 2813 2841
ep-02-01-16.txt 74199 74227
ep-02-01-17.txt 608743 608771
ep-02-02-04.txt 831331 831359

0 comments on commit 838d623

Please sign in to comment.