-
Notifications
You must be signed in to change notification settings - Fork 11
/
bagit.xml
1478 lines (1256 loc) · 47 KB
/
bagit.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version='1.0' ?>
<!--
==== TO DO: ====
Rules for encoding newlines and whitespace in filenames?
Change "algorithm" (crypto hash) to "function"? (Steffen Fritz email to jak)
-->
<!--See http://xml.resource.org/ for formatting tools that can deal with
this RFC2629 (and beyond) XML format.
-->
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY mdash '—' >
<!-- xml.resource.org often not responding?
?? try 168.143.123.173 or 194.146.105.14 ??
-->
<!ENTITY rfc1321 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.1321.xml'>
<!ENTITY rfc2119 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml'>
<!ENTITY rfc3174 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.3174.xml'>
<!ENTITY rfc3629 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.3629.xml'>
<!ENTITY rfc3986 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.3986.xml'>
<!-- RFC 2119 entities - for convenience -->
<!ENTITY must 'MUST' >
<!ENTITY must-not 'MUST NOT' >
<!ENTITY required 'REQUIRED' >
<!ENTITY shall 'SHALL' >
<!ENTITY shall-not 'SHALL NOT' >
<!ENTITY should 'SHOULD' >
<!ENTITY should-not 'SHOULD NOT' >
<!ENTITY recommended 'RECOMMENDED' >
<!ENTITY may 'MAY' >
<!ENTITY optional 'OPTIONAL' >
<!-- The current bagit version, for convenience. -->
<!ENTITY current-bagit-version '0.98' >
]>
<?xml-stylesheet type="text/xsl" href="rfc2629xslt/rfc2629.xslt" ?>
<?rfc comments="no"?>
<?rfc inline="yes"?>
<?rfc symrefs="yes"?>
<?rfc toc="yes"?>
<!-- The next line is "on" by default, but the makefile turns it off
in favor of the line after it when preparing an IETF draft -->
<!--#if internet-draft dont --> <?rfc private="NDIIPP Content Transfer Project"?> <rfc>
<!-- reducing docName from URL to old-style draft name in hopes of clearing
the automatic I-D submission process; if it works, I'm not sure
automatic submission is worth depriving readers of the full URL -->
<!--#if internet-draft then <rfc ipr="trust200902" docName="draft-A-S-00.txt"> -->
<front>
<title abbrev="BagIt">
The BagIt File Packaging Format (V¤t-bagit-version;)
</title>
<author initials="A." surname="Boyko"
fullname="Andy Boyko">
<address>
<postal>
<street>1438 Kingfisher Way</street>
<city>Sunnyvale</city> <region>CA</region>
<code>94087</code>
<country>USA</country>
</postal>
<email>andy@boyko.net</email>
</address>
</author>
<author initials="J." surname="Kunze"
fullname="John A. Kunze">
<organization>
California Digital Library
</organization>
<address>
<postal>
<street>415 20th St, 4th Floor</street>
<city>Oakland</city> <region>CA</region>
<code>94612</code>
<country>US</country>
</postal>
<email>jak@ucop.edu</email>
</address>
</author>
<author initials="J." surname="Littman"
fullname="Justin Littman">
<organization>
Library of Congress
</organization>
<address>
<postal>
<street>101 Independence Avenue SE</street>
<city>Washington</city> <region>DC</region>
<code>20540</code>
<country>USA</country>
</postal>
<email>jlit@loc.gov</email>
</address>
</author>
<author initials="L." surname="Madden"
fullname="Liz Madden">
<organization>
Library of Congress
</organization>
<address>
<postal>
<street>101 Independence Avenue SE</street>
<city>Washington</city> <region>DC</region>
<code>20540</code>
<country>USA</country>
</postal>
<email>emad@loc.gov</email>
</address>
</author>
<author initials="E." surname="Summers"
fullname="Ed Summers">
<organization>
Library of Congress
</organization>
<address>
<postal>
<street>101 Independence Avenue SE</street>
<city>Washington</city> <region>DC</region>
<code>20540</code>
<country>USA</country>
</postal>
<email>ehs@pobox.com</email>
</address>
</author>
<author initials="B." surname="Vargas"
fullname="Brian Vargas">
<address>
<postal>
<street>1354 Quincy St. NW</street>
<city>Washington</city> <region>DC</region>
<code>20011</code>
<country>USA</country>
</postal>
<email>brian@ardvaark.net</email>
</address>
</author>
<date day="20" month="March" year="2014" />
<abstract>
<t>
This document specifies BagIt, a hierarchical file packaging format for
storage and transfer of arbitrary digital content. A "bag" has just enough
structure to enclose descriptive "tags" and a "payload" but
does not require knowledge of the payload's internal semantics. This
BagIt format should be suitable for disk-based or network-based storage and
transfer.
</t>
</abstract>
</front>
<middle>
<section title="Introduction">
<section title="Purpose">
<t>
BagIt is a hierarchical file packaging format designed to support
disk-based or network-based storage and transfer of arbitrary digital
content. A bag consists of a "payload" and "tags". The content of the payload
is the custodial focus of the bag and is treated as semantically opaque.
The "tags" are metadata files intended to facilitate and document the storage
and transfer of the bag. The name, BagIt, is inspired by the "enclose and deposit" method
<xref target="ENCDEP" />, sometimes referred to as "bag it and tag it".
</t>
<!-- TODO: Move this section into the Interoperabiliyt section. -->
<t>
Implementors of BagIt tools should consider interoperability
between different platforms, operating systems, toolsets, and languages.
Differences in path separators, newline characters, reserved
file names, and maximum path lengths are all possible barriers to
moving bags between different systems. Discussion of these issues may be
found in the Interoperability section of this document.
</t>
</section> <!-- /Purpose -->
<section title="Requirements">
<t>
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in <xref target="RFC2119"/>.
</t>
<t>
An implementation is not compliant if it fails to satisfy one or
more of the MUST or REQUIRED level requirements for the protocols
it implements. An implementation that satisfies all the MUST or
REQUIRED level and all the SHOULD level requirements for its protocols
is said to be "unconditionally compliant"; one that satisfies all
the MUST level requirements but not all the SHOULD level requirements
for its protocols is said to be "conditionally compliant."
</t>
</section> <!-- /Requirements -->
<section title="Terminology">
<t>
This specification uses a number of terms to describe BagIt, some
of which are in common use, some of which are newly defined by this
specification, and others which may have meanings obvious only
to those in the community from which this spec arose. Terms defined
in this section are intended to clarify any ambiguity.
</t>
<t>
<list style="hanging">
<t hangText="bag">
A set of opaque data contained within the structure defined
by this specification.
</t>
<t hangText="bag declaration">
The tag file required to be in all bags conforming to this
specification. Contains tags necessary for bootstrapping the
reading and processing of the rest of a bag. See <xref target="sec-bag-decl"/>.
</t>
<t hangText="bag checksum algorithm">
A reference to a cryptographic checksum algorithm, such as MD5 or
SHA-1, with its name normalized for use in a manifest or tag
manifest file name. See <xref target="bag-checksum-algorithms" />.
</t>
<t hangText="complete">
A bag which comprises all elements required by this specification,
with all files listed in all payload and tag manifests present,
all payload files present listed in at least one manifest. See
<xref target="sec-complete-valid" />.
</t>
<t hangText="payload">
The data encapsulated by the bag. The contents of the payload
are opaque to this specification, and are always considered as a
set of octet streams. See <xref target="sec-payload-dir" />.
</t>
<t hangText="serialized bag">
A bag that has been serialized into a single, monolithic file. See
<xref target="sec-serialization"/>.
</t>
<t hangText="tag directory">
A directory that contains one or more tag files.
</t>
<t hangText="tag file">
A file that contains metadata intended to facilitate and document
the storage and transfer of the bag.
</t>
<t hangText="valid">
A complete bag wherein every checksum in every payload manifest and
tag manifest can be successfully verified against the corresponding
payload file. See <xref target="sec-payload-dir" />.
</t>
</list>
</t>
</section> <!-- /Terminology -->
<!-- TODO -->
<!--
<section title="Overview of Operation">
<t>
</t>
</section>
--> <!-- /Overview of Operation -->
</section> <!-- /Introduction -->
<section title="Structure">
<t>
A bag consists of a base directory containing (1) a set of required
and optional tag files; (2) a sub-directory named "data", called the payload
directory; and (3) a set of optional tag directories. The payload files in the
payload directory are an arbitrary file hierarchy
(see <xref target="sec-payload-dir" />).
The tag files in the base directory consist of one or more files named
"manifest-<spanx style="emph">algorithm</spanx>.txt"
(see <xref target="sec-payload-manifest" />), a file named "bagit.txt"
(see <xref target="sec-bag-decl" />), and zero or more additional tag
files (see <xref target="sec-optional-elements" />). The tag files in the
optional tag directories are arbitrary file hierarchies and the tag directories
&may; have any name that is not reserved for a file or directory in this specification.
</t>
<t>
The base directory &may; have any name.
</t>
<figure>
<artwork>
<base directory>/
| bagit.txt
| manifest-<algorithm>.txt
| [optional additional tag files]
\--- data/
| [payload files]
\--- [optional tag directories]/
| [optional tag files]
</artwork>
</figure>
<section title="Required Elements" anchor="sec-required-elements">
<section title="Bag Declaration: bagit.txt" anchor="sec-bag-decl">
<t>
The "bagit.txt" tag file &must; consist of exactly two lines:
<figure>
<artwork>
BagIt-Version: M.N
Tag-File-Character-Encoding: UTF-8
</artwork>
</figure>
where M.N identifies the BagIt major (M) and minor (N) version numbers,
and UTF-8 identifies the character set encoding of tag files. The bag
declaration &must; be encoded in UTF-8, and &must-not; contain a byte-order
mark (BOM).
<xref target="RFC3629"/>
</t>
<t>
The appropriate version for a bag that conforms to
this version of the specification is "¤t-bagit-version;".
</t>
</section> <!-- /Bag Declaration -->
<section title="Payload Directory: data/" anchor="sec-payload-dir">
<t>
The base directory &must; contain a sub-directory named "data", called the
payload directory.
</t>
<t>
The payload directory contains the custodial content within the bag.
The files under the payload directory are called payload files, or
the payload.
The payload is treated as octet streams for all purposes relating to this
specification, and is not otherwise prescribed.
</t>
</section> <!-- /Payload Directory -->
<section title="Payload Manifest: manifest-<alg>.txt" anchor="sec-payload-manifest">
<!-- WARNING: This section should be kept in relative sync with the
section on Tag Manifests.
-->
<t>
A payload manifest is a tag file that lists payload files and checksums for those
payload files generated using a particular bag checksum algorithm.
Every bag &must; contain one payload manifest file, and &may; contain
more than one. A payload manifest file &must;
have a name of the form manifest-<spanx style="emph">algorithm</spanx>.txt, where
<spanx style="emph">algorithm</spanx> is a string specifying
the bag checksum algorithm used in that manifest, such as:
</t>
<figure>
<artwork>
manifest-md5.txt
manifest-sha1.txt
</artwork>
</figure>
<t>A bag &must-not; contain more than one payload manifest for a particular
bag checksum algorithm.</t>
<t>
Each line of a payload manifest file &must; be of the form:
</t>
<figure>
<artwork>
CHECKSUM FILENAME
</artwork>
</figure>
<t>
where FILENAME is the pathname of a file relative to the base directory
and CHECKSUM is a hex-encoded checksum calculated according to <spanx
style="emph">algorithm</spanx> over every octet in the file. The hex-encoded
checksum &may; use uppercase and/or lowercase letters. The slash
character ('/') &must; be used as a path separator in FILENAME. One
or more linear whitespace characters (spaces or tabs) &must; separate
CHECKSUM from FILENAME. An asterisk ('*') &may; preceed FILENAME for
interoperability on some platforms (see <xref target="sec-checksum-tools"
/>). There is no limitation on the length of a pathname. The payload
manifest &must-not; reference files outside the payload directory. If
a FILENAME includes a newline (LF), a carriage return (CR), or carriage
return plus newline (CRLF) it &must; be percent-encoded
<xref target="RFC3986"/>.
</t>
<t>
Payload manifests only include the pathnames of files. Because of this,
a payload manifest cannot reference empty directories. To account for
an empty directory, a bag creator may wish to include at least one file
in that directory; it suffices, for example, to include a zero-length
file named ".keep".
</t>
</section> <!-- /Payload Manifest -->
</section> <!-- /Required Elements -->
<section title="Optional Elements" anchor="sec-optional-elements">
<section title="Tag Manifest: tagmanifest-<alg>.txt">
<!-- WARNING: This section should be kept in relative sync with the
section on Payload Manifests.
-->
<t>
A tag manifest is a tag file that lists other tag files and checksums for
those tag files generated using a particular bag checksum algorithm.
A bag &may; contain one or more tag manifests.
A tag manifest file &must; have a name of the form
"tagmanifest-<spanx style="emph">algorithm</spanx>.txt", where
<spanx style="emph">algorithm</spanx> is a string specifying
the bag checksum algorithm used in that manifest, such as:
</t>
<figure>
<artwork>
tagmanifest-md5.txt
tagmanifest-sha1.txt
</artwork>
</figure>
<t>
A tag manifest file has the same form as the payload file manifest
file described in <xref target="sec-payload-manifest" />,
but &must-not; list any payload files.
As a result, no FILENAME listed in a tag manifest begins "data/".
</t>
</section> <!-- /Tag Manifest -->
<section title="Bag Metadata: bag-info.txt">
<t>
The "bag-info.txt" file is a tag file that contains metadata elements
describing the bag and the payload. The metadata elements contained in
the "bag-info.txt" file are intended primarily for human readability.
All metadata elements are optional and &may; be repeated. Implementations
&should; assume that the ordering is significant and provide access to the
metadata elements in the order they are given in the "bag-info.txt" file.
</t>
<t>
A metadata element &must; consist of a label, a colon, and a value,
each separated by optional whitespace. It is &recommended; that
lines not exceed 79 characters in length. Long values may be continued
onto the next line by inserting a newline (LF), a carriage return (CR),
or carriage return plus newline (CRLF) and indenting the next line with
linear white space (spaces or tabs).
</t>
<t>
Reserved metadata element names are case-insensitive and defined as follows.
</t>
<t>
<list style="hanging">
<t hangText="Source-Organization">
Organization transferring the content.
</t>
<t hangText="Organization-Address">
Mailing address of the organization.
</t>
<t hangText="Contact-Name">
Person at the source organization who is responsible for the content
transfer.
</t>
<t hangText="Contact-Phone">
International format telephone number of person or position responsible.
</t>
<t hangText="Contact-Email">
Fully qualified email address of person or position responsible.
</t>
<t hangText="External-Description">
A brief explanation of the contents and provenance.
</t>
<t hangText="Bagging-Date">
Date (YYYY-MM-DD) that the content was prepared for delivery.
</t>
<t hangText="External-Identifier">
A sender-supplied identifier for the bag.
</t>
<t hangText="Bag-Size">
Size or approximate size of the bag being transferred, followed
by an abbreviation such as MB (megabytes), GB, or TB; for example,
42600 MB, 42.6 GB, or .043 TB. Compared to Payload-Oxum (described
next), Bag-Size is intended for human consumption.
</t>
<t hangText="Payload-Oxum">
The "octetstream sum" of the payload, namely, a two-part number
of the form "OctetCount.StreamCount", where OctetCount is the
total number of octets (8-bit bytes) across all payload file content
and StreamCount is the total number of payload files. Payload-Oxum
should be included in "bag-info.txt" if at all
possible. Compared to Bag-Size (above), Payload-Oxum is
intended for machine consumption.
</t>
<t hangText="Bag-Group-Identifier">
A sender-supplied identifier for the set, if any, of bags
to which it logically belongs.
This identifier must be unique across the sender's content, and if
recognizable as belonging to a globally unique scheme, the receiver
should make an effort to honor reference to it.
</t>
<t hangText="Bag-Count">
Two numbers separated by "of", in particular, "N of T",
where T is the total number of bags in a group of bags and N is the
ordinal number within the group; if T is not known, specify it as "?"
(question mark). Examples: 1 of 2, 4 of 4, 3 of ?, 89 of 145.
</t>
<t hangText="Internal-Sender-Identifier">
An alternate sender-specific identifier for the content
and/or bag.
</t>
<t hangText="Internal-Sender-Description">
A sender-local prose description of the contents of the
bag.
</t>
</list>
</t>
<t>
In addition to these metadata elements, other arbitrary metadata elements may also be present.
</t>
<t>
Here is an example "bag-info.txt" file.
<figure>
<artwork>
Source-Organization: Spengler University
Organization-Address: 1400 Elm St., Cupertino, California, 95014
Contact-Name: Edna Janssen
Contact-Phone: +1 408-555-1212
Contact-Email: ej@spengler.edu
External-Description: Uncompressed greyscale TIFF images from the
Yoshimuri papers colle...
Bagging-Date: 2008-01-15
External-Identifier: spengler_yoshimuri_001
Bag-Size: 260 GB
Payload-Oxum: 279164409832.1198
Bag-Group-Identifier: spengler_yoshimuri
Bag-Count: 1 of 15
Internal-Sender-Identifier: /storage/images/yoshimuri
Internal-Sender-Description: Uncompressed greyscale TIFFs created
from microfilm and are...
</artwork>
</figure>
</t>
</section> <!-- /Bag Metadata -->
<section title="Fetch File: fetch.txt" anchor="sec-fetch-file">
<t>
For reasons of efficiency, a bag &may; be sent with a list of files to be
fetched and added to the payload before it can meaningfully be checked
for completeness. An &optional; tag file named "fetch.txt"
contains such a list. Each line of "fetch.txt" has the form
<figure>
<artwork>
URL LENGTH FILENAME
</artwork>
</figure>
where URL identifies the file to be fetched, LENGTH is the number of
octets in the file (or "-", to leave it unspecified), and FILENAME
identifies the corresponding payload file, relative to the base directory.
The slash character ('/') &must; be used as a path separator in FILENAME.
If FILENAME begins with a slash character, the destination &must; still be
treated as relative to the bag base directory.
One or more linear whitespace characters (spaces or tabs) &must; separate these
three values, and any such characters in the URL &must; be percent-encoded
<xref target="RFC3986"/>. There is no limitation on the length of any
of the fields in the "fetch.txt".
</t>
<t>
The "fetch.txt" file allows a bag to be transmitted with
"holes" in it, which can be practical for several reasons. For example,
it obviates the need for the sender to stage a large serialized copy of
the content while the bag is transferred to the receiver. Also, this
method allows a sender to construct a bag from components that are either
a subset of logically related components (e.g., the localized logical
object could be much larger than what is intended for export) or
assembled from logically distributed sources (e.g., the object components
for export are not stored locally under one filesystem tree).
</t>
</section> <!-- Fetch File -->
<section title="Other Tag Files" anchor="sec-other-tag-files">
<t>
A bag &may; contain other tag files that are not defined by this
specification.
Implementations &should; ignore the content of any unexpected tag files,
except when they are listed in a tag manifest.
When unexpected tag files are listed in a tag manifest, implementations
&must; only treat the content of those tag files as octet streams for the
purpose of checksum verification.
</t>
</section> <!-- /Other Tag Files -->
</section> <!-- /Optional Elements -->
<section title="Text Tag File Format" anchor="sec-tag-files">
<t>
All tag files specifically described in this specification &must; adhere to
the text tag file format described below. Other tag files &may; adhere to
the text tag file format described below.
</t>
<t>
Text tag files are line-oriented, and each line &must; be
terminated by a newline (LF), a carriage return (CR), or carriage return
plus newline (CRLF).
Text tag files &must; end in the extension ".txt".
</t>
<t>
In all text tag files except for the bag declaration file, text &must; be
encoded in the character encoding specified in the "bagit.txt" bag declaration
file. Text tag files except for the bag declaration file &may; include a
byte-order mark (BOM) only if the specified encoding requires it for
proper decoding. (Note that UTF-8 does not.)
</t>
<t>
As specified in <xref target="sec-bag-decl"/>, the bag declaration
file must be encoded in UTF-8 and must not include a byte-order mark.
</t>
<!-- TODO: Character escaping -->
<!--
<t>
The backslash character ('\', U+005C) escapes from special processing any
</t>
-->
</section> <!-- /Tags Files -->
<section title="Bag Checksum Algorithms" anchor="bag-checksum-algorithms">
<t>
The payload manifest and tag manifests assert integrity of the payload
and tags in a bag using checksum algorithms. The operation
of those algorithms, and the formatting of their output within a manifest
file, are generally beyond the scope of this specification, except that the
output format &must; be able to fit in the manifest format specified in
<xref target="sec-payload-manifest"/>.
</t>
<t>
The name of the checksum algorithm &must; be normalized for use in the
manifest's filename by lowercasing the common name of the algorithm and
removing all non-alphanumeric characters.
</t>
<t>
Implementors of tools that create and validate bags &should; support at
least two widely implemented checksum algorithms: "md5"
<xref target="RFC1321"/> and "sha1" <xref target="RFC3174"/>.
</t>
</section> <!-- /Bag Checksum Algorithms -->
</section> <!-- /Bag Structure -->
<section title="Complete, Incomplete, and Valid bags" anchor="sec-complete-valid">
<t>
A <spanx style="emph">complete</spanx> bag &must; have the following
attributes:
</t>
<t>
<list style="numbers">
<t>Every required element &must; be present
(<xref target="sec-required-elements" />).</t>
<t>Every file in every payload manifest &must; be present.</t>
<t>Every file in every tag manifest &must; be present.
Tag files not listed in a tag manifest &may; be present.</t>
<t>Every payload file &must; be listed in at least one manifest.
Payload files &may; be listed in more than one payload manifest.</t>
<t>Every element present &must; comply with this specification.</t>
</list>
</t>
<t>
A bag is <spanx style="emph">incomplete</spanx> when it exhibits any of
the following exceptions to the attributes of a complete bag:
</t>
<t>
<list style="numbers">
<t>One or more files in any payload manifest are absent.</t>
<t>One or more files in any tag manifest are absent.</t>
<t>A fetch.txt is present. Any files listed in
any payload manifest or any tag manifest which are
absent &must; be listed in the fetch.txt.</t>
</list>
</t>
<t>
A <spanx style="emph">valid</spanx> bag must have the following
attributes:
</t>
<t>
<list style="numbers">
<t>The bag &must; be complete.</t>
<t>Every CHECKSUM in every payload manifest and tag manifest
can be sucessfully verified against the contents of its
corresponding FILENAME.</t>
</list>
</t>
<t>
If a bag is neither valid, complete, nor incomplete, it is
<spanx style="emph">invalid</spanx>. Definitions for the various
ways a bag may be invalid are not covered by this specification.
</t>
<t>
Tag files that do not appear in a tag manifest can be modified, added
to, or removed from a bag without impacting the completeness or validity
of the bag.
</t>
</section> <!-- Completeness and validity -->
<section title="Serialization" anchor="sec-serialization">
<t>
In some scenarios, it may be convenient to serialize the
bag's filesystem hierarchy (i.e., the base directory) into a
single-file archive format such as TAR or ZIP (the serialization) and then
later deserialize the serialization to recreate the filesystem hierarchy.
Several rules govern the serialization of a bag and apply equally
to all types of archive files:
</t>
<t>
<list style="numbers">
<t>
The top-level directory of a serialization &must; contain only one bag.
</t>
<t>
The serialization &should; have the same name as the bag's base directory,
but &must; have an extension added to identify the format. For example, the
receiver of "mybag.tar.gz" expects the corresponding base directory
to be created as "mybag".
</t>
<t>
A bag &must-not; be serialized from within its base directory, but from the
parent of the base directory (where the base directory appears as an
entry). Thus, after a bag is deserialized in an empty directory,
a listing of that directory shows exactly one entry. For example,
deserializing "mybag.zip" in an empty directory causes the creation
of the base directory "mybag" and, beneath "mybag", the creation of
all payload and tag files.
</t>
<t>
The deserialization of a bag &must; produce a single base directory
bag with the top-level structure as described in this specification without
requiring any additional un-archiving step. For example, after one
un-archiving step it would be an error for the "data/" directory to
appear as "data.tar.gz". TAR and ZIP files may appear inside the payload
beneath the "data/" directory, where they would be treated
as any other payload file.
</t>
</list>
</t>
<t>
When serializing a bag, care must be taken to
ensure that the archive format's restrictions on file naming, such as allowable
characters, length, or character encoding, will support the
requirements of the systems on which it will be used. See
<xref target="sec-interoperability" />.
</t>
</section> <!-- /Serialization -->
<section title="Examples">
<section title="Example of a basic bag">
<t>
This is the layout of a basic bag containing an image and a companion
OCR file. Lines of file content are shown in parentheses beneath the
file name.
<!-- Note that the artwork looks funky on the version line, to account
for the fact that the entity value is much shorter than the entity
name. -->
<figure>
<artwork>
myfirstbag/
|
| manifest-md5.txt
| (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png)
| (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt)
|
| bagit.txt
| (BagIt-version: 0.96 )
| (Tag-File-Character-Encoding: UTF-8 )
|
\--- data/
|
| 27613-h/images/q172.png
| (... image bytes ... )
|
| 27613-h/images/q172.txt
| (... OCR text ... )
....
</artwork>
</figure>
</t>
</section>
<!--
<section title="Optional file metadata">
<t>
The "bag-info.txt", if present, may contain optional file metadata
elements. It describes the bag contributions made my individual
files in the files-<spanx style="emph">algorithm</spanx>.txt
file. File contribution elements have the form
<figure>
<artwork>
file: <filepattern> | <prose description of contribution>
</artwork>
</figure>
where <spanx style="emph">filepattern</spanx> is either a literal
filename exactly as it appears in the manifest or a string (a pattern)
consisting of literal and wildcards '?' (matching any single character)
or '*' (matching any number of characters). For example,
<figure>
<artwork>
file: notes.txt | curatorial decision notes
file: *.lo_res.jpg | low resolution derivative images
file: *.tiff | high resolution master images
</artwork>
</figure>
</t>
</section>
-->
<section title="Another example bag">
<t>
The following example bag contains content from a web crawler.
As before, lines of file content are shown in parentheses beneath the
file name, with long lines continued indented on subsequent lines.
This bag is not complete until every
component listed in the "fetch.txt" file is retrieved.
<figure>
<artwork>
mysecondbag/
|
| manifest-md5.txt
| (93c53193ef96732c76e00b3fdd8f9dd3 data/Collection Overview.txt )
| (e9c5753d65b1ef5aeb281c0bb880c6c8 data/Seed List.txt )
| (61c96810788283dc7be157b340e4eff4 data/gov-20060601-050019.arc.gz)
| (55c7c80c6635d5a4c8fe76a940bf353e data/gov-20060601-100002.arc.gz)
|
| fetch.txt
| (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-050019.arc.gz
| 26583985 data/gov-20060601-050019.arc.gz )
| (http://WB20.Stanford.Edu/gov-06-2006/gov-20060601-100002.arc.gz
| 99509720 data/gov-20060601-100002.arc.gz )
| ( ...............................................................)
|
| bag-info.txt
| (Source-organization: California Digital Library )
| (Organization-address: 415 20th St, 4th Floor, Oakland, CA 94612)
| (Contact-name: A. E. Newman )
| (Contact-phone: +1 510-555-1234 )
| (Contact-email: alfred@ucop.edu )
| (External-Description: The collection "Local Davis Flood Control )
| Collection" includes captured California State and local )
| websites containing information on flood control resources for )
| the Davis and Sacramento area. Sites were captured by UC Davis)
| curator Wrigley Spyder using the Web Archiving Service in )
| February 2007 and October 2007. )
| (Bag-date: 2008.04.15 )
| (External-identifier: ark:/13030/fk4jm2bcp )
| (Bag-size: about 22Gb )
| (Payload-Oxum: 21836794142.831 )
| (Internal-sender-identifier: UCDL )
| (Internal-sender-description: UC Davis Libraries )
|
| bagit.txt
| (BagIt-version: 0.96 )
| (Tag-File-Character-Encoding: UTF-8 )
|
\--- data/
|
| Collection Overview.txt
| (... narrative description ... )
|
| Seed List.txt
| (... list of crawler starting point URLs ... )
....
</artwork>
</figure>
</t>
</section>
</section> <!-- /Examples -->
<section title="Security Considerations" anchor="sec-security">
<section title="Special directory characters">
<!-- Added by Brian Vargas, 2009-04-09 -->
<t>
The paths specified in the payload manifest, tag manifest, and
"fetch.txt" file do not prohibit special directory characters which might be
significant on implementing systems. Implementors &should; take care that
files outside the bag directory structure are not accessed when reading or
writing files based on paths specified in a bag.
</t>
<t>
For example, path characters such as ".." or "~"
in a maliciously crafted "fetch.txt" file might cause a naive implementation to
overwrite critical system files.
</t>
</section>
<section title="Control of URLs in fetch.txt">
<t>
Implementors of tools that complete bags by retrieving URLs listed in a
"fetch.txt" file need to be aware that some of those URLs may point to hosts,
intentionally or unintentionally, that are not under control of the bag's
sender. Checksums are intended as a reasonable guarantee against corruption
during transit, not a strong cryptographic protection against intentional
spoofing.
</t>
</section>
<section title="File sizes in fetch.txt">
<!-- Added by Brian Vargas, 2009-04-09 -->
<t>
The size of files, as optionally reported in the "fetch.txt" file, cannot be
guaranteed to match the actual file size to be downloaded. Implementors &should;
take care to appropriately handle cases where the actual file size does not
match the file size reported in the fetch.txt. Implementors &should-not; use
the file size in the "fetch.txt" file for critical resource allocation, such as
buffer sizing or storage requisitioning.
</t>
</section>
</section> <!-- End Section: Security considerations -->
<section title="Practical Considerations (non-normative)">
<section title="Disk and network transfer">
<t>
When creating a bag on physical media (such as hard disk, CD-ROM, or
DVD) for transfer to another organization, the sender should select
and format the media in a manner compatible with both the content
requirements (e.g., file names and sizes) and the receiver's technical
infrastructure. If the receiver's infrastructure is not known or the
media needs to be compatible with a range of potential receivers,
consideration should be given to portability and common usage. For
example, a "lowest common denominator" for some potential receivers
could be USB disk drives formatted with the FAT32 filesystem.
</t>
<t>
Although overall bag size is unlimited in principle, network-based
transfers may involve constraints on the amount of bag data that a
receiver can receive at one time. It may be practical to split a
large bag into several smaller bags.
</t>
<t>
Transmitting a whole bag in serialized form as a single file will tend