-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdraft-mbenjamin-nfsv4-pnfs-metastripe.xml
1358 lines (1273 loc) · 51.7 KB
/
draft-mbenjamin-nfsv4-pnfs-metastripe.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!-- One method to get references from the online citation libraries.
There has to be one entity for each item to be referenced.
An alternate method (rfc include) is described in the references. -->
<!ENTITY RFC2119 SYSTEM "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY RFC2629 SYSTEM "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.2629.xml">
<!ENTITY RFC3552 SYSTEM "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.3552.xml">
<!ENTITY RFC4506 SYSTEM "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.4506.xml">
<!ENTITY RFC5226 SYSTEM "http://xml2rfc.tools.ietf.org/public/rfc/bibxml/reference.RFC.5226.xml">
<!ENTITY I-D.narten-iana-considerations-rfc2434bis SYSTEM "http://xml2rfc.tools.ietf.org/public/rfc/bibxml3/reference.I-D.narten-iana-considerations-rfc2434bis.xml">
]>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<!-- used by XSLT processors -->
<!-- For a complete list and description of processing instructions (PIs),
please see http://xml.resource.org/authoring/README.html. -->
<!-- Below are generally applicable Processing Instructions (PIs) that most I-Ds might want to use.
(Here they are set differently than their defaults in xml2rfc v1.32) -->
<?rfc strict="yes" ?>
<!-- give errors regarding ID-nits and DTD validation -->
<!-- control the table of contents (ToC) -->
<?rfc toc="yes"?>
<!-- generate a ToC -->
<?rfc tocdepth="4"?>
<!-- the number of levels of subsections in ToC. default: 3 -->
<!-- control references -->
<?rfc symrefs="yes"?>
<!-- use symbolic references tags, i.e, [RFC2119] instead of [1] -->
<?rfc sortrefs="yes" ?>
<!-- sort the reference entries alphabetically -->
<!-- control vertical white space
(using these PIs as follows is recommended by the RFC Editor) -->
<?rfc compact="yes" ?>
<!-- do not start each main section on a new page -->
<?rfc subcompact="no" ?>
<!-- keep one blank line between list items -->
<!-- end of list of popular I-D processing instructions -->
<rfc category="std" docName="draft-mbenjamin-nfsv4-pnfs-metastripe-03" ipr="trust200902">
<!-- category values: std, bcp, info, exp, and historic
ipr values: full3667, noModification3667, noDerivatives3667
you can add the attributes updates="NNNN" and obsoletes="NNNN"
they will automatically be output with "(if approved)" -->
<!-- ***** FRONT MATTER ***** -->
<front>
<!-- The abbreviated title is used in the page header - it is only necessary if the
full title is longer than 39 characters -->
<title abbrev="pNFS Metastripe">pNFS Metadata Striping</title>
<!-- add 'role="editor"' below for the editors if appropriate -->
<!-- Another author who claims to be an editor -->
<author fullname="Matt Benjamin" initials="M.W.B."
surname="Benjamin">
<organization abbrev="CohortFS, LLC">CohortFS, LLC</organization>
<address>
<postal>
<street>206 S. Fifth Ave, Suite 150</street>
<city>Ann Arbor</city>
<region>MI</region>
<code>48104</code>
<country>USA</country>
</postal>
<phone>+1 734 761 4689</phone>
<email>matt@cohortfs.com</email>
<!-- uri and facsimile elements may also be added -->
</address>
</author>
<author fullname="Casey Bodley"
surname="Bodley">
<address>
<email>casey@cohortfs.com</email>
</address>
</author>
<author fullname="Adam C. Emerson"
surname="Emerson">
<address>
<email>aemerson@cohortfs.com</email>
</address>
</author>
<author fullname="Pranoop Ersani"
surname="Ersani">
<organization abbrev="NetApp">NetApp</organization>
<address>
<email>Pranoop.Erasani@netapp.com</email>
</address>
</author>
<author fullname="Peter Honeyman"
surname="Honeyman">
<address>
<email>peter.honeyman@gmail.com</email>
</address>
</author>
<date year="2014" />
<!-- If the month and year are both specified and are the current ones, xml2rfc will fill
in the current day for you. If only the current year is specified, xml2rfc will fill
in the current day and month for you. If the year is not the current one, it is
necessary to specify at least a month (xml2rfc assumes day="1" if not specified for the
purpose of calculating the expiry date). With drafts it is normally sufficient to
specify just the year. -->
<!-- Meta-data Declarations -->
<area>Transport Area (tsv)</area>
<workgroup>NFSv4</workgroup>
<!-- WG name at the upperleft corner of the doc,
IETF is fine for individual submissions.
If this element is not present, the default is "Network Working Group",
which is used by the RFC Editor as a nod to the history of the IETF. -->
<keyword>NFSv4</keyword>
<keyword>pNFS</keyword>
<keyword>metastripe</keyword>
<!-- Keywords will be incorporated into HTML output
files in a meta tag but they have no effect on text or nroff
output. If you submit your draft to the RFC Editor, the
keywords will be used for the search engine. -->
<abstract>
<t>
This Internet-Draft describes a means to add metadata
striping to pNFS. The text of this draft is substantially
based on prior drafts by Eisler, M., with some departures.
The current draft attempts to define a somewhat lighter-weight
protocol, in particular, seeks to permit striping for
"filehandle only" operations such as LOCK and OPEN + CLAIM_FH,
without clients having to obtain metadata layouts on regular
files. We gratefully acknowledge the primary contributions of
Mike Eisler, Pranoop Ersani, and others.
</t>
</abstract>
<note title="Internet Draft Comments">
<t>
Comments regarding this draft are solicited.
</t>
</note>
</front>
<middle>
<section title="Introduction and Motivation">
<t>
The NFSv4.1 specification describes pNFS <xref
target="NFSv4.1"/>. pNFS distributes (stripes) file data across
multiple storage devices. In NFSv4.1, parallel access is limited to
the data contents of regular files. Metadata is not distributed or
striped: the model presented in the NFSv4.1 specification is that of a
single metadata server. This document describes a means to add
metadata striping to pNFS, which includes the notion of multiple
metadata servers. With metadata striping, multiple metadata
servers may work together to provide a higher parallel performance.
</t>
<t>
Two methods are described. The first, called filehandle striping,
directs metadata operations associated with a file handle to a
preferred metadata server. The second, called directory
striping, distributes directory operations across a collection
of metadata servers.
</t>
</section>
<section anchor="Changes" title="Short List of Protocol Changes
from Previous Drafts">
<section anchor="FilehandleStriping" title="File-system wide Striping
for Filehandle-Only
Operations">
<t>
Stripe hints redirect clients to a preferred metadata server
for filehandle-only operations (below), but are backed by a
single layout per-file system, rather than per-file, as in
<xref target="METASTRIPE"/>. The new model is lighter
weight, but since it remains layout-based retains the
advantages of pNFS device indirection and garbage
collection.
</t>
</section>
<section anchor="UniformFilehandles" title="Uniform Filehandles">
<t>
<xref target="METASTRIPE"/> offers implementations the
option to propagate layout filehandles for all metadata
layout types. Since it would be impossible to reasonably
support this under the new proposed model for
filehandle-only operations, we propose instead that L-MDS
filehandles always be equivalent to I-MDS filehandles.
</t>
</section>
<section anchor="DeviceModel" title="Simplified Multipath Device
Model">
<t>
<xref target="METASTRIPE"/> defines two different methods
for encoding metadata server locations, only the "simple"
model uses the pNFS device mechanism. In this draft, we
propose a single model based on pNFS devices, in which there
is a one-to-one mapping between devices and L-MDS servers.
This approach facilitates sharing device addresses across
layouts which have servers in common and also minimizes the
difficulty of reclaiming devices no longer in use by any
metadata layout.
</t>
</section>
<section anchor="CookieModel" title="Cookie Model">
<t>
NFSv4 associates with each entry in a directory a unique
value of type cookie4, a 64-bit integer. <xref
target="METASTRIPE"/> involves cookies in stripe selection,
and imposes specific requirements on cookie values. In the
current proposal we treat cookies as opaque values except as
specified in ordinary NFSv4.1. We concur with <xref
target="METASTRIPE"/> that cookies MUST be unique within any
logical directory regardless of the striping pattern. As in
ordinary NFSv4.1, the behavior of READDIR (or PREADDIR,
below) when cookie has a value previously returned to a
client by the same server, but no longer associated with any
directory entry, is not defined.
</t>
</section>
<section anchor="LayoutCommit" title="LAYOUTCOMMIT">
<t>
In this draft, we introduce layout-subtype specific data
for the LAYOUTCOMMIT operation.
</t>
</section>
<section anchor="Attributes" title="Recommended Attributes">
<t>
We propose two new recommended attributes.
</t>
<t>
<list style="hanging">
<t>
meta_stripe_deviceid (deviceid4)
</t>
<t>
meta_stripe_count (uint32_t)
</t>
</list>
</t>
<section anchor="meta_stripe_deviceid"
title="meta_stripe_deviceid (deviceid4)">
<t>
An attribute of type meta_stripe_deviceid represents a
filehandle stripe hint. This attribute MUST NOT be offered to
clients unless they hold a valid filehandle striping layout on
the containing file system.
</t>
</section>
<section anchor="meta_stripe_count" title="meta_stripe_count
(uint32_t)">
<t>
The meta_stripe_count attribute represents, for directory
objects, the directory's current stripe count, which may
help the client decide if it will request a directory striping
layout on the directory. This attribute MAY be offered only
to clients which hold a filehandle striping layout on the
containing file system.
</t>
</section>
</section>
<section anchor="PREADDIR_OP" title="PREADDIR (Operation)">
<t>
The NFSv4.1 READDIR operation has insufficient information
to perform all possible enumerations required in the
proposed directory striping model. We propose a new PREADDIR
operation which takes, in addition to all the current
READDIR operations, also a controlling metadata layout
stateid and stripe number.
</t>
</section>
</section>
<section anchor="Terminology" title="Terminology">
<t>
<list style="hanging">
<t>
Initial Metadata Server (I-MDS). The I-MDS is the
metadata server from which the client obtains a filehandle
prior to acquiring any layout on the file.
</t>
<t>
Layout Metadata Server (L-MDS). The L-MDS is the metadata
server from which the client obtains a filehandle from
after redirection from a layout.
</t>
<t>
Regular file: An object of file type NF4REG or
NF4NAMEDATTR.
</t>
<t>
Filehandle striping. Hint-based indirection to a preferred MDS
for filehandle-based operations, backed by a
filesystem-wide metadata layout.
</t>
<t>
Directory striping. Fine-grained, layout-based indirection
for parallel operations on directories, using a striping
pattern.
</t>
</list>
</t>
</section>
<section anchor="scope" title="Scope of Metadata Layouts">
<t>
This proposal assumes a model where there are two or more
servers capable of supporting NFSv4.1 operations. At least
one server is an I-MDS, and the I-MDS should be thought of as
a normal NFSv4.1 server, with the additional capability of
granting metadata layouts on demand. The I-MDS might also be
capable of granting non-metadata layouts, but this is
orthogonal to the scope of metadata striping.
</t>
<t>
The model also requires at least one additional server, an
L-MDS, that is capable of supporting NFSv4.1 operations that
are directed to the server by the I-MDS. It is permissible
for an I-MDS to also be an L-MDS, and an L-MDS to also be an
I-MDS. Indeed, a simple submodel is for every NFSv4.1 server
in a set to be both an I-MDS and L-MDS.
</t>
<t>
For convenience, we divide NFSv4.1 metadata operations into
three classes:
<list>
<t>
Filehandle-only. These are operations that take only
filehandles as arguments, i.e. the current filehandle, or
both the current filehandle and the saved filehandle, and
no component names of files (e.g., LOCK, LAYOUTGET).
</t>
<t>
Name-based. These are operations that take one or two
filehandles (i.e. the current filehandle, or both the
current file handle and the saved filehandle) and one or
two component names of files (e.g., LINK, RENAME).
</t>
<t>
Directory-enumeration. These are operations that take
one filehandle and return the contents of a directory.
Currently, NFSv4 has only one such operation, READDIR.
This draft adds a section, PREADDIR.
</t>
</list>
</t>
<t>
Metadata striping applies to all of the foregoing NFSv4.x
operations, and is of two types:
<list>
<t>
filehandle striping uses hints (attribute-based indications)
backed by a filesystem-wide layout to direct clients to a
preferred MDS on which to perform filehandle-only
operations
</t>
<t>
directory striping uses fine-grained metadata layouts on
directories to support execution of name-based operations
(directory enumeration, creates) on a set of MDS servers
in parallel
</t>
</list>
</t>
<section anchor="FilehandleStriping2" title="Filehandle Striping">
<t>
To avoid an explosion of new client state, a coarse-grained
hinting mechanism is used to direct filehandle-only
operations to a preferred metadata server.
</t>
<t>
As specified in 5.12.1 of [NFSv4.1], when a client
encounters file system which supports LAYOUT4_METADATA, it
can obtain a metadata layout of subtype LAYOUTMETA4_FILEHANDLE,
whose scope is the entire file system, using the LAYOUTGET
operation on any filehandle object in the file system which
it is permitted to access.
</t>
<t>
Then using ordinary READDIR and GETATTR requests, the client
can obtain for any object in the file system a
meta_stripe_deviceid attribute that indicates the preferred
device to send filehandle-only or name-based operations for
that object.
</t>
<t>
For example, suppose that after obtaining an ordinary
filehandle via OPEN, a LAYOUTMETA4_FILEHANDLE layout on the
containing file system, and a meta_stripe_deviceid hint from
a previous GETATTR, READDIR, or PREADDIR,, the client wants
to get a byte range lock on the file. The client sends the
LOCK request to the network address (pNFS device, L-MDS)
indicated by the meta_stripe_deviceid attribute.
</t>
</section>
<section anchor="DirectoryStriping2" title="Directory Striping">
<t>
For name-based and directory enumeration operations, a more
fine-grained, layout-based redirection mechanism is used.
</t>
<t>
When a client obtains a filehandle for an object that is of
type directory and wishes to take advantage of metadata
striping, the client first obtains a metadata layout of
subtype LAYOUTMETA4_DIRECTORY on the directory. The client is
provided with a directory-specific list of network addresses
(devices) to which to send requests specific to objects in
that directory.
</t>
<section anchor="NameBased" title="Name-Based Operations">
<t> For name-based operations, the directory striping layout
indicates the preferred destinations in the network to send name-based
operations for that directory (e.g., CREATE). The preferred
destinations MUST apply to the current filehandle that the operation
uses. In other words, for LINK and RENAME, which take both the saved
filehandle and the current filehandle as parameters, the pNFS client
would use the stripe hint of the target directory (indicated in the
current filehandle) for guidance where to send the operation. Note
that if an L-MDS accepts a LINK or RENAME operation, the L-MDS MUST
perform the operation atomically. If it cannot, then the L-MDS MUST
return the error NFS4ERR_XDEV, and the client MUST send the operation
to the I-MDS. </t>
<t>
The choice of destination is a function of the name the
client is requesting. For example, after the client
obtains the filehandle of a directory via LOOKUP and the
metadata layout via LAYOUTGET, the client wants to open a
regular file within the directory. As with the
LAYOUT4_NFSV4_1_FILES layout type, the client has a list
network addresses to which to send requests. With the
LAYOUT4_NFSV4_1_FILES layout, the choice of the index in
the list of network addresses was computed from the offset
of the read or write request. With the metadata layout,
the choice of the index is derived from the name (or some
other method, such as the name and one or more attributes
of the directory, such as the filehandle, fileid, as
below.) passed to OPEN.
</t>
</section>
<section anchor="DirectoryEnumeration" title="Directory
Enumeration">
<t>
For directory-enumeration operations, the directory striping
layout indicates the preferred destination in the network to send
(P)READDIR operations for that directory. For example, after
the client obtains the filehandle of a directory via LOOKUP and the
metadata layout via LAYOUTGET, the client wants to read the directory.
As with the LAYOUT4_NFSV4_1_FILES layout type, the client has a list
network addresses to which to send requests. With the
LAYOUT4_NFSV4_1_FILES layout, the choice of the index in list of
network addresses was computed from the offset of the read or write
request. For directory striping layouts, the index counts from 0
to the directory stripe count, less 1.
</t>
</section>
</section>
</section>
<section anchor="MetastripeLayout" title="The Metadata Striping
Layout">
<section anchor="NameOfLayout" title="Name">
<t>
The name of the metadata striping layout type is
LAYOUT4_METADATA.
</t>
</section>
<section anchor="ValueOfLayout" title="Value">
<t>
The value of the metadata striping layout type is TBD1.
</t>
</section>
<section anchor="DataDefinitions" title="Data Type Definitions">
<section anchor="DefLayoutHint" title="Layout Hint">
<figure align="center" anchor="fig_layout_hint">
<artwork><![CDATA[
/// %
/// %/* Encoded in the loh_body field of type layouthint4: */
/// %
/// struct md_dirsize_layouthint4 {
/// uint64_t *mdlh_min_est;
/// uint64_t *mdlh_avg_est;
/// uint64_t *mdlh_max_est;
/// uint32_t *mdlh_stripe_count;
/// uint32_t *mdlh_stripe_modulus;
/// };
]]></artwork>
</figure>
<t>
The layout-type specific layouthint4 content for the
LAYOUT4_METDATA layout type is composed of five fields,
each optional. Using some combination of the
mdlh_min_est, mdlh_avg_est, and mdlh_max_est fields, the
client is enabled to give an indication of the directory
workload it expects for a new directory. The client
also may suggest an explicit stripe count or modulus
preference in mdlh_stripe_count or mdlh_stripe_modulus,
which SHOULD be congruent if specified together.
</t>
</section>
<section anchor="DefDevices" title="Devices">
<figure align="left" anchor="fig_devices">
<artwork><![CDATA[
/// % /*
/// % * Encoded in the da_addr_body field of data type
/// % * device_addr4:
/// % */
/// struct md_layout_addr4 {
/// multipath_list4 mdla_multipath_list<>;
/// };
]]></artwork>
</figure>
</section>
<section anchor="DefMetadataLayout" title="Metadata Layout">
<figure align="left" anchor="fig_metadata_layout">
<artwork><![CDATA[
/// enum md_layout_subtype4 {
/// LAYOUTMETA4_FILEHANDLE = 0,
/// LAYOUTMETA4_DIRECTORY
/// };
///
/// enum md_namebased_alg4 {
/// MDN_ALG_CITYHASH64 = 0,
/// MDN_ALG_CEPHFRAG = 1,
/// /* XXX TBD2 */
/// };
///
/// typedef uint32_t cephfrag4;
///
/// struct cephfragsplit4 {
/// cephfrag4 frag;
/// uint32_t bits;
/// };
///
/// enum cephhash4 {
/// MDC_HASH_LINUX_DCACHE = 0,
/// MDC_HASH_RJENKINS = 1,
/// MDC_HASH_CITYHASH32 = 2,
/// };
///
/// struct md_namebased_alg_cephfrag4 {
/// enum cephhash4 hash;
/// cephfragsplit4 fragtree<>;
/// };
/// struct md_layout_directory {
/// switch(enum md_namebased_alg4 mdln_namebased_alg) {
/// case MDN_ALG_CITYHASH64:
/// uint32_t mdln_cityhash_seed;
/// case MDN_ALG_CEPHFRAG:
/// md_namebased_alg_cephfrag4 mdln_cephfrag;
/// };
///
/// deviceid4 mdln_devicelist<>;
/// uint32_t mdln_stripe_pattern<>;
/// };
/// struct md_layout4 {
/// union md_layout_type
/// switch (enum md_layout_subtype4 subtype) {
/// case LAYOUTMETA4_FILEHANDLE:
/// void;
/// case LAYOUTMETA4_DIRECTORY:
/// md_layout_directory mdl_layout;
/// };
/// };
]]></artwork>
</figure>
<t>
<!-- XXX comments on mdl_flags etc -->
</t>
</section>
<section anchor="Layoutupdate4" title="Layoutupdate4 lou_body">
<figure align="left" anchor="fig_lou_body">
<artwork><![CDATA[
///
/// struct md_directory_layoutupdate4 {
/// int32_t mdlu_entries_added;
/// int32_t mdlu_entries_removed;
/// nfstime4 mdlu_last_update;
/// };
///
/// % /*
/// % * Encoded in the lou_body field of data type
/// % * layoutupdate4:
/// % */
/// struct md_layout_update4 {
/// union md_layout_type switch (enum md_layout_subtype4 subtype) {
/// case LAYOUTMETA4_FILEHANDLE:
/// void;
/// case LAYOUTMETA4_DIRECTORY:
/// md_directory_layoutupdate4 mlu_directory;
/// };
/// };
]]></artwork>
<postamble>layoutupdate4 lou_body</postamble>
</figure>
</section>
</section>
<section anchor="SemanticsOfLayout" title="Metadata Layout
Semantics">
<t>
The reply to a successful LAYOUTGET request MUST contain
exactly one element in logr_layout. The element contains
the metadata layout.
</t>
<section anchor="SemanticsLayoutget" title="LAYOUTGET Argument
Conventions">
<t>
When a client requests a layout of type LAYOUT4_METADATA,
it specifies the desired subtype, which MUST be one of
LAYOUTMETA4_FILEHANDLE or LAYOUTMETA4_DIRECTORY, as the value of
the LAYOUTGET loga_iomode argument. Server
implementations should reject LAYOUTGET requests with
other values for loga_iomode.
</t>
<t>
The value provided for loga_stateid may be any valid
stateid for the related file or directory, or else the
anonymous stateid.
</t>
<t>
The values provided for loga_offset, loga_length, and
loga_minlength are not defined for metastripe layouts, and
server implementations MUST NOT intepret these values.
</t>
</section>
<section anchor="SemanticsFilehandleStriping" title="Filehandle Striping
Layouts">
<t>
If the requested layout is of subtype LAYOUTMETA4_FILEHANDLE,
the value of the layout is void. The filehandle redirection
information issued under auspices of the layout will be
entirely in the form of filehandle striping attribute hints.
</t>
<t>
As noted in Section 4, the scope of filehandle striping layouts
is an entire file system. The client can acquire the
(singleton) filehandle striping layout for a given file system
using any corresponding file handle which it happens to
hold, and whose object the client is permitted to access.
For example, the client could use the file handle of the
first directory it traverses on a given file system,
provided the file server is an NFSv4.x file server that
supports layouts of type LAYOUT4_METADATA.
</t>
<section anchor="SISStripeHints" title="Filehandle Stripe Hints">
<t>
Filehandle stripe hints are objects of type deviceid4, and
are the value of a new recommended, get-only attribute
meta_stripe_deviceid.
</t>
<t>
A client may successfully obtain the
meta_stripe_deviceid attribute on any file object if and
only if it has successfully obtained a filehandle striping
layout on the containing file system. Since the
meta_stripe_deviceid hint is an ordinary NFSv4
attribute, the client may acquire it from a GETATTR,
READDIR, or PREADDIR request. A server implementation
SHOULD interpret a PREADDIR operation (which has a
controlling metadata layout stateid) as a request for
just those attributes that are appropriate for the
layout stateid that has been presented.
</t>
<t>
At all events, when a client holds a filehandle stripe hint
for a file object, it uses the GETDEVICEINFO operation
to map the hint value to a to a device address of data
type md_layout_addr4 in the ordinary pNFS manner.
</t>
<t>
The server ensures that each such device remains
accessible (unrecalled) for at least as long as any
filehandle striping layout exists for which the device has
been named in a hint.
</t>
</section>
</section>
<section anchor="SemanticsDirectoryStriping" title="Directory
Striping
Layouts">
<t>
If the requested layout is of subtype LAYOUTMETA4_DIRECTORY,
then the layout contains a <![CDATA[<]]>device list,
striping pattern, algorithm<![CDATA[>]]> triple enabling
the client to perform both parallel directory enumeration
operations and stripe-aware name-based operations, as
outlined in Section 4.
</t>
<t>
When the layout subtype is LAYOUTMETA4_DIRECTORY, the layout
content provides an integer identifying a hashing
algorithm, a list of deviceids, and a striping pattern.
Then mdln_namebased_alg identifies an algorithm that maps
a name, as a component4, to an integer. Each entry in the
mdln_devicelist specifies a set of metadata servers that
may be treated as equally valid for metadata requests to
the same block in the partitioned namespace. Each entry
in the stripe pattern is an index into the device list.
</t>
<t>
To perform a name based operation, the client maps the
name to a number with the name based algorithm, looks that
number up in the stripe pattern (modulo the length of the
stripe pattern), yielding a device id that may be
interpreted with GETDEVICEINFO, in the ordinary pNFS
manner. After resolving the device id as a device address
of data type md_layout_addr4, the client sends the request
to any of the devices specified in the corresponding entry
in the device list.
</t>
<section anchor="SDSNamebased" title="L-MDS Selection for Name-based
Operations">
<t>
Clients with layouts of type LAYOUTMETA4_DIRECTORY may use
the algorithm supplied in field mdln_namebased_alg of
the layout content to compute a preferred L-MDS to use
when performing name-based operations, as follows:
</t>
<figure align="left" anchor="fig_namebased_alg">
<artwork><![CDATA[
Let F be the function specified in mdln_namebased_alg;
Let X = (x1, x2, x3, ...) some set of inputs for function F, such
that x1 SHOULD be the component name of the file, and x2, x3, ... any
additional parameters required for the chosen F, their arguments
asserted to be values available to the client.
Let stripe_unit_number = F(X);
Let stripe_count = number of elements in mdl_layout.mdln_stripe_pattern;
Let idx =
mdl_layout.mdln_stripe_pattern(stripe_unit_number % stripe_count);
Let deviceid = mdl_layout.mdln_devicelist[idx];
]]></artwork>
<postamble>pseudocode</postamble>
</figure>
<t>
The client then selects an L-MDS indicated by the deviceid
(using GETDEVICEINFO in the normal manner), and sends the
name-based operation to that server.
</t>
<section anchor="SDSNamebasedCity" title="MDN_ALG_CITYHASH64">
<t>
A layout with MDN_ALG_CITYHASH64 as the mdln_namebased_alg
indicates the use of the 64-bit CityHash non-cryptographic
hashing function <xref target="CITY"/> for directory
placement, with x1 the desired component name, and x2 the
32-bit seed value returned in the layout.
</t>
</section>
<section anchor="SDSNamebasedCeph" title="MDN_ALG_CEPHFRAG">
<t>
A layout with MDN_ALG_CEPHFRAG as the mdln_namebased_alg
indicates the use of Ceph's directory fragmentation
algorithm for directory placement.
</t>
<t>
Ceph uses a recursive algorithm to partition the hash space
of a directory into fragments, which are represented by an
an ordered list of splits called the fragtree. Fragments
are split into powers of two, so each split stores this
exponent in the field 'bits'.
</t>
<t>
Similarly, the cephfrag4 encodes in its high 8 bits the total
number of bits 'n' it has split from the root fragment. In the
next highest 'n' bits, it encodes its position in the hash
space. If a given hash value 'v' matches these 'n' bits, the
fragment is said to contain 'v'.
</t>
<t>
For example, starting with the root fragment root=0x00000000
and splitting by 2 bits, we generate the four fragments
f1=0x02000000, f2=0x02400000, f3=0x02800000 and
f4=0x02C00000. Further splitting f3 by 1 bit, we generate
two new fragments g1=0x03800000 and g2=0x03A00000. The
resulting fragtree for this structure would be
{ {0x00000000, 2}, {0x02800000, 1} }.
</t>
<t>
To place a given filename, calculate its hash value 'v'
using the hash function indicated by the 'hash' enum. Then,
starting with the root fragment f=0x00000000, follow these
step recursively:
* Search for a split in the fragtree matching frag=f. If no
split is found, place the file in fragment f.
* Given a split of 'n' bits, find which of the 2^n child
fragments contains the hash value 'v'. Assign this child
fragment to 'f' and continue.
</t>
<section anchor="SDSNamebasedCephLinux" title="MDC_HASH_LINUX_DCACHE">
<t>
Specifies the use of the Linux dentry cache (needs reference)
hashing function.
</t>
</section>
<section anchor="SDSNamebasedCephJenkins" title="MDC_HASH_RJENKINS">
<t>
Specifies the use of Robert Jenkins' <xref target="JENKINS"/>
hashing function.
</t>
</section>
<section anchor="SDSNamebasedCephCity" title="MDC_HASH_CITYHASH32">
<t>
Specifies the use of the 32-bit CityHash <xref target="CITY"/>
hashing function.
</t>
</section>
</section>
</section>
<section anchor="SDSDirectoryEnum" title="Directory
Enumeration">
<t>
Clients with layouts of type LAYOUTMETA4_DIRECTORY may use
the following algorithm to perform enumeration of
striped directories preferred metadata servers, in
parallel:
</t>
<figure align="left" anchor="fig_dir_enum_alg">
<artwork><![CDATA[
For stripe_number in 0 .. length(mdl_layout.mdln_stripe_pattern) -1
do
Let stripe =
mdl_layout.mdln_stripe_pattern[stripe_number];
Let device = mdl_layout.mdln_devicelist[stripe];
<PREADDIR at device, layout_stateid, stripe_number>
]]></artwork>
<postamble>pseudocode</postamble>
</figure>
<t> That is, for each logical stripe in the directory, the
client notes stripe number (merely the stripe's offset in the
sequence), and derives from it the corresponding index into
mdln_devicelist by indirection on mdln_stripe_pattern. The object at
mdln_devicelist[stripe_number] is a device id, which the client maps
to an L-MDS using GETDEVICEINFO, and performs a sequence of PREADDIR
operations on that server. The PREADDIR operation behaves exactly as
described in section 18.23.3 of [NFSv4.1], but takes in addition to
the arguments of READDIR, a metadata layout stateid and stripe number.
</t>
<t> As in ordinary NFSv4.1, to perform a full enumeration
of the directory entries at each component L-MDS, the client commences
iteration by sending a cookie argument of zero for the first PREADDIR
operation in the current stripe, and continues performing PREADDIR
operations supplying for the cookie argument the value of last cookie
value returned in the prior PREADDIR operation in the same logical
(L-MDS) enumeration only, until a PREADDIR operation indicates that no
further entries are available. The client and server behavior for
subsequent re-traversals of a previously-enumerated logical directory
are exactly as in ordinary NFSv4.1, except with respect to entry and
cookie partitioning as described here. The client SHOULD present to a
component L-MDS only cookie values previously returned to that client
by that same L-MDS, or 0 to commence iteration. An L-MDS MAY reject
with NFS4ERR_BADCOOKIE PREADDIR operations using cookie values that
are valid cookies for the logical directory, but which are local to
another L-MDS segment. </t>
</section>
</section>
</section>
<section anchor="LAYOUTCOMMIT" title="LAYOUTCOMMIT">
<t>
As filehandle striping layouts are effectively read-only,
clients SHOULD NOT attempt commits on filehandle striping
layouts. If a server implementation receives a LAYOUTCOMMIT
for a valid filehandle striping layout, it SHOULD return
NFS4ERR_OK.
</t>
<t>
For metastripe layouts of subtype LAYOUTMETA4_DIRECTORY, the layout
specific data for LAYOUTCOMMIT contains the signed count of items
added to and removed from the directory since the last LAYOUTCOMMIT
operation.
</t>
</section>
<section anchor="PREADDIR_OP_DET" title="Operation: PREADDIR - Parallel Read Directory">
<section title="ARGUMENTS">
<figure align="left" anchor="fig_PREADDIR4args">
<artwork><![CDATA[
/// struct READDIR4args {
/// /* CURRENT_FH: directory */
/// nfs_cookie4 cookie;
/// verifier4 cookieverf;
/// count4 dircount;
/// count4 maxcount;
/// bitmap4 attr_request;
/// stateid4 layout_stateid;
/// uint32_t stripe_number;
/// };
]]></artwork>
</figure>
</section>
<section title="RESULTS">
<figure align="left" anchor="fig_PREADDIR4res">
<artwork><![CDATA[
/// typedef struct READDIR4res PREADDIR4res;
]]></artwork>
</figure>
</section>
<section title="DESCRIPTION">
</section>
<section title="IMPLEMENTATION">
</section>
</section>
</section>
<section anchor="FurtherConsiderations" title="Further
Considerations">
<section anchor="FCStorageProtocols" title="Storage Access
Protocols">
<t>
The LAYOUT4_METADATA layout type uses NFSv4.1 operations
(and potentially, operations of higher minor versions of
NFSv4, subject to the definition of a minor version of
NFSv4) to access striped metadata. The LAYOUT4_METADATA
does not affect access to storage devices, and indeed, in
the protocol described here, layouts of type
LAYOUT4_METADATA and ordinary pNFS layouts for parallel data
access (e.g., LAYOUT4_NFSV4_1_FILES, LAYOUT4_OSD2_OBJECTS,
or LAYOUT4_BLOCK_VOLUME, or a future flexible files
layout), are orthogonal.
</t>
</section>
<section anchor="FCRevocation" title="Revocation of Layouts">
<t>
Servers MAY revoke layouts of type LAYOUT4_METADATA. A
client detects if layout has been revoked if the operation
is rejected with NFS4ERR_PNFS_NO_LAYOUT. In NFSv4.1, the
error NFS4ERR_PNFS_NO_LAYOUT could be returned only by READ
and WRITE. When the server returns a layout of type
LAYOUT4_METADATA, the set of operations that can return
NFS4ERR_PNFS_NO_LAYOUT is: ACCESS, CLOSE, COMMIT, CREATE,
DELEGRETURN, GETATTR, LINK, LOCK, LOCKT, LOCKU, LOOKUP,
LOOKUPP, NVERIFY, OPEN, OPENATTR, OPEN_DOWNGRADE, PREADDIR,
READ, READDIR, READLINK, REMOVE, RENAME, SECINFO, SETATTR,
VERIFY, WRITE, GET_DIR_DELEGATION, SECINFO, SECINFO_NO_NAME,
and WANT_DELEGATION.
</t>
</section>
<section anchor="FCStateids" title="Stateids">
<t>
The pNFS specification for LAYOUT4_NFSV4_1_FILES states data
servers MUST be aware of the stateids granted by MDS so that the
stateids passed to READ and WRITE can be properly validated.
Similarly, in layouts of type LAYOUT4_METADATA, the L-MDS MUST be
aware of layout stateids issued by the controlling I-MDS in the
corresponding layout.
</t>
<t>
In addition, the L-MDS MUST be aware of any non-layout
stateids granted by the I-MDS, if and only if the client is
in contact the L-MDS under direction of a metadata layout
returned by the I-MDS, and the I-MDS has not recalled or
revoked that layout. In addition, because an L-MDS can
accept operations like OPEN and LOCK that create or modify
stateids, the I-MDS MUST be aware of stateids that an L-MDS
has returned to a client, if and only if the I-MDS granted
the client a metadata layout that directed the client to the
L-MDS.
</t>
<t>
In some cases, one L-MDS MUST be aware of a stateid
generated by another L-MDS. For example a client can obtain
a stateid from the L-MDS serving as the destination of
name-based operations, which includes OPEN. However,
operations that use the stateid will be filehandle-only
operations, and the L-MDS the OPEN operation is sent to
might differ from the L-MDS the LOCK operation for the same
target file is sent to.
</t>
<t>
When a client obtains a non-layout stateid from an L-MDS, for
example, as the result of an OPEN operation, the stateid
is asserted to be valid at the issuing L-MDS, and also the
assocated I-MDS, as noted above. In addition, if the client holds
a filehandle striping layout on the current file system, it SHOULD
request the associated stripe hint on the object, ideally in the