/
README.md
2536 lines (2090 loc) · 161 KB
/
README.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Execution Internals
The storage execution layer provides an interface for higher level MongoDB components, including
query, replication and sharding, to all storage engines compatible with MongoDB. It maintains a
catalog, in-memory and on-disk, of collections and indexes. It also implements an additional (to
whatever a storage engine implements) concurrency control layer to safely modify the catalog while
sustaining correct and consistent collection and index data formatting.
Execution facilitates reads and writes to the storage engine with various persistence guarantees,
builds indexes, supports replication rollback, manages oplog visibility, repairs data corruption
and inconsistencies, and much more.
The main code highlights are: the storage integration layer found in the [**storage/**][] directory;
the lock manager and lock helpers found in the [**concurrency/**][] directory; the catalog found in
the [**catalog/**][] directory; the index build code found in many directories; the various types of
index implementations found in the [**index/**][] directory; the sorter found in the
[**sorter/**][] directory; and the time-series bucket catalog found in the [**timeseries/**][]
directory.
[**storage/**]: https://github.com/mongodb/mongo/tree/master/src/mongo/db/storage
[**concurrency/**]: https://github.com/mongodb/mongo/tree/master/src/mongo/db/concurrency
[**catalog/**]: https://github.com/mongodb/mongo/tree/master/src/mongo/db/catalog
[**index/**]: https://github.com/mongodb/mongo/tree/master/src/mongo/db/index
[**sorter/**]: https://github.com/mongodb/mongo/tree/master/src/mongo/db/sorter
[**timeseries/**]: https://github.com/mongodb/mongo/tree/master/src/mongo/db/timeseries
For more information on the Storage Engine API, see the [storage/README][].
For more information on time-series collections, see the [timeseries/README][].
[storage/README]: https://github.com/mongodb/mongo/blob/master/src/mongo/db/storage/README.md
[timeseries/README]: https://github.com/mongodb/mongo/blob/master/src/mongo/db/timeseries/README.md
# The Catalog
The catalog is where MongoDB stores information about the collections and indexes for a MongoDB
node. In some contexts we refer to this as metadata and to operations changing this metadata as
[DDL](#glossary) (Data Definition Language) operations. The catalog is persisted as a table with
BSON documents that each describe properties of a collection and its indexes. An in-memory catalog
caches the most recent catalog information for more efficient access.
## Durable Catalog
The durable catalog is persisted as a table with the `_mdb_catalog` [ident](#glossary). Each entry
in this table is indexed with a 64-bit `RecordId`, referred to as the catalog ID, and contains a
BSON document that describes the properties of a collection and its indexes. The `DurableCatalog`
class allows read and write access to the durable data.
Starting in v5.2, catalog entries for time-series collections have a new flag called
`timeseriesBucketsMayHaveMixedSchemaData` in the `md` field. Time-series collections upgraded from
versions earlier than v5.2 may have mixed-schema data in buckets. This flag gets set to `true` as
part of the upgrade process and is removed as part of the downgrade process through the
[collMod command](https://github.com/mongodb/mongo/blob/cf80c11bc5308d9b889ed61c1a3eeb821839df56/src/mongo/db/catalog/coll_mod.cpp#L644-L663).
**Example**: an entry in the durable catalog for a collection `test.employees` with an in-progress
index build on `{lastName: 1}`:
```
{'ident': 'collection-0--2147780727179663754',
'idxIdent': {'_id_': 'index-1--2147780727179663754',
'lastName_1': 'index-2--2147780727179663754'},
'md': {'indexes': [{'backgroundSecondary': False,
'multikey': False,
'multikeyPaths': {'_id': Binary('\x00', 0)},
'ready': True,
'spec': {'key': {'_id': 1},
'name': '_id_',
'v': 2}},
{'backgroundSecondary': False,
'multikey': False,
'multikeyPaths': {'_id': Binary('\x00', 0)},
'ready': False,
'buildUUID': UUID('d86e8657-1060-4efd-b891-0034d28c3078'),
'spec': {'key': {'lastName': 1},
'name': 'lastName_1',
'v': 2}}],
'ns': 'test.employees',
'options': {'uuid': UUID('795453e9-867b-4804-a432-43637f500cf7')}},
'ns': 'test.employees'}
```
### $listCatalog Aggregation Pipeline Operator
[$listCatalog](https://github.com/mongodb/mongo/blob/532c0679ef4fc8313a9e00a1334ca18e04ff6914/src/mongo/db/pipeline/document_source_list_catalog.h#L46) is an internal aggregation pipeline operator that may be used to inspect the contents
of the durable catalog on a running server. For catalog entries that refer to views, additional
information is retrieved from the enclosing database's `system.views` collection. The $listCatalog
is generally run on the admin database to obtain a complete view of the durable catalog, provided
the caller has the required
[administrative privileges](https://github.com/mongodb/mongo/blob/532c0679ef4fc8313a9e00a1334ca18e04ff6914/src/mongo/db/pipeline/document_source_list_catalog.cpp#L55).
Example command invocation and output from
[list_catalog.js](https://github.com/mongodb/mongo/blob/532c0679ef4fc8313a9e00a1334ca18e04ff6914/jstests/core/catalog/list_catalog.js#L98)):
```
> const adminDB = db.getSiblingDB('admin');
> adminDB.aggregate([{$listCatalog: {}}]);
Collectionless $listCatalog: [
{
"db" : "test",
"name" : "system.views",
"type" : "collection",
"md" : {
"ns" : "test.system.views",
"options" : {
"uuid" : UUID("a132c4ee-a1f4-4251-8eb2-c9f4afbeb9c1")
},
"indexes" : [
{
"spec" : {
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
"ready" : true,
"multikey" : false,
"multikeyPaths" : {
"_id" : BinData(0,"AA==")
},
"head" : NumberLong(0),
"backgroundSecondary" : false
}
]
},
"idxIdent" : {
"_id_" : "index-6-2245557986372974053"
},
"ns" : "test.system.views",
"ident" : "collection-5-2245557986372974053"
},
{
"db" : "list_catalog",
"name" : "simple",
"type" : "collection",
"md" : {
"ns" : "list_catalog.simple",
"options" : {
"uuid" : UUID("a86445c2-3e3c-42ae-96be-5d451c977ed6")
},
"indexes" : [
{
"spec" : {
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
"ready" : true,
"multikey" : false,
"multikeyPaths" : {
"_id" : BinData(0,"AA==")
},
"head" : NumberLong(0),
"backgroundSecondary" : false
},
{
"spec" : {
"v" : 2,
"key" : {
"a" : 1
},
"name" : "a_1"
},
"ready" : true,
"multikey" : false,
"multikeyPaths" : {
"a" : BinData(0,"AA==")
},
"head" : NumberLong(0),
"backgroundSecondary" : false
}
]
},
"idxIdent" : {
"_id_" : "index-62-2245557986372974053",
"a_1" : "index-63-2245557986372974053"
},
"ns" : "list_catalog.simple",
"ident" : "collection-61-2245557986372974053"
},
{
"db" : "list_catalog",
"name" : "system.views",
"type" : "collection",
"md" : {
"ns" : "list_catalog.system.views",
"options" : {
"uuid" : UUID("2f76dd14-1d9a-42e1-8716-c1165cdbb00f")
},
"indexes" : [
{
"spec" : {
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
"ready" : true,
"multikey" : false,
"multikeyPaths" : {
"_id" : BinData(0,"AA==")
},
"head" : NumberLong(0),
"backgroundSecondary" : false
}
]
},
"idxIdent" : {
"_id_" : "index-65-2245557986372974053"
},
"ns" : "list_catalog.system.views",
"ident" : "collection-64-2245557986372974053"
},
{
"db" : "list_catalog",
"name" : "simple_view",
"type" : "view",
"ns" : "list_catalog.simple_view",
"_id" : "list_catalog.simple_view",
"viewOn" : "simple",
"pipeline" : [
{
"$project" : {
"a" : 0
}
}
]
},
...
]
```
The `$listCatalog` also supports running on a specific collection. See example in
[list_catalog.js](https://github.com/mongodb/mongo/blob/532c0679ef4fc8313a9e00a1334ca18e04ff6914/jstests/core/catalog/list_catalog.js#L77).
This aggregation pipeline operator is primarily intended for internal diagnostics and applications that require information not
currently provided by [listDatabases](https://www.mongodb.com/docs/v6.0/reference/command/listDatabases/),
[listCollections](https://www.mongodb.com/docs/v6.0/reference/command/listCollections/), and
[listIndexes](https://www.mongodb.com/docs/v6.0/reference/command/listIndexes/). The three commands referenced are part of the
[Stable API](https://www.mongodb.com/docs/v6.0/reference/stable-api/) with a well-defined output format (see
[listIndexes IDL](https://github.com/mongodb/mongo/blob/532c0679ef4fc8313a9e00a1334ca18e04ff6914/src/mongo/db/list_indexes.idl#L225)).
#### Read Concern Support
In terms of accessing the current state of databases, collections, and indexes in a running server,
the `$listCatalog` provides a consistent snapshot of the catalog in a single command invocation
using the default or user-provided
[read concern](https://www.mongodb.com/docs/v6.0/reference/method/db.collection.aggregate/). The
[list_catalog_read_concern.js](https://github.com/mongodb/mongo/blob/532c0679ef4fc8313a9e00a1334ca18e04ff6914/jstests/noPassthrough/list_catalog_read_concern.js#L46)
contains examples of using $listCatalog with a variety of read concern settings.
The tradtional alternative would have involved a `listDatabases` command followed by a series of
`listCollections` and `listIndexes` calls, with the downside of reading the catalog at a different
point in time during each command invocation.
#### Examples of differences between listIndexes and $listCatalog results
The `$listCatalog` operator does not format its results with the IDL-derived formatters generated for the `listIndexes` command.
This has implications for applications that read the durable catalog using $listCatalog rather than the recommended listIndexes
command. Below are a few examples where the `listIndexes` results may differ from `$listCatalog`.
| Index Type | Index Option | createIndexes | listIndexes | $listCatalog |
| ---------- | ------------ | ------------- | ----------- | ------------ |
| [Sparse](https://www.mongodb.com/docs/v6.0/core/index-sparse/) | [sparse (safeBool)](https://github.com/mongodb/mongo/blob/532c0679ef4fc8313a9e00a1334ca18e04ff6914/src/mongo/db/list_indexes.idl#L84) | `db.t.createIndex({a: 1}, {sparse: 12345})` | `{ "v" : 2, "key" : { "a" : 1 }, "name" : "a_1", "sparse" : true }` | `{ "v" : 2, "key" : { "a" : 1 }, "name" : "a_1", "sparse" : 12345 }` |
| [TTL](https://www.mongodb.com/docs/v6.0/core/index-ttl/) | [expireAfterSeconds (safeInt)](https://github.com/mongodb/mongo/blob/532c0679ef4fc8313a9e00a1334ca18e04ff6914/src/mongo/db/list_indexes.idl#L88) | `db.t.createIndex({created: 1}, {expireAfterSeconds: 10.23})` | `{ "v" : 2, "key" : { "created" : 1 }, "name" : "created_1", "expireAfterSeconds" : 10 }` | `{ "v" : 2, "key" : { "created" : 1 }, "name" : "created_1", "expireAfterSeconds" : 10.23 }` |
| [Geo](https://www.mongodb.com/docs/v6.0/tutorial/build-a-2d-index/) | [bits (safeInt)](https://github.com/mongodb/mongo/blob/532c0679ef4fc8313a9e00a1334ca18e04ff6914/src/mongo/db/list_indexes.idl#L117) | `db.t.createIndex({p: '2d'}, {bits: 16.578})` | `{ "v" : 2, "key" : { "p" : "2d" }, "name" : "p_2d", "bits" : 16 }` | `{ "v" : 2, "key" : { "p" : "2d" }, "name" : "p_2d", "bits" : 16.578 }` |
#### $listCatalog in a sharded cluster
The `$listCatalog` operator supports running in a sharded cluster where the `$listCatalog`
result from each shard is combined at the router with additional identifying information similar
to the [shard](https://www.mongodb.com/docs/v6.0/reference/operator/aggregation/indexStats/#std-label-indexStats-output-shard) output field
of the `$indexStats` operator. The following is a sample test output from
[sharding/list_catalog.js](https://github.com/mongodb/mongo/blob/532c0679ef4fc8313a9e00a1334ca18e04ff6914/jstests/sharding/list_catalog.js#L40):
```
[
{
"db" : "list_catalog",
"name" : "coll",
"type" : "collection",
"shard" : "list_catalog-rs0",
"md" : {
"ns" : "list_catalog.coll",
"options" : {
"uuid" : UUID("17886eb0-f157-45c9-9b63-efb8273f51da")
},
"indexes" : [
{
"spec" : {
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
"ready" : true,
"multikey" : false,
"multikeyPaths" : {
"_id" : BinData(0,"AA==")
},
"head" : NumberLong(0),
"backgroundSecondary" : false
}
]
},
"idxIdent" : {
"_id_" : "index-56--2997668048670645427"
},
"ns" : "list_catalog.coll",
"ident" : "collection-55--2997668048670645427"
},
{
"db" : "list_catalog",
"name" : "coll",
"type" : "collection",
"shard" : "list_catalog-rs1",
"md" : {
"ns" : "list_catalog.coll",
"options" : {
"uuid" : UUID("17886eb0-f157-45c9-9b63-efb8273f51da")
},
"indexes" : [
{
"spec" : {
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
"ready" : true,
"multikey" : false,
"multikeyPaths" : {
"_id" : BinData(0,"AA==")
},
"head" : NumberLong(0),
"backgroundSecondary" : false
}
]
},
"idxIdent" : {
"_id_" : "index-55--2220352983339007214"
},
"ns" : "list_catalog.coll",
"ident" : "collection-53--2220352983339007214"
}
]
```
## Collection Catalog
The `CollectionCatalog` class holds in-memory state about all collections in all databases and is a
cache of the [durable catalog](#durable-catalog) state. It provides the following functionality:
* Register new `Collection` objects, taking ownership of them.
* Lookup `Collection` objects by their `UUID` or `NamespaceString`.
* Iterate over `Collection` objects in a database in `UUID` order.
* Deregister individual dropped `Collection` objects, releasing ownership.
* Allow closing/reopening the catalog while still providing limited `UUID` to `NamespaceString`
lookup to support rollback to a point in time.
* Ensures `Collection` objects are in-sync with opened storage snapshots.
### Synchronization
Catalog access is synchronized using [Multiversion concurrency control] where readers operate on
immutable catalog, collection and index instances. Writes use [copy-on-write][] to create newer
versions of the catalog, collection and index instances to be changed, contents are copied from the
previous latest version. Readers holding on to a catalog instance will thus not observe any writes
that happen after requesting an instance. If it is desired to observe writes while holding a catalog
instance then the reader must refresh it.
Catalog writes are handled with the `CollectionCatalog::write(callback)` interface. It provides the
necessary [copy-on-write][] abstractions. A writable catalog instance is created by making a
shallow copy of the existing catalog. The actual write is implemented in the supplied callback which
is allowed to throw. Execution of the write callbacks are serialized and may run on a different
thread than the thread calling `CollectionCatalog::write`. Users should take care of not performing
any blocking operations in these callbacks as it would block all other DDL writes in the system.
To avoid a bottleneck in the case the catalog contains a large number of collections (being slow to
copy), immutable data structures are used, concurrent writes are also batched together. Any thread
that enters `CollectionCatalog::write` while a catalog instance is being copied or while executing
write callbacks is enqueued. When the copy finishes, all enqueued write jobs are run on that catalog
instance by the copying thread.
### Collection objects
Objects of the `Collection` class provide access to a collection's properties between
[DDL](#glossary) operations that modify these properties. Modifications are synchronized using
[copy-on-write][]. Reads access immutable `Collection` instances. Writes, such as rename
collection, apply changes to a clone of the latest `Collection` instance and then atomically install
the new `Collection` instance in the catalog. It is possible for operations that read at different
points in time to use different `Collection` objects.
Notable properties of `Collection` objects are:
* catalog ID - to look up or change information from the DurableCatalog.
* UUID - Identifier that remains for the lifetime of the underlying MongoDb collection, even across
DDL operations such as renames, and is consistent between different nodes and shards in a
cluster.
* NamespaceString - The current name associated with the collection.
* Collation and validation properties.
* Decorations that are either `Collection` instance specific or shared between all `Collection`
objects over the lifetime of the collection.
In addition `Collection` objects have shared ownership of:
* An [`IndexCatalog`](#index-catalog) - an in-memory structure representing the `md.indexes` data
from the durable catalog.
* A `RecordStore` - an interface to access and manipulate the documents in the collection as stored
by the storage engine.
A writable `Collection` may only be requested in an active [WriteUnitOfWork](#WriteUnitOfWork). The
new `Collection` instance is installed in the catalog when the storage transaction commits as the
first `onCommit` [Changes](#Changes) that run. This means that it is not allowed to perform any
modification to catalog, collection or index instances in `onCommit` handlers. Such modifications
would break the immutability property of these instances for readers. If the storage transaction
rolls back then the writable `Collection` object is simply discarded and no change is ever made to
the catalog.
A writable `Collection` is a clone of the existing `Collection`, members are either deep or
shallowed copied. Notably, a shallow copy is made for the [`IndexCatalog`](#index-catalog).
The oplog `Collection` follows special rules, it does not use [copy-on-write][] or any other form
of synchronization. Modifications operate directly on the instance installed in the catalog. It is
not allowed to read concurrently with writes on the oplog `Collection`.
Finally, there are two kinds of decorations on `Collection` objects. The `Collection` object derives
from `DecorableCopyable` and requires `Decoration`s to implement a copy-constructor. Collection
`Decoration`s are copied with the `Collection` when DDL operations occur. This is used for to keep
versioned query information per Collection instance. Additionally, there are
`SharedCollectionDecorations` for storing index usage statistics and query settings that are shared
between `Collection` instances across DDL operations.
### Collection lifetime
The `Collection` object is brought to existence in two ways:
1. Any DDL operation is run. Non-create operations such as `collMod` clone the existing `Collection`
object.
2. Using an existing durable catalog entry to instantiate an existing collection. This happens when
we:
1. Load the `CollectionCatalog` during startup or after rollback.
2. When we need to instantiate a collection at an earlier point-in-time because the `Collection`
is not present in the `CollectionCatalog`, or the `Collection` is there, but incompatible with
the snapshot. See [here](#catalog-changes-versioning-and-the-minimum-valid-snapshot) how a
`Collection` is determined to be incompatible.
3. When we read at latest concurrently with a DDL operation that is also performing multikey
changes.
For (1) and (2.1) the `Collection` objects are stored as shared pointers in the `CollectionCatalog`
and available to all operations running in the database. These `Collection` objects are released
when the collection is dropped from the `CollectionCatalog` and there are no more operations holding
a shared pointer to the `Collection`. The can be during shutdown or when the `Collection` is dropped
and there are no more readers.
For (2.2) the `Collection` objects are stored as shared pointers in `OpenedCollections`, which is a
decoration on the `Snapshot`. Meaning these `Collection` objects are only available to the operation
that instantiated them. When the snapshot is abandoned, such as during query yield, these
`Collection` objects are released. Multiple lookups from the `CollectionCatalog` will re-use the
previously instantiated `Collection` instead of performing the instantiation at every lookup for the
same operation.
(2.3) is an edge case where neither latest or pending `Collection` match the opened snapshot due to
concurrent multikey changes.
Users of `Collection` instances have a few responsibilities to keep the object valid.
1. Hold a collection-level lock.
2. Use an AutoGetCollection helper.
3. Explicitly hold a reference to the `CollectionCatalog`.
### Index Catalog
Each `Collection` object owns an `IndexCatalog` object, which in turn has shared ownership of
`IndexCatalogEntry` objects that each again own an `IndexDescriptor` containing an in-memory
presentation of the data stored in the [durable catalog](#durable-catalog).
## Catalog Changes, versioning and the Minimum Valid Snapshot
Every catalog change has a corresponding write with a commit time. When registered `OpObserver`
objects observe catalog changes, they set the minimum valid snapshot of the `Collection` to the
commit timestamp. The `CollectionCatalog` uses this timestamp to determine whether the `Collection`
is valid for a given point-in-time read. If not, a new `Collection` instance will be instantiated to
satisfy the read.
When performing a DDL operation on a `Collection` object, the `CollectionCatalog` uses a two-phase
commit algorithm. In the first phase (pre-commit phase) the `Namespace` and `UUID` is marked as
pending commit. This has the affect of reading from durable storage to determine if the latest
instance or the pending commit instance in the catalog match. We return the instance that matches.
This is only for reads without a timestamp that come in during the pending commit state. The second
phase (commit phase) removes the `Namespace` and `UUID` from the pending commit state. With this, we
can guarantee that `Collection` objects are fully in-sync with the storage snapshot.
With lock-free reads there may be ongoing concurrent DDL operations. In order to have a
`CollectionCatalog` that's consistent with the snapshot, the following is performed when setting up
a lock-free read:
* Get the latest version of the `CollectionCatalog`.
* Open a snapshot.
* Get the latest version of the `CollectionCatalog` and check if it matches the one obtained
earlier. If not, we need to retry this. Otherwise we'd have a `CollectionCatalog` that's
inconsistent with the opened snapshot.
## Collection Catalog and Multi-document Transactions
* When we start the transaction we open a storage snapshot and stash a CollectionCatalog instance
similar to a regular lock-free read (but holding the RSTL as opposed to lock-free reads).
* User reads within this transaction lock the namespace and ensures we have a Collection instance
consistent with the snapshot (same as above).
* User writes do an additional step after locking to check if the collection instance obtained is
the latest instance in the CollectionCatalog, if it is not we treat this as a WriteConflict so the
transaction is retried.
The `CollectionCatalog` contains a mapping of `Namespace` and `UUID` to the `catalogId` for
timestamps back to the oldest timestamp. These are used for efficient lookups into the durable
catalog, and are resilient to create, drop and rename operations.
Operations that use collection locks (in any [lockmode](#lock-modes)) can rely on the catalog
information of the collection not changing. However, when unlocking and then relocking, not only
should operations recheck catalog information to ensure it is still valid, they should also make
sure to abandon the storage snapshot, so it is consistent with the in memory catalog.
## Two-Phase Collection and Index Drop
Collections and indexes are dropped in two phases to ensure both that reads at points-in-time
earlier than the drop are still possible and startup recovery and rollback via a stable timestamp
can find the correct and expected data. The first phase removes the catalog entry associated with
the collection or index: this delete is versioned by the storage engine, so earlier PIT accesses
continue to see the catalog data. The second phase drops the collection or index data: this is not
versioned and there is no PIT access that can see it afterwards. WiredTiger versions document
writes, but not table drops: once a table is gone, the data is gone.
The first phase of drop clears the associated catalog entry, both in the in-memory catalog and in
the on-disk catalog, and then registers the collection or index's information with the reaper. No
new operations will find the collection or index in the catalog. Pre-existing operations with
references to the collection or index state may still be running and will retain their references
until they complete. The reaper receives the collection or index ident (identifier), along with a
reference to the in-memory collection or index state, and a drop timestamp.
The second phase of drop deletes the data. The reaper maintains a list of {ident, drop token, drop
timestamp} sets. A collection or index's data will be dropped when both the drop timestamp becomes
sufficiently persisted such that the catalog change will not be rolled back and no other reference
to the collection or index in-memory state (tracked via the drop token) remains. When no concurrent
readers of the collection or index are left, the drop token will be the only remaining reference to
the in-memory state. The drop timestamp must be both older than the timestamp of the last checkpoint
and the oldest_timestamp. Requiring the drop timestamp to reach the checkpointed time ensures that
startup recovery and rollback via recovery to a stable timestamp, which both recover to the last
checkpoint, will never be missing collection or index data that should still exist at the checkpoint
time that is less than the drop timestamp. Requiring the drop timestamp to pass (become older) than
the oldest_timestamp ensures that all reads, which are supported back to the oldest_timestamp,
successfully find the collection or index data.
There's a mechanism to delay an ident drop during the second phase via
`KVDropPendingIdentReaper::markIdentInUse()` when there are no more references to the drop token.
This is currently used when instantiating `Collection` objects at earlier points in time for reads,
and prevents the reaper from dropping the collection and index tables during the instantiation.
_Code spelunking starting points:_
* [_The KVDropPendingIdentReaper
class_](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/storage/kv/kv_drop_pending_ident_reaper.h)
* Handles the second phase of collection/index drop. Runs when notified.
* [_The TimestampMonitor and TimestampListener
classes_](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/storage/storage_engine_impl.h#L178-L313)
* The TimestampMonitor starts a periodic job to notify the reaper of the latest timestamp that is
okay to reap.
* [_Code that signals the reaper with a
timestamp_](https://github.com/mongodb/mongo/blob/r4.5.0/src/mongo/db/storage/storage_engine_impl.cpp#L932-L949)
# Storage Transactions
Through the pluggable [storage engine API](https://github.com/mongodb/mongo/blob/master/src/mongo/db/storage/README.md), MongoDB executes reads and writes on its storage engine
with [snapshot isolation](#glossary). The structure used to achieve this is the [RecoveryUnit
class](../storage/recovery_unit.h).
## RecoveryUnit
Each pluggable storage engine for MongoDB must implement `RecoveryUnit` as one of the base classes
for the storage engine API. Typically, storage engines satisfy the `RecoveryUnit` requirements with
some form of [snapshot isolation](#glossary) with [transactions](#glossary). Such transactions are
called storage transactions elsewhere in this document, to differentiate them from the higher-level
_multi-document transactions_ accessible to users of MongoDB. The RecoveryUnit controls what
[snapshot](#glossary) a storage engine transaction uses for its reads. In MongoDB, a snapshot is defined by a
_timestamp_. A snapshot consists of all data committed with a timestamp less than or equal to the
snapshot's timestamp. No uncommitted data is visible in a snapshot, and data changes in storage
transactions that commit after a snapshot is created, regardless of their timestamps, are also not
visible. Generally, one uses a `RecoveryUnit` to perform transactional reads and writes by first
configuring the `RecoveryUnit` with the desired
[ReadSource](https://github.com/mongodb/mongo/blob/b2c1fa4f121fdb6cdffa924b802271d68c3367a3/src/mongo/db/storage/recovery_unit.h#L391-L421)
and then performing the reads and writes using operations on `RecordStore` or `SortedDataInterface`,
and finally calling `commit()` on the `WriteUnitOfWork` (if performing writes).
## WriteUnitOfWork
A `WriteUnitOfWork` is the mechanism to control how writes are transactionally performed on the
storage engine. All the writes (and reads) performed within its scope are part of the same storage
transaction. After all writes have been staged, one must call `commit()` in order to atomically
commit the transaction to the storage engine. It is illegal to perform writes outside the scope of
a WriteUnitOfWork since there would be no way to commit them. If the `WriteUnitOfWork` falls out of
scope before `commit()` is called, the storage transaction is rolled back and all the staged writes
are lost. Reads can be performed outside of a `WriteUnitOfWork` block; storage transactions outside
of a `WriteUnitOfWork` are always rolled back, since there are no writes to commit.
## Lazy initialization of storage transactions
Note that storage transactions on WiredTiger are not started at the beginning of a `WriteUnitOfWork`
block. Instead, the transaction is started implicitly with the first read or write operation. To
explicitly start a transaction, one can use `RecoveryUnit::preallocateSnapshot()`.
## Changes
One can register a `Change` on a `RecoveryUnit` while in a `WriteUnitOfWork`. This allows extra
actions to be performed based on whether a `WriteUnitOfWork` commits or rolls back. These actions
will typically update in-memory state to match what was written in the storage transaction, in a
transactional way. Note that `Change`s are not executed until the destruction of the
`WriteUnitOfWork`, which can be long after the storage engine committed. Two-phase locking ensures
that all locks are held while a Change's `commit()` or `rollback()` function runs.
## StorageUnavailableException
`StorageUnavailableException` indicates that a storage transaction rolled back due to
resource contention in the storage engine. This exception is the base of exceptions related to
concurrency (`WriteConflict`) and to those related to cache pressure (`TemporarilyUnavailable` and
`TransactionTooLargeForCache`).
We recommend using the [writeConflictRetry](https://github.com/mongodb/mongo/blob/9381db6748aada1d9a0056cea0e9899301e7f70b/src/mongo/db/concurrency/exception_util.h#L140)
helper which transparently handles all exceptions related to this error category.
### WriteConflictException
Writers may conflict with each other when more than one operation stages an uncommitted write to the
same document concurrently. To force one or more of the writers to retry, the storage engine may
throw a WriteConflictException at any point, up to and including the call to commit(). This is
referred to as optimistic concurrency control because it allows uncontended writes to commit
quickly. Because of this behavior, most WUOWs are enclosed in a writeConflictRetry loop that retries
the write transaction until it succeeds, accompanied by a bounded exponential back-off.
### TemporarilyUnavailableException
When the server parameter `enableTemporarilyUnavailableExceptions` is enabled (on by default), a
TemporarilyUnavailableException may be thrown inside the server to indicate that an operation cannot
complete without blocking and must be retried. The storage engine may throw a
TemporarilyUnavailableException (converted to a TemporarilyUnavailable error for users) when an
operation is excessively rolled-back in the storage engine due to cache pressure or any reason that
would prevent the operation from completing without impacting concurrent operations. The operation
may be at fault for writing too much uncommitted data, or it may be a victim. That information is
not exposed. However, if this error is returned, it is likely that the operation was the cause of
the problem, rather than a victim.
Before 6.0, this type of error was returned as a WriteConflict and retried indefinitely inside a
writeConflictRetry loop. As of 6.0, MongoDB will retry the operation internally at most
`temporarilyUnavailableMaxRetries` times, backing off for `temporarilyUnavailableBackoffBaseMs`
milliseconds, with a linearly-increasing backoff on each attempt. After this point, the error will
escape the handler and be returned to the client.
If an operation receives a TemporarilyUnavailable error internally, a `temporarilyUnavailableErrors`
counter will be displayed in the slow query logs and in FTDC.
Notably, this behavior does not apply to multi-document transactions, which continue to return a
WriteConflict to the client in this scenario without retrying internally.
See
[wtRcToStatus](https://github.com/mongodb/mongo/blob/c799851554dc01493d35b43701416e9c78b3665c/src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp#L178-L183)
where we throw the exception in WiredTiger.
See [TemporarilyUnavailableException](https://github.com/mongodb/mongo/blob/c799851554dc01493d35b43701416e9c78b3665c/src/mongo/db/concurrency/temporarily_unavailable_exception.h#L39-L45).
### TransactionTooLargeForCacheException
A TransactionTooLargeForCacheException may be thrown inside the server to indicate that an operation
was rolled-back and is unlikely to ever complete because the storage engine cache is insufficient,
even in the absence of concurrent operations. This is determined by a simple heuristic wherein,
after a rollback, a threshold on the proportion of total dirty cache bytes the running transaction
can represent and still be considered fullfillable is checked. The threshold can be tuned with the
`transactionTooLargeForCacheThreshold` parameter. Setting this threshold to its maximum value (1.0)
causes the check to be skipped and TransactionTooLargeForCacheException to be disabled.
On replica sets, if an operation succeeds on a primary, it should also succeed on a secondary. It
would be possible to convert to both TemporarilyUnavailableException and WriteConflictException,
as if TransactionTooLargeForCacheException was disabled. But on secondaries the only
difference between the two is the rate at which the operation is retried. Hence,
TransactionTooLargeForCacheException is always converted to a WriteConflictException, which retries
faster, to avoid stalling replication longer than necessary.
Prior to 6.3, or when TransactionTooLargeForCacheException is disabled, multi-document
transactions always return a WriteConflictException, which may result in drivers retrying an
operation indefinitely. For non-multi-document operations, there is a limited number of retries on
TemporarilyUnavailableException, but it might still be beneficial to not retry operations which are
unlikely to complete and are disruptive for concurrent operations.
# Read Operations
External reads via the find, count, distint, aggregation and mapReduce cmds do not take collection
MODE_IS locks (mapReduce does continue to take MODE_IX collection locks for writes). Internal
operations continue to take collection locks. Lock-free reads (only take the global lock in MODE_IS)
achieve this by establishing consistent in-memory and storage engine on-disk state at the start of
their operations. Lock-free reads explicitly open a storage transaction while setting up consistent
in-memory and on-disk read state. Internal reads with collection level locks implicitly start
storage transactions later via the MongoDB integration layer on the first attempt to read from a
collection or index. Unless a read operation is part of a larger write operation, the transaction
is rolled-back automatically when the last GlobalLock is released, explicitly during query yielding,
or from a call to abandonSnapshot(). Lock-free read operations must re-establish consistent state
after a query yield, just as at the start of a read operation.
See
[WiredTigerCursor](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/storage/wiredtiger/wiredtiger_cursor.cpp#L48),
[WiredTigerRecoveryUnit::getSession()](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp#L303-L305),
[~GlobalLock](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/concurrency/d_concurrency.h#L228-L239),
[PlanYieldPolicy::_yieldAllLocks()](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/query/plan_yield_policy.cpp#L182),
[RecoveryUnit::abandonSnapshot()](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/storage/recovery_unit.h#L217).
## Collection Reads
Collection reads act directly on a
[RecordStore](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/storage/record_store.h#L202)
or
[RecordCursor](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/storage/record_store.h#L102).
The Collection object also provides [higher-level
accessors](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/catalog/collection.h#L279)
to the RecordStore.
## Index Reads
Index reads act directly on a
[SortedDataInterface::Cursor](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/storage/sorted_data_interface.h#L214).
Most readers create cursors rather than interacting with indexes through the
[IndexAccessMethod](https://github.com/mongodb/mongo/blob/r4.4.0-rc13/src/mongo/db/index/index_access_method.h#L142).
## Read Locks
### Locked Reads
The
[`AutoGetCollectionForRead`](https://github.com/mongodb/mongo/blob/58283ca178782c4d1c4a4d2acd4313f6f6f86fd5/src/mongo/db/db_raii.cpp#L89)
(`AGCFR`) RAII type is used by most client read operations. In addition to acquiring all necessary
locks in the hierarchy, it ensures that operations reading at points in time are respecting the
visibility rules of collection data and metadata.
`AGCFR` ensures that operations reading at a timestamp do not read at times later than metadata
changes on the collection (see
[here](https://github.com/mongodb/mongo/blob/58283ca178782c4d1c4a4d2acd4313f6f6f86fd5/src/mongo/db/db_raii.cpp#L158)).
### Lock-Free Reads (global MODE_IS lock only)
Lock-free reads use the
[`AutoGetCollectionForReadLockFree`](https://github.com/mongodb/mongo/blob/4363473d75cab2a487c6a6066b601d52230c7e1a/src/mongo/db/db_raii.cpp#L429)
helper, which, in addition to the logic of `AutoGetCollectionForRead`, will establish consistent
in-memory and on-disk catalog state and data view. Locks are avoided by comparing in-memory fetched
state before and after an on-disk storage snapshot is opened. If the in-memory state differs before
and after, then the storage snapshot is abandoned and the code will retry until before and after
match. Lock-free reads skip collection and RSTL locks, so
[the repl mode/state and collection state](https://github.com/mongodb/mongo/blob/4363473d75cab2a487c6a6066b601d52230c7e1a/src/mongo/db/db_raii.cpp#L97-L102)
are compared before and after. In general, lock-free reads work by acquiring all the 'versioned'
state needed for the read at the beginning, rather than relying on a collection-level lock to keep
the state from changing.
When setting up the read, `AutoGetCollectionForRead` will trigger the instantiation of a
`Collection` object when either:
* Reading at an earlier time than the minimum valid snapshot of the matching `Collection` from the `CollectionCatalog`.
* No matching `Collection` is found in the `CollectionCatalog`.
In versions earlier than v7.0 this would error with `SnapshotUnavailable`.
Sharding `shardVersion` checks still occur in the appropriate query plan stage code when/if the
shard filtering metadata is acquired. The `shardVersion` check after
`AutoGetCollectionForReadLockFree` sets up combined with a read request's `shardVersion` provided
before `AutoGetCollectionForReadLockFree` runs is effectively a before and after comparison around
the 'versioned' state setup. The sharding protocol obviates any special changes for lock-free
reads other than consistently using the same view of the sharding metadata
[(acquiring it once and then passing it where needed in the read code)](https://github.com/mongodb/mongo/blob/9e1f0ea4f371a8101f96c84d2ecd3811d68cafb6/src/mongo/db/catalog_raii.cpp#L251-L273).
### Consistent Data View with Operations Running Nested Lock-Free Reads
Currently commands that support lock-free reads are `find`, `count`, `distinct`, `aggregate` and
`mapReduce` --`mapReduce` still takes collection IX locks for its writes. These commands may use
nested `AutoGetCollection*LockFree` helpers in sub-operations. Therefore, the first lock-free lock
helper will establish a consistent in-memory and on-disk metadata and data view, and sub-operations
will use the higher level state rather than establishing their own. This is achieved by the first
lock helper fetching a complete immutable copy of the `CollectionCatalog` and saving it on the
`OperationContext`. Subsequent lock free helpers will check an `isLockFreeReadsOp()` flag on the
`OperationContext` and skip establishing their own state: instead, the `CollectionCatalog` saved on
the `OperationContext` will be accessed and no new storage snapshot will be opened. Saving a
complete copy of the entire in-memory catalog provides flexibility for sub-operations that may need
access to any collection.
_Code spelunking starting points:_
* [_AutoGetCollectionForReadLockFree preserves an immutable CollectionCatalog_](https://github.com/mongodb/mongo/blob/dcf844f384803441b5393664e500008fc6902346/src/mongo/db/db_raii.cpp#L141)
* [_AutoGetCollectionForReadLockFree returns early if already running lock-free_](https://github.com/mongodb/mongo/blob/dcf844f384803441b5393664e500008fc6902346/src/mongo/db/db_raii.cpp#L108-L112)
* [_The lock-free operation flag on the OperationContext_](https://github.com/mongodb/mongo/blob/dcf844f384803441b5393664e500008fc6902346/src/mongo/db/operation_context.h#L298-L300)
## Secondary Reads
The oplog applier applies entries out-of-order to provide parallelism for data replication. This
exposes readers with no set read timestamp to the possibility of seeing inconsistent states of data.
Because of this, reads on secondaries are generally required to read at replication's
[lastApplied](../repl/README.md#replication-timestamp-glossary) optime instead (see
[SERVER-34192](https://jira.mongodb.org/browse/SERVER-34192)). LastApplied is used because on
secondaries it is only updated after each oplog batch, which is a known consistent state of data.
AGCFR provides the mechanism for secondary reads. This is implemented by switching the ReadSource of [eligible
readers](https://github.com/mongodb/mongo/blob/58283ca178782c4d1c4a4d2acd4313f6f6f86fd5/src/mongo/db/storage/snapshot_helper.cpp#L106)
to read at
[kLastApplied](https://github.com/mongodb/mongo/blob/58283ca178782c4d1c4a4d2acd4313f6f6f86fd5/src/mongo/db/storage/recovery_unit.h#L411).
# Write Operations
Operations that write to collections and indexes must take collection locks. Storage engines require
all operations to hold at least a collection IX lock to provide document-level concurrency.
Operations must perform writes in the scope of a WriteUnitOfWork.
## WriteUnitOfWork
All reads and writes in the scope of a WriteUnitOfWork (WUOW) operate on the same storage engine
snapshot, and all writes in the scope of a WUOW are transactional; they are either all committed or
all rolled-back. The WUOW commits writes that took place in its scope by a call to commit(). It
rolls-back writes when it goes out of scope and its destructor is called before a call to commit().
The WriteUnitOfWork has a [`groupOplogEntries` option](https://github.com/mongodb/mongo/blob/fa32d665bd63de7a9d246fa99df5e30840a931de/src/mongo/db/storage/write_unit_of_work.h#L67)
to replicate multiple writes transactionally. This option uses the [`BatchedWriteContext` class](https://github.com/mongodb/mongo/blob/9ab71f9b2fac1e384529fafaf2a819ce61834228/src/mongo/db/batched_write_context.h#L46)
to stage writes and to generate a single applyOps entry at commit, similar to what multi-document
transactions do via the [`TransactionParticipant` class](https://github.com/mongodb/mongo/blob/219990f17695b0ea4695f827a42a18e012b1e9cf/src/mongo/db/transaction/transaction_participant.h#L82).
Unlike a multi-document transaction, the applyOps entry lacks the `lsId` and the `txnNumber`
fields. Callers must ensure that the WriteUnitOfWork does not generate more than 16MB of oplog,
otherwise the operation will fail with `TransactionTooLarge` code.
As of MongoDB 6.0, the `groupOplogEntries` mode is only used by the [BatchedDeleteStage](https://github.com/mongodb/mongo/blob/9676cf4ad8d537518eb1b570fc79bad4f31d8a79/src/mongo/db/exec/batched_delete_stage.h)
for efficient mass-deletes.
See
[WriteUnitOfWork](https://github.com/mongodb/mongo/blob/fa32d665bd63de7a9d246fa99df5e30840a931de/src/mongo/db/storage/write_unit_of_work.h).
## Collection and Index Writes
Collection write operations (inserts, updates, and deletes) perform storage engine writes to both
the collection's RecordStore and relevant index's SortedDataInterface in the same storage transaction, or
WUOW. This ensures that completed, not-building indexes are always consistent with collection data.
## Vectored Inserts
To support efficient bulk inserts, we provide an internal API on collections, insertDocuments, that
supports 'vectored' inserts. Writers that wish to insert a vector of many documents will
bulk-allocate OpTimes for each document into a vector and pass both to insertDocument. In a single
WriteUnitOfWork, for every document, a commit timestamp is set with the OpTime, and the document is
inserted. The storage engine allows each document to appear committed at a different timestamp, even
though all writes took place in a single storage transaction.
See
[insertDocuments](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/ops/write_ops_exec.cpp#L315)
and
[WiredTigerRecordStore::insertRecords](https://github.com/mongodb/mongo/blob/r4.4.0/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp#L1494).
# Concurrency Control
Theoretically, one could design a database that used only mutexes to maintain database consistency
while supporting multiple simultaneous operations; however, this solution would result in pretty bad
performance and would a be strain on the operating system. Therefore, databases typically use a more
complex method of coordinating operations. This design consists of Resources (lockable entities),
some of which may be organized in a Hierarchy, and Locks (requests for access to a resource). A Lock
Manager is responsible for keeping track of Resources and Locks, and for managing each Resource's
Lock Queue. The Lock Manager identifies Resources with a ResourceId.
## Resource Hierarchy
In MongoDB, Resources are arranged in a hierarchy, in order to provide an ordering to prevent
deadlocks when locking multiple Resources, and also as an implementation of Intent Locking (an
optimization for locking higher level resources). The hierarchy of ResourceTypes is as follows:
1. Global (three - see below)
1. Database (one per database on the server)
1. Collection (one per collection on the server)
Each resource must be locked in order from the top. Therefore, if a Collection resource is desired
to be locked, one must first lock the one Global resource, and then lock the Database resource that
is the parent of the Collection. Finally, the Collection resource is locked.
In addition to these ResourceTypes, there also exists ResourceMutex, which is independent of this
hierarchy. One can use ResourceMutex instead of a regular mutex if one desires the features of the
lock manager, such as fair queuing and the ability to have multiple simultaneous lock holders.
## Lock Modes
The lock manager keeps track of each Resource's _granted locks_ and a queue of _waiting locks_.
Rather than the binary "locked-or-not" modes of a mutex, a MongoDB lock can have one of several
_modes_. Modes have different _compatibilities_ with other locks for the same resource. Locks with
compatible modes can be simultaneously granted to the same resource, while locks with modes that are
incompatible with any currently granted lock on a resource must wait in the waiting queue for that
resource until the conflicting granted locks are unlocked. The different types of modes are:
1. X (exclusive): Used to perform writes and reads on the resource.
2. S (shared): Used to perform only reads on the resource (thus, it is okay to Share with other
compatible locks).
3. IX (intent-exclusive): Used to indicate that an X lock is taken at a level in the hierarchy below
this resource. This lock mode is used to block X or S locks on this resource.
4. IS (intent-shared): Used to indicate that an S lock is taken at a level in the hierarchy below
this resource. This lock mode is used to block X locks on this resource.
## Lock Compatibility Matrix
This matrix answers the question, given a granted lock on a resource with the mode given, is a
requested lock on that same resource with the given mode compatible?
| Requested Mode ||| Granted Mode |||
|:---------------|:---------:|:-------:|:-------:|:------:|:------:|
| | MODE_NONE | MODE_IS | MODE_IX | MODE_S | MODE_X |
| MODE_IS | Y | Y | Y | Y | N |
| MODE_IX | Y | Y | Y | N | N |
| MODE_S | Y | Y | N | Y | N |
| MODE_X | Y | N | N | N | N |
Typically, locks are granted in the order they are queued, but some LockRequest behaviors can be
specially selected to break this rule. One behavior is _enqueueAtFront_, which allows important lock
acquisitions to cut to the front of the line, in order to expedite them. Currently, all mode X and S
locks for the three Global Resources (Global, FCV, RSTL) automatically use this option.
Another behavior is _compatibleFirst_, which allows compatible lock requests to cut ahead of others
waiting in the queue and be granted immediately; note that this mode might starve queued lock
requests indefinitely.
### Replication State Transition Lock (RSTL)
The Replication State Transition Lock is of ResourceType Global, so it must be locked prior to
locking any Database level resource. This lock is used to synchronize replica state transitions
(typically transitions between PRIMARY, SECONDARY, and ROLLBACK states).
More information on the RSTL is contained in the [Replication Architecture Guide](https://github.com/mongodb/mongo/blob/b4db8c01a13fd70997a05857be17548b0adec020/src/mongo/db/repl/README.md#replication-state-transition-lock)
### Global Lock
The resource known as the Global Lock is of ResourceType Global. It is currently used to
synchronize shutdown, so that all operations are finished with the storage engine before closing it.
Certain types of global storage engine operations, such as recoverToStableTimestamp(), also require
this lock to be held in exclusive mode.
### Tenant Lock
A resource of ResourceType Tenant is used when a database belongs to a tenant. It is used to synchronize
change streams enablement and disablement for a tenant operation with other operations associated with the tenant.
Enabling or disabling of change streams (by creating or dropping a change collection) for a tenant takes this lock
in exclusive (X) mode. Acquiring this resource with an intent lock is an indication that the operation is doing reads (IS)
or writes (IX) at the database or lower level.
### Database Lock
Any resource of ResourceType Database protects certain database-wide operations such as database
drop. These operations are being phased out, in the hopes that we can eliminate this ResourceType
completely.
### Collection Lock
Any resource of ResourceType Collection protects certain collection-wide operations, and in some
cases also protects the in-memory catalog structure consistency in the face of concurrent readers
and writers of the catalog. Acquiring this resource with an intent lock is an indication that the
operation is doing explicit reads (IS) or writes (IX) at the document level. There is no Document
ResourceType, as locking at this level is done in the storage engine itself for efficiency reasons.
### Document Level Concurrency Control
Each storage engine is responsible for locking at the document level. The WiredTiger storage engine
uses MVCC [multiversion concurrency control][] along with optimistic locking in order to provide
concurrency guarantees.
## Two-Phase Locking
The lock manager automatically provides _two-phase locking_ for a given storage transaction.
Two-phase locking consists of an Expanding phase where locks are acquired but not released, and a
subsequent Shrinking phase where locks are released but not acquired. By adhering to this protocol,
a transaction will be guaranteed to be serializable with other concurrent transactions. The
WriteUnitOfWork class manages two-phase locking in MongoDB. This results in the somewhat unexpected
behavior of the RAII locking types acquiring locks on resources upon their construction but not
unlocking the lock upon their destruction when going out of scope. Instead, the responsibility of
unlocking the locks is transferred to the WriteUnitOfWork destructor. Note this is only true for
transactions that do writes, and therefore only for code that uses WriteUnitOfWork.
# Indexes
An index is a storage engine data structure that provides efficient lookup on fields in a
collection's data set. Indexes map document fields, keys, to documents such that a full collection
scan is not required when querying on a specific field.
All user collections have a unique index on the `_id` field, which is required. The oplog and some
system collections do not have an _id index.
Also see [MongoDB Manual - Indexes](https://docs.mongodb.com/manual/indexes/).
## Index Constraints
### Unique indexes
A unique index maintains a constraint such that duplicate values are not allowed on the indexed
field(s).
To convert a regular index to unique, one has to follow the two-step process:
* The index has to be first set to `prepareUnique` state using `collMod` command with the index
option `prepareUnique: true`. In this state, the index will start rejecting writes introducing
duplicate keys.
* The `collMod` command with the index option `unique: true` will then check for the uniqueness
constraint and finally update the index spec in the catalog under a collection `MODE_X` lock.
If the index already has duplicate keys, the conversion in step two will fail and return all
violating documents' ids grouped by the keys. Step two can be retried to finish the conversion after
all violating documents are fixed. Otherwise, the conversion can be cancelled using `collMod`
command with the index option `prepareUnique: false`.
The `collMod` option `dryRun: true` can also be specified to check for duplicates in the index