Notes and homework related to Week 4: Performance. 3 hours 45 min. of videos and notes, 45 min. of homework.
- Homework 4.1 Which of the following queries can utilize an index?
- Homework 4.2 What can you infer from the following
explain
output? - Homework 4.3 Making the blog fast
- Homework 4.4 Analyze a profile log taken from a mongoDB instance
- Creating an index:
db.students.ensureIndex({ student_id: 1, class: -1})
puts a composite index on fieldstudent_id
, in ascending order, and on fieldclass
, in descending order; order has no effect on searching, but makes a difference when sorting on these fields - Finding indexes: all with
db.system.indexes.find()
; for a given collection withdb.students.getIndexes()
- Dropping an index:
db.students.dropIndex({ student_id: 1})
- Multikey indexes: automatically created when inserting an array in a field participating in an index; for compound indexes, only one of the fields participating in the index can be an array (Mongo cannot index parallel arrays); multikey indexes can be defined for subdocuments also:
db.people.ensureIndex( {"adresses.tag": 1})
- Creating unique indexes:
db.students.ensureIndex({ student_id: 1}, { unique: true})
- … and removing duplicate keys:
db.students.ensureIndex({ student_id: 1}, { unique: true, dropDups: true})
; be very careful: there is No Way to determine which of the duplicates keys will be removed; and the removed documents are gone forever - Sparse indexes: if the key of an index is missing within a document, it maps to Null; if the index is unique, there can be only one null; it is however possible to create sparse indexes, that index only keys being set:
db.products.ensureIndex({ size: 1}, { [unique: true,] sparse: true})
; beware: when sorting on the key of a sparse index, documents without an entry won't appear in the results! it is an artifact of the sparse index, which gets used by the sort, therefore changing its semantics - Creating indexes in the background: by default, indexes are created in the foreground, which is faster, but is blocking all writers; with
db.people.ensureIndex( {…}, { background: true})
, it will be created in the background, not blocking other writers, however being slower; in a replica set, one server can be pulled out of the set to be indexed, and then put back again; a mongod instance can only build one background index at a time per database
db.foo.find({c:1}).explain()
"cursor": "BasicCursor"
--> full collection scan"millis": 3
--> time taken"cursor": "BtreeCursor a_1_b_1_c_1"
--> indexa_1_b_1_c_1
was used"isMultiKey": false
--> none of the values are arrays"n": 1
--> number of documents returned"nscannedObjects": 1
--> number of documents that were scanned"nscanned": 1
--> number of documents or index entries that were scanned"indexBounds": [ …]
--> strategy followed and boundary values"indexOnly": false
-->true
if everything queryied for is satisfied by the index, without retrieving the actual document
- When is an index used? When there are many indexes available for a query, MongoDB runs different query plans in parallel with the different indexes, and memorizes the one that returned first; this happens automatically behind the scenes; every hundred or so queries, it will forget what it learned and start over.
- Index Size: if an index can be kept in memory, it is blazing fast; use
db.students.stats()
to see how much place the data takes on disk;db.students.totalIndexSize()
to see how much place the indexes take - Index cardinality: in regular indexes, there is an index point for every key (#index points --1:1-- #documents); in sparse indexes, there could be less index points than documents, as null keys are dropped (#index points <= # documents); in multikey documents, there could be more index points than documents, as we have one for each value in arrays (#index points > #documents)
- Index selectivity: have them be very selective if possible, it improves performance; for instance, in a logging collection, with millions of documents having a timestamp and opcode, where the opcode represents a small set of operations, it is best to create an index on
(timestamp, opcode)
rather than(opcode, timestamp)
, because queries on a time interval with the first index will eliminate 4/5 of the collection's data upfront, leaving only 1/5 documents to scan within on the opcode - Providing hints to the database:
db.foo.find().hint({ a:1, b:1, c:1})
; or.hint( "$natural": 1)
to use no index (the «natural index») and have a cursor that goes thru all documents - … in Pymongo:
doc = db.foo.find(query).hint( [ ("c", pymongo.ASCENDING)])
(providing an array of tuples, as dictionaries are not ordered in Python -- as they are in Javascript and JSON) - Efficiency of index use:
$gt
,$lt
,$ne
operators are not particulary efficient to use with an index, as they leave a lot of documents to go thru; the same for regexps, which require generally to go thru all documents -- to the exception of^abc
regexps (matches strings starting with "abc"); so it is important to not only consider if and which index was used for a query, but also how it was used, if the database still had to go thru millions of documents
- What is the location closest to an (x,y) point?
db.places.ensureIndex({ location: "2d"})
- Finding with
db.places.find({ location: { "$near": [ x, y]}})
; will be return by order of increasing distance, so add a limit.limit( 3)
- To have geospatial indexes considering the curvature of the earth, in a spherical space model: you still create the index with
.ensureIndex({ stores: "2d"})
, but run a query withdb.runCommand({ geoNear: "stores"}, near:[ long, lat], spherical: true, maxDistance: rad})
-- order of the tuple (long, lat) matters, longitude comes first, then latitude
- Log slow queries is built in: above 100ms, Mongo automatically writes an entry within its logs (console or logfiles)
- Profiler levels: 0 = off; 1 = log slow queries; 2 = log all queries (debugging)
- Start mongod with
mongod --dbpath … --profile 1 --slowms 2
to have Mongo automatically profile slow queries running more than 2ms - Use
db.system.profile.find()
to see what was logged; notedb.system.profile
is a capped-collection, which rotates its contents - Use
db.system.profile.find({ ns:/school.students/}).sort({ ts: 1})
to find queries on collectionstudents
within DBschool
, sorting by the timestamp - Use
db.system.profile.find({ millis: { $gt: 1}}).sort({ ts: 1})
to find queries that were slower than 1ms - Turning the profiler from the Mongo shell:
db.getProfilingLevel()
to read the level,db.getProfilingStatus()
to read the status,db.setProfilingLevel( 1, 4)
to set profiling level to 1 and look for slow queries longer than 4ms;db.setProfilingLevel( 0)
to turn it off - Mongostat: most looked information is
idx miss %
--> index miss rate in memory, percentage of queries where Mongo has to go to disk; beware when it says 0: consider if either the query uses no index at all, or it resides entirely in memory - Mongotop: high-level view of where Mongo is spending its time;
mongotop 3
will display how much time MongoDB spent accessing/updating each collection in the repeating interval of the last 3 seconds
- Sharding is a technique for splitting up a large collection among multiple servers; in front of multiple
mongod
sharded servers, there is amongos
server, which is a router; the application accesses thismongos
server - About architecture: The
mongos
server is often colocated on the same machine as the app server. There can be manymongos
servers. - Then, the
mongod
servers are generally on different physical servers, and each of them can -- it is recommended -- be part of a replicat set, that is, another group ofmongod
servers, where data is kept in sync, so that if one server goes down, no data is lost; the replica set can be looked at as one shard - Mongo shards according to a shard key; when performing queries and updates, the shard key will be used to sent them to the right system
- An insert must include the shard key (the entire shard key if it's a compound key) to complete; so the developer has to be aware of what the shard key is
- For find/update/remove operations, to get higher performance, the shard key should be specified, if one knowns it; otherwise,
mongos
will have to broadcast the queries to themongod
servers, which will keep each of them busy - By the way, updates on sharded collections should be made multi
- Choosing a shard key is a topic in itself.
- Indexes are critical to performance
- Explain command to look how the database uses the indexes for particular queries
- Hint command to instruct the database which particular index to use
- Profiling to figure out which queries are slow, then use explain and hinting to improve them, possibly create new indexes.