-
Run on 1000 machines, not 1
-
64MB chunk
-
Replicates each chunk across machines
-
By doing so, system is impervious to a whole class of hardware failures
-
Power supply
-
Power to the rack
-
Network failure
-
-
Map/Reduce
-
Bigtable
-
Not relational
-
Modeled after Google’s bigtable
-
One big massive primary keyed table
-
No transactions, maybe in the future
-
Scalable
-
High Random insert, update, and delete rate
-
Loaded 1TB into a 9-node HT cluster, sustained random insert @ 1M inserts per second (Quad core Intel, 16GB RAM, SATA 3Gb/s).
-
Sparse, 2D table with cell versions
-
1 table with 2 columns, next one has 1M, that’s OK
-
4-part key
-
Row
-
Column Family
-
Column Qualifier
-
Timestamp
-
-
Tim O’Reilly walks in and looks around for a seat, they’re all taken
-
Row key is 0 Terminated
-
Col family is a single-byte (256 possible)
-
Col qualifier is 0 terminated
-
Timestamp is big-endian 1’s Comp. (memcmp, ordering has more recent ahead of older versions)
-
Bigtable uses Copy on write
-
MVCC (like Couch) form in Hypertable. Deletes are inserted as delete records. Multiple versions are kept around.
-
65K blocks of compressed KV pairs
-
Bloom Filter - booya!
-
Hyperspace has a distributed lock manager, and a small metadata fs (built on BDB)
-
Chubby (sp?) is google’s hyperspace
-
Function of the master is to perform metadata operations (ALTER, CREATE, etc.)
-
Clients can communicate with Range servers
-
Master can be down for a while with no one even noticing
-
Hot standby design for availability
-
Range Servers: Responsible for UPDATING and SCANNING
-
All sits on top of HDFS distributed FS
-
Hadoop, KFS (GFS Clone)
-
Manages ranges of table data
-
Caches updates in memory (CellCache)
-
Spills (compacts) periodically to update the disk (CellStore)
-
When updates come into a rangeserver, they’re written to a commit log, then the data structures are updated so you can replay the log.
-
When a rangeserver does anything (moves, stops), it’s written into the log
-
C++ client is the only one supported ATM:
-
You modify a table by creating a mutator
-
You scan a table by creating a scanner
-
Thrift Broker in the works
-
Someone contrib’d a Hadoop Map/Reduce connector
-
CellStore: compressed KV pairs
-
Commit log: Compressed blocks (optionally)
-
Supported types
-
zlib (fastest/best)
-
lzo (high decomp speed)
-
quicklz (fast decomp, high ratio)
-
bmz (longest commons substring, lost of replication)
-
none
-
-
CellStore blocks of KV pairs configurable
-
Not finished implementing
-
Caches results
-
Negative Cache
-
Configurable K
-
Allows you to find out if you definitely *don’t* have the data
-
Session table and crawl table
-
Splits them all up into ranges, go to rangeservers
-
Just add more machines, and the system migrates data equally
-
Balancing is questionable…
-
Control of physical layout hybrid row/col oriented
-
Improves perf. by minimizing IO
-
Grouping columns allows physical storage control
-
Makes faster updates possible
-
Can run on any distributed FS
-
FUSE hooks
-
Comparison to Hbase (Java, yuck), C++ much better
-
System is designed for async communication
-
Hypertable is CPU intensive
-
Java uses 2-3 times the memory for large memmap
-
Poor processor cache perf.
-
AOL Query logs
-
75,275,825 inserted cells
-
8-node cluster (1 1.8 Ghz Dual Core Opteron)
-
4GB RAM
-
3x 7200 SATA
-
-
Row Key 7B
-
Avg value 15B
-
Crap. Slide change
-
Another test yielded over 1M sustained inserts/s
-
Range data managed by a single rangeserver
-
No data loss, but if it goes down, bad bad
-
Can be mitigated with client-side cache or memcached
-
-
Alpha, 0.9.0.7 released
-
Beta at the end of August
-
Waiting on Hadoop JIRA 1700
-
Bug in Hadoop, don’t allow appending to existing files
-
-
GPL 2
-
Delete records get flushed in a “major operation”