OSCON 2008, Session 3: Hypertable¶ ↑

GFS¶ ↑

Run on 1000 machines, not 1

Filesystem¶ ↑

64MB chunk
Replicates each chunk across machines
By doing so, system is impervious to a whole class of hardware failures
- Power supply
- Power to the rack
- Network failure
Map/Reduce
Bigtable

Hypertable¶ ↑

Not relational
Modeled after Google’s bigtable
One big massive primary keyed table
No transactions, maybe in the future
Scalable
High Random insert, update, and delete rate
Loaded 1TB into a 9-node HT cluster, sustained random insert @ 1M inserts per second (Quad core Intel, 16GB RAM, SATA 3Gb/s).

Data Model¶ ↑

Sparse, 2D table with cell versions
1 table with 2 columns, next one has 1M, that’s OK
4-part key
- Row
- Column Family
- Column Qualifier
- Timestamp
Tim O’Reilly walks in and looks around for a seat, they’re all taken

Anatomy of a key¶ ↑

Row key is 0 Terminated
Col family is a single-byte (256 possible)
Col qualifier is 0 terminated
Timestamp is big-endian 1’s Comp. (memcmp, ordering has more recent ahead of older versions)

Concurrency¶ ↑

Bigtable uses Copy on write
MVCC (like Couch) form in Hypertable. Deletes are inserted as delete records. Multiple versions are kept around.

Cellstore¶ ↑

65K blocks of compressed KV pairs
Bloom Filter - booya!

System Overview¶ ↑

Hyperspace has a distributed lock manager, and a small metadata fs (built on BDB)
Chubby (sp?) is google’s hyperspace
Function of the master is to perform metadata operations (ALTER, CREATE, etc.)
Clients can communicate with Range servers
Master can be down for a while with no one even noticing
Hot standby design for availability
Range Servers: Responsible for UPDATING and SCANNING
All sits on top of HDFS distributed FS
Hadoop, KFS (GFS Clone)

Range server¶ ↑

Manages ranges of table data
Caches updates in memory (CellCache)
Spills (compacts) periodically to update the disk (CellStore)

Write ahead commit log¶ ↑

When updates come into a rangeserver, they’re written to a commit log, then the data structures are updated so you can replay the log.

Range meta-operation log¶ ↑

When a rangeserver does anything (moves, stops), it’s written into the log

Client API¶ ↑

C++ client is the only one supported ATM:
You modify a table by creating a mutator
You scan a table by creating a scanner
Thrift Broker in the works
Someone contrib’d a Hadoop Map/Reduce connector

Compression¶ ↑

CellStore: compressed KV pairs
Commit log: Compressed blocks (optionally)
Supported types
- zlib (fastest/best)
- lzo (high decomp speed)
- quicklz (fast decomp, high ratio)
- bmz (longest commons substring, lost of replication)
- none

Caching¶ ↑

Block Cache¶ ↑

CellStore blocks of KV pairs configurable

Query cache¶ ↑

Not finished implementing
Caches results

Bloom Filter (!!)¶ ↑

Negative Cache
Configurable K
Allows you to find out if you definitely *don’t* have the data

Scaling¶ ↑

Session table and crawl table
Splits them all up into ranges, go to rangeservers
Just add more machines, and the system migrates data equally
Balancing is questionable…

Access Groups¶ ↑

Control of physical layout hybrid row/col oriented
Improves perf. by minimizing IO
Grouping columns allows physical storage control
Makes faster updates possible

FS Broker¶ ↑

Can run on any distributed FS
FUSE hooks

More¶ ↑

Comparison to Hbase (Java, yuck), C++ much better
System is designed for async communication
Hypertable is CPU intensive
Java uses 2-3 times the memory for large memmap
Poor processor cache perf.

Performance¶ ↑

AOL Query logs
75,275,825 inserted cells
8-node cluster (1 1.8 Ghz Dual Core Opteron)
- 4GB RAM
- 3x 7200 SATA
Row Key 7B
Avg value 15B
Crap. Slide change
Another test yielded over 1M sustained inserts/s

Weaknesses¶ ↑

Range data managed by a single rangeserver
- No data loss, but if it goes down, bad bad
- Can be mitigated with client-side cache or memcached

Status¶ ↑

Alpha, 0.9.0.7 released
Beta at the end of August
Waiting on Hadoop JIRA 1700
- Bug in Hadoop, don’t allow appending to existing files
GPL 2
Delete records get flushed in a “major operation”