Centos6.6下RegionServer因长时间GC退出 #25

mattshma · 2016-05-10T04:10:29Z

报错如下：

2016-05-09 19:38:05,741 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 50631ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2016-05-09 19:38:05,741 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 62195ms for sessionid 0x25475cdd7470285, closing socket connection and attempting reconnect
2016-05-09 19:38:05,742 INFO org.apache.hadoop.hbase.ScheduledChore: Chore: CompactionChecker missed its start time
2016-05-09 19:38:05,742 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 62232ms for sessionid 0x15475cdd4733510, closing socket connection and attempting reconnect
2016-05-09 19:38:05,742 INFO org.apache.hadoop.hbase.ScheduledChore: Chore: bd15-013,60020,1462272507472-MemstoreFlusherChore missed its start time
2016-05-09 19:38:05,742 WARN org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 49738ms
GC pool 'ParNew' had collection(s): count=1 time=48250ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=1814ms
2016-05-09 19:38:07,127 WARN org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation: This client just lost it's session with ZooKeeper, closing it. It will be recreated next time someone needs it
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:700)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:611)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-05-09 19:38:07,127 INFO org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x15475cdd4733510
2016-05-09 19:38:07,127 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server bd15-013,60020,1462272507472: regionserver:60020-0x25475cdd7470285, quorum=bd15-013:2181,bd15-011:2181,bd15-017:2181,bd15-019:2181,bd15-012:2181, baseZNode=/hbase regionserver:60020-0x25475cdd7470285 received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:700)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:611)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-05-09 19:38:07,127 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x25475cdd7470285 has expired, closing socket connection
2016-05-09 19:38:07,128 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
2016-05-09 19:38:07,127 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x15475cdd4733510 has expired, closing socket connection
2016-05-09 19:38:07,127 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2016-05-09 19:38:07,135 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics as JSON on abort: {
  "beans" : [ {
    "name" : "java.lang:type=Memory",
    "modelerType" : "sun.management.MemoryImpl",
    "NonHeapMemoryUsage" : {
      "committed" : 100073472,
      "init" : 24313856,
      "max" : 136314880,
      "used" : 67207712
    },
    "ObjectPendingFinalizationCount" : 0,
    "HeapMemoryUsage" : {
      "committed" : 2075918336,
      "init" : 2147483648,
      "max" : 2075918336,
      "used" : 1078754696
    },
    "Verbose" : false,
    "ObjectName" : "java.lang:type=Memory"
  } ],
  "beans" : [ {
    "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
    "modelerType" : "RegionServer,sub=IPC",
    "tag.Context" : "regionserver",
    "tag.Hostname" : "bd15-013",
    "queueSize" : 0,
    "numCallsInGeneralQueue" : 0,
    "numCallsInReplicationQueue" : 0,
    "numCallsInPriorityQueue" : 0,
    "numOpenConnections" : 0,
    "numActiveHandler" : 0,
    "TotalCallTime_num_ops" : 2403555,
    "TotalCallTime_min" : -51,
    ...
    "exceptions.NotServingRegionException" : 91,
    "sentBytes" : 5496771135,
    "exceptions.RegionTooBusyException" : 0,
    "receivedBytes" : 255496518,
    "exceptions.OutOfOrderScannerNextException" : 0,
    "exceptions.multiResponseTooLarge" : 0,
    "exceptions.UnknownScannerException" : 0
  } ],
  "beans" : [ {
    "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
    "modelerType" : "RegionServer,sub=Replication",
    "tag.Context" : "regionserver",
    "tag.Hostname" : "bd15-013",
    "sink.appliedOps" : 0,
    "sink.appliedBatches" : 0,
    "sink.ageOfLastAppliedOp" : 0
  } ],
  "beans" : [ {
    "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
    "modelerType" : "RegionServer,sub=Server",
    "tag.zookeeperQuorum" : "bd15-013:2181,bd15-011:2181,bd15-017:2181,bd15-019:2181,bd15-012:2181",
    "tag.serverName" : "bd15-013,60020,1462272507472",
    "tag.clusterId" : "21861213-d1bc-4ae4-bd54-4dc8b593b4a6",
    "tag.Context" : "regionserver",
    "tag.Hostname" : "bd15-013",
    "regionCount" : 5388,
    "storeCount" : 5388,
    "hlogFileCount" : 2,
    "hlogFileSize" : 277,
    "storeFileCount" : 7897,
    "memStoreSize" : 2327936,
    "storeFileSize" : 6398421535012,
    "regionServerStartTime" : 1462272507472,
    "totalRequestCount" : 2404267,
    "readRequestCount" : 78521805,
    "writeRequestCount" : 2,
    "checkMutateFailedCount" : 0,
    "checkMutatePassedCount" : 0,
    "storeFileIndexSize" : 112044136,
    ...
    "Increment_num_ops" : 0,
    "Increment_min" : 0,
    "Increment_max" : 0,
    "Increment_mean" : 0.0,
    "Increment_median" : 0.0,
    "Increment_75th_percentile" : 0.0,
    "Increment_90th_percentile" : 0.0,
    "Increment_95th_percentile" : 0.0,
    "Increment_99th_percentile" : 0.0
  } ]
}
2016-05-09 19:38:07,160 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: regionserver:60020-0x25475cdd7470285, quorum=bd15-013:2181,bd15-011:2181,bd15-017:2181,bd15-019:2181,bd15-012:2181, baseZNode=/hbase regionserver:60020-0x25475cdd7470285 received expired from ZooKeeper, aborting
2016-05-09 19:38:07,160 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2016-05-09 19:38:07,160 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread
2016-05-09 19:38:07,163 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer

说明：
当前系统为 centos6.6，kernel 版本为2.6.32-504.el6.x86_64，jvm版本为1.7.0_65。

The text was updated successfully, but these errors were encountered:

mattshma · 2016-05-10T08:59:20Z

怀疑挂掉是futex_wait bug 引起的。通过 jstack -F pid，问题仍存在。

在其他机器上发现GC时间较长，CDH上一般伴随着Compaction Queue Size过大，Average time spent in garbage collection was 52.6 second(s) (87.60%) per minute over the previous 5 minute(s)。在hbase log中，日志如下：

2016-05-10 16:56:49,475 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1959ms
GC pool 'ParNew' had collection(s): count=1 time=1ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=2056ms
2016-05-10 16:56:51,864 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1888ms
GC pool 'ParNew' had collection(s): count=1 time=0ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=2019ms

根据源码看，这些log均是正常的。

对于cms造成的gc，仍需注意。

mattshma · 2016-05-10T09:22:45Z

关于futex_wait bug的问题，受影响的系统如下：

RHEL 5（包括CentOS 5和Scientific Linux 5）：所有版本（包括5.11版）都没有问题。

RHEL 6（包括CentOS 6和Scientific Linux 6）：从6.0～6.5版都没问题。但6.6版存在缺陷，而6.6.z版本没有问题。

RHEL 7（包括CentOS 7和Scientific Linux 7）：7.1是有缺陷的。并且截至2015年5月13日也没有一个7.x的修复。

mattshma · 2016-05-11T03:11:20Z

当前Hadoop等各服务均正常，只有HBase出现GC duration和The web server of this role is responding with metrics的警告。而大约一天时间后，出现GC duration的RegionServer会Down掉。

目前yum源中有2个内核版本：2.6.32-504.el6 和 2.6.32-573.26.1.el6。 2.6.32-504.el6应该是centos6.6默认内核版本。当前计划是先升级kernel到2.6.32-573.26.1.el6。

mattshma · 2016-05-11T09:44:41Z

Update：centos6.6发现另外一个内核bug（or DELL ACPI的BUG），系统一直重启，报错如下：

Mar 10 23:45:37 jh02-024 kernel: [Firmware Bug]: No valid trip found
Mar 10 23:45:37 jh02-024 kernel: ERST: Error Record Serialization Table (ERST) support is initialized.
Mar 10 23:45:37 jh02-024 kernel: GHES: APEI firmware first mode is enabled by WHEA _OSC.
Mar 10 23:45:37 jh02-024 kernel: Non-volatile memory driver v1.3
Mar 10 23:45:37 jh02-024 kernel: Linux agpgart interface v0.103
Mar 10 23:45:37 jh02-024 kernel: crash memory driver: version 1.1

疑似2.6.32-504.el6和 2.6.32-573.26.1.el6都会受影响，见ERST: Error Record Serialization Table (ERST) support is initialized。解决方案为升级内核（升级后版本为2.6.32-573.26.1.el6.x86_64），或者在grub.conf中添加noacpi=off。

mattshma · 2016-05-12T08:32:56Z

UPDATE:

对于重启的机器，即使加了noacpi=off参数，并升级内核后，该机器仍在重启。
对于RegionServer挂掉的机器，升级内核（升级后版本为2.6.32-573.26.1.el6.x86_64）后，仍有RegionServer挂掉，由于supervisor(？)原因，挂掉后会自动重启。

由于时间紧迫，集群还未在处于测试阶段，目前做法是先将操作系统降级为6.4 。

注：此时正确的做法应该还是更改内核，而不是更换操作系统，更换的内核可通过rpm -qp --changelog kernel-2.6.32-504.16.2.el6.x86_64.rpm | grep 'Ensure get_futex_key_refs() always implies a barrier' 来确保该bug已修改。

mattshma · 2016-05-12T11:12:08Z

对于RegionSever挂掉的机器，目前操作系统已降级为6.4，重装系统后，CDH需要重新安装，但数据还在，hdfs数据正常，但hbase数据存在不一致的情况，且数据变少，并有报错，见 #28 。运行一天服务正常。

不过此事给的教训很足：任何线上业务在正式使用前，请务必通过充分测试！

mattshma · 2016-05-13T04:13:53Z

UPDATE: Operating System Known Issues

mattshma · 2016-05-26T02:09:29Z

UPDATE:

HBase降到centos6.4后，不再有报错。
重启的机器(centos6.6)升级内核后，还有重启的现象存在，只不过频率很低。接下来会将其操作系统更换为centos6.4。

mattshma · 2016-07-21T02:51:29Z

update，重启机器系硬件原因，更换硬件后没再重启。

mattshma added the HBase label May 10, 2016

mattshma changed the title ~~RegionServer 因长时间GC退出~~ Centos6.6下RegionServer因长时间GC退出 May 12, 2016

mattshma closed this as completed Jul 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centos6.6下RegionServer因长时间GC退出 #25

Centos6.6下RegionServer因长时间GC退出 #25

mattshma commented May 10, 2016 •

edited

mattshma commented May 10, 2016 •

edited

mattshma commented May 10, 2016

mattshma commented May 11, 2016 •

edited

mattshma commented May 11, 2016 •

edited

mattshma commented May 12, 2016 •

edited

mattshma commented May 12, 2016 •

edited

mattshma commented May 13, 2016

mattshma commented May 26, 2016

mattshma commented Jul 21, 2016

Centos6.6下RegionServer因长时间GC退出 #25

Centos6.6下RegionServer因长时间GC退出 #25

Comments

mattshma commented May 10, 2016 • edited

mattshma commented May 10, 2016 • edited

mattshma commented May 10, 2016

mattshma commented May 11, 2016 • edited

mattshma commented May 11, 2016 • edited

mattshma commented May 12, 2016 • edited

mattshma commented May 12, 2016 • edited

mattshma commented May 13, 2016

mattshma commented May 26, 2016

mattshma commented Jul 21, 2016

mattshma commented May 10, 2016 •

edited

mattshma commented May 10, 2016 •

edited

mattshma commented May 11, 2016 •

edited

mattshma commented May 11, 2016 •

edited

mattshma commented May 12, 2016 •

edited

mattshma commented May 12, 2016 •

edited