Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centos6.6下RegionServer因长时间GC退出 #25

Closed
mattshma opened this issue May 10, 2016 · 9 comments
Closed

Centos6.6下RegionServer因长时间GC退出 #25

mattshma opened this issue May 10, 2016 · 9 comments
Labels

Comments

@mattshma
Copy link
Owner

mattshma commented May 10, 2016

报错如下:

2016-05-09 19:38:05,741 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 50631ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2016-05-09 19:38:05,741 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 62195ms for sessionid 0x25475cdd7470285, closing socket connection and attempting reconnect
2016-05-09 19:38:05,742 INFO org.apache.hadoop.hbase.ScheduledChore: Chore: CompactionChecker missed its start time
2016-05-09 19:38:05,742 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 62232ms for sessionid 0x15475cdd4733510, closing socket connection and attempting reconnect
2016-05-09 19:38:05,742 INFO org.apache.hadoop.hbase.ScheduledChore: Chore: bd15-013,60020,1462272507472-MemstoreFlusherChore missed its start time
2016-05-09 19:38:05,742 WARN org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 49738ms
GC pool 'ParNew' had collection(s): count=1 time=48250ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=1814ms
2016-05-09 19:38:07,127 WARN org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation: This client just lost it's session with ZooKeeper, closing it. It will be recreated next time someone needs it
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:700)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:611)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-05-09 19:38:07,127 INFO org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x15475cdd4733510
2016-05-09 19:38:07,127 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server bd15-013,60020,1462272507472: regionserver:60020-0x25475cdd7470285, quorum=bd15-013:2181,bd15-011:2181,bd15-017:2181,bd15-019:2181,bd15-012:2181, baseZNode=/hbase regionserver:60020-0x25475cdd7470285 received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:700)
        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:611)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-05-09 19:38:07,127 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x25475cdd7470285 has expired, closing socket connection
2016-05-09 19:38:07,128 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
2016-05-09 19:38:07,127 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x15475cdd4733510 has expired, closing socket connection
2016-05-09 19:38:07,127 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2016-05-09 19:38:07,135 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics as JSON on abort: {
  "beans" : [ {
    "name" : "java.lang:type=Memory",
    "modelerType" : "sun.management.MemoryImpl",
    "NonHeapMemoryUsage" : {
      "committed" : 100073472,
      "init" : 24313856,
      "max" : 136314880,
      "used" : 67207712
    },
    "ObjectPendingFinalizationCount" : 0,
    "HeapMemoryUsage" : {
      "committed" : 2075918336,
      "init" : 2147483648,
      "max" : 2075918336,
      "used" : 1078754696
    },
    "Verbose" : false,
    "ObjectName" : "java.lang:type=Memory"
  } ],
  "beans" : [ {
    "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
    "modelerType" : "RegionServer,sub=IPC",
    "tag.Context" : "regionserver",
    "tag.Hostname" : "bd15-013",
    "queueSize" : 0,
    "numCallsInGeneralQueue" : 0,
    "numCallsInReplicationQueue" : 0,
    "numCallsInPriorityQueue" : 0,
    "numOpenConnections" : 0,
    "numActiveHandler" : 0,
    "TotalCallTime_num_ops" : 2403555,
    "TotalCallTime_min" : -51,
    ...
    "exceptions.NotServingRegionException" : 91,
    "sentBytes" : 5496771135,
    "exceptions.RegionTooBusyException" : 0,
    "receivedBytes" : 255496518,
    "exceptions.OutOfOrderScannerNextException" : 0,
    "exceptions.multiResponseTooLarge" : 0,
    "exceptions.UnknownScannerException" : 0
  } ],
  "beans" : [ {
    "name" : "Hadoop:service=HBase,name=RegionServer,sub=Replication",
    "modelerType" : "RegionServer,sub=Replication",
    "tag.Context" : "regionserver",
    "tag.Hostname" : "bd15-013",
    "sink.appliedOps" : 0,
    "sink.appliedBatches" : 0,
    "sink.ageOfLastAppliedOp" : 0
  } ],
  "beans" : [ {
    "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
    "modelerType" : "RegionServer,sub=Server",
    "tag.zookeeperQuorum" : "bd15-013:2181,bd15-011:2181,bd15-017:2181,bd15-019:2181,bd15-012:2181",
    "tag.serverName" : "bd15-013,60020,1462272507472",
    "tag.clusterId" : "21861213-d1bc-4ae4-bd54-4dc8b593b4a6",
    "tag.Context" : "regionserver",
    "tag.Hostname" : "bd15-013",
    "regionCount" : 5388,
    "storeCount" : 5388,
    "hlogFileCount" : 2,
    "hlogFileSize" : 277,
    "storeFileCount" : 7897,
    "memStoreSize" : 2327936,
    "storeFileSize" : 6398421535012,
    "regionServerStartTime" : 1462272507472,
    "totalRequestCount" : 2404267,
    "readRequestCount" : 78521805,
    "writeRequestCount" : 2,
    "checkMutateFailedCount" : 0,
    "checkMutatePassedCount" : 0,
    "storeFileIndexSize" : 112044136,
    ...
    "Increment_num_ops" : 0,
    "Increment_min" : 0,
    "Increment_max" : 0,
    "Increment_mean" : 0.0,
    "Increment_median" : 0.0,
    "Increment_75th_percentile" : 0.0,
    "Increment_90th_percentile" : 0.0,
    "Increment_95th_percentile" : 0.0,
    "Increment_99th_percentile" : 0.0
  } ]
}
2016-05-09 19:38:07,160 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: regionserver:60020-0x25475cdd7470285, quorum=bd15-013:2181,bd15-011:2181,bd15-017:2181,bd15-019:2181,bd15-012:2181, baseZNode=/hbase regionserver:60020-0x25475cdd7470285 received expired from ZooKeeper, aborting
2016-05-09 19:38:07,160 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2016-05-09 19:38:07,160 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread
2016-05-09 19:38:07,163 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Stopping infoServer

说明:
当前系统为 centos6.6,kernel 版本为2.6.32-504.el6.x86_64,jvm版本为1.7.0_65。

@mattshma mattshma added the HBase label May 10, 2016
@mattshma
Copy link
Owner Author

mattshma commented May 10, 2016

怀疑挂掉是futex_wait bug 引起的。通过 jstack -F pid,问题仍存在。

在其他机器上发现GC时间较长,CDH上一般伴随着Compaction Queue Size过大,Average time spent in garbage collection was 52.6 second(s) (87.60%) per minute over the previous 5 minute(s)。在hbase log中,日志如下:

2016-05-10 16:56:49,475 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1959ms
GC pool 'ParNew' had collection(s): count=1 time=1ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=2056ms
2016-05-10 16:56:51,864 INFO org.apache.hadoop.hbase.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1888ms
GC pool 'ParNew' had collection(s): count=1 time=0ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=2019ms

根据源码看,这些log均是正常的。

对于cms造成的gc,仍需注意。

@mattshma
Copy link
Owner Author

关于futex_wait bug的问题,受影响的系统如下:

RHEL 5(包括CentOS 5和Scientific Linux 5):所有版本(包括5.11版)都没有问题。

RHEL 6(包括CentOS 6和Scientific Linux 6):从6.0~6.5版都没问题。 但6.6版存在缺陷,而6.6.z版本没有问题。

RHEL 7(包括CentOS 7和Scientific Linux 7):7.1是有缺陷的。并且截至2015年5月13日也没有一个7.x的修复。

@mattshma
Copy link
Owner Author

mattshma commented May 11, 2016

当前Hadoop等各服务均正常,只有HBase出现GC duration和The web server of this role is responding with metrics的警告。 而大约一天时间后,出现GC duration的RegionServer会Down掉。

目前yum源中有2个内核版本:2.6.32-504.el6 和 2.6.32-573.26.1.el6。 2.6.32-504.el6应该是centos6.6默认内核版本。当前计划是先升级kernel到2.6.32-573.26.1.el6。

@mattshma
Copy link
Owner Author

mattshma commented May 11, 2016

Update:centos6.6发现另外一个内核bug(or DELL ACPI的BUG),系统一直重启,报错如下:

Mar 10 23:45:37 jh02-024 kernel: [Firmware Bug]: No valid trip found
Mar 10 23:45:37 jh02-024 kernel: ERST: Error Record Serialization Table (ERST) support is initialized.
Mar 10 23:45:37 jh02-024 kernel: GHES: APEI firmware first mode is enabled by WHEA _OSC.
Mar 10 23:45:37 jh02-024 kernel: Non-volatile memory driver v1.3
Mar 10 23:45:37 jh02-024 kernel: Linux agpgart interface v0.103
Mar 10 23:45:37 jh02-024 kernel: crash memory driver: version 1.1

疑似2.6.32-504.el6和 2.6.32-573.26.1.el6都会受影响,见ERST: Error Record Serialization Table (ERST) support is initialized。解决方案为升级内核(升级后版本为2.6.32-573.26.1.el6.x86_64),或者在grub.conf中添加noacpi=off

@mattshma
Copy link
Owner Author

mattshma commented May 12, 2016

UPDATE:

对于重启的机器,即使加了noacpi=off参数,并升级内核后,该机器仍在重启。
对于RegionServer挂掉的机器,升级内核(升级后版本为2.6.32-573.26.1.el6.x86_64)后,仍有RegionServer挂掉,由于supervisor(?)原因,挂掉后会自动重启。

由于时间紧迫,集群还未在处于测试阶段,目前做法是先将操作系统降级为6.4 。

注:此时正确的做法应该还是更改内核,而不是更换操作系统,更换的内核可通过rpm -qp --changelog kernel-2.6.32-504.16.2.el6.x86_64.rpm | grep 'Ensure get_futex_key_refs() always implies a barrier' 来确保该bug已修改。

@mattshma mattshma changed the title RegionServer 因长时间GC退出 Centos6.6下RegionServer因长时间GC退出 May 12, 2016
@mattshma
Copy link
Owner Author

mattshma commented May 12, 2016

对于RegionSever挂掉的机器,目前操作系统已降级为6.4,重装系统后,CDH需要重新安装,但数据还在,hdfs数据正常,但hbase数据存在不一致的情况,且数据变少,并有报错,见 #28 。运行一天服务正常。

不过此事给的教训很足:任何线上业务在正式使用前,请务必通过充分测试!

@mattshma
Copy link
Owner Author

UPDATE: Operating System Known Issues

@mattshma
Copy link
Owner Author

UPDATE:

  • HBase降到centos6.4后,不再有报错。
  • 重启的机器(centos6.6)升级内核后,还有重启的现象存在,只不过频率很低。接下来会将其操作系统更换为centos6.4。

@mattshma
Copy link
Owner Author

update,重启机器系硬件原因,更换硬件后没再重启。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant