Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

执行distcp时报错Mismatch in length #76

Closed
mattshma opened this issue Dec 23, 2016 · 0 comments
Closed

执行distcp时报错Mismatch in length #76

mattshma opened this issue Dec 23, 2016 · 0 comments
Labels

Comments

@mattshma
Copy link
Owner

mattshma commented Dec 23, 2016

当前hadoop版本2.6.0,在使用distcp拷贝数据时,始终报错如下:

Error: java.io.IOException: File copy failed: hftp://A:50070/user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418 --> webhdfs://B:50070/user/hive/warehouse/idmap.db/log/tt=2016091608/k1_.1473984000418
	at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:284)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:252)
	at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying hftp://A:50070/user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418 to webhdfs://B:50070/user/hive/warehouse/idmap.db/log/tt=2016091608/k1_.1473984000418
	at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
	at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:280)
	... 10 more
Caused by: java.io.IOException: Mismatch in length of source:hftp://A:50070/user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418 and target:webhdfs://B:50070/user/hive/warehouse/idmap.db/log/tt=2016091608/.distcp.tmp.attempt_1470826899353_60634_m_000000_0
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.compareFileLengths(RetriableFileCopyCommand.java:194)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:127)
	at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)
	at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
	... 11 more

既然大小不一致,第一想法是能不能像skipcrccheck一样把大小校验跳过,查看distcp命令,没提供相关命令。查看RetriableFileCopyCommand.java

  private void compareFileLengths(FileStatus sourceFileStatus, Path target,
                                  Configuration configuration, long targetLen)
                                  throws IOException {
    final Path sourcePath = sourceFileStatus.getPath();
    FileSystem fs = sourcePath.getFileSystem(configuration);
    if (fs.getFileStatus(sourcePath).getLen() != targetLen)
      throw new IOException("Mismatch in length of source:" + sourcePath
                + " and target:" + target);
  }

对比大小写在源码中,所以这步走不通。再稍加思考下,即能想到文件未关闭会影响distcp,查看文件文件状态,如下:

# sudo -u hdfs hdfs fsck -openforwrite /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418
Connecting to namenode via http://A:50070
FSCK started by hdfs (auth:SIMPLE) from /10.6.25.21 for path /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418 at Fri Dec 23 12:00:41 CST 2016
/user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418 545955110 bytes, 5 block(s), OPENFORWRITE: Status: HEALTHY
 Total size:	545955110 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	5 (avg. block size 109191022 B)
 Minimally replicated blocks:	5 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	3.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		58
 Number of racks:		1
FSCK ended at Fri Dec 23 12:00:41 CST 2016 in 1 milliseconds


The filesystem under path '/user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418' is HEALTHY

文件果然未关闭。查看文件owner是flume,即知该文件同flume写入,联想到前些日子由于dn异常导致flume无法关闭文件的问题。所以问题的解决方法即关闭文件即可。

hadoop提供了一些恢复机制,由于Lease Recovery会触发Block Recovery,当DataNode完成Block Recovery后,文件会被关闭。所以这里手动执行下recoverLease操作即可。在正式操作前,先对源文件拷贝备份,在拷贝过程中,意外发现备份文件大小变化了。如下:

# sudo -u hdfs hdfs dfs -cp /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418 /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418_new
# sudo -u hdfs hdfs dfs -ls /user/hive/warehouse/sdk.db/log/tt=2016091608/
Found 3 items
-rwxrwxr-x+  3 flume hive  545955110 2016-09-16 07:59 /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418
-rw-r--r--   3 hdfs  hive  654612572 2016-12-23 11:57 /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418_new
-rwxrwxr-x+  3 flume hive  656178310 2016-09-16 09:02 /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473985936422

执行recoverLease操作,如下:

# sudo -u flume hdfs debug recoverLease -path /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418
recoverLease returned false.
Giving up on recoverLease for /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418 after 1 try.
[root@bd15-026 ~]# sudo -u hdfs hdfs dfs -ls /user/hive/warehouse/sdk.db/log/tt=2016091608/
Found 3 items
-rwxrwxr-x+  3 flume hive  654612572 2016-12-23 12:04 /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418
-rw-r--r--   3 hdfs  hive  654612572 2016-12-23 11:57 /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418_new
-rwxrwxr-x+  3 flume hive  656178310 2016-09-16 09:02 /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473985936422
# sudo -u hdfs hdfs fsck -openforwrite /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418
Connecting to namenode via http://bd15-001:50070
FSCK started by hdfs (auth:SIMPLE) from /10.6.25.21 for path /user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418 at Fri Dec 23 12:04:42 CST 2016
.Status: HEALTHY
 Total size:	654612572 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	5 (avg. block size 130922514 B)
 Minimally replicated blocks:	5 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	3.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		58
 Number of racks:		1
FSCK ended at Fri Dec 23 12:04:42 CST 2016 in 0 milliseconds


The filesystem under path '/user/hive/warehouse/sdk.db/log/tt=2016091608/k1_.1473984000418' is HEALTHY

再次执行distcp,文件成功。

由上知,在拷贝过程中,也发生了recoverLease的过程。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant