Custom HiveInputFormat

文件通过snappy解压后格式如下：

[4字节数据长度（大字节序）][数据内容（pb序列化）][4字节数据长度（大字节序）][数据内容（pb序列化）]...

使用Hive表读取原始文件信息

可选字段（字段名/类型必须保持一致）
- tableid（int）
- colspace（int）
- rowkey（string）
- colkey（string）
- value（string/binary）
- score（bigint）
- ttl（int）
创建Hive表

add jar /user/hadoop/udf/UDHiveInputFormat.jar;
CREATE EXTERNAL TABLE my_test(
  tableid int,
  colkey string,
  score bigint,
  rowkey string,
  value string)
STORED BY 'com.my.tutorial.MyStorageHandler';

使用（查询等操作）

add jar /user/hadoop/udf/UDHiveInputFormat.jar;
SELECT * FROM my_test LIMIT 10;
or
SELECT * FROM my_test WHERE tableid = 1 LIMIT 10;
...

注意事项：

字段名必须属于（tableid，colspace，rowkey，colkey，value，score，ttl），字段类型也保持一致
字段类型
STORED BY必须指定相应class
如果要从my_test表中写数据到别的表，许对value做进一步处理，value中有可能包含\r、\n、\t等特殊字符，需要regexp_replace处理等

Run & Developer

$ mvn elipse:eclipse
$ mvn package

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src/main/java/com/my/tutorial		src/main/java/com/my/tutorial
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src/main/java/com/my/tutorial

src/main/java/com/my/tutorial

.gitignore

.gitignore

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Custom HiveInputFormat

使用Hive表读取原始文件信息

注意事项：

Run & Developer

About

Releases

Packages

Languages

jianle/custom-hive-input-format

Folders and files

Latest commit

History

Repository files navigation

Custom HiveInputFormat

使用Hive表读取原始文件信息

注意事项：

Run & Developer

About

Resources

Stars

Watchers

Forks

Languages