Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add record encoder #4

Closed
harbby opened this issue Feb 5, 2021 · 0 comments · Fixed by #5
Closed

Add record encoder #4

harbby opened this issue Feb 5, 2021 · 0 comments · Fixed by #5
Assignees
Labels
enhancement New feature or request

Comments

@harbby
Copy link
Owner

harbby commented Feb 5, 2021

添加 数据行编码器/解码器功能

背景:

分布式计算系统中经常会通过shuffle在节点间传输大量数据.IO瓶颈(磁盘/网络)也shuffle环节最大的挑战.
通过减少传输字节数来提升IO是非常有效的手段.常见做法是高效的序列化器+压缩来提升IO性能.

特点:

  • 高效的序列化器(编码器/解码器)+压缩可以显著减少字节数,大幅提升IO.
  • 可用于字节存储.字节存储相比对象存储更加节省空间,且能显著提高Cpu Cache命中率.字节存储例子: Flink的BinaryRow,Spark的UnsafeRow等.

设计

所有编码器解码器都继承自Encoder<E>且由Encoders类进行引用:
接口设计如下:

public interface Encoder<E>
        extends Serializable
{
    public void encoder(E value, DataOutput output)
            throws IOException;

    public E decoder(DataInput input)
            throws IOException;
}

兼容性:

  • 该patch不会引入破坏性Api变化.
  • 阶段实验性加入 DataSet.encoder(Encoder) 方法.该方法会在类型推导功能(# ??? )完成后移除
  • 该功能会对netty网络传输后端造成破坏性改变,需要在后端加入相应解码器

效果:

  • 该功能将允许用户添加设置Record的序列化器(Encoder).
  • 该序列化器将在shuffleMap write和shufflerReduce reader时起作用,将显著降低传输的字节数.
  • 且在小Record测试下有10倍数的提升(测试参考: ....)
@harbby harbby added the enhancement New feature or request label Feb 5, 2021
@harbby harbby changed the title Support Type Encoder Add record encoder Feb 5, 2021
@harbby harbby added this to To do in astarte-batch Feb 5, 2021
@harbby harbby moved this from To do to In progress in astarte-batch Feb 5, 2021
harbby added a commit that referenced this issue Feb 5, 2021
@harbby harbby linked a pull request Feb 5, 2021 that will close this issue
@harbby harbby mentioned this issue Feb 5, 2021
@harbby harbby self-assigned this Feb 6, 2021
@harbby harbby closed this as completed in #5 Feb 7, 2021
harbby added a commit that referenced this issue Feb 7, 2021
@harbby harbby reopened this Feb 7, 2021
@harbby harbby closed this as completed Feb 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
astarte-batch
In progress
Development

Successfully merging a pull request may close this issue.

1 participant