Skip to content

Latest commit

 

History

History
110 lines (72 loc) · 6.79 KB

PIPELINE.md

File metadata and controls

110 lines (72 loc) · 6.79 KB

Pipeline

The pipeline refers to all the steps that chat data goes through before reaching the report user interface. This document provides a detailed description of each step to give you a general idea of how it works.

The following diagram gives an overview of the pipeline:

Quick jump to sections:

  1. Input files
  2. Parsing
  3. Processing & Analysis
  4. Compression and Encoding
  5. Aggregate / Blocks
  6. About message serialization

1. Input files

The input files for the pipeline are chat exports from various platforms, each with its own format. Users can upload these files through the web application user interface, through the command-line interface, or by calling the npm package.

For each file, a FileInput interface has to be created, which along with metadata, contains the slice(start?: number, end?: number): ArrayBuffer function that must return a slice of the file in the specified range. This function is environment dependent. Since files may be large (several GBs) we don't read the entire file content into memory, instead we allow parsers to stream them.

2. Parsing

This step takes the input files and parses them into P-interfaces (PGuild, PChannel, PAuthor, PMessage); it is a common format for all platforms, so it is easy to work with different and potentially new platforms. Each platform creates a class that extends the Parser class and implements the parse(file: FileInput) function. Check existing implementations for reference.

If you want to support a new platform, you can find some guidance in PARSER.md document.

3. Processing & Analysis

This part is about generating the Database object from the P-interfaces generated by the parsers. The class responsible for this is DatabaseBuilder.

During processing, incoming objects from the parser can be:

  • PGuild, PChannel and PAuthor: they are added to an IndexedMap and an index is assigned to them if their ID hasn't been seen before. Subsequent processing will query the index by ID to only store the index.
  • PMessage: each message is added to a processing queue in its corresponding ChannelMessages (each channel has its own). This class handles duplicate messages and overlapping in input files. At some point, the queue will be split into groups of messages that were sent by the same author (PMessageGroup) and passed to the MessageProcessor for processing that will have to return a Message array for each group.
    The MessageProcessor class handles tokenization, language detection and sentiment computation. It inserts the data back into the data stores in the DatabaseBuilder instance (words, emoji, mentions, etc.).

After all input files have been processed, it sorts some data stores (IndexedMaps) based on the number of messages. Since this changes all indexes, we have to remap the old indexes to the new ones. Also, it filters out some data we don't want to keep. Unfortunately, it also means deserializing and serializing all messages again.

I hope the code is clear enough to understand how it works.

4. Compression and Encoding

After the Database object is ready, we store it in the standalone report HTML file. To minimize the file size, we use compression and encoding techniques. First, we compress it with fflate, then encode it with base91 using HTML-friendly characters. The resulting string is stored in the HTML file.

When the report HTML is loaded (via blob or file://) we do the inverse process to get the Database object: decode the string with base91 and decompress it with fflate.

5. Aggregate / Blocks

The last mile is to aggregate the database into useful "blocks" that summarizes some particular information. This is done in the worker running in the report UI. Blocks must be defined inside pipeline/aggregate/blocks and imported in pipeline/aggregate/Blocks.ts.

A typical block implementation looks like this:

export interface MyBlockResult {
    totalMessages: number;
}

export interface MyBlockArgs {
    channelIndex: number;
}

const fn: BlockFn<MyBlockResult, MyBlockArgs> = (database, filters, common, args) => {
    let totalMessages = 0;

    const processMessage = (msg: MessageView) => {
        if (msg.channelIndex === args.channelIndex)
            totalMessages++;
    };

    filterMessages(processMessage, database, filters, {
        // wether to use current filters
        authors: true,
        channels: true,
        time: true,
    });

    return { totalMessages };
};

export default {
    key: "folder/my-block",
    triggers: ["authors", "channels", "time"],
    fn,
} as BlockDescription<"folder/my-block", MyBlockResult, MyBlockArgs>;

It is implemented with a function that receives the Database, the current filters (in the UI), some common data (cached for performance) and the block arguments. It returns the block result.

In the UI blocks can be requested using the hook useBlockData:

const data = useBlockData("folder/my-block", { channelIndex: 0 });

Very neat, right?

About message serialization

Instead of storing messages as JS Objects, they are serialized into a custom binary format. This is done to reduce memory consumption and improve performance. It also allows us to store many more messages at once and thus, reports are smaller.

Overview of serialization files:

  • MessageSerialization.ts: contains writeMessage and readMessage to serialize and deserialize a single message. Make heavy use of writeIndexCounts and readIndexCounts to serialize and deserialize IndexCounts (see below).
  • IndexCountsSerialization.ts: allows serialization of IndexCounts which are pairs of [index, count]. We use this format to point out which and how many of each object are used in a given context (e.g. emoji in a message)
  • MessagesArray.ts: useful abstraction to work with serialized messages as if they were a regular Array (push and iteration)
  • MessageView.ts: class implementing the Message interface that deserializes data on demand. Useful if you don't need all the data at once. Used in aggregation.