# Data Encoding, Decoding and Flow

## Apache Avro

Avro has the following types:

- null: no value
- boolean: a binary value
- int: 32-bit signed integer
- long: 64-bit signed integer
- float: single precision (32-bit) IEEE 754 floating-point number
- double: double precision (64-bit) IEEE 754 floating-point number
- bytes: sequence of 8-bit unsigned bytes
- string: Unicode character sequence
- record: ordered collection of named fields
- enum: enumeration of string values
- array: ordered collection of values
- map: collection of key-value pairs
- union: ordered list of values

It has two schema languages: one (`Avro IDL`) intended for human editing, and one (based on JSON) that is more easily machine-readable.

### Encoding

We can encode the previous example record in IDL using the following schema in the `.avsc` file:

```avro
record Person {
  string userName;
  union { null, long } favoriteNumber = null;
  array<string> interests;
}
```

The equivalent JSON representation of that schema is as follows:

```json
{
  "type": "record",
  "name": "Person",
  "fields": [
    { "name": "userName", "type": "string" },
    { "name": "favoriteNumber", "type": ["null", "long"], "default": null },
    { "name": "interests", "type": { "type": "array", "items": "string" } }
  ]
}
```

The data encoded with this schema looks like this:
![avro](../assets/avro.png)

First and foremost, it's important to note that the schema lacks tag numbers. When we encode our sample record using this schema, the resulting Avro binary encoding is impressively compact, spanning just _32 bytes_—the most space-efficient among all the encodings we've observed.

Examining the byte sequence, one can readily discern the _absence of field identifiers or datatype markers_. The encoding solely comprises concatenated values. For instance, a string is represented by a length prefix followed by UTF-8 bytes, but there are no explicit indicators within the encoded data to specify that it is, indeed, a string. In fact, it could be interpreted as an integer or any other data type altogether. Similarly, an integer is encoded using a variable-length encoding.

To correctly parse the binary data, you must traverse the fields in the order they appear in the schema and _refer to the schema_ itself to ascertain the datatype of each field. Consequently, the binary data can only be accurately decoded if the code reading the data employs the exact same schema as the code that wrote the data. Any deviation or mismatch in the schema between the reader and the writer would result in incorrectly decoded data.

With Avro, data encoding and decoding are based on two schemas: the `writer's schema` used during data encoding and the `reader's schema` employed during data decoding. These schemas do not necessarily have to be identical but should be compatible. When decoding data, the Avro library compares the writer's and reader's schemas, resolving any discrepancies between them.

The Avro specification ensures that fields in different orders between the writer's and reader's schemas pose no issues during resolution since schema matching occurs based on field names. If the reader's schema lacks a field present in the writer's schema, it is simply ignored. Conversely, if the reader's schema expects a field that the writer's schema does not contain, the missing field is filled in with a default value declared in the reader's schema. This allows for flexible schema evolution while maintaining data compatibility.

### Reading (Decoding) a File

Instead of demonstrating RPC, let's look at how to decode data from a file from a real-world dataset. We have a genomic variation data of 1000 samples from the [OpenCGA](http://docs.opencb.org/display/opencga/Welcome+to+OpenCGA) project.

In [1]:
import fastavro
import copy
import json
from pprint import pprint #pretty print

In [2]:
with open('../data/1k.variants.avro', 'rb') as f:
    reader = fastavro.reader(f)
    genomic_var_1k = [sample for sample in reader]
    metadata = copy.deepcopy(reader.metadata)
    writer_schema = copy.deepcopy(reader.writer_schema)
    schema_from_file = json.loads(metadata['avro.schema'])

In [3]:
len(genomic_var_1k)

1000

In [4]:
pprint(writer_schema)

{'fields': [{'doc': '* The variant ID.',
             'name': 'id',
             'type': ['null',
                      {'avro.java.string': 'String', 'type': 'string'}]},
            {'default': [],
             'doc': '* Other names used for this genomic variation.',
             'name': 'names',
             'type': {'items': {'avro.java.string': 'String', 'type': 'string'},
                      'type': 'array'}},
            {'doc': '* Chromosome where the genomic variation occurred.',
             'name': 'chromosome',
             'type': {'avro.java.string': 'String', 'type': 'string'}},
            {'doc': '* Normalized position where the genomic variation '
                    'starts.\n'
                    '         * <ul>\n'
                    '         * <li>SNVs have the same start and end '
                    'position</li>\n'
                    '         * <li>Insertions start in the last present '
                    'position: if the first nucleotide\n'
          

In [5]:
pprint(schema_from_file)

{'fields': [{'doc': '* The variant ID.',
             'name': 'id',
             'type': ['null',
                      {'avro.java.string': 'String', 'type': 'string'}]},
            {'default': [],
             'doc': '* Other names used for this genomic variation.',
             'name': 'names',
             'type': {'items': {'avro.java.string': 'String', 'type': 'string'},
                      'type': 'array'}},
            {'doc': '* Chromosome where the genomic variation occurred.',
             'name': 'chromosome',
             'type': {'avro.java.string': 'String', 'type': 'string'}},
            {'doc': '* Normalized position where the genomic variation '
                    'starts.\n'
                    '         * <ul>\n'
                    '         * <li>SNVs have the same start and end '
                    'position</li>\n'
                    '         * <li>Insertions start in the last present '
                    'position: if the first nucleotide\n'
          

In [6]:
pprint(genomic_var_1k[0])

{'alternate': 'T',
 'annotation': {'additionalAttributes': None,
                'alternate': 'T',
                'ancestralAllele': None,
                'chromosome': '22',
                'consequenceTypes': [{'biotype': 'lincRNA',
                                      'cdnaPosition': 0,
                                      'cdsPosition': 0,
                                      'codon': '',
                                      'ensemblGeneId': 'ENSG00000233866',
                                      'ensemblTranscriptId': 'ENST00000424770',
                                      'geneName': 'LA16c-4G1.3',
                                      'proteinVariantAnnotation': {'alternate': '',
                                                                   'features': [],
                                                                   'functionalDescription': None,
                                                                   'keywords': [],
                                 

In [7]:
for f in schema_from_file["fields"]:
    print(f["name"])

id
names
chromosome
start
end
reference
alternate
strand
sv
length
type
hgvs
studies
annotation
