# Lessons overview

In this lesson, we will cover the following topics:

1. Introduction to Data Encoding
2. Importance of Data Encoding in Machine Learning
3. Overview of Different Data Encoding Techniques
4. Hands-on Examples of Data Encoding
5. Best Practices for Data Encoding

By the end of this lesson, you should have a solid understanding of data encoding and its significance in the data preprocessing pipeline.


### Comparison of Encoding Formats

| Feature                | Thrift                          | Protobuf                        | Avro                           | Parquet                        | ORC                            | Arrow                          |
|------------------------|----------------------------------|----------------------------------|--------------------------------|--------------------------------|--------------------------------|--------------------------------|
| **Schema**            | `.thrift` file                 | `.proto` file                   | `.avsc` file (JSON/IDL)        | Embedded in file              | Embedded in file              | Embedded in file              |
| **Data Types**        | Rich type system               | Rich type system                | Rich type system               | Columnar types                | Columnar types                | Columnar types                |
| **Encoding**          | Compact binary with field tags | Compact binary with field tags  | Compact binary without tags    | Columnar binary               | Columnar binary               | Columnar binary               |
| **Schema Evolution**  | Backward & forward compatible  | Backward & forward compatible   | Backward & forward compatible  | Backward & forward compatible | Backward & forward compatible | Backward & forward compatible |
| **RPC Support**       | Yes                            | Yes (via gRPC)                  | No                             | No                             | No                             | No                             |
| **Use Case**          | RPC and data serialization     | RPC and data serialization      | Data serialization             | Analytical workloads          | Analytical workloads          | In-memory analytics           |

Each format has its strengths and is suited for specific use cases. For example, Thrift and Protobuf are ideal for RPC systems, while Avro, Parquet, ORC, and Arrow are better suited for data storage and analytics.

### Understanding Thrift in Plain Language

Thrift is a tool that helps different programs talk to each other, even if they are written in different programming languages. It does this by defining a common language, called a schema, that both programs can understand. This schema is written in a `.thrift` file.

Here’s how it works:

1. **Schema Definition**: You describe the structure of your data in a `.thrift` file. For example, you can define a `Person` with fields like `name`, `age`, and `hobbies`.

2. **Code Generation**: Thrift takes the `.thrift` file and generates code in your preferred programming language. This code helps you easily work with the data structure you defined.

3. **Data Encoding**: When you want to send data, Thrift converts it into a compact binary format. This format is small and efficient, making it faster to send over a network.

4. **Remote Procedure Calls (RPC)**: Thrift also allows you to define services in the `.thrift` file. A service is like a list of actions that one program can ask another program to perform. For example, a `School` service might have an action like `teachCourse` to add a course to a person’s list of hobbies.

5. **Cross-Language Communication**: Since Thrift generates code for many programming languages, you can have a Python program talk to a Java program, or any other combination, as long as they both use the same `.thrift` file.

In summary, Thrift is a powerful tool for making programs work together by providing a common way to define data and actions, and by efficiently encoding data for communication.

### Thrift Server Code - Line by Line Explanation

Let's break down the Thrift server code to understand what each line does:

```python
import thriftpy2
person_thrift = thriftpy2.load("./schema/person.thrift", module_name="person_thrift")

from thriftpy2.rpc import make_server

class School(object):
    def teachCourse(self, person, course):
        person.interests.append(course)
        return person

server = make_server(person_thrift.School, School(), client_timeout=None)
server.serve()
```

**Line 1:** `import thriftpy2`
- This imports the `thriftpy2` library, which is a Python implementation of Apache Thrift
- Thriftpy2 allows us to work with Thrift schemas and create RPC servers/clients

**Line 2:** `person_thrift = thriftpy2.load("./schema/person.thrift", module_name="person_thrift")`
- This line loads our `.thrift` schema file and converts it into Python classes
- `"./schema/person.thrift"` is the path to our schema file
- `module_name="person_thrift"` gives a name to the generated module
- After this line, we can access `person_thrift.Person` (the struct) and `person_thrift.School` (the service)

**Line 4:** `from thriftpy2.rpc import make_server`
- This imports the `make_server` function which helps us create an RPC server
- An RPC server listens for incoming requests and processes them

**Line 6:** `class School(object):`
- This defines a Python class that implements our Thrift service
- The class name `School` matches the service name in our `.thrift` file
- This class will contain the actual implementation of our service methods

**Line 7:** `def teachCourse(self, person, course):`
- This defines the `teachCourse` method that was declared in our `.thrift` file
- It takes two parameters: a `Person` object and a `course` string
- The method signature must match what we defined in the schema

**Line 8:** `person.interests.append(course)`
- This adds the new course to the person's list of interests
- We're modifying the `interests` field of the `Person` object
- `append()` adds the course to the end of the list

**Line 9:** `return person`
- This returns the modified `Person` object back to the client
- The return type must match what we declared in the `.thrift` file

**Line 11:** `server = make_server(person_thrift.School, School(), client_timeout=None)`
- This creates the actual RPC server
- `person_thrift.School` is the service interface from our schema
- `School()` is an instance of our implementation class
- `client_timeout=None` means clients can take as long as they need

**Line 12:** `server.serve()`
- This starts the server and makes it listen for incoming requests
- The server will run indefinitely, waiting for clients to connect
- When a client calls `teachCourse`, our implementation will be executed

### What are struct, service, and method? — linked to the notebook example

This short reference ties the abstract terms used in Thrift to the concrete example in the notebook (`schema/person.thrift` and the generated Python code).

- Struct (example: `Person`)
  - Plain meaning: a struct is a schema for a data record — a named collection of fields (like a small form or a typed dictionary).
  - In the notebook: `struct Person { 1: required string userName, 2: optional i64 favoriteNumber, 3: optional list<string> interests }` defines what data a Person contains.
  - At runtime: after loading the schema with Thrift (`thriftpy2.load(...)`) you get a Python class `person_thrift.Person` you can instantiate and send over the network.

- Service (example: `School`)
  - Plain meaning: a service is a named collection (an interface) of remote operations — think of it as the API surface that other programs can call.
  - In the notebook: `service School { Person teachCourse(1: required Person person, 2: required string course) }` declares the API the server will expose.
  - At runtime: Thrift exposes `person_thrift.School` as the service interface. You implement those methods in a Python class (commonly named `School`) and bind it to a server with `make_server(...)`.

- Method (example: `teachCourse`)
  - Plain meaning: a method is a single operation/remote procedure inside a service — a callable action that accepts arguments and returns a value.
  - In the notebook: `teachCourse` takes a `Person` and a `course` string and returns a `Person`.
  - At runtime: the client calls `client.teachCourse(person, "coding")`; the client stub serializes the arguments, sends them to the server, and the server implementation (the `teachCourse` method on your `School` class) runs and returns a `Person` object.

How they map together (quick flow):
1. Write `struct` + `service` in `schema/person.thrift`.
2. Load schema in Python: `person_thrift = thriftpy2.load("./schema/person.thrift", ...)` → gives `person_thrift.Person` and `person_thrift.School`.
3. Server: implement a Python class with methods matching the service (e.g., `def teachCourse(self, person, course): ...`) and run `make_server(person_thrift.School, School(), ...)`.
4. Client: create a client stub with `make_client(person_thrift.School, ...)`, build a `person_thrift.Person(...)`, and call `client.teachCourse(...)`.

Practical tips / gotchas
- Optional fields may be None — server code should guard before using lists (check `if person.interests is None:` before appending). 
- Methods can mutate objects in place and return them; clients expect the returned object to reflect changes.
- The service name and the Python implementation class may share the same name (e.g., `School`) — that's normal but don't confuse the *service interface* (schema) with the *class that implements it* (runtime object).

### What are `person.thrift` and `person_thrift_server.py` (purpose & how they work together)

- `person.thrift` (schema / contract)
  - Purpose: defines the data structures and RPC API contract. It contains the `struct` definitions (e.g., `Person`) and `service` definitions (e.g., `School` with methods like `teachCourse`).
  - Role: a language-agnostic specification that both client and server rely on. Field tags, types, and required/optional flags live here.
  - When used: tools or libraries (here `thriftpy2`) read this file and produce language-specific classes/interfaces you use at runtime.
  - Practical notes: maintain stable field tag numbers; make new fields optional for compatibility.

- `person_thrift_server.py` (server implementation)
  - Purpose: provides the runtime implementation of the service declared in the schema. It loads the schema, implements the service methods in a Python class (e.g., `class School`), and starts an RPC server.
  - Role: receives encoded requests from clients, decodes them into `Person` objects, executes the method logic (e.g., add a course, assign a grade), and returns encoded responses.
  - Key lines to look for in the file:
    - schema load: `thriftpy2.load("./schema/person.thrift", ...)` — imports the generated types and service interface
    - implementation: `class School(...): def teachCourse(self, person, course): ...` — the server-side logic
    - server start: `make_server(person_thrift.School, School(), ...)` and `server.serve()` — binds implementation to RPC and listens for clients
  - Practical notes: guard against missing optional fields (e.g., ensure `person.interests` exists before appending); ensure server and clients use the same schema file/version.

How they fit together (quick flow):
1. Author `person.thrift` (the contract).
2. Server loads the contract, implements methods, and starts serving (`person_thrift_server.py`).
3. Clients load the same contract, create `Person` instances, and call remote methods — data flows encoded across the network and is decoded by the other side.

If you want, I can also add a small annotated snippet in the notes showing the exact lines from both files and a one-line explanation beside each — shall I add that?

### Annotated snippets — key lines from `schema/person.thrift` and `person_thrift_server.py`

#### `schema/person.thrift` (selected lines)

```thrift
struct Person {
  1: required string userName,
  2: optional i64 favoriteNumber,
  3: optional list<string> interests
}

service School {
    Person teachCourse(1: required Person person, 2: required string course)
}
```

- `struct Person { ... }` — defines the data record shape (a typed container) that will be encoded/decoded.
- `1: required string userName` — field tag `1` is the stable numeric identifier used in the binary encoding; `required` means writers must set it.
- `2: optional i64 favoriteNumber` — optional numeric field; absent values are simply omitted from the encoding.
- `3: optional list<string> interests` — a repeated/collection field; may be None or an empty list at runtime.
- `service School { ... }` — declares the RPC interface (the API surface) other programs can call.
- `Person teachCourse(1: required Person person, 2: required string course)` — a method signature: inputs (tagged) and return type (`Person`).

#### `person_thrift_server.py` (selected lines)

```python
import thriftpy2
person_thrift = thriftpy2.load("./schema/person.thrift", module_name="person_thrift")

from thriftpy2.rpc import make_server

class School(object):
    def teachCourse(self, person, course):
        person.interests.append(course)
        return person

server = make_server(person_thrift.School, School(), client_timeout=None)
server.serve()
```

- `import thriftpy2` — imports the Thrift runtime used to load schemas and run RPC.
- `thriftpy2.load(...)` — loads the schema file and provides Python classes like `person_thrift.Person` and the service interface `person_thrift.School`.
- `from thriftpy2.rpc import make_server` — helper to create a network server bound to a service implementation.
- `class School(object):` — the Python class that implements the service methods declared in the schema.
- `def teachCourse(self, person, course):` — the method that executes when clients call `teachCourse`; signatures must match the schema.
- `person.interests.append(course)` — server-side logic modifying the `Person` object in-place (guard against `None` in production code).
- `server = make_server(...)` and `server.serve()` — create and start the RPC server so clients can connect and call methods.

Tip: you can add small defensive checks (e.g., `if person.interests is None: person.interests = []`) to avoid runtime errors when fields are optional.

---

# Protocol Buffers (Protobuf)

### Protobuf: what `message`, `service`, and `rpc` (method) mean

- `message` (analogous to Thrift `struct`)
  - Purpose: defines a typed data record — a schema for the fields a record carries. In Protobuf, `message Person { ... }` declares the shape of a Person.
  - Runtime role: after compiling (`protoc`) the `.proto` file, each `message` becomes a language-specific class you can instantiate, serialize, and send over the network.
  - Practical note: fields are identified by numeric tags (e.g., `= 1`) and must be chosen and preserved carefully for schema evolution.

- `service` (the API surface)
  - Purpose: groups related remote procedures into a named API, e.g., `service School { ... }`.
  - Runtime role: code generation creates server interfaces and client *stubs* (or, with gRPC, strongly-typed stubs) so servers implement the interface and clients call methods remotely.

- `rpc` (method)
  - Purpose: declares a single remote procedure (method) in the service with a specific request and response message type, e.g., `rpc teachCourse(CourseRequest) returns (Person)`.
  - Runtime role: the client calls the generated stub method; the runtime serializes the request message, sends it to the server, the server runs the implemented method and returns a serialized response.

Short annotated example (from the notebook):

```protobuf
message Person {
  string user_name = 1;         // field tag 1 — stable identifier in the encoding
  optional int64 favorite_number = 2; // optional numeric field (proto3 has subtle rules)
  repeated string interests = 3; // a list/array of strings
}

message CourseRequest {
  Person person = 1;            // nested message used as request wrapper
  string course = 2;
}

service School {
  rpc teachCourse(CourseRequest) returns (Person); // method: request -> response
}
```

Quick analogies & tips
- Think "message = typed record / struct", "service = interface / API", and "rpc = a function signature on that API". 
- With gRPC, the `.proto` drives both the data classes and the RPC bindings. `protoc` + plugins generate the server and client scaffolding.
- Schema evolution: add fields with new tag numbers; don't reuse tags; be mindful of proto3 defaults (no explicit `required`).

Light question: want an annotated snippet mapping these lines to the exact generated Python server/client lines (like I did for Thrift) and add it into the notes as well?

### Protobuf/gRPC Client Code - Line by Line Explanation

Let's break down the complete Protobuf/gRPC client code to understand what each line does. This builds on the server code and schema we defined earlier.

#### Context: The Schema and Server Setup

First, recall our `.proto` schema (from the `%%writefile ../schema/person.proto` cell):
```protobuf
syntax = "proto3";

message Person {
  string user_name = 1;
  optional int64 favorite_number = 2;
  repeated string interests = 3;
}

message CourseRequest {
  Person person = 1;
  string course = 2;
}

service School {
  rpc teachCourse(CourseRequest) returns (Person) {}
}
```

And our server implementation (from the `%%writefile ../person_protobuf_server.py` cell):
```python
class School(person_pb2_grpc.SchoolServicer):
  def teachCourse(self, request, context):
    request.person.interests.append(request.course)
    return request.person
```

#### Client Setup and Imports

```python
import sys
sys.path.append('..')
import grpc
import person_pb2
import person_pb2_grpc
```

**Line 1-2:** `import sys` and `sys.path.append('..')`
- These lines modify Python's module search path to include the parent directory
- This is needed because the generated `person_pb2.py` and `person_pb2_grpc.py` files are in the parent directory
- Without this, Python wouldn't be able to find and import the generated Protobuf classes

**Line 3:** `import grpc`
- Imports the main gRPC library for Python
- This provides the networking infrastructure for making remote procedure calls
- Handles connection management, serialization, and communication protocols

**Line 4:** `import person_pb2`
- Imports the generated Python classes for our Protobuf messages
- This gives us access to `person_pb2.Person` and `person_pb2.CourseRequest` classes
- These are the data model classes (equivalent to structs) we can instantiate and populate

**Line 5:** `import person_pb2_grpc`
- Imports the generated gRPC service classes
- This gives us access to `person_pb2_grpc.SchoolStub` for making client calls
- Contains all the networking code to call remote service methods

#### Helper Function

```python
def teach_course(stub, person, course):
    person = stub.teachCourse(person_pb2.CourseRequest(person=person, course=course))
    return person
```

**Line 1:** `def teach_course(stub, person, course):`
- Defines a helper function that wraps the gRPC call
- Takes three parameters:
  - `stub`: the gRPC client stub (connection to the server)
  - `person`: a `Person` message object
  - `course`: a string representing the course name

**Line 2:** `person = stub.teachCourse(person_pb2.CourseRequest(person=person, course=course))`
- **`person_pb2.CourseRequest(person=person, course=course)`**: Creates a new `CourseRequest` message
  - This wraps our `person` and `course` parameters into the request format expected by the server
  - The server method signature requires a single `CourseRequest` parameter, not separate arguments
- **`stub.teachCourse(...)`**: Makes the actual remote procedure call
  - `stub` is the client-side proxy that handles network communication
  - Serializes the `CourseRequest`, sends it to the server, waits for response
  - The server runs its `teachCourse` method and returns a `Person` message
  - The stub deserializes the response back into a Python `Person` object
- **`person = ...`**: Assigns the returned `Person` object to the `person` variable
  - This overwrites the original person with the modified version from the server

**Line 3:** `return person`
- Returns the modified `Person` object that came back from the server
- This person now has the new course added to their interests list

#### Main Client Logic

```python
with grpc.insecure_channel('localhost:50051') as channel:
    martin = person_pb2.Person(user_name='Martin', favorite_number=1337, interests=["daydreaming", "hacking"])
    course = "coding"
    stub = person_pb2_grpc.SchoolStub(channel)
    martin = teach_course(stub, martin, course)
    print(martin.interests)
```

**Line 1:** `with grpc.insecure_channel('localhost:50051') as channel:`
- **`grpc.insecure_channel('localhost:50051')`**: Creates a connection to the gRPC server
  - `'localhost:50051'` specifies the server address and port (must match the server's port)
  - `insecure_channel` means no encryption/TLS (fine for local development)
  - For production, you'd use `secure_channel` with proper certificates
- **`with ... as channel:`**: Uses a context manager to ensure the connection is properly closed
  - The connection will be automatically closed when the block exits
  - Even if an exception occurs, the connection cleanup is guaranteed

**Line 2:** `martin = person_pb2.Person(user_name='Martin', favorite_number=1337, interests=["daydreaming", "hacking"])`
- Creates a new `Person` message object using the generated `person_pb2.Person` class
- Sets the fields according to our schema:
  - `user_name='Martin'` → sets field tag 1 (string)
  - `favorite_number=1337` → sets field tag 2 (optional int64)
  - `interests=["daydreaming", "hacking"]` → sets field tag 3 (repeated string)
- This creates a local Python object that matches our Protobuf schema

**Line 3:** `course = "coding"`
- Simple string variable holding the course we want to add
- This will be passed to the server as part of the `CourseRequest`

**Line 4:** `stub = person_pb2_grpc.SchoolStub(channel)`
- Creates a client stub for the `School` service
- **`SchoolStub`**: Generated class that provides client-side methods for all service RPCs
  - Has a `teachCourse` method that matches our service definition
  - Handles serialization, network communication, and deserialization automatically
- **`(channel)`**: Associates the stub with our gRPC connection
  - All method calls on this stub will use this channel to reach the server

**Line 5:** `martin = teach_course(stub, martin, course)`
- Calls our helper function to make the actual RPC call
- Passes the stub (connection), martin (Person object), and course (string)
- The function returns the modified martin with "coding" added to his interests
- **Behind the scenes flow**:
  1. Helper function creates `CourseRequest(person=martin, course="coding")`
  2. Stub serializes the request into binary format
  3. gRPC sends the binary data over the network to localhost:50051
  4. Server receives, deserializes, and processes the request
  5. Server calls `teachCourse` method, adds course to interests
  6. Server serializes and sends back the modified Person
  7. Client stub deserializes the response into a Python Person object

**Line 6:** `print(martin.interests)`
- Prints the updated interests list for martin
- Should now include the original interests plus "coding"
- Output would be something like: `['daydreaming', 'hacking', 'coding']`

#### Key Concepts Summary

1. **Message Objects**: `person_pb2.Person` and `person_pb2.CourseRequest` are Python classes representing our Protobuf messages
2. **Service Stub**: `person_pb2_grpc.SchoolStub` is a client proxy that makes remote calls look like local method calls
3. **Channel**: `grpc.insecure_channel` manages the network connection to the server
4. **Request/Response Flow**: Client creates request → serializes → network → server processes → serializes response → network → client deserializes
5. **Type Safety**: The generated classes ensure we use the correct field names and types as defined in our schema

---

# Apache Avro

### Apache Avro File Reading Code - Line by Line Explanation

Let's break down the Avro file reading code to understand what each line does, both technically and in plain language:

```python
with open('../data/1k.variants.avro', 'rb') as f:
    reader = fastavro.reader(f)
    genomic_var_1k = [sample for sample in reader]
    metadata = copy.deepcopy(reader.metadata)
    writer_schema = copy.deepcopy(reader.writer_schema)
    schema_from_file = json.loads(metadata['avro.schema'])
```

**Line 1:** `with open('../data/1k.variants.avro', 'rb') as f:`
- **Technical Explanation:**
  - Opens the Avro file in **binary read mode** (`'rb'`).
  - `'../data/1k.variants.avro'` is the path to a genomic variation dataset with 1000 samples.
  - Uses a context manager (`with`) to automatically close the file when done.
  - Avro files are binary format, so we must open in binary mode (not text mode).
- **Plain Language Explanation:**
  - Think of it like: "Open this data file like opening a book, but tell the computer it's written in a special binary language, not regular text."
  - Why 'rb'? Like how you need different glasses to read different languages, computers need to know if a file contains text or binary data.
  - The 'with' part: It's like saying "when I'm done reading this book, automatically put it back on the shelf for me."

**Line 2:** `reader = fastavro.reader(f)`
- **Technical Explanation:**
  - Creates an Avro reader object using the `fastavro` library.
  - The reader can parse the binary Avro format and extract both data and metadata.
  - `f` is the file handle from line 1.
  - This reader will allow us to iterate through all records in the file.
- **Plain Language Explanation:**
  - Think of it like: "Get me a translator who understands this special Avro language and can convert it to something I can understand."
  - Real-world analogy: Like hiring an interpreter when you visit a foreign country—they help you understand what's being said.

**Line 3:** `genomic_var_1k = [sample for sample in reader]`
- **Technical Explanation:**
  - **List comprehension** that reads all records from the Avro file into memory.
  - `reader` is iterable—each iteration gives you one record (sample).
  - `sample` represents one genomic variant record from the dataset.
  - `genomic_var_1k` becomes a Python list containing all 1000 genomic variant records.
  - Each sample is converted from Avro binary format into a Python dictionary.
- **Plain Language Explanation:**
  - Think of it like: "Go through every page in this book and copy all the information into my notebook."
  - Real-world analogy: Like making photocopies of every recipe in a cookbook so you can use them later.
  - What happens: The computer reads all 1000 genetic samples and puts them in a list you can work with.

**Line 4:** `metadata = copy.deepcopy(reader.metadata)`
- **Technical Explanation:**
  - Extracts metadata from the Avro file header.
  - `reader.metadata`: Contains file-level information like schema, creation time, etc.
  - `copy.deepcopy()`: Makes a complete independent copy of the metadata.
  - Without deep copy, changes to metadata could affect the original reader object.
  - Metadata includes information about how the file was created and encoded.
- **Plain Language Explanation:**
  - Think of it like: "Make me a complete copy of the book's table of contents and publication information."
  - Real-world analogy: Like photocopying the copyright page and index of a book for your records.
  - Why copy? So you have your own version that won't change if something happens to the original.

**Line 5:** `writer_schema = copy.deepcopy(reader.writer_schema)`
- **Technical Explanation:**
  - Extracts the **writer's schema**—the schema used when the file was written.
  - This is the original schema definition that was used to encode the data.
  - `copy.deepcopy()` again creates an independent copy.
  - The writer's schema tells us the structure and types of fields in each record.
  - Important for Avro because you need the exact schema to properly decode binary data.
- **Plain Language Explanation:**
  - Think of it like: "Give me a copy of the instruction manual that was used to organize this data."
  - Real-world analogy: Like getting a copy of the recipe format that was used to write down all the recipes in a cookbook.
  - Why important? You need to know how the data was organized to understand what each piece means.

**Line 6:** `schema_from_file = json.loads(metadata['avro.schema'])`
- **Technical Explanation:**
  - Extracts the schema from metadata and parses it as JSON.
  - `metadata['avro.schema']`: The schema is stored as a JSON string in the metadata.
  - `json.loads()`: Converts the JSON string into a Python dictionary.
  - This gives us a more accessible format to examine the schema structure.
  - The schema describes field names, types, and structure of each genomic variant record.
- **Plain Language Explanation:**
  - Think of it like: "Take this instruction manual and convert it from a compressed format into something easy to read and navigate."
  - Real-world analogy: Like unfolding a complex origami instruction sheet into a simple step-by-step guide.
  - Result: You get an easy-to-use dictionary that tells you what fields exist in your data and what types they are.

#### Key Concepts Summary

1. **Binary Format**: Avro files are binary, requiring `'rb'` mode and special parsing libraries
2. **Schema Dependency**: Unlike JSON, Avro data can't be read without knowing the schema
3. **Metadata Storage**: Avro embeds the schema and other metadata directly in the file header
4. **Memory Loading**: The list comprehension loads all records into RAM (fine for 1000 records, consider streaming for larger files)
5. **Deep Copying**: Ensures we have independent copies of metadata and schema for safe manipulation

#### Real-world Context

This genomic dataset contains information about genetic variations across 1000 samples, commonly used in bioinformatics research. The Avro format is ideal here because:
- Compact binary encoding saves storage space
- Schema evolution allows adding new fields as research progresses
- Self-describing format includes metadata about the dataset