# SAI #01: Column Based vs. Row Based Storage, Kafka - Writing Data 

## 1.𝗥𝗼𝘄 𝗕𝗮𝘀𝗲𝗱 𝘃𝘀 𝗖𝗼𝗹𝘂𝗺𝗻 𝗕𝗮𝘀𝗲𝗱 𝗙𝗶𝗹𝗲 𝗙𝗼𝗿𝗺𝗮𝘁

### 1.1. 𝗥𝗼𝘄 𝗕𝗮𝘀𝗲𝗱: Use Avro for example file format
Ref: https://avro.apache.org/docs/1.11.1/getting-started-python/

In [1]:
import avro
import json

from avro.datafile import DataFileWriter, DataFileReader
from avro.io import DatumWriter, DatumReader

+ **Step1: Prepare the schema of avro .avsc file**

In [2]:
user_schema = {
    "namespace": "Company",
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "Age", "type": "int"},
        {"name": "Occupation", "type": "string"},
        {"name": "No_of_dog", "type": "int"}
    ]
}
# schema = avro.schema.parse(open("user.avsc", "rb").read())
schema = avro.schema.parse(json.dumps(user_schema))

+ **Step2: Write file to .avsc file**

In [3]:
row_data = [
    {"name": "Aurimas", "Age": 31, "Occupation": "MLOps", "No_of_dog":2},
    {"name": "Thomas", "Age": 25, "Occupation": "DE", "No_of_dog":0},
    {"name": "Suzan", "Age": 29, "Occupation": "MLE", "No_of_dog":1},
    {"name": "Peter", "Age": 34, "Occupation": "SWE", "No_of_dog":0}
]
writer = DataFileWriter(open("user.avsc", "wb"), DatumWriter(), schema)
# Write row by row
for ele in row_data:
    writer.append(ele)
writer.close()

+ **Step3: Reading file from .avsc file**

In [4]:
reader = DataFileReader(open("user.avsc", "rb"), DatumReader())
for user in reader:
    print(user)
reader.close()

{'name': 'Aurimas', 'Age': 31, 'Occupation': 'MLOps', 'No_of_dog': 2}
{'name': 'Thomas', 'Age': 25, 'Occupation': 'DE', 'No_of_dog': 0}
{'name': 'Suzan', 'Age': 29, 'Occupation': 'MLE', 'No_of_dog': 1}
{'name': 'Peter', 'Age': 34, 'Occupation': 'SWE', 'No_of_dog': 0}


+ **Step4: Add more data to exist .avsc file**

In [5]:
writer = DataFileWriter(open("user.avsc", "ab+"), DatumWriter())
writer.append({"name": "David", "Age": 28, "Occupation": "SE", "No_of_dog":3})
writer.close()

### 1.2. Column Based: Use Parquet, ORC file for example file formats

Ref: https://arrow.apache.org/docs/python/parquet.html

In [6]:
import pyarrow.parquet as pq
import pandas as pd
import pyarrow as pa

+ **Step1: Prepare the column based data as Json**

In [7]:
column_data = {
    "name": ["Aurimas", "Thomas", "Suzan", "Peter"],
    "Age": [31, 25, 29, 34],
    "Occupation": ["MLOps", "DE", "MLE", "SWE"],
    "No_of_dog": [2, 0, 1, 0]
}
df = pd.DataFrame(column_data)
table = pa.Table.from_pandas(df)

+ **Step2: Write column data to the parquet file**

In [8]:
# Write the parquet file 
pq.write_table(table, 'user.parquet')

+ **Step3: Read the data from parquet file**

In [9]:
# Read the parquet file 
data = pq.read_table('user.parquet')
data

pyarrow.Table
name: string
Age: int64
Occupation: string
No_of_dog: int64
----
name: [["Aurimas","Thomas","Suzan","Peter"]]
Age: [[31,25,29,34]]
Occupation: [["MLOps","DE","MLE","SWE"]]
No_of_dog: [[2,0,1,0]]

**ORC: https://arrow.apache.org/docs/python/orc.html**

In [10]:
from pyarrow import orc 
import pyarrow as pa

+ **Step1: Prepare the column based data as Json**

In [11]:
table = pa.table({
    "name": ["Aurimas", "Thomas", "Suzan", "Peter"],
    "Age": [31, 25, 29, 34],
    "Occupation": ["MLOps", "DE", "MLE", "SWE"],
    "No_of_dog": [2, 0, 1, 0]
})

+ **Step2: Write column data to ORC file**

In [12]:
# Write the orc file 
orc.write_table(table, 'user.orc')

+ **Step3: Read data from the orc file**

In [13]:
# Read data from the orc file 
orc.read_table('user.orc')

pyarrow.Table
name: string
Age: int64
Occupation: string
No_of_dog: int64
----
name: [["Aurimas","Thomas","Suzan","Peter"]]
Age: [[31,25,29,34]]
Occupation: [["MLOps","DE","MLE","SWE"]]
No_of_dog: [[2,0,1,0]]

## 2. Kafka: Producer and Consumer Examples. 

+ **Step1: Start the Kafka service to create the `my_topic` kafka connection**


```bash
# From the /sai01 folder, run the command below
$ docker-compose up
```

+ **Step2: Start the Kafka client to alway listen the `my_topic` kafka connection**


```bash
# From the /sai01 folder, run the command below
(venv)$ python run_kafka_consumer.py
```

+ **Step3: Trigger the function below to run producer sending message to `my_topic` connection**

When running the cell below at this step, it will send the message to `my_topic` kafka, and in the terminal log of step2's script, you could see the sending message like this 
```
b'message 0'
b'message 1'
b'message 2'
b'message 3'
b'message 4'
b'message 5'
b'message 6'
b'message 7'
b'message 8'
b'message 9'
```

In [None]:
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])

In [None]:
# >>> Rerun this cell to resend message to kafka topic
for i in range(10):
    message = "Test message {}".format(i).encode('utf-8')
    producer.send('my_topic', message)

producer.flush()