# Part 1: Producing the data
In this task, we will implement one Apache Kafka producer to simulate real-time data streaming. Spark is not allowed/required in this part since it’s simulating a streaming data source.

1. Your program should send one batch of applications every 5 seconds. One batch consists of a random 100-500 rows from the application_data_stream dataset. Note that only the number of rows needs to be random, you can read the file sequentially.  
    As an example, in the first and second batches, assuming we generate random numbers 100 and 400, the first batch will send records 1-100 from the CSV file, and the second batch will send records 101-500.  
    The CSV shouldn’t be loaded to memory at once to conserve memory (i.e. Read rows as needed).  
2. Add an integer column named ‘ts’ for each row, a Unix timestamp in seconds since the epoch (UTC timezone). Spead your batch out evenly for 5 seconds.  
    For example, if you send a batch of 100 records at 2024-02-01 00:00:00 (ISO format: YYYY-MM-DD HH:MM:SS) -> (ts = 1704027600):
    - Record 1-20: ts = 1704027600 
    - Record 21-40: ts = 1704027601 
    - Record 41-60: ts = 1704027602
    - ….
3. Send your batch to a Kafka topic with an appropriate name.

All the data except for the ‘ts’ column should be sent in String type, without changing to other data types. In many streaming processing applications, the data sources usually have little to no processing power (e.g. sensors). To simulate this, we shouldn’t do any processing/transformation at the producer.


In [None]:
from time import sleep
from json import dumps
from kafka3 import KafkaProducer
import csv
import random
import datetime as dt

hostip = "kafka"
topic = 'application_stream'

def read_csv(filename):
    line_list = []
    with open(filename, 'r+') as file:
        reader = csv.DictReader(file)
        for line in reader:
            line_list.append(line)
    return line_list

def publish_message(producer, topic, data):
    try:
        producer.send(topic, data)
        print('Message published successfully. Data: ')
        for d in data:
            print(str(d))
    except Exception as ex:
        print('Exception while connecting Kafka.')
        print(str(ex))
  
        
def connect_kafka_producer():
    _producer = None
    try:
        _producer = KafkaProducer(bootstrap_servers=[f'{hostip}:9092'],
                                  value_serializer=lambda x: dumps(x).encode('ascii'),
                                  api_version=(0, 10))
        
    except Exception as ex:
        print('Exception while connecting Kafka.')
        print(str(ex))
        
    finally:
        return _producer

if __name__ == '__main__':
    
    line_list = read_csv('application_data_stream.csv')
    print('Publishing records..')
    producer = connect_kafka_producer()
    
    try:
        start_index = 0
        while True: 
            num_lines = random.randint(100, 500)
            #print(num_lines)
            
            if num_lines <= len(line_list[start_index:]):
                to_send = line_list[start_index : start_index + num_lines]
                start_index = start_index + num_lines
                
            else:
                lines_left = num_lines - len(line_list[start_index:])
                to_send = line_list[start_index:] + line_list[:lines_left]
                start_index = lines_left
                
            ts = {'ts': int(dt.datetime.now().timestamp())}
            data = [dict(record, **ts) for record in to_send]
            
            publish_message(producer, topic, data)
            sleep(5)
            
    except KeyboardInterrupt:
        print('Stopping the data generation loop.')