## FIT3182: Assignment 2 Part B (Event Producer 2) 

### Name: Ashley Ooi Yan-Lin (ID: 31171095)

### Task 1: Processing Data Stream

### (b) Write a python program that loads all the data from hotspot_AQUA_streaming.csv and randomly (with replacement) feed the data to the stream every 2 seconds. AQUA is the satellite from NASA that reports latitude, longitude, confidence and surface temperature of a location. You will need to append additional information such as producer information to identify the producer and created date & time.

Firstly, we need to establish a connection with our MongoClient and retrieve the collection created in Part A to obtain the most recent date.

In [1]:
# import statements
import pymongo
from pymongo import MongoClient
from pprint import pprint
import pandas as pd
from datetime import datetime,timedelta

client = MongoClient () 
db = client.fit3182_assignment_db
collection = db.partA

Next, we need to create our program to facilitate the transmission of our data to Kafka.

In [None]:
from time import sleep
from json import dumps
from kafka import KafkaProducer
import random

# Reads data from hotspot_AQUA_streaming.csv, puts each row data into a document and appends all such documents into a list.
def read_hotspot_AQUA_streaming():
    hotspot_AQUA_streaming = pd.read_csv('hotspot_AQUA_streaming.csv')

    data = []
    for index,aquaRow in hotspot_AQUA_streaming.iterrows():
        document = {}
        document['latitude'] = float(aquaRow['latitude'])
        document['longitude'] = float(aquaRow['longitude'])
        document['confidence'] = int(aquaRow['confidence'])
        document['surface_temperature_celcius'] = int(aquaRow['surface_temperature_celcius'])
        data.append(document)
    
    return data

# Gets the latest date in our collection
def get_latest_date():
    latest_date = collection.aggregate([
                {"$sort":{"date":-1}},
                {"$project":{"_id":0,"date":1}},
                {"$limit":1}
                ])
    for document in latest_date:
        latest_date = document['date']
    return latest_date
     
# Publishes message to Kafka
def publish_message(producer_instance, topic_name, data):
    try:
        producer_instance.send(topic_name, value=data)
        print('Message published successfully. Data: ' + str(data))
    except Exception as ex:
        print('Exception in publishing message.')
        print(str(ex))
        
def connect_kafka_producer():
    _producer = None
    try:
        _producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
                                  value_serializer=lambda x: dumps(x).encode('ascii'),
                                  api_version=(0, 10))
    except Exception as ex:
        print('Exception while connecting Kafka.')
        print(str(ex))
    finally:
        return _producer
    
if __name__ == '__main__':
   
    topic = 'PartB'
    print('Publishing records..')
    producer02 = connect_kafka_producer()
    data = read_hotspot_AQUA_streaming()  # Gets all the documents produced from hotspot_AQUA_streaming.csv
    latest_date = get_latest_date() + timedelta(days=1) # After getting latest date, we would add one day to it to get the first date we would use to start feeding data
    secondsPassed= 0 # Tracks how many seconds we should add to our latest date

    while True:
        chosenData = random.choice(data)  # Randomly chooses a document from our list of documents
        curr_date = latest_date + timedelta(seconds=secondsPassed) # Creates the date we will use to feed our data by adding the number of seconds to our latest date from Part A
        chosenData['producer'] = "hotspot_AQUA_streaming"
        chosenData["created_datetime"] = curr_date.strftime("%d/%m/%Y %H:%M:%S")
        publish_message(producer02, topic, chosenData)
        secondsPassed += 17280 # After we insert a hotspot streaming data, add 17280 seconds to secondPassed. 17280 second is equivalent to 24 hours divided by 5. As we would insert 5 hotspot AQUA streaming data per day, adding 17280 seconds each time after we insert a hotspot streaming data simulates the time difference of adding the 5 data in a day
        sleep(2) # Sleep for 2 seconds so that we would be able to insert 5 data per climate streaming data being inserted, as the climate streaming data is inserted every 10 seconds

Message published successfully. Data: {'latitude': -35.8125, 'longitude': 142.1286, 'confidence': 75.0, 'surface_temperature_celcius': 109.0, 'created_datetime': '2022-05-20 03:28:21', 'producer_id': 'hotspot_aqua_producer'}
Message published successfully. Data: {'latitude': -37.0827, 'longitude': 143.8836, 'confidence': 72.0, 'surface_temperature_celcius': 47.0, 'created_datetime': '2022-05-20 04:26:39', 'producer_id': 'hotspot_aqua_producer'}
Message published successfully. Data: {'latitude': -37.8343, 'longitude': 143.6581, 'confidence': 72.0, 'surface_temperature_celcius': 46.0, 'created_datetime': '2022-05-20 08:50:36', 'producer_id': 'hotspot_aqua_producer'}
Message published successfully. Data: {'latitude': -35.9438, 'longitude': 145.0824, 'confidence': 78.0, 'surface_temperature_celcius': 52.0, 'created_datetime': '2022-05-20 14:17:49', 'producer_id': 'hotspot_aqua_producer'}
Message published successfully. Data: {'latitude': -37.1929, 'longitude': 143.8132, 'confidence': 59.0,