## PRODUCER

The presented code represents the Kafka producer, the component designed to capture and send real-time sensor data from bike-sharing stations to a Kafka topic.

The process begins with the producer making an HTTP request to obtain station information from a specified URL.

Subsequently, the Kafka producer is created and configured to communicate with specified Kafka servers, using a serialization function that transforms the data into JSON format. This configuration is crucial to ensure that the sent data are correctly interpreted and managed by downstream Kafka consumers.

The heart of the script is the function responsible for sending the updated station status to the Kafka topic sensor_data. Reading the station IDs of interest from a file, the producer queries the URL with a latency of one second to obtain the updated station status and sends the data to the Kafka topic.

Finally, the script includes measures to ensure that, in case of errors or at the end of execution, the connection with the Kafka producer is properly closed, avoiding potential resource leaks or blocks in the system.

In [3]:
!pip install requests



In [4]:
import pandas as pd
import requests
from kafka import KafkaProducer
import json
import time
import datetime

In [None]:
def fetch_and_filter_stations(station_info_url, short_names_file):
    response = requests.get(station_info_url)
    data = response.json()
    df = pd.DataFrame(data['data']['stations'])
    with open(short_names_file, 'r') as f:
        short_names = [line.strip() for line in f]
    filtered_df = df.loc[df['short_name'].isin(short_names)]
    filtered_df["station_id"].to_csv("station_id.txt", header=False, index=False)

def datetime_serializer(obj):
    if isinstance(obj, datetime.datetime):
        return obj.strftime('%Y-%m-%d %H:%M:%S')
    raise TypeError("Type not serializable")

def create_kafka_producer(bootstrap_servers, value_serializer):
    return KafkaProducer(
        bootstrap_servers=bootstrap_servers,
        value_serializer=value_serializer
    )

def send_station_status(producer, station_status_url, station_ids_file):
    counter = 0
    with open(station_ids_file, 'r') as f:
        station_ids = f.read().splitlines()
    while counter < 1036800:  # Use a more meaningful constant or configuration for this limit
        response = requests.get(station_status_url)
        data = response.json()
        for station in data['data']['stations']:
            if station['station_id'] in station_ids:
                sensor_data = {
                    'station_id': station['station_id'],
                    'num_bikes_available': station['num_bikes_available'],
                    'num_ebikes_available': station['num_ebikes_available'],
                    'num_docks_available': station['num_docks_available'],
                    'dt2': station['last_reported']
                }
                producer.send('sensor_data', sensor_data)
        counter += 1
        time.sleep(1)

def main():
    station_info_url = 'https://gbfs.capitalbikeshare.com/gbfs/en/station_information.json'
    station_status_url = 'https://gbfs.capitalbikeshare.com/gbfs/en/station_status.json'
    fetch_and_filter_stations(station_info_url, 'start_stations.txt')
    
    try:
        producer = create_kafka_producer(
            bootstrap_servers=['localhost:9092'],
            value_serializer=lambda v: json.dumps(v, default=datetime_serializer).encode('utf-8')
        )
        send_station_status(producer, station_status_url, 'station_id.txt')
        producer.send('sensor_data', {'terminate': True})  # Optionally add callbacks here
    except Exception as ex:
        print(f"Errore nella connessione al broker Kafka: {str(ex)}")
    finally:
        if producer is not None:
            producer.close()

if __name__ == "__main__":
    main()
