## Sending data to a Kafka server

This notebook uses the [Python client for the Apache Kafka distributed stream processing system](http://kafka-python.readthedocs.io/en/master/index.html) to send messages to a Kafka server. 

* Sensor data is available from https://uv.ulb.ac.be/pluginfile.php/923479/course/section/165902/data.conv.txt.gz
* Sensor location is available from https://uv.ulb.ac.be/pluginfile.php/923479/course/section/165902/mote_locs.txt

In this example, Kafka is used to send messages containing the temperature data of sensor 1, from the 28/02 to the 06/03.

You need to have Kafka and Zookeeper servers running to execut this notebook. If you use the Docker course container, or work on the course cluster, these servers should already be running. Otherwise, you may start them on your machine with

```
nohup $KAFKA_PATH/bin/zookeeper-server-start.sh $KAFKA_PATH/config/zookeeper.properties  > $HOME/zookeeper.log 2>&1 &
nohup $KAFKA_PATH/bin/kafka-server-start.sh $KAFKA_PATH/config/server.properties > $HOME/kafka.log 2>&1 &
```

where `KAFKA_PATH` points to the folder containing Kafka. See https://kafka.apache.org/quickstart for how to install Kafka on your machine. 


### General import

In [6]:
from kafka import KafkaProducer
import time
import numpy as np
import pandas as pd
import math

### Get the n closest sensors

In [7]:
def getSensorsLoc(locations_file):
    """
    returns an array where each element is [sensor, x_coord, y_coord] based on a location file
    """
    sensors_loc = []
    with open(locations_file, "r") as f:
        lines = f.readlines()
        for line in lines:
            info = line.split(" ")
            sensor = int(info[0])
            x = float(info[1])
            y = float(info[2])
            sensors_loc.append([sensor, x, y])
    sensors_loc = np.array(sensors_loc)
    return sensors_loc

def getNClosestNeighbors(sensorId, sensors_loc, n):
    """
    returns a list of n closest neighbors ordered from closest to furthest to the given sensorId
    """

    index_sensor_id = np.where(sensors_loc[:,0] == sensorId)[0][0]
    x_sensor = sensors_loc[index_sensor_id, 1]
    y_sensor = sensors_loc[index_sensor_id, 2]

    neighbors = []
    distances = []
    for i in range(len(sensors_loc)):
        if i!= index_sensor_id:
            id_neighbor = sensors_loc[i,0]
            x_neighbor = sensors_loc[i,1]
            y_neighbor = sensors_loc[i,2]
            x = x_sensor - x_neighbor
            y = y_sensor - y_neighbor
            distance = math.sqrt(math.pow(x,2) + math.pow(y,2))
            neighbors.append(id_neighbor)
            distances.append(distance)
    ar_neighbors = np.array(neighbors)
    ar_distances = np.array(distances)
    inds = ar_distances.argsort()
    sorted_neighbors = ar_neighbors[inds]
    sorted_distances = ar_distances[inds]

    return sorted_neighbors[:n]

In [8]:
DATA_LOCATION = "../data"
FILE = "data.conv.txt"
LOC = "mote_locs.txt"
data_file = "{}/{}".format(DATA_LOCATION, FILE)
data_loc = "{}/{}".format(DATA_LOCATION, LOC)

sensors_loc = getSensorsLoc(data_loc)
n = 10
closest_neighbors_1 = getNClosestNeighbors(1, sensors_loc,n)
closest_neighbors_24 = getNClosestNeighbors(24, sensors_loc,n)

### Load measurements, sort by Date/Time, add relative number of seconds since beginning

In [9]:

#Takes about one minute to load
data=pd.read_csv(data_file,header=None,sep=" ")
data.columns=["Date","Hour","Sensor","Value","Voltage"]
data=data.sort_values(['Date','Hour']).reset_index(drop=True)

In [10]:
data['datetime']=pd.to_datetime(data.Date+' '+data.Hour)
data['relative_datetime']=data['datetime']-data['datetime'][0]
data['seconds']=data['relative_datetime'].dt.total_seconds()

In [11]:
sensorId_type=data.Sensor.str.split("-",expand=True)
sensorId_type.columns=['SensorId','Type']
data['SensorId']=sensorId_type['SensorId'].astype(int)
data['Type']=sensorId_type['Type'].astype(int)


In [12]:
#Drop features not needed for the simulation
data=data.drop(['datetime','relative_datetime','Sensor','Date','Hour','Voltage'],axis=1)

### Select temperature data from sensor 1 and its 5 closest neighbors (24 and its 5 closest neighbors will be added later

In [16]:
# 5 closest neighbors of sensor 1 are sensors 2, 3, 33, 34, 35
# 5 closest neighbors of sensor 24 are sensors 22, 23, 25, 26, 27
print(closest_neighbors_1)
test = data[(data.SensorId.isin(closest_neighbors_1)) & (data.Type==0)]
temp = data[((data.SensorId==1) | (data.SensorId in closest_neighbors_1) | \
             (data.SensorId==24) |(data.SensorId in closest_neighbors_24)) & (data.Type==0)]
temp=temp.reset_index(drop=True).drop(['Type'], axis=1)

[33.  2.  3. 35. 37. 34. 31.  4. 32. 36.]


ValueError: Lengths must match to compare

### Create  Kafka producer

In [7]:
producer = KafkaProducer(bootstrap_servers='localhost:9092')

### Stream data

We simulate the streaming of data by sending every five seconds the set of measurements collected during one day. This allows to speed up the simulation (for 8 days - from 28/02/2017 to 7/03/2017: 8*10=80 seconds).


In [10]:
interval=40

#Start at relative day 0 (2017-02-28)
day=0

#For synchronization with receiver (for the sake of the simulation), starts at a number of seconds multiple of 'interval'
current_time=time.time()
time_to_wait=interval-current_time%interval
time.sleep(time_to_wait)

#Loop for sending messages to Kafka with the topic persistence
for day in range(0,8):
    
    time_start=time.time()
    
    #Select sensor measurements for the corresponding relative day
    data_current_day=temp[(temp.seconds>=day*86400) & (temp.seconds<(day+1)*86400)]
    data_current_day=data_current_day.dropna()
    #For all measurements in that hour
    for i in range(len(data_current_day)):
        #Get data
        current_data=list(data_current_day.iloc[i])
        #Transform list to string
        message=str(current_data)
        #Send
        producer.send('RLS',message.encode())
    
    time_to_send=time.time()-time_start
    print("Time to send "+str(len(data_current_day))+" measurements (day "+str(day)+" ) : "+str(time_to_send))

    day=day+1
    
    time.sleep(interval-time_to_send)

Time to send 7332 measurements (day 0 ) : 3.069990634918213
Time to send 3880 measurements (day 1 ) : 5.548675537109375
Time to send 3894 measurements (day 2 ) : 5.347285032272339
Time to send 3610 measurements (day 3 ) : 4.80565881729126
Time to send 3527 measurements (day 4 ) : 5.259955406188965
Time to send 3397 measurements (day 5 ) : 4.854121685028076
Time to send 4315 measurements (day 6 ) : 5.2760114669799805
Time to send 4503 measurements (day 7 ) : 6.032655954360962
