# Producer notebook
This notebook simulates the producer of streaming data. This is done by reading the data in the CSV files, and at each time interval, sending some data to a server. This requires a connection to the server first. 

We use server name `localhost` and port `9999` to communicate.

The parameters for the notebook are the size of the batch  `pi`, i.e the number of days of data that will be sent each time, and `delta`, the time interval.

In [1]:
import time
import numpy as np
import socket
from pyspark.sql import SparkSession

## Setting up the socket connection

The following block of code sets up the connection to the server. It is to be ran, and expecting **the server to be up and running** (in the `consumer.ipynb` file).

Run the following block and head to the `consumer.ipynb` file to set up the server.

In [2]:
import socket
  
# take the server name and port name
host = 'localhost'
port = 9997
  
# create a socket at server side
# using TCP / IP protocol
s = socket.socket(socket.AF_INET,
                  socket.SOCK_STREAM)
  
# bind the socket with server
# and port number
s.bind((host, port))
  
# allow maximum 1 connection to
# the socket
s.listen(5)
  
# wait till a client accept
# connection
c, addr = s.accept()
  
# display client address
print("CONNECTION FROM:", str(addr))



CONNECTION FROM: ('127.0.0.1', 58880)


## Setting up the data source : Spark DataFrame

In this scenario, we will consider a huge dataframe from which we will send the data. For manipulating teh dataframe, we use `Spark DataFrames`. Thus, we need to setup a Spark session in our producer node.

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
    .master("local")\
    .appName("Producer")\
    .getOrCreate()

sc = spark.sparkContext

In [4]:
# Loading data as a single RDD

data = sc.textFile("CLEAN_DATA/*.csv")
data = data.map(lambda line : line.split(","))

In [5]:
data.count()

2725056

In [6]:
delta = 5 # seconds
pi = 30 # days
gaps_per_day = 96
gaps = pi*gaps_per_day

## Filtering the data at each stage
Each time the `delta` period is elapsed, a batch of data is emitted and this batch is a subset of the `data` dataframe. This subset is built around the `timegap` value of the rows, for which at each emission we set a lower bound and an upper bound, and all the data between those bounds is selected.

For instance, the bounds for the first iteration are obviously `0` and `gaps_per_day*pi`. For the next iteration, the lower bound becomes the former upper bound `+1`, and the upper bound becomes the current lower bound `+gaps_per_day*pi`.

```python
def getNextBounds(lower, upper, pi) :
    lower = upper + 1
    upper = lower + pi*gaps_per_day
    return lower, upper
``` 

In [7]:
def getNextBounds(lower, upper, batch_size=pi) :
    lower = upper 
    upper = lower + batch_size*gaps_per_day
    return lower, upper

In [8]:
# Example
lower_example, upper_example = 0, gaps_per_day*pi
for i in range(7) :
    lower_example, upper_example = getNextBounds(lower_example, upper_example)
    print(lower_example, upper_example)

2880 5760
5760 8640
8640 11520
11520 14400
14400 17280
17280 20160
20160 23040


In [9]:
# Setting initial bounds
lower, upper = 0, gaps_per_day*pi

### Selecting subset of the data
The data is encoded in an RDD. We can use the `filter` method.

In [10]:
lower, upper = getNextBounds(lower, upper)
print(f"Current lower and bounds : [{lower}, {upper}]")
data.filter(lambda x : float(x[0]) <= upper and float(x[0]) > lower).take(15)

Current lower and bounds : [2880, 5760]


[['2881', '0.0', '-1.0', 'CAT17'],
 ['2882', '0.0', '-1.0', 'CAT17'],
 ['2883', '0.0', '-1.0', 'CAT17'],
 ['2884', '0.0', '-1.0', 'CAT17'],
 ['2885', '0.0', '-1.0', 'CAT17'],
 ['2886', '0.0', '-1.0', 'CAT17'],
 ['2887', '0.0', '-1.0', 'CAT17'],
 ['2888', '0.0', '-1.0', 'CAT17'],
 ['2889', '0.0', '-1.0', 'CAT17'],
 ['2890', '0.0', '-1.0', 'CAT17'],
 ['2891', '0.0', '-1.0', 'CAT17'],
 ['2892', '0.0', '-1.0', 'CAT17'],
 ['2893', '0.0', '-1.0', 'CAT17'],
 ['2894', '0.0', '-1.0', 'CAT17'],
 ['2895', '0.0', '-1.0', 'CAT17']]

### Producing the data at a regular interval

In [11]:

# Setting parameters again
pi = 5
delta = 10
lower, upper = 0, pi*gaps_per_day


while True :

    restrained_data = data.filter(lambda line : float(line[0]) < upper and float(line[0]) > lower)
    restrained_data = restrained_data.map(lambda line : ','.join(line))
    message = str('\n'.join(restrained_data.collect())) + "\n"

    # Send message to the client
    try:  
        print("Sending message")
        print(message)
        c.send(message.encode())
    except socket.error:
        # If failed, client is probably disconnected. Wait for another connection
        c.close()
        c, addr = s.accept()

    lower, upper = getNextBounds(lower, upper, batch_size=pi)
    time.sleep(delta)

Sending message
1,0.0,-1.0,CAT17
2,0.0,-1.0,CAT17
3,0.0,-1.0,CAT17
4,0.0,-1.0,CAT17
5,0.0,-1.0,CAT17
6,0.0,-1.0,CAT17
7,0.0,-1.0,CAT17
8,0.0,-1.0,CAT17
9,0.0,-1.0,CAT17
10,0.0,-1.0,CAT17
11,0.0,-1.0,CAT17
12,0.0,-1.0,CAT17
13,0.0,-1.0,CAT17
14,0.0,-1.0,CAT17
15,0.0,-1.0,CAT17
16,0.0,-1.0,CAT17
17,0.0,-1.0,CAT17
18,0.0,-1.0,CAT17
19,0.0,-1.0,CAT17
20,0.0,-1.0,CAT17
21,0.0,-1.0,CAT17
22,0.0,-1.0,CAT17
23,0.0,-1.0,CAT17
24,0.0,-1.0,CAT17
25,0.0,-1.0,CAT17
26,0.0,-1.0,CAT17
27,0.0,-1.0,CAT17
28,0.0,-1.0,CAT17
29,0.0,-1.0,CAT17
30,0.0,-1.0,CAT17
31,0.0,-1.0,CAT17
32,0.0,-1.0,CAT17
33,0.0,-1.0,CAT17
34,0.0,-1.0,CAT17
35,0.0,-1.0,CAT17
36,0.0,-1.0,CAT17
37,0.0,-1.0,CAT17
38,0.0,-1.0,CAT17
39,0.0,-1.0,CAT17
40,0.0,-1.0,CAT17
41,0.0,-1.0,CAT17
42,0.0,-1.0,CAT17
43,0.0,-1.0,CAT17
44,0.0,-1.0,CAT17
45,0.0,-1.0,CAT17
46,0.0,-1.0,CAT17
47,0.0,-1.0,CAT17
48,0.0,-1.0,CAT17
49,0.0,-1.0,CAT17
50,0.0,-1.0,CAT17
51,0.0,-1.0,CAT17
52,0.0,-1.0,CAT17
53,0.0,-1.0,CAT17
54,0.0,-1.0,CAT17
55,0.0,-1.0,CAT17
56,

Sending message
481,0.0,-1.0,CAT17
482,0.0,-1.0,CAT17
483,0.0,-1.0,CAT17
484,0.0,-1.0,CAT17
485,0.0,-1.0,CAT17
486,0.0,-1.0,CAT17
487,0.0,-1.0,CAT17
488,0.0,-1.0,CAT17
489,0.0,-1.0,CAT17
490,0.0,-1.0,CAT17
491,0.0,-1.0,CAT17
492,0.0,-1.0,CAT17
493,0.0,-1.0,CAT17
494,0.0,-1.0,CAT17
495,0.0,-1.0,CAT17
496,0.0,-1.0,CAT17
497,0.0,-1.0,CAT17
498,0.0,-1.0,CAT17
499,0.0,-1.0,CAT17
500,0.0,-1.0,CAT17
501,0.0,-1.0,CAT17
502,0.0,-1.0,CAT17
503,0.0,-1.0,CAT17
504,0.0,-1.0,CAT17
505,0.0,-1.0,CAT17
506,0.0,-1.0,CAT17
507,0.0,-1.0,CAT17
508,0.0,-1.0,CAT17
509,0.0,-1.0,CAT17
510,0.0,-1.0,CAT17
511,0.0,-1.0,CAT17
512,0.0,-1.0,CAT17
513,0.0,-1.0,CAT17
514,0.0,-1.0,CAT17
515,0.0,-1.0,CAT17
516,0.0,-1.0,CAT17
517,0.0,-1.0,CAT17
518,0.0,-1.0,CAT17
519,0.0,-1.0,CAT17
520,0.0,-1.0,CAT17
521,0.0,-1.0,CAT17
522,0.0,-1.0,CAT17
523,0.0,-1.0,CAT17
524,0.0,-1.0,CAT17
525,0.0,-1.0,CAT17
526,0.0,-1.0,CAT17
527,0.0,-1.0,CAT17
528,0.0,-1.0,CAT17
529,0.0,-1.0,CAT17
530,0.0,-1.0,CAT17
531,0.0,-1.0,CAT17
532,0.0,-1.0,CA

Sending message
961,0.0,-1.0,CAT17
962,0.0,-1.0,CAT17
963,0.0,-1.0,CAT17
964,0.0,-1.0,CAT17
965,0.0,-1.0,CAT17
966,0.0,-1.0,CAT17
967,0.0,-1.0,CAT17
968,0.0,-1.0,CAT17
969,0.0,-1.0,CAT17
970,0.0,-1.0,CAT17
971,0.0,-1.0,CAT17
972,0.0,-1.0,CAT17
973,0.0,-1.0,CAT17
974,0.0,-1.0,CAT17
975,0.0,-1.0,CAT17
976,0.0,-1.0,CAT17
977,0.0,-1.0,CAT17
978,0.0,-1.0,CAT17
979,0.0,-1.0,CAT17
980,0.0,-1.0,CAT17
981,0.0,-1.0,CAT17
982,0.0,-1.0,CAT17
983,0.0,-1.0,CAT17
984,0.0,-1.0,CAT17
985,0.0,-1.0,CAT17
986,0.0,-1.0,CAT17
987,0.0,-1.0,CAT17
988,0.0,-1.0,CAT17
989,0.0,-1.0,CAT17
990,0.0,-1.0,CAT17
991,0.0,-1.0,CAT17
992,0.0,-1.0,CAT17
993,0.0,-1.0,CAT17
994,0.0,-1.0,CAT17
995,0.0,-1.0,CAT17
996,0.0,-1.0,CAT17
997,0.0,-1.0,CAT17
998,0.0,-1.0,CAT17
999,0.0,-1.0,CAT17
1000,0.0,-1.0,CAT17
1001,0.0,-1.0,CAT17
1002,0.0,-1.0,CAT17
1003,0.0,-1.0,CAT17
1004,0.0,-1.0,CAT17
1005,0.0,-1.0,CAT17
1006,0.0,-1.0,CAT17
1007,0.0,-1.0,CAT17
1008,0.0,-1.0,CAT17
1009,0.0,-1.0,CAT17
1010,0.0,-1.0,CAT17
1011,0.0,-1.0,CAT17
101

Sending message
1441,0.0,-1.0,CAT17
1442,0.0,-1.0,CAT17
1443,0.0,-1.0,CAT17
1444,0.0,-1.0,CAT17
1445,0.0,-1.0,CAT17
1446,0.0,-1.0,CAT17
1447,0.0,-1.0,CAT17
1448,0.0,-1.0,CAT17
1449,0.0,-1.0,CAT17
1450,0.0,-1.0,CAT17
1451,0.0,-1.0,CAT17
1452,0.0,-1.0,CAT17
1453,0.0,-1.0,CAT17
1454,0.0,-1.0,CAT17
1455,0.0,-1.0,CAT17
1456,0.0,-1.0,CAT17
1457,0.0,-1.0,CAT17
1458,0.0,-1.0,CAT17
1459,0.0,-1.0,CAT17
1460,0.0,-1.0,CAT17
1461,0.0,-1.0,CAT17
1462,0.0,-1.0,CAT17
1463,0.0,-1.0,CAT17
1464,0.0,-1.0,CAT17
1465,0.0,-1.0,CAT17
1466,0.0,-1.0,CAT17
1467,0.0,-1.0,CAT17
1468,0.0,-1.0,CAT17
1469,0.0,-1.0,CAT17
1470,0.0,-1.0,CAT17
1471,0.0,-1.0,CAT17
1472,0.0,-1.0,CAT17
1473,0.0,-1.0,CAT17
1474,0.0,-1.0,CAT17
1475,0.0,-1.0,CAT17
1476,0.0,-1.0,CAT17
1477,0.0,-1.0,CAT17
1478,0.0,-1.0,CAT17
1479,0.0,-1.0,CAT17
1480,0.0,-1.0,CAT17
1481,0.0,-1.0,CAT17
1482,0.0,-1.0,CAT17
1483,0.0,-1.0,CAT17
1484,0.0,-1.0,CAT17
1485,0.0,-1.0,CAT17
1486,0.0,-1.0,CAT17
1487,0.0,-1.0,CAT17
1488,0.0,-1.0,CAT17
1489,0.0,-1.0,CAT17
1490

Sending message
1921,0.0,-1.0,CAT17
1922,0.0,-1.0,CAT17
1923,0.0,-1.0,CAT17
1924,0.0,-1.0,CAT17
1925,0.0,-1.0,CAT17
1926,0.0,-1.0,CAT17
1927,0.0,-1.0,CAT17
1928,0.0,-1.0,CAT17
1929,0.0,-1.0,CAT17
1930,0.0,-1.0,CAT17
1931,0.0,-1.0,CAT17
1932,0.0,-1.0,CAT17
1933,0.0,-1.0,CAT17
1934,0.0,-1.0,CAT17
1935,0.0,-1.0,CAT17
1936,0.0,-1.0,CAT17
1937,0.0,-1.0,CAT17
1938,0.0,-1.0,CAT17
1939,0.0,-1.0,CAT17
1940,0.0,-1.0,CAT17
1941,0.0,-1.0,CAT17
1942,0.0,-1.0,CAT17
1943,0.0,-1.0,CAT17
1944,0.0,-1.0,CAT17
1945,0.0,-1.0,CAT17
1946,0.0,-1.0,CAT17
1947,0.0,-1.0,CAT17
1948,0.0,-1.0,CAT17
1949,0.0,-1.0,CAT17
1950,0.0,-1.0,CAT17
1951,0.0,-1.0,CAT17
1952,0.0,-1.0,CAT17
1953,0.0,-1.0,CAT17
1954,0.0,-1.0,CAT17
1955,0.0,-1.0,CAT17
1956,0.0,-1.0,CAT17
1957,0.0,-1.0,CAT17
1958,0.0,-1.0,CAT17
1959,0.0,-1.0,CAT17
1960,0.0,-1.0,CAT17
1961,0.0,-1.0,CAT17
1962,0.0,-1.0,CAT17
1963,0.0,-1.0,CAT17
1964,0.0,-1.0,CAT17
1965,0.0,-1.0,CAT17
1966,0.0,-1.0,CAT17
1967,0.0,-1.0,CAT17
1968,0.0,-1.0,CAT17
1969,0.0,-1.0,CAT17
1970

KeyboardInterrupt: 

In [None]:
s.close(), c.close()