<a href="https://colab.research.google.com/github/nbpyth97/Exercise/blob/master/lab2/ps2022_lab2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Processamento de Streams 2022
## Lab 2 - (Unstructured) Spark Streaming
---
### Colab Setup



In [21]:
#@title Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
#@title Install PySpark
!pip install pyspark findspark --quiet
import findspark
findspark.init()
findspark.find()

'/usr/local/lib/python3.7/dist-packages/pyspark'

---
### Weblog Sender
The stream server is a small python TCP server, listening
on port 7777 (localhost). 

The stream will consist of a set of text lines, obtained from the output log of a webserver.



In [23]:
!wget -q -O - https://github.com/smduarte/ps2022/raw/main/colab/logsender.tgz | tar xfz - 2> /dev/null

!nohup python logsender/server.py logsender/web.log 7777 > /dev/null 2> /dev/null &

The python code below shows the basics needed to process data from socket source using PySpark.

Spark Streaming python documentation is found [here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.streaming.html)

In [24]:
from pyspark import SparkContext

sc = SparkContext("local[2]", "WebLogExample")
import socket
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream("localhost", 7777)
result = lines.window(10,5).filter(lambda line : len(line) > 0 )
resultWords = result.map(lambda line: (line.split(' ')[1],1))
count = resultWords.reduceByKey(lambda a,b:a+b).filter(lambda x:x[1]>50)


count.pprint()

ssc.start()
ssc.awaitTermination(10)
ssc.stop()

-------------------------------------------
Time: 2022-03-15 11:41:54
-------------------------------------------
('185.28.193.95', 93)
('120.52.73.98', 80)
('120.52.73.97', 100)

-------------------------------------------
Time: 2022-03-15 11:41:59
-------------------------------------------
('185.28.193.95', 93)
('192.241.151.220', 128)
('97.77.104.22', 138)
('120.52.73.98', 207)
('178.22.148.122', 189)
('120.52.73.97', 287)



---
# Exercises

## Exercise 1

In a denial-of-service event it is important to identify the IP sources that might be attacking the system, by issuing a large number of requests.

Write a program to find the IP sources that have done more than 50 requests in the last 10 seconds -- dump this information every 5 seconds. 


In [25]:
from pyspark import SparkContext

sc = SparkContext("local[2]", "WebLogExample")
import socket
from pyspark.streaming import StreamingContext

try:
  ssc = StreamingContext(sc, 1)
  lines = ssc.socketTextStream("localhost", 7777)
  result = lines.window(30,5).filter(lambda line : len(line) > 0 )
  resultWords = result.map(lambda line: (line.split(' ')[1],1))
  count = resultWords.reduceByKey(lambda a,b:a+b).filter(lambda x:x[1]>50)
  
  count.pprint()
    
  ssc.start()
  ssc.awaitTermination(10)
  ssc.stop()
except:
  print('Error')
  ssc.stop()



-------------------------------------------
Time: 2022-03-15 11:42:07
-------------------------------------------
('185.28.193.95', 93)
('120.52.73.98', 80)
('120.52.73.97', 100)

-------------------------------------------
Time: 2022-03-15 11:42:12
-------------------------------------------
('185.28.193.95', 93)
('192.241.151.220', 156)
('97.77.104.22', 162)
('120.52.73.98', 248)
('178.22.148.122', 223)
('120.52.73.97', 341)



In [26]:
ssc.stop()

## Exercise 2

#### a)
Write a program to dump the number of requests, minimum processing time, maximum processing time for request in the last 10 seconds, **for all** source IPs that performed more than 100 requests -- dump this information every 5 second.  

In [28]:
from os import linesep
from pyspark import SparkContext

sc = SparkContext("local[2]", "WebLogExample")
import socket
from pyspark.streaming import StreamingContext

try:
  ssc = StreamingContext(sc, 1)
  lines = ssc.socketTextStream("localhost", 7777)
  result = lines.window(10,5).filter(lambda line : len(line) > 0 )
  line = result.map(lambda line:'({0},{1})-{2}'.format(line.split()[1],1,line.split()[-1])).map(lambda x:x.split('-'))

  line.pprint()

  ssc.start()
  ssc.awaitTermination(10)
  ssc.stop()
except:
  print('-----------Error-----------')
  ssc.stop()

-------------------------------------------
Time: 2022-03-15 11:43:15
-------------------------------------------
['(37.139.9.11,1)', '0.026']
['(178.22.148.122,1)', '0.088']
['(178.22.148.122,1)', '0.088']
['(37.139.9.11,1)', '0.057']
['(37.139.9.11,1)', '0.015']
['(185.28.193.95,1)', '0.056']
['(185.28.193.95,1)', '0.052']
['(185.28.193.95,1)', '0.055']
['(185.28.193.95,1)', '0.013']
['(37.139.9.11,1)', '0.039']
...

-------------------------------------------
Time: 2022-03-15 11:43:20
-------------------------------------------
['(37.139.9.11,1)', '0.026']
['(178.22.148.122,1)', '0.088']
['(178.22.148.122,1)', '0.088']
['(37.139.9.11,1)', '0.057']
['(37.139.9.11,1)', '0.015']
['(185.28.193.95,1)', '0.056']
['(185.28.193.95,1)', '0.052']
['(185.28.193.95,1)', '0.055']
['(185.28.193.95,1)', '0.013']
['(37.139.9.11,1)', '0.039']
...



glow para uma lista

#### b)

Write a program to dump the number of requests, minimum processing time, maximum processing time for request in the last 10 seconds, **only if at least one** source IP has performed more than 100 requests -- dump this information every 5 second.

## Exercise 3
Write a program to dump the IP sources that deviate most from the average in terms of the number of requests made in the last 30 seconds - dump this information every 5 seconds.

## Exercise 4

Run additional logsender servers for subsets of the logs (IPv4 and IPv6 logs), using the following commands.

```
!nohup python logsender/server.py logsender/webipv4.log 7778 > /dev/null 2> /dev/null &
!nohup python logsender/server.py logsender/webipv6.log 7779 > /dev/null 2> /dev/null &
```

Write a program that combines the two streams, dumping the number of requests made in the last 15 seconds - dump this information every 5 seconds.

## Exercise 5

Write a program that combines the two streams from the previous exercise and dumps the proportion of IPv4 vs IPv6 requests in the last 20 seconds - dump this information every 5 seconds.
