<a href="https://colab.research.google.com/github/nbpyth97/Exercise/blob/master/lab2/ps2022_lab2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Processamento de Streams 2022
## Lab 2 - (Unstructured) Spark Streaming
---
### Colab Setup



In [1]:
#@title Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
#@title Install PySpark
!pip install pyspark findspark --quiet
import findspark
findspark.init()
findspark.find()

'/usr/local/lib/python3.7/dist-packages/pyspark'

---
### Weblog Sender
The stream server is a small python TCP server, listening
on port 7777 (localhost). 

The stream will consist of a set of text lines, obtained from the output log of a webserver.



In [3]:
!wget -q -O - https://github.com/smduarte/ps2022/raw/main/colab/logsender.tgz | tar xfz - 2> /dev/null

!nohup python logsender/server.py logsender/web.log 7777 > /dev/null 2> /dev/null &

The python code below shows the basics needed to process data from socket source using PySpark.

Spark Streaming python documentation is found [here](https://spark.apache.org/docs/latest/api/python/reference/pyspark.streaming.html)

In [20]:
from pyspark import SparkContext

sc = SparkContext("local[2]", "WebLogExample")
import socket
from pyspark.streaming import StreamingContext

ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream("localhost", 7777)
result = lines.window(10,5).filter(lambda line : len(line) > 0 )
resultWords = result.map(lambda line: (line.split(' ')[1],1))
count = resultWords.reduceByKey(lambda a,b:a+b).filter(lambda x:x[1]>50)


count.pprint()

ssc.start()
ssc.awaitTermination(10)
ssc.stop()

-------------------------------------------
Time: 2022-03-14 19:20:49
-------------------------------------------
('185.28.193.95', 93)
('120.52.73.98', 80)
('120.52.73.97', 100)

-------------------------------------------
Time: 2022-03-14 19:20:54
-------------------------------------------
('185.28.193.95', 93)
('192.241.151.220', 129)
('97.77.104.22', 138)
('120.52.73.98', 209)
('178.22.148.122', 191)
('120.52.73.97', 288)



---
# Exercises

## Exercise 1

In a denial-of-service event it is important to identify the IP sources that might be attacking the system, by issuing a large number of requests.

Write a program to find the IP sources that have done more than 50 requests in the last 10 seconds -- dump this information every 5 seconds. 


In [13]:
from pyspark import SparkContext

sc = SparkContext("local[2]", "WebLogExample")
import socket
from pyspark.streaming import StreamingContext

try:
  ssc = StreamingContext(sc, 1)
  lines = ssc.socketTextStream("localhost", 7777)
  result = lines.window(30,5).filter(lambda line : len(line) > 0 )
  resultWords = result.map(lambda line: (line.split(' ')[1],1))
  count = resultWords.reduceByKey(lambda a,b:a+b).filter(lambda x:x[1]>50)
  
  count.pprint()
    
  ssc.start()
  ssc.awaitTermination(10)
  ssc.stop()
except:
  print('Error')
  ssc.stop()



-------------------------------------------
Time: 2022-03-14 19:12:46
-------------------------------------------
('185.28.193.95', 93)
('120.52.73.98', 80)
('120.52.73.97', 100)

-------------------------------------------
Time: 2022-03-14 19:12:51
-------------------------------------------
('185.28.193.95', 93)
('192.241.151.220', 128)
('97.77.104.22', 138)
('120.52.73.98', 208)
('178.22.148.122', 190)
('120.52.73.97', 288)



In [58]:
ssc.stop()

## Exercise 2

#### a)
Write a program to dump the number of requests, minimum processing time, maximum processing time for request in the last 10 seconds, **for all** source IPs that performed more than 100 requests -- dump this information every 5 second.  

In [60]:
from os import linesep
from pyspark import SparkContext

sc = SparkContext("local[2]", "WebLogExample")
import socket
from pyspark.streaming import StreamingContext

try:
  ssc = StreamingContext(sc, 1)
  lines = ssc.socketTextStream("localhost", 7777)
  result = lines.window(10,5).filter(lambda line : len(line) > 0 )
  line = result.map(lambda line:((line.split()[1],1),line.split()[-1])).reduceByKey(lambda a,b:a+b)

  line.pprint()

  ssc.start()
  ssc.awaitTermination(10)
  ssc.stop()
except:
  print('-----------Error-----------')
  ssc.stop()

-------------------------------------------
Time: 2022-03-14 19:54:10
-------------------------------------------
(('37.139.9.11', 1), '0.0260.0570.0150.0393.05141.512')
(('178.22.148.122', 1), '0.0880.0880.1020.0960.0910.5821.8781.8601.8431.9271.8320.1050.1190.1140.1780.1781.7601.8512.2140.0160.0660.3332.1440.1090.1863.1340.2930.1280.1170.20340.85541.7050.1700.203')
(('202.47.236.252', 1), '0.2270.2260.2082.2583.0630.1240.2570.21841.8940.1860.2350.275')
(('2a02:c207:2008:5497::1', 1), '0.2030.0540.141')
(('2a01:488:66:1000:5c33:8503:0:1', 1), '0.0680.2300.07640.8720.0910.1400.1084.2580.089')
(('120.52.73.97', 1), '2.9942.9972.9932.9872.9863.5033.1014.3012.9983.1873.1063.4873.5033.9872.9003.9883.9023.5013.9864.3014.0843.9004.0823.9883.9884.1640.1550.2320.1370.2250.3050.2550.2570.2490.1240.3100.1750.2760.2680.2513.6400.1900.1900.2720.2673.96446.84941.95941.84541.84941.76342.29542.28442.25742.46342.6820.2730.2680.2680.2390.2290.1970.1660.2270.2800.2560.5930.2350.2170.2880.1994.3354.3330.

glow para uma lista

#### b)

Write a program to dump the number of requests, minimum processing time, maximum processing time for request in the last 10 seconds, **only if at least one** source IP has performed more than 100 requests -- dump this information every 5 second.

## Exercise 3
Write a program to dump the IP sources that deviate most from the average in terms of the number of requests made in the last 30 seconds - dump this information every 5 seconds.

## Exercise 4

Run additional logsender servers for subsets of the logs (IPv4 and IPv6 logs), using the following commands.

```
!nohup python logsender/server.py logsender/webipv4.log 7778 > /dev/null 2> /dev/null &
!nohup python logsender/server.py logsender/webipv6.log 7779 > /dev/null 2> /dev/null &
```

Write a program that combines the two streams, dumping the number of requests made in the last 15 seconds - dump this information every 5 seconds.

## Exercise 5

Write a program that combines the two streams from the previous exercise and dumps the proportion of IPv4 vs IPv6 requests in the last 20 seconds - dump this information every 5 seconds.
