# EX2-SYS: Jupyter, Python processes, measuring performance

Your assignment: complete the `TODO`'s and include also the **output of each cell**.

### Step 1: Download HDFS log file (open data) and Unzip files. Check with professor: this file may be available internally.

In [None]:
!mkdir data

In [None]:
![ -e "data/hdfs/HDFS.log" ] || (wget https://zenodo.org/records/8196385/files/HDFS_v1.zip -P data/ && unzip -o data/HDFS_v1.zip -d data/hdfs && rm data/HDFS_v1.zip)

### Step 2: This practice is going to process file `data/hdfs/HDFS.log`. First, create a Python program that counts the number of lines of this file.

In [None]:
def count_lines(file_path):
    n = 0
    with open(file_path, 'r', encoding='utf-8') as file:
        for _ in file:
            n += 1
    return n

### Step 3: Test your function

In [None]:
file_path = 'data/hdfs/HDFS.log'
n = count_lines(file_path)
print('File %s has %d lines.' % (file_path, n))

### Step 4: Also, get the size of the input file (in bytes)

In [None]:
!ls -l data/hdfs/HDFS.log
!ls -l data/hdfs/HDFS.log | awk '{print $5}'

### Step 5: List the running Python processes

In [None]:
!pgrep -af '[p]ython'

### Step 6: Python threads and also child/parent processes

In [None]:
!ps -eLf | head -1
!ps -eLf | grep -i '[p]ython'

### Step 7: Interpret and write down what does the output from the last two commands actually means (process and thread hierarchy)
R: `O comando ps -eLf mostra todos os processos e suas threads. O filtro com grep '[p]ython' extrai apenas os que estão executando código Python.
O output mostra que há dois principais processos Python em execução:
Um processo do jupyter-lab (PID 1), rodando com múltiplas threads (LWP: 1, 9, 10, 20, 21).
Um processo de kernel do notebook (ipykernel_launcher, PID 19), que é filho do processo anterior e também possui várias threads (LWP: 19, 22–26).
Cada entrada representa uma thread de um processo. A thread principal tem LWP igual ao PID. As demais são threads secundárias (criadas por bibliotecas ou pelo interpretador Python).
A hierarquia mostra que o kernel Python (PID 19) foi iniciado a partir do processo do Jupyter (PID 1), evidenciando a estrutura pai-filho entre eles.`

### Step 8: Write a function that categorizes the lines in `HDFS.log` in a nested dictionary:

In [None]:
import re
from collections import defaultdict

def get_logfile_data_as_dict(file_path):
    data = defaultdict(lambda: defaultdict(list))
    pattern = re.compile(r'\b(INFO|WARN|ERROR|DEBUG|TRACE|FATAL)\b\s+([\w\.]+):\s+(.*)')

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            match = pattern.search(line)
            if match:
                log_type = match.group(1).lower()
                class_name = match.group(2)
                message = match.group(3).strip()
                data[log_type][class_name].append(message)
    return data

### Step 9: Measure the throughput of `count_lines()` and `get_logfile_data_as_dict()`

In [None]:
import time

def measure_time_count_lines(file_path):
    start_time = time.time()
    count_lines(file_path)
    elapsed_time = time.time() - start_time
    return elapsed_time

def measure_time_get_logfile_data_as_dict(file_path):
    start_time = time.time()
    get_logfile_data_as_dict(file_path)
    elapsed_time = time.time() - start_time
    return elapsed_time

### Step 10: Replication: repeat the previous functions a number of times, report each time

In [None]:
num_replications = 3
file_path = 'data/hdfs/HDFS.log'
times_lines = []
times_logs = []
print("== Replicating count_lines ==")
for i in range(num_replications):
    t = measure_time_count_lines(file_path)
    times_lines.append(t)
    print(f"Repetition {i+1}: {t:.4f} seconds")

print("\n== Replicating get_logfile_data_as_dict ==")
for i in range(num_replications):
    t = measure_time_get_logfile_data_as_dict(file_path)
    times_logs.append(t)
    print(f"Repetition {i+1}: {t:.4f} seconds")

### Step 11: Take the average, minimum, maximum and standard deviation of those runtime values

In [None]:
import statistics

def print_stats(times, func):
    print(f"\n== Stats for {func} ==")
    print(f"Average: {statistics.mean(times):.4f} seconds")
    print(f"Min    : {min(times):.4f} seconds")
    print(f"Max    : {max(times):.4f} seconds")
    print(f"Stdev  : {statistics.stdev(times):.4f} seconds" if len(times) > 1 else "Stdev  : N/A")

print_stats(times_lines, "count_lines")
print_stats(times_logs, "get_logfile_data_as_dict")

### Step 12: Response time

In [None]:
server_code = """import socket
import time
import random

def process_message(message):
    print(f"Received message: {message}")
    time.sleep(random.uniform(0, 1)) # sleep some time, emulate varying serve time
    if message == "stop":
        return True, f"Processed: {message}"
    else:
        return False, f"Processed: {message}"    

def start_server(host='0.0.0.0', port=12345):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as server_socket:
        server_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        server_socket.bind((host, port))
        server_socket.listen(1)
        print(f"Listening on {host}:{port}...")

        stop = False

        while not stop:
        
            conn, addr = server_socket.accept()
            with conn:
                print(f"Connection from {addr}")
                data = conn.recv(1024).decode().strip()
                if data:
                    stop, response = process_message(data)
                    conn.sendall(response.encode())
                
if __name__ == "__main__":
    start_server()
"""

# Write the code to the file
server_file_path = '/tmp/server.py'
with open(server_file_path, "w") as file:
    file.write(server_code)

print(f"Python code written to {server_file_path}")

In [None]:
client_code = """import socket
import sys
import time

def send_message(host='127.0.0.1', port=12345, message='Hello, Server!'):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as client_socket:
        client_socket.connect((host, port))
        client_socket.sendall(message.encode())
        response = client_socket.recv(1024).decode()
        print(f"Server response: {response}")

if __name__ == "__main__":
    message = sys.argv[1]
    start_time = time.time()
    send_message(message=message)
    elapsed_time = time.time() - start_time
    print(f"Response time: {elapsed_time} seconds")
"""

# Write the code to the file
client_file_path = '/tmp/client.py'
with open(client_file_path, "w") as file:
    file.write(client_code)

print(f"Python code written to {client_file_path}")

### Step 13: TODO: Open two terminals:

- Start by running this on each (pyenv) -- this means that is set to be used a *specific* python installation, including packages and versioning:

```bash
source /app/.venv/bin/activate
```

- First one: run server
- Second one: run client (a few times)
- Include here the output

```text

8d5244dad408:/app/hostdir# python /tmp/server.py
Listening on 0.0.0.0:12345...
Connection from ('127.0.0.1', 34508)
Received message: Diogo
Connection from ('127.0.0.1', 37712)
Received message: Rafael
Connection from ('127.0.0.1', 37728)
Received message: Joao
Connection from ('127.0.0.1', 51978)
Received message: Arthur

Output: 8d5244dad408:/app/hostdir# python /tmp/client.py Diogo
Server response: Processed: Diogo
Response time: 0.22179007530212402 seconds
8d5244dad408:/app/hostdir# python /tmp/client.py Rafael
Server response: Processed: Rafael
Response time: 0.9741623401641846 seconds
8d5244dad408:/app/hostdir# python /tmp/client.py Joao
Server response: Processed: Joao
Response time: 0.4639458656311035 seconds
8d5244dad408:/app/hostdir# python /tmp/client.py Arthur
Server response: Processed: Arthur
Response time: 0.04177045822143555 seconds
8d5244dad408:/app/hostdir# 
```

### Step 14: Modify server.py to:
1. Construct an in memory dict data using `get_logfile_data_as_dict()`, for the HDFS file.
2. Process message function `process_message()` get as input one of the log types (info, warn, error, etc.) and returns the **total of string characters** over all log lines of that type.
3. Below the code with the function that must be implemented.

In [None]:
server_code_modified = """import socket
import time
import random
import re
from collections import defaultdict


def get_logfile_data_as_dict(file_path):
    data = defaultdict(lambda: defaultdict(list))
    pattern = re.compile(r'\\b(INFO|WARN|ERROR|DEBUG|TRACE|FATAL)\\b\\s+([\\w\\.]+):\\s+(.*)')

    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            match = pattern.search(line)
            if match:
                log_type = match.group(1).lower()
                class_name = match.group(2)
                message = match.group(3).strip()
                data[log_type][class_name].append(message)

    return data

log_data = get_logfile_data_as_dict('data/hdfs/HDFS.log')

def process_message(message):
    print(f"Received message: {message}")

    if message.lower() == "stop":
        return True, f"Processed: {message}"
    else:
        total_chars = 0
        for class_name in log_data.get(message.lower(), {}):
            for log_msg in log_data[message.lower()][class_name]:
                total_chars += len(log_msg)
        return False, f"Total characters for '{message}': {total_chars}"


def start_server(host='0.0.0.0', port=12345):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as server_socket:
        server_socket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        server_socket.bind((host, port))
        server_socket.listen(1)
        print(f"Listening on {host}:{port}...")

        stop = False

        while not stop:
            conn, addr = server_socket.accept()
            with conn:
                print(f"Connection from {addr}")
                data = conn.recv(1024).decode().strip()
                if data:
                    stop, response = process_message(data)
                    conn.sendall(response.encode())
                
if __name__ == "__main__":
    start_server()
"""

# Write the code to the file
server_file_path = '/tmp/server_modified.py'
with open(server_file_path, "w") as file:
    file.write(server_code_modified)

print(f"Python code written to {server_file_path}")

### Step 15: Measure response time with the modified version of server: `server_modified.py`

```text
Output:
8d5244dad408:/app/hostdir# python /tmp/client.py INFO
Server response: Total characters for 'INFO': 573129686
Response time: 0.8905811309814453 seconds
8d5244dad408:/app/hostdir# python /tmp/client.py WARN
Server response: Total characters for 'WARN': 681065
Response time: 0.0015490055084228516 seconds
8d5244dad408:/app/hostdir# python /tmp/client.py ERROR
Server response: Total characters for 'ERROR': 0
Response time: 0.0004227161407470703 seconds
8d5244dad408:/app/hostdir# python /tmp/client.py DEBUG
Server response: Total characters for 'DEBUG': 0
Response time: 0.0006170272827148438 seconds

8d5244dad408:/app/hostdir# python /tmp/server_modified.py
Listening on 0.0.0.0:12345...
Connection from ('127.0.0.1', 42904)
Received message: INFO
Connection from ('127.0.0.1', 42912)
Received message: WARN
Connection from ('127.0.0.1', 59646)
Received message: ERROR
Connection from ('127.0.0.1', 59662)
Received message: DEBUG
```