
# Tutorial 9 - Intro to Parallel Programming using Python


### INFO - Use `multiprocess` package if youre using Jupyterlab/IPython, not `multiprocessing`. 

### Question 1 - Image Filter Processing

Scenario:

* You work as a data engineer at a photo analysis company. Part of your pipeline involves applying a grayscale filter to user-uploaded images before analysis.

* The current implementation processes each image one at a time, which is too slow for large batches. You decide to speed this up using Python's multiprocessing module, as grayscale conversion is CPU-bound.

* Your job is to apply the filter in parallel using `multiprocess`, compare it to the sequential method, and report performance improvements.



Task:  
1. Use the function apply_grayscale(image_path) to simulate grayscale processing.

1. Create a list of 10 simulated image file paths.

1. Use Python’s multiprocess.Pool to process them in parallel.

1. Compare the time taken for sequential vs. parallel processing.

1. Print your conclusion.


Provided code snippet:

```
import time

def apply_grayscale(image_path):
    """Simulates a CPU-bound grayscale filter (takes ~1 second)."""
    print(f"Processing {image_path}")
    time.sleep(1)  # Simulated CPU-bound delay
    return f"{image_path} processed"


```

In [3]:
%pip install multiprocess

Collecting multiprocess
  Downloading multiprocess-0.70.18-py310-none-any.whl (134 kB)
     ---------------------------------------- 0.0/134.9 kB ? eta -:--:--
     ------------------------------- ------ 112.6/134.9 kB 6.8 MB/s eta 0:00:01
     -------------------------------------- 134.9/134.9 kB 2.7 MB/s eta 0:00:00
Collecting dill>=0.4.0
  Downloading dill-0.4.0-py3-none-any.whl (119 kB)
     ---------------------------------------- 0.0/119.7 kB ? eta -:--:--
     -------------------------------------- 119.7/119.7 kB 3.5 MB/s eta 0:00:00
Installing collected packages: dill, multiprocess
Successfully installed dill-0.4.0 multiprocess-0.70.18
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
# Task 1: Use the function apply_grayscale(image_path) to simulate grayscale processing
# (Function already provided above)
import time
import multiprocess as mp

def apply_grayscale(image_path):
    """Simulates a CPU-bound grayscale filter (takes ~1 second)."""
    print(f"Processing {image_path}")
    time.sleep(1)  # Simulated CPU-bound delay
    return f"{image_path} processed"



In [5]:
# Task 2: Create a list of 10 simulated image file paths
image_paths = [f"image_{i}.jpg" for i in range(1, 11)]
print(f"Image paths: {image_paths}")

Image paths: ['image_1.jpg', 'image_2.jpg', 'image_3.jpg', 'image_4.jpg', 'image_5.jpg', 'image_6.jpg', 'image_7.jpg', 'image_8.jpg', 'image_9.jpg', 'image_10.jpg']


In [10]:
# Task 3: Use Python's multiprocess.Pool to process them in parallel
print("\n--- Parallel Processing ---")
start_time = time.time()
if __name__ == '__main__' or 'ipykernel' in str(type(get_ipython())):
    with mp.Pool() as pool:
        parallel_results = pool.map(apply_grayscale, image_paths)
parallel_time = time.time() - start_time
print(f"Parallel processing took: {parallel_time:.2f} seconds")


--- Parallel Processing ---


NameError: name 'time' is not defined

In [11]:
# Task 4: Compare the time taken for sequential vs. parallel processing

# Sequential processing
print("\n--- Sequential Processing ---")
start_time = time.time()
sequential_results = []
for image_path in image_paths:
    result = apply_grayscale(image_path)
    sequential_results.append(result)
sequential_time = time.time() - start_time
print(f"Sequential processing took: {sequential_time:.2f} seconds")


--- Sequential Processing ---
Processing image_1.jpg
Processing image_2.jpg
Processing image_3.jpg
Processing image_4.jpg
Processing image_5.jpg
Processing image_6.jpg
Processing image_7.jpg
Processing image_8.jpg
Processing image_9.jpg
Processing image_10.jpg
Sequential processing took: 10.07 seconds


In [12]:
# Task 5: Print your conclusion
print("\n--- Performance Comparison ---")
speedup = sequential_time / parallel_time
print(f"Sequential time: {sequential_time:.2f} seconds")
print(f"Parallel time: {parallel_time:.2f} seconds")
print(f"Speedup: {speedup:.2f}x")
print(f"Performance improvement: {((sequential_time - parallel_time) / sequential_time * 100):.1f}%")

print("\n--- Conclusion ---")
print(f"Parallel processing is {speedup:.2f} times faster than sequential processing.")
print("This demonstrates the effectiveness of multiprocessing for CPU-bound tasks like image processing.")


--- Performance Comparison ---


NameError: name 'parallel_time' is not defined

### Question 2 - Log File Analysis with Multiprocessing

Scenario:  
* You are a backend engineer at a cybersecurity company. Every day, your system generates log files that track user activity, errors, and suspicious behavior. Each file contains thousands of lines.

* Your job is to scan each log file for suspicious activity (e.g., the keyword "unauthorized access"). This task is currently done sequentially and takes too long when scanning dozens of log files.

* Your manager asks you to speed up the process using parallel programming.

Task:

1. Generate synthetic log files, each with 1,000 lines of random content.

1. Insert the keyword "unauthorized access" randomly in some files.

1. Write a function to scan a single log file for that keyword.

1. Use multiproces to scan all files in parallel.

1. Compare the execution time of sequential vs. parallel scanning.

1. Report which files contain the keyword.

#### Provided code snippet

```
import random

# Generate a synthetic log file
def generate_log_file(num_lines=1000, keyword="unauthorized access", inject=False):
    lines = []
    for _ in range(num_lines):
        line = "User login successful" if random.random() > 0.01 else "System error occurred"
        lines.append(line)
    if inject:
        index = random.randint(0, num_lines - 1)
        lines[index] = f"Alert: {keyword} detected from IP 192.168.0.{random.randint(1,255)}"
    return lines
```

Expected output:

- List of files where "unauthorized access" is detected

- Time taken for sequential and parallel processing

- Performance conclusion

In [13]:
import random

# Generate a synthetic log file
def generate_log_file(num_lines=1000, keyword="unauthorized access", inject=False):
    lines = []
    for _ in range(num_lines):
        line = "User login successful" if random.random() > 0.01 else "System error occurred"
        lines.append(line)
    if inject:
        index = random.randint(0, num_lines - 1)
        lines[index] = f"Alert: {keyword} detected from IP 192.168.0.{random.randint(1,255)}"
    return lines

In [14]:
import os

# Create multiple synthetic log files, randomly injecting the keyword

log_dir = "logs"
os.makedirs(log_dir, exist_ok=True)

num_files = 10
log_filenames = []
for i in range(num_files):
    inject = random.choice([True, False])  # Randomly decide to inject keyword
    filename = os.path.join(log_dir, f"log_{i+1}.txt")
    with open(filename, 'w') as f:
        log_lines = generate_log_file(num_lines=1000, keyword="unauthorized access", inject=inject)
        for line in log_lines:
            f.write(line + "\n")
    log_filenames.append(filename)
print(f"Generated log files: {log_filenames}")


Generated log files: ['logs\\log_1.txt', 'logs\\log_2.txt', 'logs\\log_3.txt', 'logs\\log_4.txt', 'logs\\log_5.txt', 'logs\\log_6.txt', 'logs\\log_7.txt', 'logs\\log_8.txt', 'logs\\log_9.txt', 'logs\\log_10.txt']


In [15]:
#Write function to scan single log file for keyword
def scan_log_file(file_path, keyword="unauthorized access"):
    """Scans a log file for a specific keyword."""
    with open(file_path, 'r') as f:
        for line in f:
            if keyword in line:
                return True
    return False

In [None]:
#Use multiprocessing to scan all log files in parallel
def scan_logs_in_parallel(log_files, keyword="unauthorized access"):
    """Scans multiple log files in parallel for a specific keyword."""
    with mp.Pool() as pool:
        results = pool.starmap(scan_log_file, [(file, keyword) for file in log_files])
    return results

In [None]:
#compare sequential vs parallel scanning
def scan_logs_sequential(log_files, keyword="unauthorized access"):
    """Scans multiple log files sequentially for a specific keyword."""
    results = []
    for file in log_files:
        result = scan_log_file(file, keyword)
        results.append(result)
    return results

In [16]:
#report results
def report_results(results, log_files):
    """Reports the results of the log scanning."""
    for file, found in zip(log_files, results):
        status = "Found" if found else "Not Found"
        print(f"{file}: {status}")

In [None]:
#give the comparision of parallel vs sequential scanning
def compare_scanning_methods(log_files, keyword="unauthorized access"):
    """Compares parallel and sequential log scanning methods."""
    print("\n--- Parallel Log Scanning ---")
    start_time = time.time()
    parallel_results = scan_logs_in_parallel(log_files, keyword)
    parallel_time = time.time() - start_time
    report_results(parallel_results, log_files)
    print(f"Parallel scanning took: {parallel_time:.2f} seconds")

    print("\n--- Sequential Log Scanning ---")
    start_time = time.time()
    sequential_results = scan_logs_sequential(log_files, keyword)
    sequential_time = time.time() - start_time
    report_results(sequential_results, log_files)
    print(f"Sequential scanning took: {sequential_time:.2f} seconds")

    speedup = sequential_time / parallel_time
    print("\n--- Performance Comparison ---")
    print(f"Sequential time: {sequential_time:.2f} seconds")
    print(f"Parallel time: {parallel_time:.2f} seconds")
    print(f"Speedup: {speedup:.2f}x")
    print(f"Performance improvement: {((sequential_time - parallel_time) / sequential_time * 100):.1f}%")