### The Problem: Log File Correlator

You are a backend engineer for a large web service. The service generates two distinct log files:

1.  **`requests.log`**: Records when a request is received.
      * Format: `TIMESTAMP_ISO8601,REQUEST_ID,USER_ID`
      * Example: `2025-09-07T12:15:01.123Z,req-abc,user-123`
2.  **`responses.log`**: Records when a response is sent.
      * Format: `TIMESTAMP_ISO8601,REQUEST_ID,STATUS_CODE`
      * Example: `2025-09-07T12:15:01.345Z,req-abc,200`

Your task is to write a Python script that **correlates** these two logs and produces a single, combined JSON output. For each request, the output should include the `request_id`, `user_id`, `status_code`, and a calculated `duration_ms`.

#### The Challenges (What makes it "hard")

  * **Scalability**: The log files are **too large to fit into memory**.
  * **Unordered Entries**: The logs are not guaranteed to be in chronological order. A response might be logged before its corresponding request.
  * **Orphaned Entries**: A `request_id` might appear in one file but not the other due to network errors or crashes.

-----

### Level 1: Junior Engineer Solution ("It Works")

This solution correctly solves the problem for small files but ignores the scalability and robustness constraints. It's a good starting point that demonstrates basic Python skills.

**Characteristics:**

  * Reads entire files into memory.
  * Uses multiple loops and basic dictionaries.
  * Minimal error handling.
  * Contained within a single script or function.

<!-- end list -->

```python
import json
from datetime import datetime

def junior_correlator(requests_file, responses_file, output_file):
    requests_data = {}
    # 1. Read all requests into a dictionary
    with open(requests_file, 'r') as f:
        for line in f:
            timestamp_str, request_id, user_id = line.strip().split(',')
            requests_data[request_id] = {
                'start_time': datetime.fromisoformat(timestamp_str.replace('Z', '+00:00')),
                'user_id': user_id
            }

    responses_data = {}
    # 2. Read all responses into another dictionary
    with open(responses_file, 'r') as f:
        for line in f:
            timestamp_str, request_id, status_code = line.strip().split(',')
            responses_data[request_id] = {
                'end_time': datetime.fromisoformat(timestamp_str.replace('Z', '+00:00')),
                'status_code': int(status_code)
            }

    results = []
    # 3. Loop through requests to find matching responses
    for req_id, req_info in requests_data.items():
        if req_id in responses_data:
            resp_info = responses_data[req_id]
            duration = (resp_info['end_time'] - req_info['start_time']).total_seconds() * 1000
            
            results.append({
                'request_id': req_id,
                'user_id': req_info['user_id'],
                'status_code': resp_info['status_code'],
                'duration_ms': int(duration)
            })

    # 4. Write all results at once
    with open(output_file, 'w') as f:
        json.dump(results, f, indent=2)

# Example Usage (assuming you create these files):
# junior_correlator('requests.log', 'responses.log', 'output_junior.json')
```

-----

### Level 2: Mid-Level Engineer Solution ("It's Well-Crafted")

This solution addresses some of the junior version's shortcomings. It uses better data structures and practices, showing an understanding of code organization and efficiency, though it still has memory limitations.

**Characteristics:**

  * Uses a single pass over the second file to enrich data from the first.
  * Handles potential errors during parsing (e.g., a malformed line).
  * Code is broken down into logical functions.
  * Uses type hints and generators for better memory management of the *output*.

<!-- end list -->

```python
import json
from datetime import datetime
from typing import Dict, Any, Iterator

def parse_log_line(line: str) -> Dict[str, Any]:
    """Parses a single log line and handles potential errors."""
    try:
        parts = line.strip().split(',')
        timestamp = datetime.fromisoformat(parts[0].replace('Z', '+00:00'))
        return {'timestamp': timestamp, 'parts': parts[1:]}
    except (ValueError, IndexError):
        # Log this error in a real system
        return None

def mid_level_correlator(requests_file: str, responses_file: str) -> Iterator[Dict[str, Any]]:
    """Correlates logs and yields results one by one."""
    pending_requests = {}
    
    # Pass 1: Process all requests first
    with open(requests_file, 'r') as f:
        for line in f:
            parsed = parse_log_line(line)
            if parsed:
                req_id, user_id = parsed['parts']
                pending_requests[req_id] = {
                    'start_time': parsed['timestamp'],
                    'user_id': user_id
                }

    # Pass 2: Process responses and find matches
    with open(responses_file, 'r') as f:
        for line in f:
            parsed = parse_log_line(line)
            if not parsed:
                continue
            
            resp_id, status_code = parsed['parts']
            if resp_id in pending_requests:
                req_info = pending_requests[resp_id]
                duration = (parsed['timestamp'] - req_info['start_time']).total_seconds() * 1000
                
                yield {
                    'request_id': resp_id,
                    'user_id': req_info['user_id'],
                    'status_code': int(status_code),
                    'duration_ms': int(duration)
                }
                # Remove to handle potential duplicate request_ids
                del pending_requests[resp_id]

def main():
    """Main function to run the correlation and write to a file."""
    with open('output_mid.json', 'w') as f:
        # The generator approach prevents building a huge list in memory
        results = list(mid_level_correlator('requests.log', 'responses.log'))
        json.dump(results, f, indent=2)

# if __name__ == "__main__":
#     main()
```

-----

### Level 3: Senior Engineer Solution ("It's Production-Ready")

This solution is designed for scalability and robustness. It correctly handles the "too large for memory" constraint by processing the files in a streaming fashion.

**Characteristics:**

  * **Streaming approach**: Never holds the full dataset in memory. Memory is only used for *in-flight* requests.
  * **Object-Oriented Design**: Encapsulates logic in a class, making it maintainable and testable.
  * **Robust Error Handling**: Uses Python's `logging` module to report issues like orphaned entries.
  * **Configurability**: Uses `argparse` to accept file paths from the command line, making it a flexible tool.

<!-- end list -->

```python
import json
import logging
import argparse
from datetime import datetime
from typing import Dict, Any, Iterator

# Set up basic logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

class LogCorrelator:
    """
    Correlates request and response logs in a memory-efficient, streaming manner.
    """
    def __init__(self):
        self.pending_requests: Dict[str, Dict[str, Any]] = {}

    def _process_request_line(self, line: str):
        try:
            timestamp_str, req_id, user_id = line.strip().split(',')
            # If a response arrived first, complete the record
            if req_id in self.pending_requests and 'end_time' in self.pending_requests[req_id]:
                self.pending_requests[req_id].update({
                    'start_time': datetime.fromisoformat(timestamp_str.replace('Z', '+00:00')),
                    'user_id': user_id
                })
            else: # Otherwise, store the request data
                self.pending_requests[req_id] = {
                    'start_time': datetime.fromisoformat(timestamp_str.replace('Z', '+00:00')),
                    'user_id': user_id
                }
        except (ValueError, IndexError):
            logging.warning(f"Skipping malformed request line: {line.strip()}")

    def _process_response_line(self, line: str) -> Iterator[Dict[str, Any]]:
        try:
            timestamp_str, resp_id, status_code = line.strip().split(',')
            end_time = datetime.fromisoformat(timestamp_str.replace('Z', '+00:00'))

            # If the request arrived first, complete and yield the record
            if resp_id in self.pending_requests and 'start_time' in self.pending_requests[resp_id]:
                req_info = self.pending_requests.pop(resp_id) # Pop to free memory
                duration = (end_time - req_info['start_time']).total_seconds() * 1000
                yield {
                    'request_id': resp_id,
                    'user_id': req_info['user_id'],
                    'status_code': int(status_code),
                    'duration_ms': int(duration)
                }
            else: # Otherwise, store the response as pending
                self.pending_requests[resp_id] = {
                    'end_time': end_time,
                    'status_code': int(status_code)
                }
        except (ValueError, IndexError):
            logging.warning(f"Skipping malformed response line: {line.strip()}")

    def correlate(self, requests_file: str, responses_file: str) -> Iterator[Dict[str, Any]]:
        """
        Processes both log files and yields correlated results.
        This approach assumes files are of similar size and processes them "together".
        A more advanced solution for disparate file sizes might process one fully first.
        """
        logging.info("Starting log correlation.")
        with open(requests_file, 'r') as req_f, open(responses_file, 'r') as resp_f:
            # In a real-world scenario with massive files, you might read them in chunks
            # or use a more sophisticated streaming library.
            for req_line, resp_line in zip(req_f, resp_f):
                self._process_request_line(req_line)
                yield from self._process_response_line(resp_line)

        # After files are processed, log any remaining orphaned entries
        orphaned_count = len(self.pending_requests)
        if orphaned_count > 0:
            logging.warning(f"Found {orphaned_count} orphaned log entries.")
        logging.info("Correlation complete.")

def main():
    parser = argparse.ArgumentParser(description="Correlate web service log files.")
    parser.add_argument("requests_file", help="Path to the requests log file.")
    parser.add_argument("responses_file", help="Path to the responses log file.")
    parser.add_argument("output_file", help="Path for the JSON output file.")
    args = parser.parse_args()

    correlator = LogCorrelator()
    with open(args.output_file, 'w') as f:
        results = list(correlator.correlate(args.requests_file, args.responses_file))
        json.dump(results, f, indent=2)
    logging.info(f"Successfully wrote {len(results)} records to {args.output_file}")

# To run from the command line:
# python your_script_name.py requests.log responses.log output_senior.json
# if __name__ == "__main__":
#     main()
```

In [2]:
import json
from datetime import datetime

def junior_correlator(requests_file, responses_file, output_file):
    requests_data = {}
    # 1. Read all requests into a dictionary
    with open(requests_file, 'r') as f:
        for line in f:
            timestamp_str, request_id, user_id = line.strip().split(',')
            requests_data[request_id] = {
                'start_time': datetime.fromisoformat(timestamp_str.replace('Z', '+00:00')),
                'user_id': user_id
            }

    responses_data = {}
    # 2. Read all responses into another dictionary
    with open(responses_file, 'r') as f:
        for line in f:
            timestamp_str, request_id, status_code = line.strip().split(',')
            responses_data[request_id] = {
                'end_time': datetime.fromisoformat(timestamp_str.replace('Z', '+00:00')),
                'status_code': int(status_code)
            }

    results = []
    # 3. Loop through requests to find matching responses
    for req_id, req_info in requests_data.items():
        if req_id in responses_data:
            resp_info = responses_data[req_id]
            duration = (resp_info['end_time'] - req_info['start_time']).total_seconds() * 1000
            
            results.append({
                'request_id': req_id,
                'user_id': req_info['user_id'],
                'status_code': resp_info['status_code'],
                'duration_ms': int(duration)
            })

    # 4. Write all results at once
    with open(output_file, 'w') as f:
        json.dump(results, f, indent=2)

# Example Usage (assuming you create these files):
junior_correlator('requests.log', 'responses.log', 'output_junior.json')

In [3]:

from datetime import datetime
from typing import Dict, Any, Iterator

def parse_log_line(line: str) -> Dict[str, Any]:
    """Parses a single log line and handles potential errors."""
    try:
        parts = line.strip().split(',')
        timestamp = datetime.fromisoformat(parts[0].replace('Z', '+00:00'))
        return {'timestamp': timestamp, 'parts': parts[1:]}
    except (ValueError, IndexError):
        # Log this error in a real system
        return None

def mid_level_correlator(requests_file: str, responses_file: str) -> Iterator[Dict[str, Any]]:
    """Correlates logs and yields results one by one."""
    pending_requests = {}
    
    # Pass 1: Process all requests first
    with open(requests_file, 'r') as f:
        for line in f:
            parsed = parse_log_line(line)
            if parsed:
                req_id, user_id = parsed['parts']
                pending_requests[req_id] = {
                    'start_time': parsed['timestamp'],
                    'user_id': user_id
                }

    # Pass 2: Process responses and find matches
    with open(responses_file, 'r') as f:
        for line in f:
            parsed = parse_log_line(line)
            if not parsed:
                continue
            
            resp_id, status_code = parsed['parts']
            if resp_id in pending_requests:
                req_info = pending_requests[resp_id]
                duration = (parsed['timestamp'] - req_info['start_time']).total_seconds() * 1000
                
                yield {
                    'request_id': resp_id,
                    'user_id': req_info['user_id'],
                    'status_code': int(status_code),
                    'duration_ms': int(duration)
                }
                # Remove to handle potential duplicate request_ids
                del pending_requests[resp_id]

def main():
    """Main function to run the correlation and write to a file."""
    with open('output_mid.json', 'w') as f:
        # The generator approach prevents building a huge list in memory
        results = list(mid_level_correlator('requests.log', 'responses.log'))
        json.dump(results, f, indent=2)


In [13]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from delta import *

builder = pyspark.sql.SparkSession.builder.appName("LogLakeZero") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

# Create the SparkSession
spark = configure_spark_with_delta_pip(builder).getOrCreate()

print("Spark and Delta Lake are ready.")

:: loading settings :: url = jar:file:/Users/jesses_fables/Desktop/.venv/lib/python3.12/site-packages/pyspark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/jesses_fables/.ivy2.5.2/cache
The jars for the packages stored in: /Users/jesses_fables/.ivy2.5.2/jars
io.delta#delta-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e6984e08-91a7-429c-b435-b24be0c1202b;1.0
	confs: [default]
	found io.delta#delta-spark_2.13;4.0.0 in central
	found io.delta#delta-storage;4.0.0 in central
	found org.antlr#antlr4-runtime;4.13.1 in central
:: resolution report :: resolve 280ms :: artifacts dl 14ms
	:: modules in use:
	io.delta#delta-spark_2.13;4.0.0 from central in [default]
	io.delta#delta-storage;4.0.0 from central in [default]
	org.antlr#antlr4-runtime;4.13.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules     

Spark and Delta Lake are ready.


In [30]:
requests_df = pd.read_csv('requests.log', names = ['TIMESTAMP_ISO8601','REQUEST_ID','USER_ID'])
responses_df = pd.read_csv('responses.log', names = ['TIMESTAMP_ISO8601','REQUEST_ID','STATUS_CODE'])

#Create PySpark DataFrame from Pandas
sparkDF_requests_df=spark.createDataFrame(requests_df) 
##sparkDF_requests_df.printSchema()
##sparkDF_requests_df.show()

sparkDF_responses_df=spark.createDataFrame(responses_df) 

In [31]:
sparkDF_requests_df.write.format("delta").mode("overwrite").save('files/requests_log')

25/09/07 13:10:10 WARN TaskSetManager: Stage 5 contains a task of very large size (2110 KiB). The maximum recommended task size is 1000 KiB.
25/09/07 13:10:17 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

In [32]:
sparkDF_responses_df.write.format("delta").mode("overwrite").save('files/responses_log')

25/09/07 13:10:44 WARN TaskSetManager: Stage 9 contains a task of very large size (1941 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

In [52]:
sparkDF_requests_df.groupBy("REQUEST_ID") \
    .count() \
    .orderBy("count", ascending=False) \
    .head(5)

25/09/07 13:20:24 WARN TaskSetManager: Stage 46 contains a task of very large size (2110 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

[Row(REQUEST_ID='req-f0588781-cfd6-4abe-b2a2-adf65699a24e', count=1),
 Row(REQUEST_ID='req-86f954e5-cf6e-4921-8434-849c90f4ddc8', count=1),
 Row(REQUEST_ID='req-4b69dc47-c2fe-47b5-af20-ddc21de6c766', count=1),
 Row(REQUEST_ID='req-orphan-abd8ea83-4180-4715-b3f8-10b107da6618', count=1),
 Row(REQUEST_ID='req-3305f84d-a721-4f0e-abdb-fe6595c3fde0', count=1)]

In [112]:
sparkDF_responses_df.createOrReplaceTempView("responses")
sparkDF_requests_df.createOrReplaceTempView("requests")

orphan_requests = spark.sql("SELECT count(*) as result from responses where REQUEST_ID LIKE '%orphan%'")
orphan_requests_count=orphan_requests.collect()[0][0]
print(f"Total oprhans (using collect): {orphan_requests_count:,}")


25/09/07 13:36:39 WARN TaskSetManager: Stage 95 contains a task of very large size (1941 KiB). The maximum recommended task size is 1000 KiB.


Total oprhans (using collect): 5,000


                                                                                

You are a backend engineer for a large web service. The service generates two distinct log files:

1.  **`requests.log`**: Records when a request is received.
      * Format: `TIMESTAMP_ISO8601,REQUEST_ID,USER_ID`
      * Example: `2025-09-07T12:15:01.123Z,req-abc,user-123`
2.  **`responses.log`**: Records when a response is sent.
      * Format: `TIMESTAMP_ISO8601,REQUEST_ID,STATUS_CODE`
      * Example: `2025-09-07T12:15:01.345Z,req-abc,200`

Your task is to write a Python script that **correlates** these two logs and produces a single, combined JSON output. For each request, the output should include the `request_id`, `user_id`, `status_code`, and a calculated `duration_ms`.

#### The Challenges (What makes it "hard")

  * **Scalability**: The log files are **too large to fit into memory**.
  * **Unordered Entries**: The logs are not guaranteed to be in chronological order. A response might be logged before its corresponding request.
  * **Orphaned Entries**: A `request_id` might appear in one file but not the other due to network errors or crashes.


You are a backend engineer for a large web service. The service generates two distinct log files:

1.  **`requests.log`**: Records when a request is received.
      * Format: `TIMESTAMP_ISO8601,REQUEST_ID,USER_ID`
      * Example: `2025-09-07T12:15:01.123Z,req-abc,user-123`
2.  **`responses.log`**: Records when a response is sent.
      * Format: `TIMESTAMP_ISO8601,REQUEST_ID,STATUS_CODE`
      * Example: `2025-09-07T12:15:01.345Z,req-abc,200`

Your task is to write a Python script that **correlates** these two logs and produces a single, combined JSON output. For each request, the output should include the `request_id`, `user_id`, `status_code`, and a calculated `duration_ms`.

#### The Challenges (What makes it "hard")

  * **Scalability**: The log files are **too large to fit into memory**.
  * **Unordered Entries**: The logs are not guaranteed to be in chronological order. A response might be logged before its corresponding request.
  * **Orphaned Entries**: A `request_id` might appear in one file but not the other due to network errors or crashes.


In [118]:
joined_table = spark.sql("""
SELECT a.TIMESTAMP_ISO8601 as start_time, b.TIMESTAMP_ISO8601 as end_time, a.REQUEST_ID, a.USER_ID, b.STATUS_CODE
FROM requests as a
LEFT JOIN responses as b on a.REQUEST_ID = b.REQUEST_ID
""")
joined_table.createOrReplaceTempView("joined_view")

In [139]:
hmm = spark.sql('SELECT * FROM joined_view UNION SELECT * FROM joined_view')

In [126]:
hmm.createOrReplaceTempView("hmm")

In [128]:
hmm3 = spark.sql("SELECT * FROM hmm where request_id like '%orphan%'")

In [132]:
hmm3.count()

25/09/07 13:48:05 WARN TaskSetManager: Stage 125 contains a task of very large size (2110 KiB). The maximum recommended task size is 1000 KiB.
25/09/07 13:48:06 WARN TaskSetManager: Stage 126 contains a task of very large size (1941 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

5000

In [138]:
hmm.count()

25/09/07 13:49:26 WARN TaskSetManager: Stage 152 contains a task of very large size (2110 KiB). The maximum recommended task size is 1000 KiB.
25/09/07 13:49:26 WARN TaskSetManager: Stage 153 contains a task of very large size (1941 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

210000

In [136]:
joined_table.count()

25/09/07 13:49:02 WARN TaskSetManager: Stage 143 contains a task of very large size (1941 KiB). The maximum recommended task size is 1000 KiB.
25/09/07 13:49:02 WARN TaskSetManager: Stage 144 contains a task of very large size (2110 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

105000

In [161]:
result_log = spark.sql("""
SELECT
request_id,
user_id,
status_code,
TIMESTAMPDIFF(MILLISECOND, start_time, end_time) as duration_ms
from joined_view
WHERE status_code IS NOT NULL
""")
result_log.createOrReplaceTempView("result_log_success")
result_log.show()

25/09/07 14:04:30 WARN TaskSetManager: Stage 253 contains a task of very large size (2110 KiB). The maximum recommended task size is 1000 KiB.
25/09/07 14:04:30 WARN TaskSetManager: Stage 254 contains a task of very large size (1941 KiB). The maximum recommended task size is 1000 KiB.
[Stage 254:>                                                        (0 + 4) / 4]

+--------------------+--------+-----------+-----------+
|          request_id| user_id|status_code|duration_ms|
+--------------------+--------+-----------+-----------+
|req-000272c0-2d42...| user-94|        200|       1357|
|req-00030e39-a5a5...| user-59|        200|       1857|
|req-0006f648-8d39...|user-209|        200|        629|
|req-0007b1d0-3373...| user-68|        503|       1977|
|req-0009006d-36ff...| user-73|        200|        296|
|req-000a83e4-c15a...|user-261|        200|       1517|
|req-000b3223-af78...| user-45|        200|        741|
|req-000d9994-48d8...|user-485|        404|       1323|
|req-000efa6e-45f2...|user-641|        404|        144|
|req-0012220b-15e4...|user-975|        200|       1301|
|req-0012f62c-9711...|user-110|        200|        367|
|req-00149896-4ada...|user-804|        200|        712|
|req-00159c3e-f544...| user-39|        200|        730|
|req-0016ff15-7075...|user-244|        200|        822|
|req-0017095d-e64f...| user-77|        200|     

                                                                                