**Pattern 1** — Key=Value Parsing (Logs)

**Problem:** You receive app logs like:

**Task:** Parse each log line into a Python dict so you can access fields by key.

In [0]:
# user=alice time=2025-01-01T12:00:00Z event=click duration=120 status=OK

def parse_kv(line: str) -> dict:
    parts = line.split()          # split on whitespace
    out = {}
    for p in parts:
        k, v = p.split("=", 1)    # split key and value at first '='
        out[k] = v
    return out


line = "user=alice time=2025-01-01T12:00:00Z event=click duration=120 status=OK"
record = parse_kv(line)
print(record["user"])      # alice
print(record["duration"])  # 120

**Explanation:**

line.split() breaks the line into tokens: ["user=alice", "time=...", ...].

For each token, p.split("=", 1) splits only at the first =, giving key and value.

Store in a dict so lookups are O(1): out["user"], out["status"], etc.

**Time & Space Complexity (per line):**

Let L = length of the string, k = number of key=value pairs.

split() scans the whole string once → O(L).

Inner split("=", 1) over each token still adds up to O(L) overall.

Time: O(L)

Space: O(L) for the dict + token list (we’re storing keys and values).

**Pattern 2** — CSV/TSV Parsing
Case A: Simple CSV (no quotes)

**Problem:** You have comma-separated values with no quotes:

**Task:** Parse into columns.

In [0]:
line = "alice,2025-01-01,click,120,OK"
cols = line.split(",")
user, date, event, duration, status = cols

**Explanation:**

split(",") is enough when there are no commas inside fields and no quotes.

**Time & Space (per line):**

Time: O(L) to scan for commas.

Space: O(L) for the list of columns.

**Problem:** Fields can contain commas/quotes:

**Task:** Parse safely respecting quotes.

In [0]:
import csv

## "alice","2025-01-01","clicked, scrolled","120","OK"

with open("events.csv", newline="") as f:
    reader = csv.reader(f)
    for row in reader:
        # row is a list of column values correctly parsed
        print(row)


**Explanation:**

csv.reader handles:

Commas inside quoted fields

Escaped quotes

Different dialects (if configured).

**Time & Space (per line):**

Time: O(L) per line in typical implementations.

Space: O(L) for the list of parsed fields.

**Pattern 3** — Log Timestamp Parsing

**Problem:** You have log lines with timestamps as strings

**Task:** Convert the timestamp string into a datetime for comparisons and filtering.

In [0]:
## 2025-01-01 12:00:00,alice,click

from datetime import datetime

def parse_ts(ts_str: str) -> datetime:
    # Format: "YYYY-MM-DD HH:MM:SS"
    return datetime.strptime(ts_str, "%Y-%m-%d %H:%M:%S")


line = "2025-01-01 12:00:00,alice,click"
ts_str, user, event = line.split(",", 2)
ts = parse_ts(ts_str)

cutoff = datetime(2025, 1, 1, 0, 0, 0)
if ts >= cutoff:
    print("New year event:", line)


**Explanation:**

strptime converts a fixed-format timestamp into a datetime object.

Once parsed, you can do comparisons, sorting, time arithmetic.

**Time & Space (per timestamp):**

The format is fixed length, so parsing cost is effectively O(1) per timestamp.

Over n log lines, timestamp parsing is O(n) total.

Space: O(1) extra per parse (just one datetime).


**Pattern 4** — Extract Information Using Regex

**Problem:** You have access logs

**Task:** Extract the user ID (1234, 9876) from the URL path.

GET /api/user/1234 HTTP/1.1 200

GET /api/user/9876 HTTP/1.1 500

In [0]:
import re

pattern = re.compile(r"/api/user/(\d+)")

def extract_user_id(line: str) -> str | None:
    m = pattern.search(line)
    if m:
        return m.group(1)
    return None


line = "GET /api/user/1234 HTTP/1.1 200"
user_id = extract_user_id(line)
print(user_id)  # "1234"


**Explanation:**

Regex /api/user/(\d+):

/api/user/ literal path.

(\d+) capture one or more digits (user ID).

search scans the line; if it matches, group(1) is the captured ID.

**Time & Space (per line):**

Let L = length of the line.

For a simple pattern like this, regex engines run in O(L) on average.

Time: O(L)

Space: O(1) extra (just a match object).

**Pattern 5** — Clean Dirty Data

**Problem:** You get messy strings from users

**Task:** Normalize for consistent storage/processing:

Trim whitespace

Replace tabs with spaces

Collapse double spaces

Lowercase

In [0]:
raw = "   Alice\t Smith   "

def clean(s: str) -> str:
    return (
        s.strip()           # remove leading/trailing whitespace
         .replace("\t", " ")
         .replace("  ", " ")  # simple collapse of double spaces (could loop)
         .lower()
    )

raw = "   Alice\t Smith   "
print(clean(raw))  # "alice smith"


**Explanation:**

strip() trims start/end whitespace.

replace("\t", " ") standardizes tabs to single spaces.

replace(" ", " ") reduces obvious double-spaces (can be repeated or use regex).

lower() makes comparisons case-insensitive.

**Time & Space:**

Each operation scans the string once → each is O(L).

A constant number of passes → still O(L) overall.

Time: O(L)

Space: O(L) for new cleaned string.

**Pattern 6** — Multi-Delimiter Splitting

**Problem:** You receive data where fields can be separated by commas, tabs, or pipes

**Task:** Split the line into tokens on any of ,, \t, or |.

In [0]:
## "alice,2025-01-01|click\tOK"

import re

def split_multi(line: str) -> list[str]:
    # split on comma OR tab OR pipe, treat 1+ delimiters as one
    return re.split(r"[,\t|]+", line)

line = "alice,2025-01-01|click\tOK"
print(split_multi(line))  # ['alice', '2025-01-01', 'click', 'OK']


**Explanation:**

Regex class [,\t|]+ means:

Any of comma, tab, or pipe.

+ = one or more in a row (so || or ,, treated as single split).

**Time & Space (per line):**

Regex scan is O(L) for a simple character class.

Time: O(L)

Space: O(L) for list of tokens.

**Pattern 7** — String Normalization for Dedup

**Problem:** You have multiple variants of the same value

**Task:** Deduplicate names ignoring case, extra spaces, and punctuation.

In [0]:
names = [
    "Alice Smith",
    " alice  smith ",
    "ALICE SMITH",
    "Alice smith!!!",
]

import re

def normalize(s: str) -> str:
    s = s.lower().strip()
    # remove punctuation (keep letters, digits, spaces)
    s = re.sub(r"[^a-z0-9\s]", "", s)
    # collapse multiple spaces
    s = re.sub(r"\s+", " ", s)
    return s

unique = {}
for name in names:
    key = normalize(name)
    if key not in unique:
        unique[key] = name   # keep original representative

print(list(unique.values()))  # ['Alice Smith']


**Explanation:**

lower() and strip() for baseline normalization.

Remove punctuation via re.sub(r"[^a-z0-9\s]", "", s).

Collapse sequences of whitespace → single space via \s+.

**Time & Space:**

Let L = max length of a string, n = number of strings.

normalize does a constant number of passes over L → O(L).

Dedup using dict/set is average O(1) lookup per item.

Total Time: O(n · L)

Space: O(n · L) to store normalized keys and representatives.

**Pattern 8** — Path Parsing

**Problem:** You have file paths like

**Task:** Extract:

Base filename (events_2025-01-01.log)

Root without extension (events_2025-01-01)

Extension (.log)

In [0]:
path = "/var/log/app/events_2025-01-01.log"

import os

def parse_path(path: str):
    base = os.path.basename(path)       # "events_2025-01-01.log"
    root, ext = os.path.splitext(base)  # ("events_2025-01-01", ".log")
    return base, root, ext


path = "/var/log/app/events_2025-01-01.log"
base, root, ext = parse_path(path)
print(base)  # events_2025-01-01.log
print(root)  # events_2025-01-01
print(ext)   # .log


**Explanation:**

basename strips directory components; leaves only last path segment.

splitext splits on the last . to separate extension.

**Time & Space:**

Let L = length of path string.

Both basename and splitext are simple scans: O(L).

Time: O(L)

Space: O(L) for substrings (base, root, ext).

Core String Operations

In [0]:
## Lower/upper/strip
s.lower()
s.upper()
s.strip()

## Splitting
s.split(',')
s.split(maxsplit=1)

## Replacing
s.replace(" ", "_")

## Find/search
s.find("ERROR")
"ERROR" in s

## Starts/ends
s.startswith("2025")
s.endswith(".csv")

## Is-type checks
s.isdigit()
s.isalpha()

In [0]:
## Efficient String Building (Avoid +=)

## Bad
out = ""
for s in parts:
    out += s + ","

## Corret
out = ",".join(parts)

In [0]:
## Fast Membership Tests With Sets
banned = {"spam", "fraud", "ban"}
if word.lower() in banned:
    ...

In [0]:
## Using Partition : Faster than split:

k, _, v = s.partition("=")

In [0]:
## Safe Substring Checking
if "ERROR" in log_line:
    ...

In [0]:
## Regex for Named Groups
m = re.search(r"user=(?P<user>\w+)", log)
m.group("user")

In [0]:
## Tokenization With defaultdict

from collections import defaultdict
freq = defaultdict(int)

for word in text.split():
    freq[word.lower()] += 1

In [0]:
## Problem 1 — Parse key=value log lines

def parse_line(line):
    out = {}
    for p in line.split():
        k, v = p.split("=", 1)
        out[k] = v
    return out

In [0]:
## Problem 2 — Extract user ID from request path

path = "/api/user/12345/details"

import re
m = re.search(r"/user/(\d+)", path)
uid = int(m.group(1))

In [0]:
## Problem 3 — Normalize names

def norm(s):
    return " ".join(s.lower().strip().split())

In [0]:
## Problem 4 — Clean a CSV line with varying whitespace

cols = [c.strip() for c in line.split(",")]

In [0]:
## Problem 5 — Extract substring between brackets

message: [ERROR] something failed
import re
level = re.search(r"\[(.*?)\]", msg).group(1)

In [0]:
## Problem 7 — Tokenize text and count words

from collections import Counter
Counter(text.lower().split())

In [0]:
## Problem 8 — Mask PII

import re
masked = re.sub(r"\b\d{16}\b", "****", line)  # credit cards

In [0]:
## Problem 9 — Validate date string

try:
    datetime.strptime(s, "%Y-%m-%d")
except ValueError:
    ...

In [0]:
## Problem 10 — Extract domain from email

domain = email.split("@")[1]

**Summary**

Strings are immutable → use join for building

split, strip, replace, partition, regex = must-master

Normalize → lowercase, strip, remove punctuation

Parsing patterns:

key=value

delimited fields

regex extractions

timestamps

cleaning messy data

DE-specific tasks:

log parsing

CSV cleaning

drift detection through string categories

file/path parsing

extracting IDs