# From Python to Production
## Notebook 6 ‚Äî Strings & Text Processing

By **Prerna Joshi** | #25DaysOfDataTech 

"Almost every dataset is messy text ‚Äî master strings, and you master real-world data cleaning."

---

### What you'll learn
- String basics & immutability, literal forms, escapes, raw strings
- Joining, splitting, slicing, searching, replacing
- Case handling (`lower`, `title`, `casefold`) and character tests (`isalpha`...)
- Unicode & normalization (NFC/NFKD), accent stripping
- Bytes vs `str`, encodings, safe file I/O
- Regex essentials (`re`): `search`, `findall`, groups, `sub`, flags
- Production patterns for text cleaning & extraction
- Performance tips for large text


> **Why this matters for data work**  
> 80% of real data wrangling is text cleanup. Knowing the right string tools saves hours and prevents subtle bugs (especially with encodings and Unicode).


## 1. Strings ‚Äî Immutability

Python strings are immutable; operations create **new** strings. Prefer `''.join(...)` over repeated `+` in loops.


In [1]:
name = "  prerna joshi  "
clean = name.strip().title()
(name, clean, id(name) == id(clean))


('  prerna joshi  ', 'Prerna Joshi', False)

## 2. Literals, Escapes, Raw Strings

- Single `'...'`, double `"..."`, or triple quotes shown as `'''...'''` and `\"\"\"...\"\"\"`
- Escape with `\\n`, `\\t`, `\\\\`, `\\\"`
- **Raw strings**: prefix with `r` to avoid escape processing (great for regex paths/patterns)


In [3]:
a = "Line1\nLine2"
b = r"C:\Users\pj\projects\*\data"
c = "Multi-line\nstring with \"quotes\" and 'quotes'"

# Debug view (repr) ‚Äî Jupyter shows this for expression results
(a, b, c.splitlines())

print("\n--- Human-readable (print) ---")
print("a:")
print(a)                 # newline renders as a new line
print("\nb:")
print(b)                 # backslashes are shown once
print("\nc, splitlines():")
for line in c.splitlines():
    print(line)

print("\n--- For comparison, explicit repr() ---")
print("repr(a):", repr(a))
print("repr(b):", repr(b))
print("repr(c.splitlines()):", repr(c.splitlines()))




--- Human-readable (print) ---
a:
Line1
Line2

b:
C:\Users\pj\projects\*\data

c, splitlines():
Multi-line
string with "quotes" and 'quotes'

--- For comparison, explicit repr() ---
repr(a): 'Line1\nLine2'
repr(b): 'C:\\Users\\pj\\projects\\*\\data'
repr(c.splitlines()): ['Multi-line', 'string with "quotes" and \'quotes\'']


## 3. Joining & f-Strings (Formatting)

- Use `separator.join(iterable)` for efficient concatenation
- Prefer **f-strings** for readable formatting


In [3]:
parts = ["Data", "AI", "Engineering"]
joined = " ¬∑ ".join(parts)
role = "Data Engineer"; tasks = 3
msg = f"{role}: {tasks} key tasks today"
joined, msg


('Data ¬∑ AI ¬∑ Engineering', 'Data Engineer: 3 key tasks today')

## 4. Splitting / Partitioning / Stripping

- `split()` vs `rsplit()`; limit pieces with `maxsplit`
- `partition(sep)` returns `(head, sep, tail)` without losing separator
- `strip()/lstrip()/rstrip()` remove whitespace or given chars


In [4]:
s = "id=101;name=Prerna;role=Data Engineer"
left, sep, right = s.partition(";")
tokens = s.split(";")
user = "___pj___".strip("_")
(left, sep, right), tokens, user


(('id=101', ';', 'name=Prerna;role=Data Engineer'),
 ['id=101', 'name=Prerna', 'role=Data Engineer'],
 'pj')

## 5. Slicing & Indexing

`seq[start:stop:step]` ‚Äî supports negative indices/steps.


In [5]:
text = "PYTHON"
text[1:4], text[-3:], text[::-1]


('YTH', 'HON', 'NOHTYP')

## 6. Searching & Replacing

Use `in`, `find`, `index`, `count`, `replace`, `removeprefix`, `removesuffix`, and `translate` for character‚Äëlevel transforms.


In [6]:
title = "  Data‚ÄîAI‚ÄîEngineering  ".strip()
has_ai = "AI" in title
first = title.find("AI")
replaced = title.replace("‚Äî", " | ")
has_ai, first, replaced


(True, 5, 'Data | AI | Engineering')

## 7. Case Handling & Character Tests

- `lower/upper/title/capitalize/swapcase`  
- `casefold` for aggressive, Unicode‚Äëaware case normalization (useful for search)
- `isalpha/isdigit/isalnum/isspace/isdecimal` etc.


In [7]:
s = "Stra√üe"   # German sharp S
(s.lower(), s.casefold()), s.isalpha(), "123".isdigit()


(('stra√üe', 'strasse'), True, True)

## 8. Unicode & Normalization (NFC/NFKD)

Visually identical strings can have different codepoints (composed vs decomposed). Normalize before comparisons or hashing.


In [8]:
import unicodedata

s1 = "caf√©"                          # composed '√©'
s2 = "cafeÃÅ"                    # 'e' + combining accent
eq_raw = (s1 == s2)
n1 = unicodedata.normalize("NFC", s1)
n2 = unicodedata.normalize("NFC", s2)
eq_norm = (n1 == n2)
eq_raw, eq_norm


(False, True)

## 9. Accent Stripping (when appropriate)

Normalize to `NFKD` and drop non‚Äëspacing marks.


In [9]:
import unicodedata

def strip_accents(s: str) -> str:
    nfkd = unicodedata.normalize("NFKD", s)
    return "".join(ch for ch in nfkd if unicodedata.category(ch) != "Mn")

strip_accents("Cura√ßao, S√£o Paulo, M√ºnchen")


'Curacao, Sao Paulo, Munchen'

## 10. Bytes vs `str`, Encodings

- `str` = Unicode text; `bytes` = raw 8‚Äëbit data
- Encode with `.encode('utf-8')`; decode with `.decode('utf-8')`
- When reading files, **always** set an explicit encoding (`utf‚Äë8` prefer)


In [10]:
t = "hello ‚ú®"
b = t.encode("utf-8")
decoded = b.decode("utf-8")
t, b, decoded


('hello ‚ú®', b'hello \xe2\x9c\xa8', 'hello ‚ú®')

## 11. Safe File I/O (Explicit Encoding)

Use `with open(path, encoding="utf-8") as f:` and handle errors with `errors="replace"` or `"ignore"` when needed.


In [11]:
# Demo only; will create & read a temp file
path = "demo_text.txt"
with open(path, "w", encoding="utf-8") as f:
    f.write("line 1\nline 2 ‚ú®")
with open(path, encoding="utf-8") as f:
    content = f.read()
content.splitlines()


['line 1', 'line 2 ‚ú®']

## 12. Regex Essentials (`re`)

- `re.search` (anywhere), `re.match` (start), `re.findall`, `re.sub`
- Use raw strings for patterns: `r"\d+"`
- Common flags: `re.I` (ignore case), `re.M` (multi-line), `re.S` (dot matches newline)
- Groups with `(...)`, named groups `(?P<name>...)`


In [12]:
import re

text = "Email me at prerna.joshi@example.com or pj@uta.edu"
pattern = re.compile(r"(?P<user>[\w\.-]+)@(?P<domain>[\w\.-]+)", re.I)
emails = pattern.findall(text)
masked = pattern.sub(lambda m: m.group("user") + "@***", text)
emails, masked


([('prerna.joshi', 'example.com'), ('pj', 'uta.edu')],
 'Email me at prerna.joshi@*** or pj@***')

## 13. Common Extraction Patterns (starter)

- **Integers / floats**: `r"-?\d+"`, `r"-?\d+(?:\.\d+)?"`
- **Dates (simple)**: `r"\b\d{4}-\d{2}-\d{2}\b"`
- **Words**: `r"\b\w+\b"`
> Use libraries (e.g., `dateutil`, `pandas`) for robust parsing in production.


In [13]:
import re
s = "Total: -12.50 on 2025-12-05; items: 3"
floats = re.findall(r"-?\d+(?:\.\d+)?", s)
date = re.search(r"\b\d{4}-\d{2}-\d{2}\b", s).group(0)
floats, date


(['-12.50', '2025', '-12', '-05', '3'], '2025-12-05')

## 14. A Tiny Cleaning Pipeline (Composable)

Steps: normalize ‚Üí lower/casefold ‚Üí strip accents ‚Üí collapse whitespace ‚Üí remove punctuation subset ‚Üí tokenize.


In [14]:
import re, unicodedata

WS = re.compile(r"\s+")
PUNCT = str.maketrans({c: " " for c in ",.;:!?"})

def normalize_text(s: str) -> list[str]:
    s = unicodedata.normalize("NFKC", s)
    s = s.casefold()
    s = strip_accents(s)
    s = s.translate(PUNCT)
    s = WS.sub(" ", s).strip()
    return s.split()

normalize_text("Caf√©‚ÄîAI, Engineering!!  is FUN		Right?")


['cafe‚Äîai', 'engineering', 'is', 'fun', 'right']

## 15. Performance Tips

- Build strings with `join`, not `+` in loops
- Pre‚Äëcompile hot regexes: `pattern = re.compile(...)`
- Stream files line‚Äëby‚Äëline instead of `read()` for very large files
- Use generator pipelines to avoid large intermediates
- Normalize once up‚Äëfront when doing many comparisons/hashes


## 16. Practice (Try first, then reveal solutions)

1. **smart_title**: Implement `smart_title(s)` that title‚Äëcases words but leaves known acronyms (`AI`, `ML`, `NLP`, `USA`) uppercase.  
2. **normalize_id**: Given a messy ID string, return lowercase alphanumerics only (drop other chars).  
3. **is_palindrome**: Case‚Äë & accent‚Äëinsensitive palindrome check (ignore non‚Äëalphanumerics).  
4. **extract_emails**: Return a list of emails from text; mask domains (`user@***`).  
5. **word_counts**: Build a frequency dict of tokens using the cleaning pipeline in ¬ß14. Return top 5.  
6. **replace_many**: Implement multi‚Äëreplace using `translate` with a mapping `{",": " ", ";": " ", "|": " "}`.  
7. **split_kv**: Parse `"a=1; b=2; c=3"` into a dict safely using `partition`.  
8. **truncate_ellipsis**: Truncate a string to `n` characters without breaking words; add `‚Ä¶` if truncated.  
9. **slugify**: Convert `"Hello, World!"` ‚Üí `"hello-world"` using normalization + regex.  
10. **safe_open_read**: Read a file at `path` with `utf-8` and `errors="replace"`, returning the text.  
11. **regex_date_swap**: Using regex, transform `"2025-12-05"` to `"12/05/2025"`.  
12. **find_urls** *(bonus)*: Extract basic `http(s)://...` URLs with regex (keep it simple).


## 17. Practice Solutions  
*(Click to reveal after solving.)*

<details>
<summary><strong>Solution 1Ô∏è‚É£ ‚Äî smart_title</strong></summary>

```python
ACROS = {"AI","ML","NLP","USA"}
def smart_title(s: str) -> str:
    words = s.split()
    out = []
    for w in words:
        ww = w.upper()
        out.append(ww if ww in ACROS else w.title())
    return " ".join(out)
```
</details>

<details>
<summary><strong>Solution 2Ô∏è‚É£ ‚Äî normalize_id</strong></summary>

```python
import re
def normalize_id(s: str) -> str:
    return re.sub(r"[^a-z0-9]+", "", s.casefold())
```
</details>

<details>
<summary><strong>Solution 3Ô∏è‚É£ ‚Äî is_palindrome</strong></summary>

```python
import re, unicodedata

def _alnum_casefold_no_accent(s: str) -> str:
    s = unicodedata.normalize("NFKD", s)
    s = "".join(ch for ch in s if unicodedata.category(ch) != "Mn")
    s = s.casefold()
    return re.sub(r"[^a-z0-9]", "", s)

def is_palindrome(s: str) -> bool:
    t = _alnum_casefold_no_accent(s)
    return t == t[::-1]
```
</details>

<details>
<summary><strong>Solution 4Ô∏è‚É£ ‚Äî extract_emails</strong></summary>

```python
import re
EMAIL = re.compile(r"(?P<user>[\\w\\.-]+)@(?P<domain>[\\w\\.-]+)")

def extract_emails(text: str):
    return EMAIL.findall(text)

def mask_domains(text: str):
    return EMAIL.sub(lambda m: m.group("user")+"@***", text)
```
</details>

<details>
<summary><strong>Solution 5Ô∏è‚É£ ‚Äî word_counts</strong></summary>

```python
from collections import Counter

def word_counts(text: str, top=5):
    toks = normalize_text(text)
    return Counter(toks).most_common(top)
```
</details>

<details>
<summary><strong>Solution 6Ô∏è‚É£ ‚Äî replace_many</strong></summary>

```python
def replace_many(s: str) -> str:
    table = str.maketrans({",":" ", ";":" ", "|":" "})
    return s.translate(table)
```
</details>

<details>
<summary><strong>Solution 7Ô∏è‚É£ ‚Äî split_kv</strong></summary>

```python
def split_kv(s: str) -> dict:
    out = {}
    for part in s.split(";"):
        part = part.strip()
        if not part:
            continue
        k, sep, v = part.partition("=")
        if sep:
            out[k.strip()] = v.strip()
    return out
```
</details>

<details>
<summary><strong>Solution 8Ô∏è‚É£ ‚Äî truncate_ellipsis</strong></summary>

```python
def truncate_ellipsis(s: str, n: int) -> str:
    if len(s) <= n:
        return s
    cut = s[:n].rsplit(" ", 1)[0]
    return cut + "‚Ä¶"
```
</details>

<details>
<summary><strong>Solution 9Ô∏è‚É£ ‚Äî slugify</strong></summary>

```python
import re, unicodedata

def slugify(s: str) -> str:
    s = unicodedata.normalize("NFKC", s)
    s = strip_accents(s).casefold()
    s = re.sub(r"[^a-z0-9]+", "-", s).strip("-")
    return s
```
</details>

<details>
<summary><strong>Solution üîü ‚Äî safe_open_read</strong></summary>

```python
def safe_open_read(path: str) -> str:
    with open(path, encoding="utf-8", errors="replace") as f:
        return f.read()
```
</details>

<details>
<summary><strong>Solution 1Ô∏è‚É£1Ô∏è‚É£ ‚Äî regex_date_swap</strong></summary>

```python
import re
DATE = re.compile(r"(?P<y>\\d{4})-(?P<m>\\d{2})-(?P<d>\\d{2})")

def date_swap(s: str) -> str:
    return DATE.sub(lambda m: f"{m.group('m')}/{m.group('d')}/{m.group('y')}", s)
```
</details>

<details>
<summary><strong>Solution 1Ô∏è‚É£2Ô∏è‚É£ ‚Äî find_urls</strong></summary>

```python
import re
URL = re.compile(r"https?://[^\\s)]+")
def find_urls(text: str):
    return URL.findall(text)
```
</details>


## 18. Mini Cheatsheet

- Prefer `join` over `+` in loops
- Normalize (`NFC/NFKD`) before comparing/hashing
- Use `casefold` for case-insensitive compare
- Encode/decode explicitly (`utf-8`); set `encoding=` on `open()`
- Pre-compile regexes you reuse; use raw strings for patterns
- Build small, composable cleaning functions for clarity & tests
