# Chapter 8 — Strings: A Deeper Look


###  **Objectives**
- Understand **text processing** fundamentals.
- **Format** string content for clean output. 
- **Compare strings** using relational operators.   
- **Split** strings into tokens and **join** them back with separators.  
- Create and use **regular expressions (regex)** for:
- Use regex **metacharacters**, **quantifiers**, **character classes**, and **grouping**.  
- Understand the role of **string manipulation** in **Natural Language Processing (NLP)**.  
- Learn the data science concepts of **data munging**, **data wrangling**, and **data cleaning**.  
- Use **regex** to clean and format messy data.


Strings are one of Python’s most powerful and frequently used data types.  
They are essential in **text analysis**, **data cleaning**, **web scraping**, and **natural language processing (NLP)**.

In this chapter, we’ll explore how Python provides a rich set of built-in **string methods** and the **`re` module** for regular expressions to make text manipulation efficient and expressive.


In [3]:
# Example: basic string creation and display
message = "Data Science with Python"
print(message)


Data Science with Python


Python strings are **immutable**, meaning their contents cannot be changed after creation.  
Operations that appear to modify a string actually create a **new string object**.


In [4]:
# Example: Immutability
s = "Hello"
s_upper = s.upper()

print(s)        # original string remains unchanged
print(s_upper)  # new string object is created


Hello
HELLO


# 8.1 Introduction

Strings are sequences of characters. They support many of the same operations as lists and tuples but are **immutable**.

This chapter explores:
- Advanced string manipulation
- Regular expressions (`re` module)
- Data cleaning (munging/wrangling)
- Text processing for data science and NLP

### Key Concepts

| Concept | Description | Example |
|----------|--------------|----------|
| **String immutability** | Strings cannot be changed in place. | `"cat".replace('c', 'b') → 'bat'"` |
| **String as sequence** | Strings support slicing, indexing, and iteration. | `"Python"[1:4] → 'yth'"` |
| **Regular expressions** | Patterns for searching, matching, or validating text. | `re.fullmatch(r'\d{4}', '2025')` |
| **Data munging/wrangling** | Cleaning and transforming text data. | Removing spaces, symbols, duplicates |
| **Text processing** | Extracting information or structure from text. | Tokenization, lemmatization |


### Importance

Text data dominates modern datasets.  
String manipulation and regex are critical for:
- Cleaning raw text
- Extracting structured information
- Preparing text for machine learning or NLP models

### String and NLP Applications

- Anagrams  
- Automated grading  
- Chatbots  
- Compilers and interpreters  
- Document classification and summarization  
- Machine translation  
- Natural language understanding  
- Opinion and sentiment analysis  
- Search engines  
- Spam detection  
- Speech-to-text and text-to-speech  
- Grammar and spell checking  
- Web scraping  
- Word clouds and games  
- Fraud detection  
- Medical and legal document processing 

### Connection to Data Science

Regular expressions and Pandas can be combined for:
- Cleaning messy datasets
- Validating data formats
- Extracting patterns (emails, IDs, prices)
- Text preprocessing for NLP

# 8.2 Formatting Strings

Proper text formatting makes data easier to read and understand.  
Python provides flexible tools for **string formatting** — especially with **f-strings**, which are powerful and readable.


## 8.2.1 Presentation Types

When you specify a placeholder in an f-string, Python assumes the value should be displayed as a **string**, unless another type is explicitly defined.

For example, formatting a float value to **two decimal places**:


In [5]:
f'{17.489:.2f}'

'17.49'

Python supports precision only for **floating-point** and **Decimal** values.  
If you use `.2f` on a string, you’ll get a `ValueError`.
The letter after the colon (`:`) in the placeholder is the **presentation type** — it tells Python how to display the value.

### Integers

The `d` presentation type formats integer values as strings:


In [6]:
f'{10:d}'

'10'

Other integer presentation types include:

| Type | Description |
|-------|--------------|
| `b` | Binary |
| `o` | Octal |
| `x` / `X` | Hexadecimal (lowercase / uppercase) |


### Characters

The `c` presentation type formats an **integer character code** as the corresponding **Unicode character**:


In [7]:
f'{65:c} {97:c}'

'A a'

### Strings

The `s` presentation type formats a value as a string.  
If omitted, Python converts non-string values automatically to strings.


In [8]:
f'{"hello":s} {7}'

'hello 7'

Note: You cannot place **single quotes inside a single-quoted string**.

### Floating-Point and Decimal Values

The `f` presentation type is used for **floating-point** and **Decimal** values.  
For **very large or small numbers**, use **Exponential (scientific)** notation with `e` or `E`.


In [9]:
from decimal import Decimal

f'{Decimal("10000000000000000000000000.0"):.3f}'

'10000000000000000000000000.000'

In [10]:
f'{Decimal("10000000000000000000000000.0"):.3e}'

'1.000e+25'

Here, `1.000e+25` means \(1.000 × 10^{25}\).  
If you prefer a capital `E`, use `E` instead of `e`.

## 8.2.2 Field Widths and Alignment

Field widths allow you to control how many **character positions** are used when displaying text or numbers.  
By default:
- Numbers are **right-aligned**.
- Strings are **left-aligned**.

In [11]:
f'[{27:10d}]'

'[        27]'

In [12]:
f'[{3.5:10f}]'

'[  3.500000]'

In [13]:
f'[{"hello":10}]'

'[hello     ]'

In the examples above:
- The number `27` and float `3.5` are right-aligned within a **10-character-wide field**.
- The string `"hello"` is left-aligned.
- Float values default to **six digits** after the decimal point.
- Extra spaces fill the unused positions in the field.


### Explicit Alignment

You can explicitly specify alignment using these symbols:

| Symbol | Alignment | Example |
|:--------|:-----------|:---------|
| `<` | Left align | `f'{27:<10d}'` |
| `>` | Right align | `f'{27:>10d}'` |
| `^` | Center align | `f'{27:^10d}'` |

In [14]:
f'[{27:<15d}]'

'[27             ]'

In [15]:
f'[{3.5:<15f}]'

'[3.500000       ]'

In [16]:
f'[{"hello":>15}]'

'[          hello]'

### Centering a Value in a Field

To **center-align** values, use `^`.  
If the total width is odd, the extra space is added to the **right**.

In [17]:
f'[{27:^7d}]'

'[  27   ]'

In [18]:
f'[{3.5:^7.1f}]'

'[  3.5  ]'

In [19]:
f'[{"hello":^7}]'

'[ hello ]'

## 8.2.3 Numeric Formatting

Python provides several options to control **how numbers are displayed** — including signs, zero-padding, spacing, and digit grouping.

### Formatting Positive Numbers with Signs

Use `+` before the field width to **display a sign** for positive and negative numbers.


In [20]:
f'[{27:+10d}]'

'[       +27]'

To **pad with zeros** instead of spaces, place a `0` before the field width (after the sign symbol, if used).


In [21]:
f'[{27:+010d}]'

'[+000000027]'

### Using a Space Where a `+` Sign Would Appear

A space before the field width reserves a space for positive numbers.  
This is useful for aligning positive and negative values vertically.


In [22]:
print(f'{27:d}\n{27: d}\n{-27: d}')

27
 27
-27


### Grouping Digits with Commas

You can format numbers with **thousands separators** using a comma (`,`).


In [23]:
f'{12345678:,d}'

'12,345,678'

In [24]:
f'{123456.78:,.3f}'

'123,456.780'

## 8.2.4 String’s `format` Method

Before Python 3.6 introduced **f-strings**, the primary way to format strings was using the **`format()` method**.  
This method is still commonly seen in older code and official documentation.


### Basic Usage

You call `format()` on a format string containing **curly brace `{}` placeholders**,  
optionally including format specifiers (after a colon `:`).


In [25]:
'{:.2f}'.format(17.489)

'17.49'

### Multiple Placeholders

You can include **multiple placeholders**.  
The arguments passed to `format()` are substituted **from left to right**.


In [26]:
'{} {}'.format('Amanda', 'Cyan')

'Amanda Cyan'

### Referencing Arguments by Position Number

You can explicitly refer to **positional arguments** using numbers that start from `0`.  
This allows reuse and reordering of values.


In [27]:
'{0} {0} {1}'.format('Happy', 'Birthday')

'Happy Happy Birthday'

### Referencing Keyword Arguments

You can pass **named (keyword) arguments** and refer to them using their keys inside the placeholders.


In [28]:
'{first} {last}'.format(first='Amanda', last='Gray')

'Amanda Gray'

In [29]:
'{last} {first}'.format(first='Amanda', last='Gray')

'Gray Amanda'

# 8.3 Concatenating and Repeating Strings

Strings in Python can be **concatenated** (joined together) using the `+` operator and **repeated** using the `*` operator.  
Since strings are **immutable**, every operation creates a **new string object** and assigns it to the variable.

You can also use **augmented assignment operators** (`+=` and `*=`) for these tasks.


In [30]:
# Concatenating strings using +=
s1 = 'happy'
s2 = 'birthday'
s1 += ' ' + s2  # same as s1 = s1 + ' ' + s2
print(s1)

happy birthday


In [31]:
# Repeating strings using *=
symbol = '>'
symbol *= 5  # same as symbol = symbol * 5
print(symbol)

>>>>>


# 8.4 Stripping Whitespace from Strings

Strings often contain unwanted **whitespace** at the beginning or end (such as spaces, tabs `\t`, or newlines `\n`).  
Python provides several methods to remove them.  

Because strings are **immutable**, these methods return **new strings**, leaving the original unchanged.

### Removing Leading and Trailing Whitespace
Use `.strip()` to remove **both leading and trailing** whitespace.

In [32]:
sentence = '\t \n This is a test string. \t\t \n'
print(sentence.strip())

This is a test string.


### Removing Leading Whitespace
Use `.lstrip()` to remove **only leading (left-side)** whitespace.


In [33]:
sentence = '\t \n This is a test string. .\t\t \n'
print(sentence.lstrip())

This is a test string. .		 



### Removing Trailing Whitespace
Use `.rstrip()` to remove **only trailing (right-side)** whitespace.

In [34]:
sentence = '\t \n. This is a test string. \t\t \n'
print(sentence.rstrip())

	 
. This is a test string.


All these methods remove **spaces**, **tabs (`\t`)**, and **newlines (`\n`)** from the corresponding sides of the string.


# 8.5 Changing Character Case

Python provides several **string methods** to change the capitalization of text.  
Since strings are **immutable**, these methods return **new strings** instead of modifying the original.


### Converting to All Lowercase or Uppercase

You can use:
- `.lower()` → to make all letters lowercase  
- `.upper()` → to make all letters uppercase


In [35]:
print("Happy Birthday".lower())
print("Happy Birthday".upper())

happy birthday
HAPPY BIRTHDAY


### Capitalizing Only the First Character

The `.capitalize()` method returns a copy of the string with **only the first character capitalized**  
and the rest converted to lowercase.


In [36]:
print('happy birthday'.capitalize())

Happy birthday


### Capitalizing the First Character of Every Word

The `.title()` method capitalizes the **first character of each word**,  
useful for formatting titles or headings.


In [37]:
print('strings: a deeper look'.title())

Strings: A Deeper Look


# 8.6 Comparison Operators for Strings

Strings in Python can be compared using **comparison operators** such as:
`==`, `!=`, `<`, `<=`, `>`, and `>=`.

These comparisons are performed **lexicographically** (based on Unicode values of characters).


### Character Codes and Unicode Values

Each character has an underlying **integer Unicode value**.  Uppercase letters (e.g., `'A'`) have **smaller** numeric codes than lowercase letters (e.g., `'a'`).
You can check these numeric codes using the `ord()` function.

In [38]:
print(f'A: {ord("A")}; a: {ord("a")}')

A: 65; a: 97


#### Example: Comparing Strings

Let’s compare `'Orange'` and `'orange'`:


In [39]:
print('Orange' <  'orange')

True


#### Explanation

- `'O'` (Unicode 79) is **less than** `'o'` (Unicode 111).  
- Therefore, `'Orange' < 'orange'` evaluates to **True**.  
- String comparison in Python is **case-sensitive**.


In [40]:
print('Orange' == 'orange')

False


In [41]:
print('Orange' <= 'orange')

True


In [42]:
print('Orange' >  'orange')

False


In [43]:
print('Orange' >= 'orange')

False


In [44]:
print('Orange' != 'orange')

True


# 8.7 Searching for Substrings

You can search for **substrings** (one or more adjacent characters) within a string to:

- Count occurrences  
- Locate positions  
- Check if a substring exists  
- Verify if a string starts or ends with specific text  

All methods perform lexicographic comparisons using underlying Unicode values.


### Counting Occurrences

The `count()` method returns the number of times a substring occurs in the string.


In [45]:
sentence = 'to be or not to be that is the question'
print(sentence.count('to'))          

2


In [46]:
print(sentence.count('to', 12))       # (searches from index 12 to end)

1


In [47]:
print(sentence.count('that', 12, 25)) # (searches between index 12 and 25)

1


### Locating a Substring in a String

The `index()` method returns the **first index** where a substring is found.  
If not found, it raises a `ValueError`.


In [48]:
sentence = 'to be or not to be that is the question'
print(sentence.index('be'))

3


The `rindex()` method searches from the **end of the string** and returns the **last index** where the substring is found.


In [49]:
print(sentence.rindex('be'))

16


### Using `find()` and `rfind()`

These methods behave like `index()` and `rindex()`, but instead of raising an error,
they return **-1** if the substring is not found.


In [50]:
sentence = 'to be or not to be that is the question'
print(sentence.find('that'))   

19


In [51]:
print(sentence.find('xyz'))   

-1


In [52]:
print(sentence.rfind('be'))   

16


### Checking for Substring Presence

Use the `in` and `not in` operators to check if a substring exists in a string.


In [53]:
print('that' in sentence)

True


In [54]:
print('THAT' in sentence)

False


In [55]:
print('THAT' not in sentence)

True


### Checking Start or End of a String

Use `startswith()` and `endswith()` to check if a string begins or ends with a specific substring.


In [56]:
print(sentence.startswith('to'))

True


In [57]:
print(sentence.startswith('be')) 

False


In [58]:
print(sentence.endswith('question')) 

True


In [59]:
print(sentence.endswith('quest'))

False


# 8.8 Replacing Substrings

A common text manipulation task is **finding and replacing** substrings within a string.  
The `replace()` method takes two arguments the **old substring** and the **new substring** and replaces all occurrences of the old one with the new one.  

It returns a **new string**, leaving the original unchanged.


In [60]:
values = '1\t2\t3\t4\t5'
print(values.replace('\t', ','))

1,2,3,4,5


You can also provide an **optional third argument** to specify the **maximum number of replacements** to perform.


In [61]:
text = 'a b c b d b e'
print(text.replace('b', 'X', 2))

a X c X d b e


# 8.9 Splitting and Joining Strings

Strings can be **split** into parts (tokens) and **joined** back together using delimiters.  
These operations are essential for text parsing, data cleaning, and tokenization tasks.


### Splitting Strings
The `split()` method breaks a string into a list of substrings based on a **delimiter** (default: whitespace).


In [62]:
letters = 'A, B, C, D'
print(letters.split(', '))

['A', 'B', 'C', 'D']


You can limit the number of splits by providing a **second argument**, the maximum number of splits.


In [63]:
print(letters.split(', ', 2))

['A', 'B', 'C, D']


`rsplit()` works similarly but starts splitting from the **end** of the string.

### Joining Strings
The `join()` method concatenates elements of an **iterable** (like a list) into a single string, separated by the string on which `join()` is called.


In [64]:
letters_list = ['A', 'B', 'C', 'D']
print(','.join(letters_list))

A,B,C,D


Example with a **list comprehension**:

In [65]:
print(','.join([str(i) for i in range(10)]))

0,1,2,3,4,5,6,7,8,9


### String Methods `partition()` and `rpartition()`
`partition()` splits a string into a tuple of three parts:
1. The part before the separator,  
2. The separator itself,  
3. The part after the separator.


In [66]:
print('Amanda: 89, 97, 92'.partition(': '))

('Amanda', ': ', '89, 97, 92')


`rpartition()` works the same way but searches from the **end** of the string.


In [67]:
url = 'http://www.deitel.com/books/PyCDS/table_of_contents.html'
rest_of_url, separator, document = url.rpartition('/')

print(document)      
print(rest_of_url)    


table_of_contents.html
http://www.deitel.com/books/PyCDS


### String Method `splitlines()`
`splitlines()` splits a string at each newline (`\n`) character and returns a list of lines.


In [68]:
lines = """This is line 1
This is line2
This is line3"""

print(lines)

This is line 1
This is line2
This is line3


In [69]:
print(lines.splitlines())

['This is line 1', 'This is line2', 'This is line3']


Passing `True` keeps the newline characters in the result.


In [70]:
print(lines.splitlines(True))

['This is line 1\n', 'This is line2\n', 'This is line3']


# 8.10 Characters and Character-Testing Methods

In Python, **characters** are simply **one-character strings**.  
A variety of string methods exist to test whether a string matches certain properties, such as containing only digits, letters, or whitespace.

These methods are particularly useful for **input validation** and **text preprocessing**.


#### Checking for Digits
`isdigit()` returns `True` if all characters in the string are digits (`0–9`).


In [71]:
print('-27'.isdigit())

False


In [72]:
print('27'.isdigit())

True


### Checking for Alphanumeric Characters
`isalnum()` returns `True` if the string contains only **letters and digits**, and `False` if it contains spaces or symbols.


In [73]:
print('A9876'.isalnum())

True


In [74]:
print('123 Main Street'.isalnum())

False


### Common Character-Testing Methods

| Method | Returns `True` If the String Contains Only... |
|:-------|:----------------------------------------------|
| `isalnum()` | Letters and digits |
| `isalpha()` | Letters only |
| `isdigit()` | Digits only |
| `isdecimal()` | Decimal characters only |
| `isnumeric()` | Numeric characters (including fractions, superscripts) |
| `isspace()` | Whitespace characters |
| `islower()` | Lowercase letters |
| `isupper()` | Uppercase letters |
| `istitle()` | Titlecase words (each word starts uppercase) |
| `isascii()` | ASCII characters only |

Each method returns `False` if **any character** in the string fails to meet the condition.


#### Example Demonstrations


In [75]:
print('python'.isalpha())

True


In [76]:
print('12345'.isnumeric())

True


In [77]:
print('   '.isspace())

True


In [78]:
print('Hello World'.istitle())

True


In [79]:
print('hello world'.islower())

True


In [80]:
print('HELLO'.isupper())

True


# 8.11 Raw Strings

In Python, **backslashes (`\`)** are used to introduce **escape sequences** such as:
- `\n` → newline  
- `\t` → tab  

If you want to include a literal backslash in a string, you normally need to **escape it** with another backslash (`\\`).


In [81]:
file_path = 'C:\\MyFolder\\MySubFolder\\MyFile.txt'
print(file_path)

C:\MyFolder\MySubFolder\MyFile.txt


This can make strings (especially file paths) difficult to read.  
To simplify this, Python provides **raw strings**, which are preceded by the letter **`r`**.
Raw strings treat backslashes as **literal characters**, not as escape indicators.


In [82]:
file_path = r'C:\MyFolder\MySubFolder\MyFile.txt'
print(file_path)

C:\MyFolder\MySubFolder\MyFile.txt


#### Key Point:
Raw strings are especially useful when working with:
- **Windows file paths**
- **Regular expressions**, which often contain many backslashes.

Even though raw strings display with double backslashes in Python’s internal representation,  
they make your **source code cleaner and easier to read**.


# 8.12 Introduction to Regular Expressions

Regular expressions (regex) are patterns that help recognize specific sequences of characters in text such as phone numbers, emails, or ZIP codes. They’re useful for **data validation**, **text extraction**, **cleaning**, and **transformation**.

Python’s `re` module provides built-in support for working with regular expressions.



In [83]:
import re

### Matching Literal Characters with `re.fullmatch()`

The function `re.fullmatch(pattern, string)` checks if the *entire* string matches the given pattern.


In [84]:
pattern = '02215'
print('Match' if re.fullmatch(pattern, '02215') else 'No match')

Match


In [85]:
print('Match' if re.fullmatch(pattern, '51220') else 'No match')

No match


### Using Character Classes and Quantifiers

`re` provides special symbols (metacharacters) for matching groups of characters.  
For example, `\d` matches any digit (0–9).  
The quantifier `{5}` means *exactly five repetitions*.


In [86]:
print('Valid' if re.fullmatch(r'\d{5}', '02215') else 'Invalid')

Valid


In [87]:
print('Valid' if re.fullmatch(r'\d{5}', '9876') else 'Invalid')

Invalid


### Custom Character Classes

Square brackets `[]` define a set of valid characters.  
For example:
- `[A-Z]` matches any uppercase letter  
- `[a-z]` matches any lowercase letter  
- `[A-Z][a-z]*` matches a capitalized name (like "Wally")


In [88]:
print('Valid' if re.fullmatch('[A-Z][a-z]*', 'Wally') else 'Invalid')

Valid


In [89]:
print('Valid' if re.fullmatch('[A-Z][a-z]*', 'eva') else 'Invalid')

Invalid


### Negated Character Classes

A caret `^` at the start of brackets negates the class meaning *any character except those listed*.


In [90]:
print('Match' if re.fullmatch('[^a-z]', 'A') else 'No match')

Match


In [91]:
print('Match' if re.fullmatch('[^a-z]', 'a') else 'No match')

No match


### Matching Special Characters Literally

Inside character classes, metacharacters lose their special meaning.
For example, `[*+$]` matches any one of `*`, `+`, or `$`.


In [92]:
print('Match' if re.fullmatch('[*+$]', '*') else 'No match')

Match


In [93]:
print('Match' if re.fullmatch('[*+$]', '!') else 'No match')

No match


### Quantifiers: `*`, `+`, and `?`

- `*` → zero or more repetitions  
- `+` → one or more repetitions  
- `?` → zero or one repetition


In [94]:
# * allows empty lowercase part
print('Valid' if re.fullmatch('[A-Z][a-z]*', 'E') else 'Invalid')

Valid


In [95]:
# + requires at least one lowercase letter
print('Valid' if re.fullmatch('[A-Z][a-z]+', 'E') else 'Invalid')

Invalid


### Example: British vs. American spelling

The pattern `labell?ed` matches both `"labeled"` and `"labelled"`.


In [96]:
print('Match' if re.fullmatch('labell?ed', 'labelled') else 'No match')

Match


In [97]:
print('Match' if re.fullmatch('labell?ed', 'labeled') else 'No match')

Match


In [98]:
print('Match' if re.fullmatch('labell?ed', 'labellled') else 'No match')

No match


### Range Quantifiers `{n, m}`

You can specify how many times a character or group should appear:
- `{n,}` → at least n times  
- `{n,m}` → between n and m times (inclusive)


In [99]:
# At least 3 digits
print('Match' if re.fullmatch(r'\d{3,}', '123') else 'No match')

Match


In [100]:
print('Match' if re.fullmatch(r'\d{3,}', '12') else 'No match')

No match


In [101]:
# Between 3 and 6 digits
print('Match' if re.fullmatch(r'\d{3,6}', '123') else 'No match')

Match


In [102]:
print('Match' if re.fullmatch(r'\d{3,6}', '1234567') else 'No match')

No match


## 8.12.2 Replacing Substrings and Splitting Strings

The `re` module provides two powerful functions for **string manipulation**:
- `re.sub()` — replaces substrings that match a given pattern.
- `re.split()` — splits a string into parts based on a pattern.


### Function `re.sub()` — Replacing Patterns

The syntax is:
```python
re.sub(pattern, replacement, string, count=0)


In [103]:
# Replace all tabs (\t) with commas
result = re.sub(r'\t', ', ', '1\t2\t3\t4')
print(result)

1, 2, 3, 4


In [104]:
# Replace only the first two tabs
result = re.sub(r'\t', ', ', '1\t2\t3\t4', count=2)
print(result)  # Output: '1, 2, 3\t4'

1, 2, 3	4


### Function `re.split()` — Splitting Strings

The syntax is:
```python
re.split(pattern, string, maxsplit=0)


In [105]:
result = re.split(r',\s*', '1, 2, 3,4, 5,6,7,8')
print(result)

['1', '2', '3', '4', '5', '6', '7', '8']


In [106]:
# Limit the number of splits to 3
result = re.split(r',\s*', '1, 2, 3,4, 5,6,7,8', maxsplit=3)
print(result)

['1', '2', '3', '4, 5,6,7,8']


## 8.12.3 Other Search Functions and Accessing Matches

The `re` module provides several ways to **search for patterns** in strings and **access matches**:
- `search()` — finds the first match anywhere in the string  
- `match()` — matches only at the beginning of a string  
- `findall()` — returns all matches as a list  
- `finditer()` — returns an iterator of match objects  


In [107]:
# --- Function search() ---
result = re.search('Python', 'Python is fun')
print(result.group() if result else 'not found')

Python


In [108]:
result2 = re.search('fun!', 'Python is fun')
print(result2.group() if result2 else 'not found')

not found


### Ignoring Case

Use the optional `flags=re.IGNORECASE` argument to perform **case-insensitive searches**:


In [109]:
result3 = re.search('Sam', 'SAM WHITE', flags=re.IGNORECASE)
print(result3.group() if result3 else 'not found')

SAM


### Anchors: Beginning (^) and End ($)

- `^` → matches **start** of a string  
- `$` → matches **end** of a string  


In [110]:
result = re.search('^Python', 'Python is fun')
print(result.group() if result else 'not found')

Python


In [111]:
result = re.search('fun$', 'Python is fun')
print(result.group() if result else 'not found')

fun


### Function findall() — Find All Matches

`findall()` returns a list of **all matching substrings**.


In [112]:
contact = 'Wally White, Home: 555-555-1234, Work: 555-555-4321'
phones = re.findall(r'\d{3}-\d{3}-\d{4}', contact)
print(phones)

['555-555-1234', '555-555-4321']


### Function finditer() — Memory-Efficient Search

`finditer()` returns an **iterator of match objects**, useful for large datasets.


In [113]:
for phone in re.finditer(r'\d{3}-\d{3}-\d{4}', contact):
    print(phone.group())

555-555-1234
555-555-4321


### Capturing Substrings with Parentheses ()

Parentheses `()` in a regular expression **capture** parts of a match into groups.


In [114]:
text = 'Charlie Cyan, e-mail: demo1@deitel.com'
pattern = r'([A-Z][a-z]+ [A-Z][a-z]+), e-mail: (\w+@\w+\.\w{3})'
result = re.search(pattern, text)

In [115]:
# View all captured groups
print(result.groups())  

('Charlie Cyan', 'demo1@deitel.com')


In [116]:
# Full match
print(result.group())  

Charlie Cyan, e-mail: demo1@deitel.com


In [117]:
# Access individual groups
print(result.group(1))

Charlie Cyan


In [118]:
print(result.group(2))

demo1@deitel.com


# 8.13 Introduction to Data Science: Pandas, Regular Expressions and Data Munging

Raw data is often **incomplete**, **inconsistent**, or **incorrect**.  
Before analysis, data must be **cleaned** and **transformed** a process known as **data munging** (or **data wrangling**).

Industry studies show that **data scientists spend up to 75%** of their time cleaning and transforming data.

### Common Data Cleaning Tasks
- Deleting observations with missing or invalid values  
- Replacing missing/bad values with reasonable ones  
- Removing or handling outliers  
- Eliminating duplicates  
- Correcting inconsistent formats  

>  **Note:** “Substituting reasonable values” must not be used to manipulate results or confirm hypotheses.

### Example: Handling Missing Data

Consider hospital data where patient temperature readings are recorded four times a day.  
A missing reading might appear as `0.0`, which can distort averages.


In [119]:
temps = ['Brown, Sue', 98.6, 98.4, 98.7, 0.0]

In [120]:
# Compute average with the bad value
avg_with_bad = sum(temps[1:]) / 4

print("Average with 0.0 included:", round(avg_with_bad, 2))

Average with 0.0 included: 73.92


In [121]:
# Compute average ignoring the missing value
avg_cleaned = sum(temps[1:4]) / 3

print("Average with missing handled:", round(avg_cleaned, 2))

Average with missing handled: 98.57


## 8.13.1 Data Validation with Pandas and Regular Expressions

Let's use **Pandas Series** and **regular expressions** to validate ZIP Codes.


In [122]:
import pandas as pd

zips = pd.Series({'Boston': '02215', 'Miami': '3310'})
print(zips)


Boston    02215
Miami      3310
dtype: object


In [123]:
# Validate ZIP codes (must be 5 digits)
valid_zips = zips.str.match(r'\d{5}')
print(valid_zips)

Boston     True
Miami     False
dtype: bool


Invalid data can be fixed at source (if possible) or cleaned programmatically.


### Checking for Substring Patterns with `contains()`

Sometimes you need to know whether a value **contains** a substring, not matches it entirely.


In [124]:
cities = pd.Series(['Boston, MA 02215', 'Miami, FL 33101'])
print(cities)

0    Boston, MA 02215
1     Miami, FL 33101
dtype: object


In [125]:
# Check if city entries contain a valid state abbreviation
print(cities.str.contains(r' [A-Z]{2} '))

0    True
1    True
dtype: bool


In [126]:
# Check if the text is only a space and two letters
print(cities.str.match(r' [A-Z]{2} '))

0    False
1    False
dtype: bool


## 8.13.2 Reformatting (Transforming) Data

Now let’s **munge data into a new format**.  
Example: Convert unformatted 10-digit phone numbers into `###-###-####`.


In [127]:
contacts = [
    ['Mike Green', 'demo1@deitel.com', '5555555555'],
    ['Sue Brown', 'demo2@deitel.com', '5555551234']
]

contactsdf = pd.DataFrame(contacts, columns=['Name', 'Email', 'Phone'])
print(contactsdf)


         Name             Email       Phone
0  Mike Green  demo1@deitel.com  5555555555
1   Sue Brown  demo2@deitel.com  5555551234


We can apply a **custom function** to transform the `Phone` column using the `map()` method.


In [128]:
import re

def get_formatted_phone(value):
    result = re.fullmatch(r'(\d{3})(\d{3})(\d{4})', value)
    return '-'.join(result.groups()) if result else value

formatted_phone = contactsdf['Phone'].map(get_formatted_phone)
print(formatted_phone)


0    555-555-5555
1    555-555-1234
Name: Phone, dtype: object


In [129]:
# Update the original DataFrame
contactsdf['Phone'] = formatted_phone
print(contactsdf)

         Name             Email         Phone
0  Mike Green  demo1@deitel.com  555-555-5555
1   Sue Brown  demo2@deitel.com  555-555-1234


The data is now reformatted properly.  
Data munging involves **cleaning**, **validating**, and **transforming** to make data ready for analysis.

# 8.14 Wrap-Up

In this chapter, we explored Python’s **string formatting** and **processing** capabilities.

### Key Topics Covered

- **String Formatting**
  - Using **f-strings** and the **`format()`** method for flexible formatting.
- **Augmented Assignments**
  - Concatenation (`+=`) and repetition (`*=`) of strings.
- **String Cleaning and Manipulation**
  - Removing whitespace and changing case.
  - Splitting strings and joining iterables of strings.
  - Using character-testing methods like `.isalpha()`, `.isdigit()`, etc.
- **Raw Strings**
  - Treat backslashes (`\`) as literal characters instead of escape sequences.
  - Especially useful in **regular expressions**.

### Regular Expressions and `re` Module

- Used **`re.fullmatch()`** to validate entire strings.  
- Applied **`re.sub()`** for search-and-replace operations.  
- Used **`re.split()`** to tokenize strings based on regex-defined delimiters.  
- Demonstrated pattern searching and accessing matched substrings.

### Data Science Context

- Introduced **data munging** (or **data wrangling**) — cleaning and transforming data for analysis.
- Demonstrated data cleaning and transformation with **Pandas**:
  - Validated ZIP Codes using regex.
  - Reformatted phone numbers using **`map()`** and custom functions.

### Next Steps

In the next chapter, we’ll extend these concepts by:
- **Reading and writing text files**
- Working with **CSV files** via the `csv` module
- Introducing **exception handling** to manage errors gracefully