&copy; 2022 by Pearson Education, Inc. All Rights Reserved. The content in this notebook is based on the book [**Python for Programmers**](https://amzn.to/2VvdnxE).

### Python Fundamentals LiveLessons Videos
* For a detailed presentation of the content in this notebook see **[Lesson 8](https://learning.oreilly.com/videos/python-fundamentals/9780135917411/9780135917411-PFLL_Lesson08_00)** on O'Reilly Online Learning

# 8. Strings: A Deeper Look 

# 8.1 Introduction
* Format string content.
* Manipulate strings with a few of the many string methods [(full list)](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str).
* Regular expressions for matching patterns in strings, replace substrings and validate data.

# 8.2 Formatting Strings
# 8.2.1 Presentation Types

### **`f` Presentation Type** for Floating-Point Numbers and 'Decimal's

In [1]:
f'{17.489:.2f}'

'17.49'

* Can also use the **`e` presentation type** for exponential (scientific) notation.

### **`d` Presentation Type** for Integers
* There also are integer presentation types (`b`, `o` and `x` or `X`) that format integers using the binary, octal or hexadecimal number systems. 

In [2]:
f'{10:x}'

'a'

### **`c` presentation type** Formats an Integer Character Code as a Character

In [3]:
f'{0x1F600:c}  {65:c}  {97:c}'

'😀  A  a'

### **`s` Presentation Type** for Strings Is the Default
* If specified explicitly, the value to format **must be a string**, an **expression that produces a string** or a **string literal**. 
* If you do not specify a presentation type, non-string values are converted to strings.

In [4]:
f'{"hello":s} {7}'

'hello 7'

## 8.2.2 Field Widths and Alignment
* Python **right-aligns numbers** and **left-aligns other values**.
* Python formats `float` values with **six digits of precision**.

In [5]:
f'[{27:10d}]'

'[        27]'

In [6]:
f'[{3.5:10f}]'  # field width forces default precision

'[  3.500000]'

In [7]:
f'[{"hello":10}]'

'[hello     ]'

### Centering a Value in a Field with `^` 

In [8]:
f'[{27:^7d}]'

'[  27   ]'

# 8.6 Comparison Operators for Strings Use Lexicographical Comparisons
* All the comparison operators are supported.

In [9]:
print(f'A: {ord("A")}; a: {ord("a")}')

A: 65; a: 97


In [10]:
'Astronaut' == 'astronaut'

False

In [11]:
'Astronaut' != 'astronaut'

True

In [12]:
'Astronaut' < 'astronaut'

True

# 8.7 Searching for Substrings

### Locating the **First** Occurrence of a Substring in a String with Method **`index`**
* `ValueError` if the substring is not found.

In [13]:
sentence = 'to be or not to be that is the question'

In [14]:
sentence

'to be or not to be that is the question'

In [15]:
sentence.index('be')

3

### Locating the **Last** Occurrence of a Substring in a String with Method **`rindex`**
* Various string methods provide "`r`" versions that work from the end of the string.

In [16]:
sentence.rindex('be')

16

### String Methods **`find`** and **`rfind`** 
* Like `index` and `rindex` but, if the substring is not found, return `-1` rather than causing a `ValueError`. 
* See the [full list of string methods](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) for additional searching capabilities.

### Determining Whether a String Contains a Substring (Case Sensitive)

In [17]:
sentence

'to be or not to be that is the question'

In [18]:
'that' in sentence

True

In [19]:
'THAT' in sentence

False

# 8.9 Splitting and Joining Strings

### Splitting Strings with Method `split`  
* Tokenizes a string at each whitespace character by default. 

In [20]:
letters = 'A, B, C, D'

In [21]:
letters.split(', ')

['A', 'B', 'C', 'D']

### Joining Strings with Method **`join`** 
* Concatenates an iterable **containing only string values**. 

In [22]:
letters_list = ['A', 'B', 'C', 'D']

In [23]:
','.join(letters_list)  # ',' is the delimiter to place between the strings in letters_list

'A,B,C,D'

In [24]:
','.join([str(i) for i in range(10)])

'0,1,2,3,4,5,6,7,8,9'

# 8.11 Raw Strings
* Backslash characters in strings introduce **escape sequences**. 

In [25]:
file_path = 'C:\\MyFolder\\MySubFolder\\MyFile.txt'  # Windows file path

In [26]:
file_path

'C:\\MyFolder\\MySubFolder\\MyFile.txt'

In [27]:
print(file_path)

C:\MyFolder\MySubFolder\MyFile.txt


* **Raw strings**—preceded by **`r`**—treat each backslash as a literal character.
* Particularly useful with **regular expressions**, which often contain backslashes.

In [28]:
file_path = r'C:\MyFolder\MySubFolder\MyFile.txt'

In [29]:
file_path

'C:\\MyFolder\\MySubFolder\\MyFile.txt'

# 8.12 Introduction to Regular Expressions
* **Regular expressions (regex)** recognize **patterns** in text&mdash;phone numbers, e-mail addresses, ZIP Codes, web page addresses, Social Security numbers, ...
* Useful for 
    * Validating data.
    * Extracting data from text (known as **scraping** or **mining**).
    * Cleaning data to prepare it for use. 
    * Transforming data into other formats.

### Regular Expression Repositories of Existing Regular Expressions
* [regex101.com](https://regex101.com) 
* [regexlib.com](http://www.regexlib.com)
* [regular-expressions.info](https://www.regular-expressions.info)

## 8.12.1 **`re` Module** and Function `fullmatch` 
* Function **`fullmatch`** checks whether the **entire string** in its second argument **matches the pattern** in its first argument. 

### Use `\d` Predefined Character Class and a Quantifier to Match **5 Digits**
* **`\d`** matches one digit (0–9). 
* A character class **matches _one_ character**. 

In [30]:
import re

In [31]:
'Valid' if re.fullmatch(r'\d{5}', '02215') else 'Invalid' 

'Valid'

In [32]:
'Valid' if re.fullmatch(r'\d{5}', '9876') else 'Invalid'

'Invalid'

### Predefined Character Classes

| Character class	| Matches
| :------------	| :------------
| `\d`	| Any digit (0–9).
| `\D`	| Any character that is _not_ a digit.
| `\s`	| Any whitespace character (such as spaces, tabs and newlines).
| `\S`	| Any character that is _not_ a whitespace character.
| `\w`	| Any **word character** (also called an **alphanumeric character**)—that is, any uppercase or lowercase letter, any digit or an underscore
| `\W`	| Any character that is _not_ a word character.

### Custom Character Classes Are Defined in `[]`
* `[aeiou]` matches a lowercase vowel
* `[A-Z]` matches an uppercase letter
* `[a-z]` matches a lowercase letter 
* `[a-zA-Z]` matches any lowercase or uppercase letter. 

### Quantifiers 
* **`*` quantifier** matches **zero or more occurrences** of the subexpression to its left (**`[a-z]`**)* **`+` quantifier** is like `*` but matches **at least one occurrence** of a subexpression.
* **`?`** quantifier matches **zero or one occurrences** of a subexpression.
* **`{n,}`** quantifier matches **at least _n_ occurrences** of a subexpression. 
* **`{n,m}`** quantifier matches **between *n* and _m_ (inclusive) occurrences** of a subexpression.

## 8.12.2 Replacing Substrings and Splitting Strings

### Function `sub`—Replacing **All** Occurrences of a Pattern
* Three arguments&mdash;**pattern to match**, **replacement text** and **string to search**.

In [33]:
re.sub(r'\t', ', ', '1\t2\t3\t4')  # replace tabs with ', '

'1, 2, 3, 4'

### Function `split`—Tokenizing a String Using a Regex to Specify the Delimiter

In [34]:
# split at any number of whitespace characters
re.split(r',\s*', '1,  2,  3,4,    5,6,7,8')

['1', '2', '3', '4', '5', '6', '7', '8']

## 8.12.3 Other Search Functions; Accessing Matches

### Function `search`—Finding the First Match Anywhere in a String
* Returns a **match object** (of type **`SRE_Match`**) that contains the matching substring. 
* The match object’s **`group`** method returns matching substring.

In [35]:
result = re.search('Python', 'Python is fun')

In [36]:
result.group() if result else 'not found'

'Python'

* Can search for a match only at the **beginning of a string** with function **`match`**. 
* `search` returns `None` if the string does **not** contain the pattern: 

In [37]:
result2 = re.search('fun!', 'Python is fun')

In [38]:
result2.group() if result2 else 'not found'

'not found'

### Ignoring Case with `flags=re.IGNORECASE` 
* Matches are **case sensitive** by default.
* Many `re` module functions receive an **optional `flags` keyword argument** that changes how regular expressions are matched. 

In [39]:
result3 = re.search('Sam', 'SAM WHITE', flags=re.IGNORECASE)

In [40]:
result3.group() if result3 else 'not found'

'SAM'

### Function `findall` Returns a List of Matching Substrings

In [41]:
contact = 'Wally White, Home: 555-555-1234, Work: 555-555-4321'

In [42]:
re.findall(r'\d{3}-\d{3}-\d{4}', contact)

['555-555-1234', '555-555-4321']

### Function `finditer` Returns a **Lazy Iterable** of Match Objects
* Can save memory over `findall` because it returns one match at a time.

In [43]:
for phone in re.finditer(r'\d{3}-\d{3}-\d{4}', contact):
    print(phone.group())

555-555-1234
555-555-4321


### More Info
* See Lesson 8 in [**Python Fundamentals LiveLessons** here on O'Reilly Online Learning](https://learning.oreilly.com/videos/python-fundamentals/9780135917411)
* See Chapter 8 in [**Python for Programmers** on O'Reilly Online Learning](https://learning.oreilly.com/library/view/python-for-programmers/9780135231364/)
* Interested in a print book? Check out:

| Python for Programmers | Intro to Python for Computer<br>Science and Data Science
| :------ | :------
| <a href="https://amzn.to/2VvdnxE"><img alt="Python for Programmers cover" src="../images/PyFPCover.png" width="150" border="1"/></a> | <a href="https://amzn.to/2LiDCmt"><img alt="Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud" src="../images/IntroToPythonCover.png" width="159" border="1"></a>

>Please **do not** purchase both books&mdash;_Python for Programmers_ is a subset of _Intro to Python for Computer Science and Data Science_

&copy; 2022 by Pearson Education, Inc. All Rights Reserved. The content in this notebook is based on the book [**Python for Programmers**](https://amzn.to/2VvdnxE).