# Chapter21 - Python Regular Expressions
Regular expressions (regex) provide a powerful way to search and manipulate strings. Python's built-in `re` module allows for efficient searching, splitting, and replacement of text using regex patterns. This notebook introduces regular expressions in Python, covering basic patterns, special characters, and useful functions within the `re` module.

##  21.1 Basic Patterns: Matching Literal Strings
In regular expressions, the simplest form is matching literal strings. For example, searching for the word 'Python' in a string.

In [None]:
import re
text = "Python is an amazing programming language."
match = re.search("Python", text)
if match:
    print("Found:", match.group())
else:
    print("Not found.")

Found: Python


## 21.2 Special Characters in Regular Expressions
Special characters are symbols that control how the search is performed. Here are a few:
- `.` (Dot): Matches any character except newline.
- `^`: Matches the start of a string.
- `$`: Matches the end of a string.
- `[abc]`: Matches a or b or c.
- `[a-zA-Z0-9]`: Matches any letter from (a to z) or (A to Z) or (0 to 9).
- `+`: Checks if the preceding character appears one or more times starting from that position.
- `*`: Checks if the preceding character appears zero or more times starting from that position.
- `?`: Checks if the preceding character appears exactly zero or one time starting from that position.

In [None]:
import re

# Text for searching
text = "cat mat bat rat"

# Pattern to match any three-letter word ending with 'at'
pattern = ".at"

# Finding all matches
matches = re.findall(pattern, text)

# Printing the matches
print("Matches:", matches)


Matches: ['cat', 'mat', 'bat', 'rat']


In the above example, the pattern .at is used to find all three-letter words in the string "cat mat bat rat" that end with 'at'. The dot . matches any character except a newline, so it will match 'c', 'm', 'b', and 'r' in this case, making the full matches 'cat', 'mat', 'bat', and 'rat'.

In [None]:
import re

# Text for searching
text = "Python is fun. Python is easy."

# Pattern to match 'Python' at the start of the string
pattern = "^Python"

# Checking if the pattern matches
match = re.search(pattern, text)

# Printing the result
if match:
    print("Match found at the start of the string.")
else:
    print("No match found at the start of the string.")


Match found at the start of the string.


In the above example, the pattern ^Python is used to check if the string "Python is fun. Python is easy." starts with 'Python'. The caret ^ asserts the position at the start of the string. Since the string indeed starts with 'Python', the output is "Match found at the start of the string."


In [None]:
import re

# Text for searching
text = "Python is fun. Python is easy."

# Pattern to match 'Python' at the start of the string
pattern = "^fun"

# Checking if the pattern matches
match = re.search(pattern, text)

# Printing the result
if match:
    print("Match found at the start of the string.")
else:
    print("No match found at the start of the string.")


No match found at the start of the string.


In the above example, the pattern ^fun is used to check if the string "Python  is fun. Python is easy." starts with 'fun'. The caret ^ asserts the position at the start of the string. Since the string doesn't start with 'fun', the result is "No match found at the start of the string."

In [None]:
import re

# Text for searching
text = "The end of the story"

# Pattern to match 'story' at the end of the string
pattern = "story$"

# Checking if the pattern matches
match = re.search(pattern, text)

# Printing the result
if match:
    print("Match found at the end of the string.")
else:
    print("No match found at the end of the string.")


Match found at the end of the string.


In the above example, the pattern "story\$" is used to check if the string "The end of the story" ends with 'story'. The dollar sign $ asserts the position at the end of the string. Since the string does indeed end with 'story', the output of the code is "Match found at the end of the string."

In [None]:
re.search(r'[0-6]', 'Number: 5').group()

'5'

In [None]:
## Matches any character except 5
re.search(r'Number: [^5]', 'Number: 0').group()

## This will not match and hence a NONE value will be returned
#re.search(r'Number: [^5]', 'Number: 5').group()

'Number: 0'

In [None]:
re.search(r'Co+kie', 'Cooookie').group()

'Cooookie'

In [None]:
# Checks for any occurrence of a or o or both in the given sequence
re.search(r'Ca*o*kie', 'Cookie').group()

'Cookie'

In [None]:
# Checks for any occurrence of a or o or both in the given sequence
re.search(r'Ca*o*kie', 'Cookie').group()

'Cookie'

In [None]:
# Checks for exactly zero or one occurrence of a or o or both in the given sequence
re.search(r'Colou?r', 'Color').group()

'Color'


- `{x}`: Repeat exactly x number of times.
- `{x,}`: Repeat at least x times or more.
- `{x, y}`: Repeat at least x times but no more than y times.

In [None]:
re.search(r'\d{9,10}', '0987654321').group()

'0987654321'

## Functions in the `re` Module
The `re` module offers a set of functions to perform queries on an input string. Here are some commonly used ones:
- `re.search()`: Returns a Match object if there is a match anywhere in the string.
- `re.match()`: Returns a Match object if the string starts with the pattern.
- `re.findall()`: Returns a list of all non-overlapping matches in the string.
- `re.sub()`: Replaces one or many matches with a string.

In the folliwng example:

re.search('fun', text) searches for the substring 'fun' anywhere in the text. Since 'fun' is found, it prints a message indicating its presence.
re.match('Python', text) checks if the text starts with 'Python'. As the text does start with 'Python', it prints a confirmation message.
re.findall('Python', text) finds all non-overlapping occurrences of 'Python' in the text. It finds three occurrences and prints the count.
re.sub('Python', 'Programming', text) replaces all occurrences of 'Python' with 'Programming' in the text and prints the modified text: "Programming is fun. Programming is easy. Programming is powerful."

In [None]:
import re

# Sample text
text = 'Python is fun. Python is easy. Python is powerful.'

# re.search() example
search_result = re.search('fun', text)
if search_result:
    print('re.search(): Found "fun" in the text')
else:
    print('re.search(): "fun" not found')

# re.match() example
match_result = re.match('Python', text)
if match_result:
    print('re.match(): Text starts with "Python"')
else:
    print('re.match(): Text does not start with "Python"')

# re.findall() example
findall_result = re.findall('Python', text)
print('re.findall(): Found occurrences of "Python":', len(findall_result))

# re.sub() example
sub_result = re.sub('Python', 'Programming', text)
print('re.sub(): Replaced "Python" with "Programming":', sub_result)


re.search(): Found "fun" in the text
re.match(): Text starts with "Python"
re.findall(): Found occurrences of "Python": 3
re.sub(): Replaced "Python" with "Programming": Programming is fun. Programming is easy. Programming is powerful.


## 21. 4 Practical Example

You will work with the first part of a free e-book titled "The Idiot", written by Fyodor Dostoyevsky from the Project Gutenberg. The novel is about Prince (Knyaz) Lev Nikolayevich Myshkin, a guileless man whose good, kind, simple nature mistakenly leads many to believe he lacks intelligence and insight. The title is an ironic reference to this young man.

You shall be writing some regular expressions to parse through the text and complete some exercises.


In [None]:
import re
import requests
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

def get_book(url):
    # Sends a http request to get the text from project Gutenberg
    raw = requests.get(url).text
    # Discards the content before PART I from the beginning of the book
    start = re.search(r"I\.", raw ).end()
    # Discards the text starting Part II of the book
    stop = re.search(r"II\.", raw).start()
    # Keeps the relevant text
    text = raw[start:stop]
    return text

def preprocess(sentence):
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

book = get_book(the_idiot_url)
processed_book = preprocess(book)
print(processed_book)

 towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages found themselves opposite each other. both were young fellows both were rather poorly dressed both had remarkable faces and both were evident

Exercise: Find the number of the pronoun "the" in the corpus. Hint: Use the len() function.

In [None]:
len(re.findall(r"the", processed_book))

301

Find the number of questions asked:

In [None]:
len(re.findall(r'\?', book))

38

## 21.5 Conclusion
Regular expressions are a powerful tool for text processing. Mastery of regex can help you perform complex text manipulations efficiently. Remember to use them judiciously, as complex patterns can become difficult to read and maintain.