# String Manipulations, RegEx

(Lecture 8)

In this lecture we will continue our journey of **String Manipulation** after reviewing **list comprehensions**


1. [List Comprehensions](#List-Comprehensions)
   * [Dictionary Comprehensions](#Dictionary-Comprehensions)
2. [String Operations](#String-Operations)
   * [Sundry Methods](#Sundry-Methods)
   * [Find and Replace](#Find-and-Replace)
3. [Regular Expressions](#Regular-Expressions)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
#new module
import re

## List Comprehensions

List comprehensions are a convenient and widely used Python language feature. They allow you to concisely form a new list by filtering the elements of a collection, transforming the elements passing the filter into one concise expression. They take the basic form:

`[expr for value in collection if condition]`

This is equivalent to the following for loop:

```
result = []
for value in collection:
    if condition:
        result.append(expr)
````

The filter condition can be omitted, leaving only the expression. For example, given a list of strings, we could filter out strings with length 2 or less and convert them to uppercase like this:

In [None]:
strings = ["eye", "lived devil", "wow", "deed", "noon", "kayak"]

results = []
#convert those strings to upper case that are longer than 3 characters
for x in strings:
  if len(x) > 3:
    results.append(x.upper())

results

['LIVED DEVIL', 'DEED', 'NOON', 'KAYAK']

In [None]:
[x.upper() for x in strings if len(x) > 3]

['LIVED DEVIL', 'DEED', 'NOON', 'KAYAK']

In [None]:
[x.upper() for x in strings if len(x) > 3]

### Dictionary Comprehension

A dictionary comprehension looks like this:

```
dict_comp = {key-expr: value-expr for value in collection
             if condition}
```

As a simple dictionary comprehension example, we could create a lookup map of these strings for their locations in the list:

In [None]:
results = {}
for i, x in enumerate(strings):
  results[x] = i

results

{'eye': 0, 'lived devil': 1, 'wow': 2, 'deed': 3, 'noon': 4, 'kayak': 5}

In [None]:
results = {x:i for i, x in enumerate(strings) if len(x) > 3}
results

{'lived devil': 1, 'deed': 3, 'noon': 4, 'kayak': 5}

Like list comprehensions, set and dictionary comprehensions are mostly conveniences, but they similarly can make code both easier to write and read. Consider the list of strings from before. Suppose we wanted a set containing just the lengths of the strings contained in the collection; we could easily compute this using a set comprehension:


In [None]:
unique_lengths = {len(x) for x in strings}
unique_lengths

## String Operations

[Built-in methods](https://github.com/markusloecher/Python-Workshop/blob/main/Lectures/Readme.md)

In [None]:
my_string = "A man, a plan, a canal: Panama"
my_string.lower()#this is not inplace

TypeError: ignored

In [None]:
my_string

'A man, a plan, a canal: Panama'

In [None]:
my_string.count("a")#it is case sensitive

9

#### Splitting

In [None]:
IndWords = my_string.split(sep = " ")#both the colon and the commas are still present
print(IndWords)

['A', 'man,', 'a', 'plan,', 'a', 'canal:', 'Panama']


In [None]:
print(my_string.split(sep=","))
print(my_string.split(maxsplit=2))
#Breaking at line boundaries
print(my_string.splitlines())

['A man', ' a plan', ' a canal: Panama']
['A', 'man,', 'a plan, a canal: Panama']
['A man, a plan, a canal: Panama']


In [None]:
my_string = "A man, a plan, a canal:
Panama"
#Breaking at line boundaries
print(my_string.splitlines())

In [8]:
my_string = "A man, a plan, a canal\n Panama"
print(my_string)
print(my_string.splitlines())

A man, a plan, a canal
 Panama
['A man, a plan, a canal', ' Panama']


#### Joining

Concatenate strings from iterables


In [None]:
print(" ".join(IndWords))

A man, a plan, a canal: Panama


#### Slicing

In [9]:
print(my_string[0:5])#the same for numpy arrays or lists !

A man


In [None]:
print(my_string[10:20:2])

ln  a


#### Palindromes

Which words are palindromes ?

In [12]:
"eye kayak"[::-1]
"Berlin"[::-1]

'nilreB'

In [None]:
"edit tide"[::-1]

'edit tide'

In [None]:
print(my_string[::-1])

## Find and Replace

In [None]:
my_string.find("Panama")#success

24

In [13]:
 my_string.find("Berlin")#not found, failure

-1

In [None]:
my_string.index("Panama")

24

In [14]:
my_string.replace("Panama", "Suez")

'A man, a plan, a canal\n Suez'

In [None]:
my_string.replace("Berlin", "Suez")#failure mode

## Regular Expressions

<p>The <code class="w3-codespan">re</code> module offers a set of functions that allows
us to search a string for a match:</p>

<table class="ws-table-all notranslate">
<tr>
<th style="width:120px">Function</th>
<th>Description</th>
</tr>
<tr>
<td><a href="#findall">findall</a></td>
<td>Returns a list containing all matches</td>
</tr>
<tr>
<td><a href="#search">search</a></td>
<td>Returns a <a href="#matchobject">Match object</a> if there is a match anywhere in the string</td>
</tr>
<tr>
<td><a href="#split">split</a></td>
<td>Returns a list where the string has been split at each match </td>
</tr>
<tr>
<td><a href="#sub">sub</a></td>
<td>Replaces one or many matches with a string</td>
</tr>
</table>

<p>Metacharacters are characters with a special meaning:</p>

<table class="ws-table-all notranslate">
<tr>
<th style="width:120px">Character</th>
<th>Description</th>
<th style="width:120px">Example</th>
<th style="width:75px">Try it</th>
</tr>
<tr>
<td>[]</td>
<td>A set of characters</td>
<td>&quot;[a-m]&quot;</td>
<td><a target="_blank" class="w3-btn btnsmall btnsmall" href="trypython.asp?filename=demo_regex_meta1">Try it &raquo;</a></td>
</tr>
<tr>
<td>\</td>
<td>Signals a special sequence (can also be used to escape special characters)</td>
<td>&quot;\d&quot;</td>
<td><a target="_blank" class="w3-btn btnsmall btnsmall" href="trypython.asp?filename=demo_regex_meta2">Try it &raquo;</a></td>
</tr>
<tr>
<td>.</td>
<td>Any character (except newline character)</td>
<td>&quot;he..o&quot;</td>
<td><a target="_blank" class="w3-btn btnsmall btnsmall" href="trypython.asp?filename=demo_regex_meta3">Try it &raquo;</a></td>
</tr>
<tr>
<td>^</td>
<td>Starts with</td>
<td>&quot;^hello&quot;</td>
<td><a target="_blank" class="w3-btn btnsmall btnsmall" href="trypython.asp?filename=demo_regex_meta4">Try it &raquo;</a></td>
</tr>
  
  <tr>
<td>*</td>
<td>Zero or more occurrences</td>
<td>&quot;he.*o&quot;</td>
<td><a target="_blank" class="w3-btn btnsmall btnsmall" href="trypython.asp?filename=demo_regex_meta6">Try it &raquo;</a></td>
  </tr>
  <tr>
<td>+</td>
<td>One or more occurrences</td>
<td>&quot;he.+o&quot;</td>
<td><a target="_blank" class="w3-btn btnsmall btnsmall" href="trypython.asp?filename=demo_regex_meta7">Try it &raquo;</a></td>
  </tr>
  <tr>
<td>?</td>
<td>Zero or one occurrences</td>
<td>&quot;he.?o&quot;</td>
<td><a target="_blank" class="w3-btn btnsmall btnsmall" href="trypython.asp?filename=demo_regex_meta10">Try it &raquo;</a></td>
  </tr>
  <tr>
<td>{}</td>
<td>Exactly the specified number of occurrences</td>
<td>&quot;he.{2}o&quot;</td>
<td><a target="_blank" class="w3-btn btnsmall btnsmall" href="trypython.asp?filename=demo_regex_meta8">Try it &raquo;</a></td>
  </tr>
  <tr>
<td>|</td>
<td>Either or</td>
<td>&quot;falls|stays&quot;</td>
<td><a target="_blank" class="w3-btn btnsmall btnsmall" href="trypython.asp?filename=demo_regex_meta9">Try it &raquo;</a></td>
  </tr>
  <tr>
<td>()</td>
<td>Capture and group</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
  </tr>
</table>

In [None]:
#Find digits via \d
string="The winners are: User2-4, UserN, User1, UserMarkus"
re.findall(r"User\d", string)

['User2', 'User1']

In [None]:
#exclude digits via \D
re.findall(r"User\D", string)

['UserN']

In [None]:
#find words via \w
re.findall(r"User\w", string)

['User2', 'UserN', 'User1', 'UserM']

In [None]:
#white spaces \s
string="Is it New York or  New-York?"
re.findall(r"New\sYork", string)

['New York']

In [None]:
#not a white spaces \S
string="Is it New York or  New-York?"
re.findall(r"New\SYork", string)

['New-York']

### Repetitions


In [None]:
string = "my password is password1234"
#re.search(r"\w{8}\d{4}", string)
re.findall(r"\w{8}\d{4}", string)

['password1234']

### Quantifiers

How many times to match a pattern immediately to **its left**

- `+` once or more
- `*` zero times or more
- `?` zero times or once
- `{n,m}` n times at least, m times at most

In [None]:
text = "Possible exam dates: 6-25 or 7-5 or 2023-12-24"
re.findall(r"\d+-\d+", text)

['6-25', '7-5', '2023-12']

In [None]:
text = "Possible exam dates: 6-25 or 7-5 or 2023-12-24"
re.findall(r"\d+-\d+-*\d*", text)

['6-25', '7-5', '2023-12-24']

In [None]:
text = "Possible exam dates: 6-25 or 7-5 or 2023-12-24"
re.findall(r"\d+-\d+-?\d?", text)

['6-25', '7-5', '2023-12-2']

"Escaping" the special meaning of `+`

In [None]:
phone_number = "Mobile: +49 1578-1378941 or (49)1578-1378941"
re.findall(r"\+\d{2}\s*\d{4}-\d{7}", phone_number)

['+49 1578-1378941']

"Escaping" the special meaning of `()`

In [None]:
re.findall(r"\(\d{2}\)\s*\d{4}-\d{7}", phone_number)

['(49)1578-1378941']

The or operator `|`

In [None]:
re.findall(r"\+\d{2}\s*\d{4}-\d{7}|\(\d{2}\)\s*\d{4}-\d{7}", phone_number)

['+49 1578-1378941', '(49)1578-1378941']

In [None]:
phone_number = "At HWR: 030-30877-1443 Mobile: 01578-1378941 or "
re.findall(r"\d{2,3}-\d{4,5}-\d{4}|\d{5}-\d{7}", phone_number)

['030-30877-1443', '01578-6798941']

### Special Characters


- `.` Match any character (except newlines)
- `^` Start of the String
- `$` End of the String
- `\` Escape special characters
- `|` OR operator
- `[]` group of characters

In [None]:
my_links = "check out my blogs: https://markusloecher.github.io codeandstats.github.io"
re.findall(r"https://.+github.io", my_links)

['https://markusloecher.github.io codeandstats.github.io']

In [None]:
print(re.findall(r"^out", my_links))
re.findall(r"^check", my_links)

[]


['check']

In [None]:
re.findall(r"\w+\.github.io$", my_links)

['codeandstats.github.io']

In [None]:
my_links = "check out my blogs: https://markusloecher.github.io codeandstats.github.io"
re.findall(r"[a-z.]github.io", my_links)

NameError: ignored

--------------------------

Further Reading:

- [](#)