<table border="0" style="width:100%">
 <tr>
    <td>
        <img src="https://static-frm.ie.edu/university/wp-content/uploads/sites/6/2022/06/IE-University-logo.png" width=150>
     </td>
    <td><div style="font-family:'Courier New'">
            <div style="font-size:25px">
                <div style="text-align: right"> 
                    <b> MASTER IN BIG DATA</b>
                    <br>
                    Python for Data Analysis II
                    <br><br>
                    <em> Daniel Sierra Ramos </em>
                </div>
            </div>
        </div>
    </td>
 </tr>
</table>

# **S02: REGULAR EXPRESSIONS AND VECTORIZED STRING OPERATIONS**

We have this string in a Python variable

    text = "This is          a text"

We want to split it to remove the blank spaces

In [10]:
text = "This is          a text"

In [11]:
text.split(" ")

['This', 'is', '', '', '', '', '', '', '', '', '', 'a', 'text']

**A regular expresion is a sequence of characters that specifies a _search pattern_ in a string**

The Python interface to regular expressions is contained in the built-in `re` module. The regular expresions sintaxis is like a special language adapted to build search patterns. It's not a complex syntax, but it has a lot of different options. Let's see an example.

Here comes to help the regular expression language

In [12]:
import re

regex = re.compile(r'\s+')
regex.split(text)

['This', 'is', 'a', 'text']

In this case, the search pattern is ``"\s+"``: "``\s``" is a special character that matches any whitespace (space, tab, newline, etc.), and the "``+``" is a character that indicates *one or more* of the entity preceding it. So, in essence, we are splitting the string by whitespaces (" "), but this is more powerful because we can split the string by **any number of whitespaces**

Let's see another example

In [13]:
line = "The quick brown fox jumped over a lazy dog"

In [14]:
regex = re.compile(r'\s\w{3}\s')
regex.split(line)

['The quick brown', 'jumped over a lazy dog']

"The quick brown fox jumped over a lazy dog"

In this case, we're splitting the text by `\s\w{3}\s`, that is: a whitespace (`\s`), three consecutive alphanumeric characters (`\w{3}`) and a whitespace (`\s`). If you check again the string, the only substring that match this pattern is **" fox "**.

In the `re` package, a regular expresion must be compiled first with `compile`, turning it into a `Pattern` object

In [15]:
type(regex)

re.Pattern

There is a lot of string operations we can do with a regular expression. Check the full list with `regex.` tab completion. The most common
 - `findall` - Return a list of all non-overlapping matches of pattern in string.
 - `search` - Scan through string looking for a match, and return a corresponding match object instance

### A more sophisticated example

In [16]:
email = re.compile(r'\w+@\w+\.[a-z]{3}')  # this is the pattern of an email 

Using this, if we're given a line from a document, we can quickly extract things that look like email addresses

In [17]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email.findall(text)

['guido@python.org', 'guido@google.com']

In [18]:
email.findall('barack.obama@whitehouse.gov')

['obama@whitehouse.gov']

In [19]:
# for the obama example

In [20]:
email = re.compile(r'\w+\.\w+@\w+\.\w{2,3}')

In [21]:
email.findall('barack.obama@whitehouse.gov')

['barack.obama@whitehouse.gov']

### Basics of regular expression syntax

#### Simple strings are matched directly

If you build a regular expression on a simple string of characters or digits, it will match that exact string:

In [22]:
regex = re.compile('ion')
regex.findall('Great Expectations')

['ion']

#### Some characters have special meanings

While simple letters or numbers are direct matches, there are a handful of characters that have special meanings within regular expressions. They are:
```
. ^ $ * + ? { } [ ] \ | ( )
```
We will discuss the meaning of some of these momentarily.
In the meantime, you should know that if you'd like to match any of these characters directly, you can *escape* them with a back-slash:

In [14]:
regex = re.compile(r'\$')
regex.findall("the cost is $20")

['$']

The ``r`` preface in ``r'\$'`` indicates a *raw string*; in standard Python strings, the backslash is used to indicate special characters.
For example, a tab is indicated by ``"\t"``:

In [15]:
print('a\tb\tc')

a	b	c


Such substitutions are not made in a raw string:

In [16]:
print(r'a\tb\tc')

a\tb\tc


For this reason, whenever you use backslashes in a regular expression, it is good practice to use a raw string.

#### Putting together several characters to build the search pattern

In [17]:
regex = re.compile(r'\w\s\w')
regex.findall('the fox is 9 years old')

['e f', 'x i', 's 9', 's o']

The following table lists a few of these characters that are commonly useful:

| Character | Description                 | Character | Description                     |
|-----------|-----------------------------|-----------|---------------------------------|
| ``"\d"``  | Match any digit             | ``"\D"``  | Match any non-digit             |
| ``"\s"``  | Match any whitespace        | ``"\S"``  | Match any non-whitespace        |
| ``"\w"``  | Match any alphanumeric char | ``"\W"``  | Match any non-alphanumeric char |

This is *not* a comprehensive list or description; for more details, see Python's [regular expression syntax documentation](https://docs.python.org/3/library/re.html#re-syntax).

#### Square brackets match custom character groups

If the built-in character groups aren't specific enough for you, you can use square brackets to specify any set of characters you're interested in.
For example, the following will match any lower-case vowel:

In [18]:
regex = re.compile('[aeiou]')
regex.split('consequential')

['c', 'ns', 'q', '', 'nt', '', 'l']

Similarly, you can use a dash to specify a range: for example, ``"[a-z]"`` will match any lower-case letter, and ``"[1-3]"`` will match any of ``"1"``, ``"2"``, or ``"3"``.
For instance, you may need to extract from a document specific numerical codes that consist of a capital letter followed by a digit. You could do this as follows:

In [19]:
regex = re.compile('[A-Z][0-9]')
regex.findall('1043879, G2, H6')

['G2', 'H6']

#### Wildcards match repeated characters

If you would like to match a string with, say, three alphanumeric characters in a row, it is possible to write, for example, ``"\w\w\w"``.
Because this is such a common need, there is a specific syntax to match repetitions – curly braces with a number:

In [20]:
regex = re.compile(r'\w{3}')
regex.findall('The quick brown fox')

['The', 'qui', 'bro', 'fox']

There are also markers available to match any number of repetitions – for example, the ``"+"`` character will match *one or more* repetitions of what precedes it:

In [21]:
regex = re.compile(r'\w+')
regex.findall('The quick brown fox')

['The', 'quick', 'brown', 'fox']

The following is a table of the repetition markers available for use in regular expressions:

| Character | Description | Example |
|-----------|-------------|---------|
| ``?`` | Match zero or one repetitions of preceding  | ``"ab?"`` matches ``"a"`` or ``"ab"`` |
| ``*`` | Match zero or more repetitions of preceding | ``"ab*"`` matches ``"a"``, ``"ab"``, ``"abb"``, ``"abbb"``... |
| ``+`` | Match one or more repetitions of preceding  | ``"ab+"`` matches ``"ab"``, ``"abb"``, ``"abbb"``... but not ``"a"`` |
| ``{n}`` | Match ``n`` repetitions of preeeding | ``"ab{2}"`` matches ``"abb"`` |
| ``{m,n}`` | Match between ``m`` and ``n`` repetitions of preceding | ``"ab{2,3}"`` matches ``"abb"`` or ``"abbb"`` |

In [22]:
regex = re.compile(r'\w{3,4}')
regex.findall('The quick brown fox')

['The', 'quic', 'brow', 'fox']

With these basics in mind, let's return to our email address matcher:

In [23]:
email = re.compile(r'\w+@\w+\.[a-z]{3}')
email.findall('barack.obama@whitehouse.gov')

['obama@whitehouse.gov']

Our email regular expresion is not suited for emails with dots. So let's change the regex a little bit

In [24]:
email2 = re.compile(r'[\w.]+@\w+\.[a-z]{3}')
email2.findall('barack.obama@whitehouse.gov')

['barack.obama@whitehouse.gov']

We have changed ``"\w+"`` to ``"[\w.]+"``, so we will match any alphanumeric character *or* a period.

#### Parentheses indicate *groups* to extract

For compound regular expressions like our email matcher, we often want to extract their components rather than the full match. This can be done using parentheses to *group* the results:

In [25]:
email3 = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})')

In [26]:
text = "To email Guido, try guido@python.org or the older address guido@google.com."
email3.findall(text)

[('guido', 'python', 'org'), ('guido', 'google', 'com')]

As we see, this grouping actually extracts a list of the sub-components of the email address.

We can go a bit further and *name* the extracted components using the ``"(?P<name> )"`` syntax, in which case the groups can be extracted as a Python dictionary:

In [27]:
email4 = re.compile(r'(?P<user>\w+)@(?P<domain>\w+)\.(?P<suffix>[a-z]{3})')
match = email4.match('guido@python.org')
match.groupdict()

{'user': 'guido', 'domain': 'python', 'suffix': 'org'}

### More resources on regular expressions
This is just the beggining. The regular expression language is so much extense, but these are the most typical structures. 
 - Cheatsheet - https://cheatography.com/davechild/cheat-sheets/regular-expressions/
 - Online regex validator - https://regex101.com/

# Vectorized String Operations with `pandas`

Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the ``str`` attribute of Pandas Series and Index objects containing strings.
So, for example, suppose we create a Pandas Series with this data:

### The `str` attribute

In [56]:
data = ["peter", "Paul", None, "MARY", "gUIDO"]

In [57]:
import pandas as pd
names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

We can now call a single method that will capitalize all the entries, while skipping over any missing values:

In [58]:
names.str

<pandas.core.strings.accessor.StringMethods at 0x7f19eeac8880>

In [59]:
names.str.capitalize()

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

Using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas.

### Methods similar to Python string methods
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas ``str`` methods that mirror Python string methods:

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

Notice that these have various return values. Some, like ``lower()``, return a series of strings:

In [60]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

In [61]:
monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

In [62]:
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [63]:
monte.str.startswith('T')

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

In [64]:
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

### Methods using regular expressions

In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in ``re`` module:

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |

With these, you can do a wide range of interesting operations.
For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:

In [65]:
monte

0    Graham Chapman
1       John Cleese
2     Terry Gilliam
3         Eric Idle
4       Terry Jones
5     Michael Palin
dtype: object

In [66]:
monte.str.extract(r'([A-Za-z]+)', expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string (``^``) and end-of-string (``$``) regular expression characters:

In [68]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

The ability to concisely apply regular expressions across ``Series`` or ``Dataframe`` entries opens up many possibilities for analysis and cleaning of data.

### Vectorized string indexing and slicing


The ``get()`` and ``slice()`` operations, in particular, enable vectorized element access from each array.
For example, we can get a slice of the first three characters of each array using ``str.slice(0, 3)``.
Note that this behavior is also available through Python's normal indexing syntax–for example, ``df.str.slice(0, 3)`` is equivalent to ``df.str[0:3]``:

In [69]:
monte.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

Indexing via ``df.str.get(i)`` and ``df.str[i]`` is likewise similar.

These ``get()`` and ``slice()`` methods also let you access elements of arrays returned by ``split()``.
For example, to extract the last name of each entry, we can combine ``split()`` and ``get()``:

In [71]:
monte

0    Graham Chapman
1       John Cleese
2     Terry Gilliam
3         Eric Idle
4       Terry Jones
5     Michael Palin
dtype: object

In [70]:
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object