# import

In [1]:
import re

# What is it and what do we use it for?

* Regular expressions (or regex or regexp) are useful for <b> extracting information from any text </b> by searching for one or more matches of a specific search pattern
    * i.e. specific sequences of ASCII characters
* It can be used for parsing, replacing strings, passing through translating data to other formats and web scraping
* Once you've learned the syntax you can use this tool in almost all programming languages!
    * but with slight distinctions about the support of the most advanced features/syntax

# Basic Topics
## Anchors -  ^ and dollar signs (replaced with £)
* <b> ^The </b>: matches any string that starts with 'The'
* <b> end£' </b> : matches any string that ends with 'end' 
* <b> ^The end£ </b>  :  exact string match (starts and ends with 'The end')
* <b> roar </b>  : matches any string that has the text 'roar' in it

## Quantifiers - * + ? and { }
```
abc*        matches a string that has ab followed by zero or more c 

abc+        matches a string that has ab followed by one or more c

abc?        matches a string that has ab followed by zero or one c

abc{2}      matches a string that has ab followed by 2 c

abc{2,}     matches a string that has ab followed by 2 or more c

abc{2,5}    matches a string that has ab followed by 2 up to 5 c

a(bc)*      matches a string that has a followed by zero or more copies of the sequence bc

a(bc){2,5}  matches a string that has a followed by 2 up to 5 copies of the sequence bc

```

* example
    * match
        * 'nooooo'  (5 trailing nos)
            * 'no'*
            * 'no'*
            * no{5}
            * no{1,5}
 ## OR Operator - | or [ ]
 
 ```
a(b|c)     matches a string that has a followed by b or c (and captures b or c)
a[bc]      same as previous, but without capturing b or c
 ```
 
 * 'barn' | 'bark'
    * bar(n|k)
    * bar[nk]
    
 ## Character classes — \d \w \s and .
 ```
\d         matches a single character that is a digit 
\w         matches a word character (alphanumeric character plus underscore) 
\s         matches a whitespace character (includes tabs and line breaks)
.          matches any character -> Try it!
 
 ```
 * Use the '.' sparingly -> class or negated character class are faster and more precise *
* \d, \w and \s also present their negations with \D, \W and \S respectively.
* For example, \D will perform the inverse match with respect to that obtained with \d.

` \D         matches a single non-digit character `
* dollar sign is replaced by £
* In order to be taken literally, you must escape the characters '^.[]£()|*+?{\'   with a backslash \ as they have special meaning.

`\£\d       matches a string that has a £ before one digit `
* e.g. in a menu, it'll match the prices
* You can match also non-printable characters like tabs \t, new-lines \n, carriage returns \r.


## Flags 

* A regex usually comes within this form `/abc/`, where the search pattern is delimited by two slash characters /. 
* At the end we can specify a flag with these values (we can also combine them each other):
    * <b> g (global) </b> does not return after the first match, restarting the subsequent searches from the end of the previous match
    * <b> m (multi-line) </b> when enabled ^ and £ will match the start and end of a line, instead of the whole string
    * <b> i (insensitive)</b> makes the whole expression case-insensitive (for instance /aBc/i would match AbC)

In [2]:
## Basic: Practice 

# Intermediate topics 

## Grouping and capturing - ( )
```
a(bc)           parentheses create a capturing group with value bc 
a(?:bc)*        using ?: we disable the capturing group 
a(?<foo>bc)     using ?<foo> we put a name to the group 
```
* this operator is v useful when you need to extract info from astrings/data using your preferred programming language
* any multiple occurrences captures by several groups will be exposed in the form of an array
    * the values can be accessed by specifying an index on the result of the match
* if we choose to put a name to the groups (using `?<foo>...`) we can retrieve the group values using the match result like a dictionary where the keys will be the name of each group

## Bracket expressions = [ ]

```
[abc]            matches a string that has either an a or a b or a c -> is the same as a|b|c 
[a-c]            same as previous
[a-fA-F0-9]      a string that represents a single hexadecimal digit, case insensitively 
[0-9]%           a string that has a character from 0 to 9 before a % sign
[^a-zA-Z]        a string that has not a letter from a to z or from A to Z. In this case the ^ is used as negation of the expression 

```
* similar like the or operator from before

## Greedy and lazy match
* the quantifiers `* + {}` are greedy operators
    * they expand the match as far as they can through the provided text
* For example, `<.+>` matches `<div>simple div</div>` in This is a `<div> simple div</div>` test. 
* In order to catch only the `div` tag we can use a `?` to make it lazy:
```
<.+?>            matches any character one or more times included inside < and >, expanding as needed
```
* to avoid the usage of `.` , use:
    ```
    <[^<>]+>         matches any character except < or > one or more times included inside < and >
    ```
   * the brackets mean that you're referring to something in the list! 

# Advanced topics

## boundaries -/b and 

1. Finding first and all occurrences that match an expression pattern

```
#return the first match
re.search(pattern, text)

# return all the matches
re.findall(pattern, text)
```

In [None]:
# example
pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
    print("Search successful.")
else:
    print("Search unsuccessful.")	