# In this notebook let's see some useful regular expression
See references:
* [Basics - iterative tutorial](https://regexone.com/lesson/introduction_abcs)
* [Python RegEx w3schools](https://www.w3schools.com/python/python_regex.asp)
* [Python RegEx Doc](https://docs.python.org/2/howto/regex.html#regex-howto)

Notes: RegEx might be faster than python code, this one is possibly more readable than regular expressions.
<table>
    <tr>
        <th>Operator</th>
        <th>Description</th>
    </tr>
    <tr>
        <td><b>\.</b></td>
        <td> It is a meta character that represents match any single character. To refer to the character <b>\.</b>  (dot), use <b>\.</b>. </td>
    </tr>
    <tr>
        <td><b>\ + char</b></td>
        <td>Indicates a metacharacter.</td>
    </tr>
    <tr>
        <td><b>[abc]</b></td>
        <td>Selects the slices that contains the char a,b or c.</td>
    </tr>
    <tr>
        <td><b>^</b></td>
        <td>Including this character as in [^abc] we select the strings that does't contain a,b or c.</td>
    </tr>
    <tr>
        <td><b>-</b></td>
        <td>If we want to select characters in a range we use <b>dash (-)</b> to separate the first element of the last of a given range. Ex: <b>[0-6]</b>. One special metacharacter is <b>\w</b> that is equivalent to <b>[A-Za-z0-9].</b></td>
    </tr>
    <tr>
        <td><b>{}</b></td>
        <td>Repetition. Ex <b>a{3}</b>, search for an a repeated 3 times. Ex <b>a{1,3}</b>, no less than one time, no more then 3.</td>
    </tr>
    <tr>
        <td><b>* +</b></td>
        <td>Matching an arbitary number of characters. Ex  <b>\d*</b> any number of digits.<b>\d+</b> ensures at least one digit. in practice: <b>*</b> means zero or more and <b> + </b> means one or more.</td>
    </tr>
    <tr>
        <td><b>?</b></td>
        <td>Optional characters. Ex <b>ab?c</b> might match abc or ac because  is optional.</td>
    </tr>
    <tr>
        <td><b>\s</b></td>
        <td>Spaces</td>
    </tr>
    <tr>
        <td><b>^word</b></td>
        <td>Startings and endings.<b>^word</b> match the lines that begin with word.</td>
    </tr>
    <tr>
        <td><b>()</b></td>
        <td>Match group. Ex <b>^(file\w*)\.pdf$</b> match the pdf files that have the word file in the name, but doesn't capture the extension.</td>
    </tr>
    <tr>
        <td><b>(())</b></td>
        <td>Match a group in a matched group.</td>
    </tr>
    <tr>
        <td><b>|</b></td>
        <td>Conditionals: or <b>|</b>. Ex (a|b).</td>
    </tr>
    <tr>
        <td><b>\D \W \S \b</b></td>
        <td>The oposites: \D non digits, \W non alphanumeric, \S no whitespace. \b matches the boundary of a word and a non word char</td>
    </tr>
</table>

---


In [12]:
# The regular expressin library
import re
import requests
import bs4 as bs

## Main functions of the re module
<table>
    <tr>
        <th>Function</th>
        <th>Description</th>
    </tr>
    <tr>
        <td>findall</td>
        <td>Returns a list containing all matches</td>
    </tr>
     <tr>
        <td>search</td>
        <td>Returns a Match object if there is a match anywhere in the string</td>
    </tr>
     <tr>
        <td>split</td>
        <td>Returns a list where the string has been split at each match</td>
    </tr>
     <tr>
        <td>sub</td>
        <td>Replaces one or many matches with a string</td>
    </tr>
</table>


In [19]:
page_link = 'https://en.wikipedia.org/wiki/March_Comes_in_Like_a_Lion'
response = requests.get(page_link, timeout=5)
print('Status response:', response.status_code)


Status response: 200


In [20]:
page_content = bs.BeautifulSoup(response.content, 'html.parser') 


### re.findall()


In [21]:
term = 'Shogi'
x = re.findall(term,page_content.text)
print(x) 

['Shogi', 'Shogi', 'Shogi', 'Shogi']


### re.search()


In [26]:
term = 'Shogi'
match_obj = re.search(term, page_content.text)
print('The position of the first occurrence (%d, %d).' %(match_obj.start(), match_obj.end()))

The position of the first occurrence (1161, 1166).


### re.split()

In [29]:
term = 'Shogi'
list_nbhd = re.split(term, page_content.text)
print('Split at each occurrence of the term specified:', list_nbhd)

Split at each occurrence of the term specified: ['\n\n\n\nMarch Comes in Like a Lion - Wikipedia\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );\n(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"March_Comes_in_Like_a_Lion","wgTitle":"March Comes in Like a Lion","wgCurRevisionId":879023528,"wgRevisionId":879023528,"wgArticleId":12467068,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with reference errors","CS1 uses Japanese-language script (ja)","CS1 Japanese-language sources (ja)","Pages with duplicate reference names","Articles containing Japanese-language text","Interlanguage link template link number","Articles with Japanese-language external links","Manga series","2007 manga","2016 anime television series","Anime series base

### re.sub()
Replaces the matches with some text.

In [30]:
text_sample = 'Hello it\'s me'
x = re.sub('me', 'you', text_sample)
print(x)
print(text_sample)

Hello it's you
Hello it's me


## The match object
The match object has some methods to retrieve content.
<table>
    <tr>
        <th>Method</th>
        <th>Description</th>
    </tr>
    <tr>
        <td>.span()</td>
        <td>returns a tuple with the position of the match</td>
    </tr>
    <tr>
        <td>.string()</td>
        <td>returns the string passed to the function</td>
    </tr>
    <tr>
        <td>.group()</td>
        <td>returns the part of the string where there was match</td>
    </tr>
</table>

### Extras
[[1] More exercises](https://regexone.com/problem/matching_decimal_numbers)
```
# starts with one or more digits, followed (optionally) by digits separated by a comma
# followed by the decimal digits (optionallly) and  the expression also treats the case
# with the exponential notation.
^-?\d+(,\d+)*(\.\d+(e\d+)?)?$
```