# RegEx (Ch7 of Automating Boring Stuff)

In [None]:
### **RUN THIS CELL FIRST SO THAT THE TARGET EXAMPLES ARE LOADED** 

import re

english_text = 'Ryokan: A traditional Japanese inn costing anywhere from ¥7,000 a person to upwards of ¥70,000 for a luxury stay.\nMinshuku: A family-run, Japanese-style bed and breakfast costing around ¥4,000 to ¥9,000 a person.\nBusiness hotel: A no-frills hotel room costing anywhere from ¥5,000 to ¥10,000 a night for a single room.\nCapsule hotel: Private, enclosed sleeping spaces and communal amenities for somewhere between ¥3000 and ¥5000 a night.\nAirBnB: Rent a room, or more, from a private homeowner for around ¥1,500 to ¥6,000 or more. Midway through 2018, a new Japanese law that required private lodgings to be registered with local authorities led to AirBnB removing thousands of unregistered listings from its website, but there are still many available.\nInternet cafes and "Manga Kissa": Sometimes a communal space, sometimes a private booth with a comfy reclining chair or bed-like floor for ¥1,500 to ¥3,000 for a 7- to 12-hour package. \n Love hotel: A private room for an adult couple that can cost around ¥3,000 for a “rest” of a few hours and about ¥7,000 to 12,000 for an overnight “stay,” depending on the area and type of hotel.'
japanese_text = '良品計画はこのほど、9月・10月に「ずっと、見直し。ずっと、良い値。」をテーマに価格見直しを行うことを発表した。\n同社ではこれまでもシーズンごとに価格の見直しを行ってきたが、生活者の想いにより一層応えるために、2020年秋冬シーズンは商品の「適正価格」をさらに推し進める。\n日常生活にもっとも必要な「くらしの基本」となる商品の価格をさらに見直し、「ずっと、見直し。(新価格)」として展開する。さらに2020年秋冬シーズン以前に価格を見直した商品(代表商品:シリコーン調理スプーン、ファイルボックス、ぽち菓子など)を「ずっと、良い値。」として改めて紹介していく。\n「ずっと、見直し。」9月、10月主な対象商品は、「えらべる・足なり直角靴下シリーズ(大人、子供)」「綿であったかインナーシリーズ(紳士、婦人)」「軽量ポケッタブルノーカラーダウンベスト(紳士)」「ウレタンフォーム三層バススポンジ」「掃除用品システム・カーペットクリーナー用替えテープ3本組」「世界の焼き菓子シリーズ」など。\n「えらべる・足なり直角靴下シリーズ(大人、子供)」は、足と同じ90度にかかとを編むことでかかとが収まり、動きによる「ずれ」を抑え、ずれ落ちにくい靴下。9月3日から、大人3足3足690円(旧価格790円)、子供3足690円(旧価格2足490円)で販売する。\n「綿であったかインナーシリーズ(紳士、婦人)」は、乾燥しにくく、静電気も起きにくい綿の機能インナー。9月8日から790円(旧価格990円)で販売。9月16日からは、「軽量ポケッタブルノーカラーダウンベスト(紳士)」を2,990円(旧価格3,990円)で販売する。'
emails_text = 'Press: If you are a member of the press, email us at press@google.com or testing123@g93901.ac.jp. visit our blog.\nBusinesses & Organizations: Grow your business with Google Ads\nSolve business challenges with Google Cloud\nWork together with G Suite\nEngage supporters and fundraise with Google for Nonprofits\n'


# How is RegEx useful?

In [None]:
message = 'Call me at 070-1234-5678 tomorrow. 080-9876-4321 is my office.'

def isPhoneNumber(text):
    if len(text) != 13:
        return False
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False
    if text[3] != '-':
        return False
    for i in range(4, 8):
        if not text[i].isdecimal():
            return False
    if text[8] != '-':
        return False
    for i in range(9, 13):
        if not text[i].isdecimal():
            return False
    return True

found = False

for i in range(len(message)-13):
    if isPhoneNumber(message[i:i+13]):
        found = True
        number = message[i:i+13]
        break
    else:
        continue
    
print("Phone number found = ", found)
print("Number is =", number)

Phone number found =  True
Number is = 070-1234-5678


In [None]:
match_list = re.findall(r'\d{3}.\d{4}.\d{4}', message)
print(match_list)

['070-1234-5678', '080-9876-4321']


# RegEx in Python

always import re, which is the regex library

there are two ways to perform regex matching,  
*   creating a pattern object first
*   directly working with the text via library functions



In [None]:
# Creating a pattern

pattern = re.compile(r'¥[0-5]*')    # note the 'r' which means raw string literal

# (and what is raw string you ask? basically raw strings see \n as \n, not a line break)


match = pattern.finditer(english_text)

for line in match:
    print(line.group())

¥
¥
¥4
¥
¥5
¥10
¥3000
¥5000
¥1
¥
¥1
¥3
¥3
¥


In [None]:
# Directly working with text

re.findall(r'¥[,\d]+', english_text)


['¥7,000',
 '¥70,000',
 '¥4,000',
 '¥9,000',
 '¥5,000',
 '¥10,000',
 '¥3000',
 '¥5000',
 '¥1,500',
 '¥6,000',
 '¥1,500',
 '¥3,000',
 '¥3,000',
 '¥7,000']

# Formulating Regular Expressions

Every character should be taken literally except for the following: 
*   \ --- used as escape character
*   ^ --- marks start of line 
*   $ --- marks end of line
*   | --- logical 'or'
*   . --- wildcard, represents any character
*   ? --- Zero or one of the previous character
*   \* --- Zero or many of the previous character
*   \+ --- One or many of the previous character
*   \[\] --- Class of characters
    *     ^ --- negative character class
*   ( ) --- Group for extraction
*   \{a, b\} --- Repetition in a range between a to b times



In [None]:
english_text = 'Ryokan: A traditional Japanese inn costing anywhere from ¥7,000 a person to upwards of ¥70,000 for a luxury stay.\nMinshuku: A family-run, Japanese-style bed and breakfast costing around ¥4,000 to ¥9,000 a person.\nBusiness hotel: A no-frills hotel room costing anywhere from ¥5,000 to ¥10,000 a night for a single room.\nCapsule hotel: Private, enclosed sleeping spaces and communal amenities for somewhere between ¥3000 and ¥5000 a night.\nAirBnB: Rent a room, or more, from a private homeowner for around ¥1,500 to ¥6,000 or more. Midway through 2018, a new Japanese law that required private lodgings to be registered with local authorities led to AirBnB removing thousands of unregistered listings from its website, but there are still many available.\nInternet cafes and "Manga Kissa": Sometimes a communal space, sometimes a private booth with a comfy reclining chair or bed-like floor for ¥1,500 to ¥3,000 for a 7- to 12-hour package. \n Love hotel: A private room for an adult couple that can cost around ¥3,000 for a “rest” of a few hours and about ¥7,000 to 12,000 for an overnight “stay,” depending on the area and type of hotel.'

re.findall(r'\Z.+\.$', english_text)


[]

# Some Common Parameters for RegEx compiling

*   re.I or re.IGNORECASE
*   re.M or re.MULTILINE
    *    When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). 
*   re.X or re.VERBOSE
    *    This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>. 
*   re.S or re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.




In [None]:
# Creating a pattern

pattern = re.compile(r'[a-c]{2}', re.IGNORECASE)    # note the 'r' which means raw string literal

# (and what is raw string you ask? basically raw strings see \n as \n, not a line break)

pattern.findall(english_text)



['ac', 'ca', 'ab', 'ca', 'ac', 'ac', 'ca', 'ab']

# RegEx for Japanese Text

refer to https://gist.github.com/terrancesnyder/1345094

In [None]:
# Creating a pattern

pattern = re.compile(r'[ぁ-ん]+')    # note the 'r' which means raw string literal

# (and what is raw string you ask? basically raw strings see \n as \n, not a line break)

pattern.findall(japanese_text)

['はこのほど',
 'に',
 'ずっと',
 'し',
 'ずっと',
 'い',
 'を',
 'に',
 'しを',
 'うことを',
 'した',
 'ではこれまでも',
 'ごとに',
 'の',
 'しを',
 'ってきたが',
 'の',
 'いにより',
 'えるために',
 'は',
 'の',
 'をさらに',
 'し',
 'める',
 'にもっとも',
 'な',
 'くらしの',
 'となる',
 'の',
 'をさらに',
 'し',
 'ずっと',
 'し',
 'として',
 'する',
 'さらに',
 'に',
 'を',
 'した',
 'ぽち',
 'など',
 'を',
 'ずっと',
 'い',
 'として',
 'めて',
 'していく',
 'ずっと',
 'し',
 'な',
 'は',
 'えらべる',
 'なり',
 'であったか',
 'え',
 'の',
 'き',
 'など',
 'えらべる',
 'なり',
 'は',
 'と',
 'じ',
 'にかかとを',
 'むことでかかとが',
 'まり',
 'きによる',
 'ずれ',
 'を',
 'え',
 'ずれ',
 'ちにくい',
 'から',
 'で',
 'する',
 'であったか',
 'は',
 'しにくく',
 'も',
 'きにくい',
 'の',
 'から',
 'で',
 'からは',
 'を',
 'で',
 'する']

# Finding Emoji 

Reference: https://www.regextester.com/106421

In the works... still trying to figure out how to make this work...

In [None]:
emoji_text = r'Thats great! 😆😆😆😛'

pattern = re.compile("(<u\\+\\w+?>)")

print(pattern.search(emoji_text))



None


1. What is the function that creates Regex objects?

    - re.compile(r'')

2. Why are raw strings often used when creating Regex objects?

    - Because Regex frequently use backslashes, and to avoid having to use extra backslashes to escape Python parsing...

3. What does the search() method return?

    - returns None if pattern not found
    - returns Match object if one match found

4. How do you get the actual strings that match the pattern from a Match object?

    - .group() method returns actual matched text

5. In the regex created from r'(\d\d\d)-(\d\d\d-\d\d\d\d)', what does group 0 cover? Group 1? Group 2?

    - group 0 : 415-555-4242 (all matched text)
    - group 1 : 415
    - group 2 : 555-4242

6. Parentheses and periods have specific meanings in regular expression syntax. How would you specify that you want a regex to match actual parentheses and period characters?

    - escape characters \( and \) and \.

7. The findall() method returns a list of strings or a list of tuples of strings. What makes it return one or the other?

    - List of Strings: as long as no groups in the regex
    - List of Tuples: if groups present in the regex

8. What does the | character signify in regular expressions?

    - Pipe to match one of many regex

9. What two things does the ? character signify in regular expressions?

    - Preceding group is an Optional Pattern
    - Zero or One time

10. What is the difference between the + and * characters in regular expressions?

    - * : Zero or More
    - + : Once or More

11. What is the difference between {3} and {3,5} in regular expressions?

    - {3} 3 times
    - {3,5} 3-5 timess (greedy)

12. What do the \d, \w, and \s shorthand character classes signify in regular expressions?

    - \d : digit 0 to 9
    - \w : any letter, numeric digit, or undersscore (word)
    - \s : any space, tab, or newline

13. What do the \D, \W, and \S shorthand character classes signify in regular expressions?

    - \D : NOT digit 0 to 9
    - \W : NOT letter, numeric digit, or undersscore (word)
    - \S : NOT space, tab, or newline

14. What is the difference between .* and .*??

    - .* : Matches Everything
    - .+ : Matches Everything except empty strings

15. What is the character class syntax to match all numbers and lowercase letters?

    - /[0-9a-z]/ 

16. How do you make a regular expression case-insensitive?

    - re.compile 2nd parameter re.IGNORECASE

17. What does the . character normally match? What does it match if re.DOTALL is passed as the second argument to re.compile()?

    - Wildcard = Anything except \n
    - Wildcard = Anything

18. If numRegex = re.compile(r'\d+'), what will numRegex.sub('X', '12 drummers, 11 pipers, five rings, 3 hens') return?

    - 'X drummers, X pipers, five rings, X hens'

19. What does passing re.VERBOSE as the second argument to re.compile() allow you to do?

    - ignore whitespace and comments in your regex 

20. How would you write a regex that matches a number with commas for every three digits? It must match the following:

'42'
'1,234'
'6,368,745'
but not the following:

'12,34,567' (which has only two digits between the commas)
'1234' (which lacks commas)

    - \d{1,3}(,\d{3})*

21. How would you write a regex that matches the full name of someone whose last name is Watanabe? You can assume that the first name that comes before it will always be one word that begins with a capital letter. The regex must match the following:

'Haruto Watanabe'
'Alice Watanabe'
'RoboCop Watanabe'
but not the following:

'haruto Watanabe' (where the first name is not capitalized)
'Mr. Watanabe' (where the preceding word has a nonletter character)
'Watanabe' (which has no first name)
'Haruto watanabe' (where Watanabe is not capitalized)

22. How would you write a regex that matches a sentence where the first word is either Alice, Bob, or Carol; the second word is either eats, pets, or throws; the third word is apples, cats, or baseballs; and the sentence ends with a period? This regex should be case-insensitive. It must match the following:

'Alice eats apples.'
'Bob pets cats.'
'Carol throws baseballs.'
'Alice throws Apples.'
'BOB EATS CATS.'
but not the following:

'RoboCop eats apples.'
'ALICE THROWS FOOTBALLS.'
'Carol eats 7 cats.'