# Regular Expressions

Video: https://www.youtube.com/watch?v=sHw5hLYFaIw

Notebook: https://github.com/codebasics/py/blob/master/Advanced/regex/regex_tutorial_python.ipynb

In [31]:
import re

## 1. Extract phone numbers

In [32]:
text = """
Elon musk's phone number is 9991116666, call him if you 
have any questions on dodgecoin. Tesla's revenue is 40 
billion Tesla's CFO number (999)-333-7777
"""

In [33]:
# regex for a DIGIT: \d

# regex for any sequence of digits: \d{3} (where 3 is
# the length of the sequence)

# regex for OR: |
pattern = '\(\d{3}\)-\d{3}-\d{4}|\d{10}'

In [34]:
# find ALL matches
matches = re.findall(pattern, text)
matches

['9991116666', '(999)-333-7777']

In [39]:
chat = """
codebasics: you ask a lot of questions :) 1234567891, abc@xyz.com
codebasics: here it is: (123)-567-8912 email: abc@xyz.com
codebasics: yes, phone: 1234567891 email: abc@xyz.com abX_02@hg.pu
"""

In [36]:
matches = re.findall(pattern, chat)
matches

['1234567891', '(123)-567-8912', '1234567891']

In [42]:
# dot: \.
# any non-whitespace character: \S
# any whitespace character: \s
# two or three of a: a{2,3}
# any character in range a-z, A-Z, and _: [a-zA-Z_]
pattern = '\S+@{1}\S+\.{1}[a-zA-Z]{2,3}'

In [43]:
matches = re.findall(pattern, chat)
matches

['abc@xyz.com', 'abc@xyz.com', 'abc@xyz.com', 'abX_02@hg.pu']

## 2. Extract note titles

In [9]:
text = '''
Note 1 - Overview
Tesla, Inc. (“Tesla”, the “Company”, “we”, “us” or “our”) was incorporated in the State of Delaware on July 1, 2003. We design, develop, manufacture and sell high-performance fully electric vehicles and design, manufacture, install and sell solar energy generation and energy storage
products. Our Chief Executive Officer, as the chief operating decision maker (“CODM”), organizes our company, manages resource allocations and measures performance among two operating and reportable segments: (i) automotive and (ii) energy generation and storage.
Beginning in the first quarter of 2021, there has been a trend in many parts of the world of increasing availability and administration of vaccines
against COVID-19, as well as an easing of restrictions on social, business, travel and government activities and functions. On the other hand, infection
rates and regulations continue to fluctuate in various regions and there are ongoing global impacts resulting from the pandemic, including challenges
and increases in costs for logistics and supply chains, such as increased port congestion, intermittent supplier delays and a shortfall of semiconductor
supply. We have also previously been affected by temporary manufacturing closures, employment and compensation adjustments and impediments to
administrative activities supporting our product deliveries and deployments.
Note 2 - Summary of Significant Accounting Policies
Unaudited Interim Financial Statements
The consolidated balance sheet as of September 30, 2021, the consolidated statements of operations, the consolidated statements of
comprehensive income, the consolidated statements of redeemable noncontrolling interests and equity for the three and nine months ended September
30, 2021 and 2020 and the consolidated statements of cash flows for the nine months ended September 30, 2021 and 2020, as well as other information
disclosed in the accompanying notes, are unaudited. The consolidated balance sheet as of December 31, 2020 was derived from the audited
consolidated financial statements as of that date. The interim consolidated financial statements and the accompanying notes should be read in
conjunction with the annual consolidated financial statements and the accompanying notes contained in our Annual Report on Form 10-K for the year
ended December 31, 2020.
'''

We want to capture the subtitle "Note" with the number after it and the title of the note (all the text until the new line character).

In [10]:
# regex for ANY CHAR EXCEPT: [^\n] (where \n is the 
# character you don't want to find)

# regex for ONE OR MORE CHAR: + (after character that
# may be repeated)

# regex for ZERO OR MORE CHAR: * (after character that
# may or may not exist, and may be repeated any number
# of times if it exists)

pattern = 'Note \d - [^\n]*'

In [11]:
# find ALL matches
matches = re.findall(pattern, text)
matches

['Note 1 - Overview', 'Note 2 - Summary of Significant Accounting Policies']

Now we want only the titles of the notes (without the word "Note" and its number before it). 

In [12]:
# use the same pattern to match on, but put everything you
# want to get extracted in brackets ()
# aka CAPTURE EVERYTHING ENCLOSED

pattern = 'Note \d - ([^\n]*)'

In [13]:
# find ALL matches
matches = re.findall(pattern, text)
matches

['Overview', 'Summary of Significant Accounting Policies']

## 3. Extract financial periods from a company's financial reporting

In [14]:
text = '''
The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
In previous quarter i.e. FY2020 Q4 it was $3 billion. 
FY2030 Q5 
fy2030 q3
'''

In [15]:
# to match on 1 OR 2 OR 3 OR 4:
# [1234] (or give range: [1-4])

pattern = "FY(\d{4} Q[1-4])"

In [16]:
# find ALL matches
# ignore case: add flags argument to re.findall
matches = re.findall(pattern, text, flags = re.IGNORECASE)
matches

['2021 Q1', '2020 Q4', '2030 q3']

## 4. Extract order numbers

In [44]:
text = """
codebasics: Hello, I am having an issue with my order # 412889912
codebasics: I have a problem with my order number 412889912
codebasics: My order 412889912 is having an issue, I was charged 300$ when 
online it says 280$
"""

In [50]:
# return what's in parentheses ()
# not: ^
pattern = 'order[^\d]*(\d*)'

In [51]:
matches = re.findall(pattern, text)
matches

['412889912', '412889912', '412889912']

## 5. ## 6. Extract full name, birthdate, birthplace, and age from wikipedia

In [52]:
text = """
Official portrait, 2021
46th President of the United States
Incumbent
Assumed office
January 20, 2021
Vice President	Kamala Harris
Preceded by	Donald Trump
47th Vice President of the United States
In office
January 20, 2009 – January 20, 2017
President	Barack Obama
Preceded by	Dick Cheney
Succeeded by	Mike Pence
United States Senator
from Delaware
In office
January 3, 1973 – January 15, 2009
Preceded by	J. Caleb Boggs
Succeeded by	Ted Kaufman
Member of the New Castle County Council
from the 4th district
In office
January 5, 1971 – January 3, 1973
Preceded by	Lawrence T. Messick
Succeeded by	Francis R. Swift
Personal details
Born	Joseph Robinette Biden Jr.
November 20, 1942 (age 80)
Scranton, Pennsylvania, U.S.
Political party	Democratic (since 1969)
Other political
affiliations	Independent (before 1969)
Spouses	
Neilia Hunter
​
​(m. 1966; died 1972)​
Jill Jacobs ​(m. 1977)​
Children	
BeauHunterNaomiAshley
Relatives	Biden family
Residence	
White House
Alma mater	
University of Delaware (BA)
Syracuse University (JD)
Occupation	
Politicianlawyerauthor
Awards	List of honors and awards
Signature	Cursive signature in ink
Website	
Campaign website
White House website
"""

In [53]:
pattern = 'age (\d+)'

In [54]:
matches = re.findall(pattern, text)
matches

['80']

In [55]:
pattern = 'Born[\s](.+)\n'
matches = re.findall(pattern, text)
matches

['Joseph Robinette Biden Jr.']

In [58]:
# alternative
pattern = 'Born(.+)\n'
matches = re.findall(pattern, text)
matches[0].strip()

'Joseph Robinette Biden Jr.'

In [59]:
pattern = 'Born.*\n(.*)\(age'
matches = re.findall(pattern, text)
matches[0].strip()

'November 20, 1942'

In [60]:
pattern = '\(age.*\n(.*)'
matches = re.findall(pattern, text)
matches[0].strip()

'Scranton, Pennsylvania, U.S.'

## Define helper functions

In [61]:
def get_pattern_match(pattern, text):
    matches = re.findall(pattern, text)
    if matches:
        return matches[0]

In [63]:
def get_personal_information(text):
    age = get_pattern_match('age (\d+)', text)
    full_name = get_pattern_match('Born(.+)\n', text)
    birthplace = get_pattern_match('\(age.*\n(.*)', text)
    birthdate = get_pattern_match('Born.*\n(.*)\(age', text)
    
    return {
        'age': int(age),
        'name': full_name.strip(),
        'birth_date': birthdate.strip(),
        'birth_place': birthplace.strip()
    }

In [64]:
get_personal_information(text)

{'age': 80,
 'name': 'Joseph Robinette Biden Jr.',
 'birth_date': 'November 20, 1942',
 'birth_place': 'Scranton, Pennsylvania, U.S.'}