In [1]:
import re

This regular expressions practice sheet was used side-by-side with https://regex101.com/

In [2]:
# Return phone numbers (that could be written in two formats)

text='''
Elon musk's phone number is 9991116666, call him if you have any questions on dodgecoin. 
Tesla's revenue is 40 billion. Tesla's CFO number is (999)-333-7777
'''

pattern = '\(\d{3}\)-\d{3}-\d{4}|\d{10}'

matches = re.findall(pattern,text)
matches

['9991116666', '(999)-333-7777']

In [3]:
# From Tesla company filings 2021
# Return the titles next to "Note 1" and "Note 2" and so on

text = '''
Notes to Consolidated Financial Statements
(unaudited)
Note 1 - Overview
Tesla, Inc. (“Tesla”, the “Company”, “we”, “us” or “our”) was incorporated in the State of Delaware on July 1, 2003. We design, develop,
manufacture and sell high-performance fully electric vehicles and design, manufacture, install and sell solar energy generation and energy storage
products. Our Chief Executive Officer, as the chief operating decision maker (“CODM”), organizes our company, manages resource allocations and
measures performance among two operating and reportable segments: (i) automotive and (ii) energy generation and storage.
Beginning in the first quarter of 2021, there has been a trend in many parts of the world of increasing availability and administration of vaccines
against COVID-19, as well as an easing of restrictions on social, business, travel and government activities and functions. On the other hand, infection
rates and regulations continue to fluctuate in various regions and there are ongoing global impacts resulting from the pandemic, including challenges
and increases in costs for logistics and supply chains, such as increased port congestion, intermittent supplier delays and a shortfall of semiconductor
supply. We have also previously been affected by temporary manufacturing closures, employment and compensation adjustments and impediments to
administrative activities supporting our product deliveries and deployments.
Note 2 - Summary of Significant Accounting Policies
Unaudited Interim Financial Statements
The consolidated balance sheet as of September 30, 2021, the consolidated statements of operations, the consolidated statements of
comprehensive income, the consolidated statements of redeemable noncontrolling interests and equity for the three and nine months ended September
30, 2021 and 2020 and the consolidated statements of cash flows for the nine months ended September 30, 2021 and 2020, as well as other information
disclosed in the accompanying notes, are unaudited. The consolidated balance sheet as of December 31, 2020 was derived from the audited
consolidated financial statements as of that date. The interim consolidated financial statements and the accompanying notes should be read in
conjunction with the annual consolidated financial statements and the accompanying notes contained in our Annual Report on Form 10-K for the year
ended December 31, 2020.
'''

pattern = 'Note \d - ([^\n]*)'
re.findall(pattern,text)

['Overview', 'Summary of Significant Accounting Policies']

## Exact financial periods from a company's financial reporting 

In [4]:
text = '''
The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
In previous quarter i.e. FY2020 Q4 it was $3 billion. Here's a lowercase example: fy2020 Q4
'''

pattern1 = 'FY\d{4} Q[1234]'

pattern2 = 'FY\d{4} Q[1-4]'

# Add flags below to take care of upper and lower case letters
# re.I and re.IGNORECASE give the same thing

matches1 = re.findall(pattern1,text, flags=re.I) 
matches2 = re.findall(pattern2,text, flags=re.IGNORECASE)

print(f'pattern 1 output: {matches1}')
print(' ')
print(f'pattern 2 output: {matches2}')

pattern 1 output: ['FY2021 Q1', 'FY2020 Q4', 'fy2020 Q4']
 
pattern 2 output: ['FY2021 Q1', 'FY2020 Q4', 'fy2020 Q4']


## What if I don't want the "FY" but just the "2021 Q4"?

In [5]:
#Just add in brackets to capture the part you want from the matched texts
pattern1 = 'FY(\d{4} Q[1234])'

pattern2 = 'FY(\d{4} Q[1-4])'

# Add flags below to take care of upper and lower case letters
# re.I and re.IGNORECASE give the same thing

matches1 = re.findall(pattern1,text, flags=re.I) 
matches2 = re.findall(pattern2,text, flags=re.IGNORECASE)

print(f'pattern 1 output: {matches1}')
print(' ')
print(f'pattern 2 output: {matches2}')

pattern 1 output: ['2021 Q1', '2020 Q4', '2020 Q4']
 
pattern 2 output: ['2021 Q1', '2020 Q4', '2020 Q4']


## Instead of the financial periods, let's extract the actual values

In [6]:
text = '''
The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion. Tesla's employee count is 5400.In previous quarter i.e. 
FY2020 Q4 it was $3 billion. Here's a lowercase example: fy2020 Q4
'''

pattern = '\$([0-9\.]+)'
matches1 = re.findall(pattern,text) 

print(f'pattern output: {matches1}')

pattern output: ['4.85', '3']


## Now let's extract both the financial periods and the values

In [7]:
text = '''
The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion. Tesla's employee count is 5400.In previous quarter i.e. FY2020 Q4 it was $3 billion. Here's a lowercase example: fy2020 Q4
'''
# We want:
#2021 Q1 ==> 4.85
#2020 Q4 ==> 3



pattern = 'FY(\d{4} Q[1-4]) [^\$]+\$([0-9.]+)'
matches = re.findall(pattern,text) 

print(f'{matches}')

[('2021 Q1', '4.85'), ('2020 Q4', '3')]


In [8]:
### OTHER THAN findall, there is also a method called search
# "findall" finds all occurences of the pattern in the text, whereas "search" will return the first occurence

# Now let's extract both the financial periods and the values using search

text = '''
The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion. Tesla's employee count is 5400.In previous quarter i.e. FY2020 Q4 it was $3 billion. Here's a lowercase example: fy2020 Q4
'''
# We want:
#2021 Q1 ==> 4.85
#2020 Q4 ==> 3



pattern = 'FY(\d{4} Q[1-4]) [^\$]+\$([0-9.]+)'
matches = re.search(pattern,text) 

print(f'{matches}')
matches.groups() #if you're using "search", you need to do this

<re.Match object; span=(47, 66), match='FY2021 Q1 was $4.85'>


('2021 Q1', '4.85')