# Regular Expressions
A regular expression — often written as regex or regexp — is a pattern used to search, match, and manipulate text. Think of it as a tiny, powerful language for describing text patterns.
What a regular expression actually does
It tells the computer what kind of text you’re looking for, not the exact text itself.

For example:
*   Find all email addresses in a document
*   Check if a password is strong
*   Extract dates, phone numbers, URLs
*   Replace or clean messy text

This is why regex is heavily used in NLP, data cleaning, preprocessing, and text mining.

Use this website - [Regex101 website](https://regex101.com/)

In [13]:
import re

In [7]:
text='''Elon musk's phone number is 9991116666, call him if you have any questions on dodgecoin. Tesla's revenue is 40 billion
Tesla's CFO number (999)-333-7777'''

In [8]:
text

"Elon musk's phone number is 9991116666, call him if you have any questions on dodgecoin. Tesla's revenue is 40 billion \nTesla's CFO number (999)-333-7777"

In [14]:
pattern = '\(\d{3}\)-\d{3}-\d{4}|\d{10}'

  pattern = '\(\d{3}\)-\d{3}-\d{4}|\d{10}'


In [15]:
matches = re.findall(pattern, text)
matches

['9991116666', '(999)-333-7777']

In [16]:
text = '''
Note 1 - Overview
Tesla, Inc. (“Tesla”, the “Company”, “we”, “us” or “our”) was incorporated in the State of Delaware on July 1, 2003. We design, develop, manufacture and sell high-performance fully electric vehicles and design, manufacture, install and sell solar energy generation and energy storage
products. Our Chief Executive Officer, as the chief operating decision maker (“CODM”), organizes our company, manages resource allocations and measures performance among two operating and reportable segments: (i) automotive and (ii) energy generation and storage.
Beginning in the first quarter of 2021, there has been a trend in many parts of the world of increasing availability and administration of vaccines
against COVID-19, as well as an easing of restrictions on social, business, travel and government activities and functions. On the other hand, infection
rates and regulations continue to fluctuate in various regions and there are ongoing global impacts resulting from the pandemic, including challenges
and increases in costs for logistics and supply chains, such as increased port congestion, intermittent supplier delays and a shortfall of semiconductor
supply. We have also previously been affected by temporary manufacturing closures, employment and compensation adjustments and impediments to
administrative activities supporting our product deliveries and deployments.
Note 2 - Summary of Significant Accounting Policies
Unaudited Interim Financial Statements
The consolidated balance sheet as of September 30, 2021, the consolidated statements of operations, the consolidated statements of
comprehensive income, the consolidated statements of redeemable noncontrolling interests and equity for the three and nine months ended September
30, 2021 and 2020 and the consolidated statements of cash flows for the nine months ended September 30, 2021 and 2020, as well as other information
disclosed in the accompanying notes, are unaudited. The consolidated balance sheet as of December 31, 2020 was derived from the audited
consolidated financial statements as of that date. The interim consolidated financial statements and the accompanying notes should be read in
conjunction with the annual consolidated financial statements and the accompanying notes contained in our Annual Report on Form 10-K for the year
ended December 31, 2020.
'''

In [19]:
pattern ='Note \d - ([^\n]*)'

  pattern ='Note \d - ([^\n]*)'


In [23]:
titles = re.findall(pattern, text)
titles

['Overview', 'Summary of Significant Accounting Policies']

In [31]:
text = '''
The gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
In previous quarter i.e. Fy2020 Q4 it was $3 billion.
'''

### Extract financial years

In [40]:
pattern =r'FY\d{4} Q[1-4]'

In [41]:
re.findall(pattern, text, flags=re.IGNORECASE)

['FY2021 Q1', 'FY2020 Q4']

In [38]:
pattern = r'FY(\d{4} Q[1-4])'

In [39]:
re.findall(pattern, text, flags=re.IGNORECASE)

['2021 Q1', '2020 Q4']

### extracting financial numbers only

In [42]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion. Tesla's emplyee count is 4532
In previous quarter i.e. FY2020 Q4 it was $3 billion.
'''

In [50]:
pattern = r'\$([0-9\.]+)'
re.findall(pattern, text)

['4.85', '3']

**Explanation:**

\$([0-9\.]+)

This regex is used to extract the numeric part of a dollar amount.

Breaking it down:

\$ → matches a literal dollar sign.
     ($ is a special regex character, so it must be escaped.)

([0-9\.]+) → capturing group that extracts the number.
- [0-9\.] matches any digit (0–9) or a literal dot (.).
- + means match one or more of these characters.
- (...) captures the matched number without the dollar sign.

This pattern matches:
- $4.85  → captures 4.85
- $3     → captures 3
- $100.50 → captures 100.50


#### Extract periods and financial numbers both

In [54]:
pattern = r'FY(\d{4} Q[1-4])[^\$]+\$([0-9\.]+)'
data = re.findall(pattern, text)

In [55]:
data

[('2021 Q1', '4.85'), ('2020 Q4', '3')]

In [63]:
print(f"{data[0][0]} ==> {data[0][1]}")
print(f"{data[1][0]} ==> {data[1][1]}")

2021 Q1 ==> 4.85
2020 Q4 ==> 3


In [65]:
#re.search - will search for first occurence only
data = re.search(pattern, text)
data

<re.Match object; span=(51, 70), match='FY2021 Q1 was $4.85'>

In [69]:
data.groups()

('2021 Q1', '4.85')

# Exercise

**1. Extract all twitter handles from following text. Twitter handle is the text that appears after https://twitter.com/ and is a single word. Also it contains only alpha numeric characters i.e. A-Z a-z , o to 9 and underscore _**

In [71]:
text = '''
Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information
on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers
for tesla related news,
https://twitter.com/teslarati
https://twitter.com/dummy_tesla
https://twitter.com/dummy_2_tesla
'''

In [72]:
# Answer
pattern = r'https:\/\/twitter.com\/([^,^\n]+)'
re.findall(pattern, text)

['elonmusk', 'teslarati', 'dummy_tesla', 'dummy_2_tesla']

In [74]:
# Answer - best method
pattern = r'https://twitter\.com/([a-zA-Z0-9_]+)'
re.findall(pattern, text)


['elonmusk', 'teslarati', 'dummy_tesla', 'dummy_2_tesla']

##### **explanation**:

([a-zA-Z0-9_]+)
This is the capturing group that extracts the username.

Breaking it down:

[a-zA-Z0-9_] → match any:
- lowercase letters
- uppercase letters
- digits
- underscore _

These are the only valid characters in a Twitter handle.

+ → match one or more of those characters

(...) → capture the matched username so you can extract it

So this part matches usernames like:
- elonmusk
- teslarati
- dummy_tesla
- dummy_2_tesla


**2. Extract Concentration Risk Types. It will be a text that appears after "Concentration Risk:", In below example, your regex should extract these two strings**

    (1) Credit Risk
  
    (2) Supply Rish

In [75]:
text = '''
Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.
Concentration of Risk: Supply Risk
We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these
suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.
'''
pattern = 'Concentration of Risk: ([^\n]*)'

re.findall(pattern, text)

['Credit Risk', 'Supply Risk']

In [76]:
pattern = 'Concentration of Risk: ([^\n]+)'
re.findall(pattern, text)

['Credit Risk', 'Supply Risk']

**Explanation:**

Concentration of Risk: ([^\n]*)

This regex is used to capture the text that appears after the phrase
"Concentration of Risk:" on the same line.

Breaking it down:

Concentration of Risk: → matches the literal phrase.

([^\n]*) → capturing group that extracts everything until the newline.
- [^ ] means a negated character class.
- [^\n] means "match any character except a newline".
- '*' means match zero or more of those characters.
- '+' means match one or more of those characters.
- (...) captures the matched text.

So this pattern extracts:
- "Credit Risk"
- "Supply Risk"

from the given text.


**3. Companies in europe reports their financial numbers of semi annual basis and you can have a document like this. To exatract quarterly and semin annual period you can use a regex as shown below**

Hint: you need to use (?:) here to match everything enclosed

In [77]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.
'''

In [83]:
pattern = r'FY(\d{4} [QS]\d) [^$]+\$[0-9\.]+'
re.findall(pattern, text)

['2021 Q1', '2021 S1']

In [81]:
pattern = r'FY(\d{4} (?:Q[1-4]|S[1-2]))'
re.findall(pattern, text)

['2021 Q1', '2021 S1']

**Explanation:**

FY(\d{4} (?:Q[1-4]|S[1-2]))

This regex captures fiscal periods such as "FY2021 Q1" or "FY2021 S1".

Breaking it down:

FY → matches the literal characters "FY".

(\d{4} (?:Q[1-4]|S[1-2])) → capturing group for the fiscal period.
- \d{4} → matches a 4‑digit year (e.g., 2021).
- (?:Q[1-4]|S[1-2]) → non‑capturing group that matches either:
    • Q1, Q2, Q3, Q4  (quarters)
    • S1, S2          (semesters)
- | → OR operator.
- (...) → the outer parentheses capture the entire year + period.

This pattern successfully matches:
- FY2021 Q1
- FY2021 S1
