## ALL You Need to Know About Regular Expression

#### What is Regular Expression and Why it is used in NLP??

* Regular expressions, often abbreviated as **regex**, are powerful tools used in **text processing and pattern matching**. They are a sequence of characters that define a search pattern, allowing you to match and manipulate text based on specific rules.

* Regular expressions are widely used in Natural Language Processing (NLP) for various purposes:

1. **Text Cleaning**: Regular expressions are helpful in removing unwanted characters, special symbols, or punctuation marks from text data. They allow you to identify and replace patterns in the text, such as removing URLs, email addresses, or HTML tags.

2. **Tokenization**: Tokenization is the process of breaking text into smaller meaningful units called tokens. Regular expressions can be employed to split text into words, sentences, or other desired units based on specific patterns or delimiters.

3. **Pattern Matching**: Regular expressions enable you to search for specific patterns or sequences of characters within text data. This is useful for tasks like **named entity recognition**, finding email addresses, detecting phone numbers, or extracting dates from text.

4. **Text Validation**: Regular expressions help validate the format or structure of text data. For instance, you can use them to verify if an input matches a certain pattern, such as checking if an email address is valid or if a phone number follows a specific format.

5. **Information Extraction**: Regular expressions play a vital role in extracting specific information from text data. For example, you can use them to find and extract specific patterns like dates, addresses, or product codes from unstructured text.

Overall, regular expressions provide a flexible and powerful approach for text pattern matching and manipulation, making them a valuable tool in NLP tasks.

                                                   https://regex101.com/

**Want to Learn More about Click on the above Link**![UpLookUpGIF.gif](attachment:UpLookUpGIF.gif)

In [9]:
import re
# Matches the order number
chat1='codebasics: Hello, I am having an issue with my order # 412889912 and my phone number is 8882221234'

pattern = 'order[^\d]*(\d*)'
matches = re.findall(pattern, chat1)
matches

['412889912']

In [10]:
# Matches the Phone Number
pattern='\d{10}'
matches=re.findall(pattern,chat1)
matches

['8882221234']

In [30]:
# When the sentence have two phone numbers
chat2='Hey hii janse can you send me your phone number as well your alternate number 8888881234,4356121232'
pattern='\d{10}'

In [31]:
re.findall(pattern,chat2)

['8888881234', '4356121232']

In [54]:
chat1 = 'codebasics: you ask lot of questions 😠  1235678912, abc@xyz.com'
chat2 = 'codebasics: here it is: (123)-567-8912, abc@xyz.com'
chat3 = 'codebasics: yes, phone: 1235678912 email: abc@xyz.com'

In [62]:
pattern='\d{10}|\(\d{3}\)-\d{3}-\d{4}'

In [63]:
re.findall(pattern,chat1)

['1235678912']

### 1) Regex in customer support,Retrieve order number


In [36]:

chat1='codebasics: Hello, I am having an issue with my order # 412889912'

pattern = 'order[^\d]*(\d*)'
matches = re.findall(pattern, chat1)
matches

['412889912']

In [37]:
chat2='codebasics: I have a problem with my order number 412889912'
pattern = 'order[^\d]*(\d*)'
matches = re.findall(pattern, chat2)
matches

['412889912']

In [38]:
chat3='codebasics: My order 412889912 is having an issue, I was charged 300$ when online it says 280$'
pattern = 'order[^\d]*(\d*)'
matches = re.findall(pattern, chat3)
matches

['412889912']

### Retrieve email id and phone

In [39]:
chat1 = 'codebasics: you ask lot of questions 😠  1235678912, abc@xyz.com'
chat2 = 'codebasics: here it is: (123)-567-8912, abc@xyz.com'
chat3 = 'codebasics: yes, phone: 1235678912 email: abc@xyz.com'

### -----Email id-----

In [65]:
pattern='[a-z0-9A-Z_]*@[a-z0-9A-z]*\.[a-z]*'
re.findall(pattern,chat1)

['abc@xyz.com']

In [66]:
re.findall(pattern,chat2)

['abc@xyz.com']

In [67]:
re.findall(pattern,chat3)

['abc@xyz.com']

#### Age

In [81]:
text='''Born	Mukesh Dhirubhai Ambani
19 April 1957 (age 66)
Aden, Colony of Aden
(present-day Yemen)[1][2]
Nationality	Indian
Alma mater	
St. Xavier's College, Mumbai
Institute of Chemical Technology (B.E.)
Occupation(s)	Chairman and MD, Reliance Industries
Spouse	Nita Ambani ​(m. 1985)​[3]
Children	3
Parent	
Dhirubhai Ambani (father)
Relatives	Anil Ambani (brother)
Tina Ambani (sister-in-law)'''

In [82]:
pattern='age (\d+)'
re.findall(pattern,text)

['66']

In [83]:
def get_pattern_match(pattern, text):
    matches = re.findall(pattern, text)
    if matches:
        return matches[0]

In [84]:
get_pattern_match(pattern,text)

'66'

In [85]:
def extract_personal_information(text):
    age = get_pattern_match('age (\d+)', text)
    full_name = get_pattern_match('Born(.*)\n', text)
    birth_date = get_pattern_match('Born.*\n(.*)\(age', text)
    birth_place = get_pattern_match('\(age.*\n(.*)', text)
    return {
        'age': int(age),
        'name': full_name.strip(),
        'birth_date': birth_date.strip(),
        'birth_place': birth_place.strip()
    }

In [86]:
extract_personal_information(text)

{'age': 66,
 'name': 'Mukesh Dhirubhai Ambani',
 'birth_date': '19 April 1957',
 'birth_place': 'Aden, Colony of Aden'}

#### Python Regular Expression Tutorial Exericse

##### 1. Extract all twitter handles from following text. Twitter handle is the text that appears after https://twitter.com/ and is a single word. Also it contains only alpha numeric characters i.e. A-Z a-z , o to 9 and underscore _

In [96]:
text = '''
Follow our leader Elon musk on twitter here: https://twitter.com/elonmusk, more information 
on Tesla's products can be found at https://www.tesla.com/. Also here are leading influencers 
for tesla related news,
https://twitter.com/teslarati
https://twitter.com/dummy_tesla
https://twitter.com/dummy_2_tesla
'''
pattern = 'https://twitter\.com/([a-zA-Z0-9_]+)'

re.findall(pattern, text)

['elonmusk', 'teslarati', 'dummy_tesla', 'dummy_2_tesla']

#### 2. Extract Concentration Risk Types. It will be a text that appears after "Concentration Risk:", In below example, your regex should extract these two strings

In [98]:
text = '''
Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.
Concentration of Risk: Supply Risk
We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these
suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.
'''
pattern = 'Concentration [a-zA-Z]* [a-zA-Z]*[a-zA-Z]*[^a-zA-Z]*[a-z]*([a-zA-Z]* [a-zA-Z]*)'

re.findall(pattern, text)

['Credit Risk', 'Supply Risk']

In [100]:
text = '''
Concentration of Risk: Credit Risk
Financial instruments that potentially subject us to a concentration of credit risk consist of cash, cash equivalents, marketable securities,
restricted cash, accounts receivable, convertible note hedges, and interest rate swaps. Our cash balances are primarily invested in money market funds
or on deposit at high credit quality financial institutions in the U.S. These deposits are typically in excess of insured limits. As of September 30, 2021
and December 31, 2020, no entity represented 10% or more of our total accounts receivable balance. The risk of concentration for our convertible note
hedges and interest rate swaps is mitigated by transacting with several highly-rated multinational banks.
Concentration of Risk: Supply Risk
We are dependent on our suppliers, including single source suppliers, and the inability of these suppliers to deliver necessary components of our
products in a timely manner at prices, quality levels and volumes acceptable to us, or our inability to efficiently manage these components from these
suppliers, could have a material adverse effect on our business, prospects, financial condition and operating results.
'''
pattern = 'Concentration of Risk:([^\n]*)'

re.findall(pattern, text)

[' Credit Risk', ' Supply Risk']

#### 3. Companies in europe reports their financial numbers of semi annual basis and you can have a document like this. To exatract quarterly and semin annual period you can use a regex as shown below

Hint: you need to use (?:) here to match everything enclosed

In [102]:
text = '''
Tesla's gross cost of operating lease vehicles in FY2021 Q1 was $4.85 billion.
BMW's gross cost of operating vehicles in FY2021 S1 was $8 billion.
'''

pattern = 'FY(\d{4} (?:Q[1-4]|S[1-2]))'
matches = re.findall(pattern, text)
matches

['2021 Q1', '2021 S1']