## Regular Expressions (regex)
**Regex (short for regular expressions)** in Python is a sequence of characters that define a search pattern. It is a powerful tool used for text processing and manipulation. With regex, you can search for specific patterns of characters within strings, extract specific information from text, and replace or modify text in a variety of ways. In Python, the regex module provides support for regular expressions.

In [2]:
# So to work on Regular Expression (regex) we need to import 're' model.
# 're' model comes with python and allows you to do pattern matching.
import re

* **regex101** website allow you to test your regular expression. https://regex101.com/

In [3]:
# Chat list:
chat1 = "Welcome to the party. join us on 1029843984 or abc@gmail.com"
chat2 = "yesssss!, try one more (342)-983-9843, abc_302@xyz.com"
chat3 = "There will be other option. Try to contact with 9843738432 or abcd@yahoo.com"

In [5]:
# 're' has a function called 'findall()' which takes couple of arguments. The first argument is pattern and the 2nd argument
# is a text. So this function will return matches as result.
Pattern = "\d{10}"
matches = re.findall(Pattern, chat1)
matches

['1029843984']

In [6]:
# Now if we check the seccond chat with the same pattern, it will not find the phone number because of different structure:
re.findall(Pattern, chat2)

[]

In [8]:
# So to catch the phone number which is brackets, we use different pattern:
pattern = "\(\d{3}\)-\d{3}-\d{4}"
re.findall(pattern, chat2)

['(342)-983-9843']

In [9]:
# So now if we want to catch both the types of numbers at the same times, so for that we just use OR '|' symbols between the
# expressions.
chat4 = "Welcome to the party. join us on 1029843984 or abc@gmail.com yesssss!, try one more (342)-983-9843, abc@xyz.com"
patt = "\d{10} | \(\d{3}\)-\d{3}-\d{4}"
re.findall(patt, chat4)

['1029843984 ', ' (342)-983-9843']

In [10]:
# So now if look to email id which has a buch of charecters then '@'' symbol then again buch of charecters then '.' and then
# again buch of charecters which would be 'com', 'af', 'ir' or something else.
pat = "[a-zA-Z0-9_]*@[a-zA-Z0-9]*\.com"
re.findall(pat, chat4)

['abc@gmail.com', 'abc@xyz.com']

In [11]:
# So in upper case we just keep 'com' domain. To keeps all domains we'll do it like:
pat1 = "[a-zA-Z0-9_]*@[a-zA-Z0-9]*\.[a-zA-Z0-9]"
chat5 = "Welcome to the party. join us on 1029843984 or abc@gmail.af yesssss!, try one more (342)-983-9843, abc@xyz.com"
re.findall(pat1, chat5)

['abc@gmail.a', 'abc@xyz.c']

### Regex for Information Extraction

In [12]:
text='''
Born	Elon Reeve Musk
June 28, 1971 (age 50)
Pretoria, Transvaal, South Africa
Citizenship	
South Africa (1971–present)
Canada (1971–present)
United States (2002–present)
Education	University of Pennsylvania (BS, BA)
Title	
Founder, CEO and Chief Engineer of SpaceX
CEO and product architect of Tesla, Inc.
Founder of The Boring Company and X.com (now part of PayPal)
Co-founder of Neuralink, OpenAI, and Zip2
Spouse(s)	
Justine Wilson
​
​(m. 2000; div. 2008)​
Talulah Riley
​
​(m. 2010; div. 2012)​
​
​(m. 2013; div. 2016)
'''

In [14]:
# Let's first have a function to return the matched patterns:
def get_pattern_match(pattern, text):
    matches = re.findall(pattern, text)
    if matches:
        return matches[0]

In [15]:
# To get age:
get_pattern_match(r'age (\d+)', text)

'50'

In [16]:
# To get name:
get_pattern_match(r'Born(.*)\n', text).strip()

'Elon Reeve Musk'

In [17]:
# To get birth date:
get_pattern_match(r'Born.*\n(.*)\(age', text).strip()

'June 28, 1971'

In [20]:
# To get birth place:
get_pattern_match(r'\(age.*\n(.*)', text)

'Pretoria, Transvaal, South Africa'

In [21]:
# Let's have a function to return all the necessary inforamtion:
def extract_personal_information(text):
    age = get_pattern_match('age (\d+)', text)
    full_name = get_pattern_match('Born(.*)\n', text)
    birth_date = get_pattern_match('Born.*\n(.*)\(age', text)
    birth_place = get_pattern_match('\(age.*\n(.*)', text)
    return {
        'age': int(age),
        'name': full_name.strip(),
        'birth_date': birth_date.strip(),
        'birth_place': birth_place.strip()
    }

In [22]:
# So now we can easily just pass the text:
extract_personal_information(text)

{'age': 50,
 'name': 'Elon Reeve Musk',
 'birth_date': 'June 28, 1971',
 'birth_place': 'Pretoria, Transvaal, South Africa'}

In [23]:
# Let's try some different text:
text = '''
Born	Mukesh Dhirubhai Ambani
19 April 1957 (age 64)
Aden, Colony of Aden
(present-day Yemen)[1][2]
Nationality	Indian
Alma mater	
St. Xavier's College, Mumbai
Institute of Chemical Technology (B.E.)
Stanford University (drop-out)
Occupation	Chairman and MD, Reliance Industries
Spouse(s)	Nita Ambani ​(m. 1985)​[3]
Children	3
Parent(s)	
Dhirubhai Ambani (father)
Kokilaben Ambani (mother)
Relatives	Anil Ambani (brother)
Tina Ambani (sister-in-law)
'''

In [24]:
# Call the function for the new text:
extract_personal_information(text)

{'age': 64,
 'name': 'Mukesh Dhirubhai Ambani',
 'birth_date': '19 April 1957',
 'birth_place': 'Aden, Colony of Aden'}

* **Thats were all for this session...**