### **NLP: Regular Expressions (with Sundar Pichai Example)**

#### **(1) Regex in customer support**

**Retrieve ticket number**

Suppose we have customer support chats and want to extract ticket numbers from them.

In [34]:
import re

chat1 = "support: Hi, my ticket # 987654321 is not resolved yet."
chat2 = "support: The issue is with ticket number 987654321."
chat3 = "support: Please check ticket 987654321 for updates."

pattern = r'ticket[^\d]*(\d+)'
matches = re.findall(pattern, chat1)
matches

['987654321']

In [35]:
matches = re.findall(pattern, chat2)
matches

['987654321']

In [36]:
matches = re.findall(pattern, chat3)
matches

['987654321']

#### **Reusable function for pattern matching**

Let's create a helper function to extract the first match.

In [37]:
def get_pattern_match(pattern, text):
    matches = re.findall(pattern, text)
    if matches:
        return matches[0]

In [38]:
get_pattern_match(r'ticket[^\d]*(\d+)', chat1)

'987654321'

#### **Retrieve email id and phone**

Suppose the chats also contain email addresses and phone numbers.

In [39]:
chat1 = 'support: Contact me at 9876543210, sundar@google.com'
chat2 = 'support: My number is (987)-654-3210, email: sundar@google.com'
chat3 = 'support: phone: 9876543210 email: sundar@google.com'

**-----Email id-----**

In [40]:
# Extracting email address from chat1
get_pattern_match(r'([a-zA-Z0-9_]+@[a-z]+\.[a-zA-Z0-9]+)', chat1)

'sundar@google.com'

In [41]:
# Extracting email address from chat2
get_pattern_match(r'([a-zA-Z0-9_]+@[a-z]+\.[a-zA-Z0-9]+)', chat2)

'sundar@google.com'

In [42]:
# Extracting email address from chat3
get_pattern_match(r'([a-zA-Z0-9_]+@[a-z]+\.[a-zA-Z0-9]+)', chat3)

'sundar@google.com'

**-----Phone number-----**

In [43]:
# Extracting phone number from chat1
get_pattern_match(r'(\d{10})|(\(\d{3}\)-\d{3}-\d{4})', chat1)[0]

'9876543210'

In [30]:
# Extracting phone number from chat2
get_pattern_match(r'(\d{10})|(\(\d{3}\)-\d{3}-\d{4})', chat2)[1]

'(987)-654-3210'

In [44]:
# Extracting phone number from chat3
get_pattern_match(r'(\d{10})|(\(\d{3}\)-\d{3}-\d{4})', chat3)[0]

'9876543210'

#### **(2) Regex for Information Extraction**

Let's extract structured information from a Wikipedia-style text about Sundar Pichai.

In [45]:
text = '''
Born    Sundar Pichai
June 10, 1972 (age 52)
Madurai, Tamil Nadu, India
Citizenship
India (1972-present)
United States (2017-present)
Education   Indian Institute of Technology Kharagpur (BTech)
Stanford University (MS)
University of Pennsylvania (MBA)
Title
CEO of Alphabet Inc. and Google
Spouse(s)
Anjali Pichai
Children\t2
'''


In [46]:
# Age extraction
get_pattern_match(r'age (\d+)', text)

'52'

In [47]:
# Name extraction
get_pattern_match(r'Born\s(.*)\n', text).strip()

'Sundar Pichai'

In [48]:
# Date of birth extraction
get_pattern_match(r'Born.*\n(.*)\s\(age', text)

'June 10, 1972'

In [49]:
# Birth place extraction
get_pattern_match(r'\(age.*\n(.*)', text)

'Madurai, Tamil Nadu, India'

#### **Function to extract all personal information**

In [50]:
def extract_personal_information(text):
    age = get_pattern_match(r'age (\d+)', text)
    full_name = get_pattern_match(r'Born\s(.*)\n', text)
    birth_date = get_pattern_match(r'Born.*\n(.*)\s\(age', text)
    birth_place = get_pattern_match(r'\(age.*\n(.*)', text)
    return {
        'age': int(age) if age else None,
        'full_name': full_name,
        'birth_date': birth_date,
        'birth_place': birth_place
    }

In [51]:
extract_personal_information(text)

{'age': 52,
 'full_name': '   Sundar Pichai',
 'birth_date': 'June 10, 1972',
 'birth_place': 'Madurai, Tamil Nadu, India'}

### **Summary**

- We used regular expressions to extract ticket numbers, emails, and phone numbers from customer support chats.
- We applied regex to extract structured information (age, name, birth date, birth place) from a Wikipedia-style biography.
- A reusable function helps automate the extraction process for different patterns.

**This approach is useful for information extraction tasks in NLP, especially when dealing with semi-structured text data.**