# String Operations Using Regular Expressions

Regular expressions (often abbreviated as **regex** or **regexp**) are sequences of characters used to define search patterns. They provide a concise and flexible way to match, search, and manipulate text based on specific patterns.

Regular expressions consist of a combination of literal characters and special characters called metacharacters. The metacharacters have special meanings and allow you to define complex patterns. Here are some commonly used metacharacters:

- `.` (dot): Matches any single character except a newline.
- `^` (caret): Matches the start of a string.
- `$` (dollar): Matches the end of a string.
- `*` (asterisk): Matches zero or more occurrences of the previous character or group.
- `+` (plus): Matches one or more occurrences of the previous character or group.
- `?` (question mark): Matches zero or one occurrence of the previous character or group.
- `[ ]` (square brackets): Matches any character within the brackets.
- `[^ ]` (caret within square brackets): Matches any character not in the brackets.
- `|` (pipe): Matches either the expression before or after the pipe.
- `()` (parentheses): Groups patterns together.
- `\` backward slash explain special sequence.
- `w` select words
- `d` select digits
- `\s` space
- `\n` new line

In [1]:
import re

In [2]:
s = 'this is a sample string'
pattern = 'is'

re.search(pattern, s)

<re.Match object; span=(2, 4), match='is'>

In [3]:
s[2:4]

'is'

Use `findall()` to get all the instances of a string from a mai string


In [4]:
s2 = 'This is a generic sentence. Here is my sentence. My 3rd sentence is here.'

re.findall('sentence',s2)

['sentence', 'sentence', 'sentence']

In [5]:
re.findall('is',s2)

['is', 'is', 'is', 'is']

In [10]:
emails = '''Bassel@gmail.com
            Jen@yahoo.com
            Suresh@outlook.net
            Jeremy@gmail.net
            Parth@health.gov'''


In [14]:
# capture all the words in the emails
pattern = re.compile(r'\w+') #select every word without special characters

re.findall(pattern, emails)

['Bassel',
 'gmail',
 'com',
 'Jen',
 'yahoo',
 'com',
 'Suresh',
 'outlook',
 'net',
 'Jeremy',
 'gmail',
 'net',
 'Parth',
 'health',
 'gov']

Find only the domains

In [12]:
#pattern: word@word.word
# we need to add a cursor using ()
pattern = re.compile('\w+@\w+.(\w+)')

re.findall(pattern, emails)

['com', 'com', 'net', 'net', 'gov']

`split()` function

In [15]:
text = 'apples-bananas-oranges-grapes'

fruit_list = re.split('-', text)
fruit_list

['apples', 'bananas', 'oranges', 'grapes']

In [21]:
text = 'apples-bananas,oranges|grapes'

fruit_list = re.split('[-|,]', text)
fruit_list

['apples', 'bananas', 'oranges', 'grapes']

## Using RegEx with Digits

In [22]:
phone_nums = '''
            Bassel: 234-5689789 
            Mark: 284-5083211 
            mike: 234-5666323 
            '''

In [23]:
pattern = re.compile('\d\d\d-\d\d\d\d\d\d\d')
re.findall(pattern, phone_nums)

['234-5689789', '284-5083211', '234-5666323']

Instead of repeating the digit flags, you can add multipliers

In [24]:
pattern = re.compile('\d{3}-\d{7}')
re.findall(pattern, phone_nums)

['234-5689789', '284-5083211', '234-5666323']

Get the area code

In [25]:
pattern = re.compile('(\d{3})-\d{7}')
re.findall(pattern, phone_nums)

['234', '284', '234']

## Replace String Using `sub()`

**Exercisse** Clean the text below

In [26]:
text = ''' The BEST $mvie ever made about writer's block and one of the scariest tales ever made regarding cabin fever, 
        The Shining took a simple concept of a      haunted hotel and built it ~up into an unforgettable, 
        psychological ^horror mvie that will withstand the test of 
        time despite being slated by it's original creator. scary moovie ---!!!!'''

1. Replace bad spelling of movie

In [27]:
# use or flag | to capture multiple spellings
text = re.sub('mvie|moovie|\$mvie', 'movie', text)
print(text)

 The BEST movie ever made about writer's block and one of the scariest tales ever made regarding cabin fever, 
        The Shining took a simple concept of a      haunted hotel and built it ~up into an unforgettable, 
        psychological ^horror movie that will withstand the test of 
        time despite being slated by it's original creator. scary movie ---!!!!


2. Get rid of special characters

In [29]:
# capture all alpha numerics
pattern = '[^a-zA-Z0-9\s.\']'

text = re.sub(pattern, '', text)
print(text)

 The BEST movie ever made about writer's block and one of the scariest tales ever made regarding cabin fever 
        The Shining took a simple concept of a      haunted hotel and built it up into an unforgettable 
        psychological horror movie that will withstand the test of 
        time despite being slated by it's original creator. scary movie 


## Using RegEx in Pandas

`contains()`

In [30]:
import pandas as pd

In [35]:
data = {
    'names': ['Bassel'
                ,'Jen'
                ,'Suresh'
                ,'Jeremy'
                ,'Parth'],
    'emails': [
        'Bassel@gmail.com'
        ,'Jen@yahoo.com'
        ,'Suresh@outlook.net'
        ,'Jeremy@gmail.net'
        ,'Parth@health.gov'
    ]
}

df = pd.DataFrame(data)
df.head()

Unnamed: 0,names,emails
0,Bassel,Bassel@gmail.com
1,Jen,Jen@yahoo.com
2,Suresh,Suresh@outlook.net
3,Jeremy,Jeremy@gmail.net
4,Parth,Parth@health.gov


Feature Engineering: Build an indicator that states whether the person has gmail account or not.

In [36]:
df['has gmail or yahoo'] = df['emails'].str.contains('gmail|yahoo')
df

Unnamed: 0,names,emails,has gmail or yahoo
0,Bassel,Bassel@gmail.com,True
1,Jen,Jen@yahoo.com,True
2,Suresh,Suresh@outlook.net,False
3,Jeremy,Jeremy@gmail.net,True
4,Parth,Parth@health.gov,False


Using `split()`

Get the domain name

In [40]:
df['domain'] = df['emails'].str.split('.').str[1]

In [41]:
df

Unnamed: 0,names,emails,has gmail or yahoo,domain
0,Bassel,Bassel@gmail.com,True,com
1,Jen,Jen@yahoo.com,True,com
2,Suresh,Suresh@outlook.net,False,net
3,Jeremy,Jeremy@gmail.net,True,net
4,Parth,Parth@health.gov,False,gov


Using `expand=True` with `split()`

Build 2 columns Name and State from the information below

In [42]:
data ={'info':['Bassel-NY', 'Parth-CA', 'Mark-AZ']}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,info
0,Bassel-NY
1,Parth-CA
2,Mark-AZ


In [43]:
df[['name','state']] = df['info'].str.split('-', expand=True)
df

Unnamed: 0,info,name,state
0,Bassel-NY,Bassel,NY
1,Parth-CA,Parth,CA
2,Mark-AZ,Mark,AZ
