##A Regular Expression (RegEx) is a sequence of characters that defines a search pattern.Regular expressions (regex) are essentially text patterns that you can use to automate searching through and replacing elements within strings of text. This can make cleaning and working with text-based data sets much easier, saving you the trouble of having to search through mountains of text by hand

##basic requirements - basic familiarity with key Python concepts like if-else statements, while and for loops, etc.
##At the end we will learn - introduction to how regex can be used in concert with pandas to work with large text corpuses
##(corpus means a data set of text)

###practical applications of regex:Data Validation,Text Search and Extraction,URL Matching and Routing,Data Cleaning and Transformation,Text Manipulation,
Log Analysis,Programming Language Support

###The Impact of Dirty Data on Analysis
Dirty data negatively impacts analysis. Common data quality issues that regex helps address:

Missing values Duplicate records Inconsistent formatting (dates, names) Invalid entries
Without cleaning, these issues can skew results.

Advantages of Regex for Data Cleaning
Key regex benefits for data cleaning:

Flexible pattern matching Powerful search and replace Automation at scale Language agnostic

In [6]:
import re

In [21]:
pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern,test_string)
if result:
    print("result found")
else:
    print("no result found")

result found


In [23]:
pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
  print('Search successful:{test_string}')
else:
  print('Search unsuccessful:{test_string}')	

Search successful:{test_string}


In [24]:
pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
  print(f'Search successful:{test_string}')
else:
  print(f'Search unsuccessful:{test_string}')	

Search successful:abyss


In [48]:
regex = '^ab'
strings = 'abyss'
result = re.match(regex,strings)

if result:
    print(f'{strings}')
else:
    print(f'{strings}')

abyss


In [50]:
regex = r'^ab'
strings = ['abyss','abs','alias','an abacus']

for string in strings:
    if re.match(regex,strings):
        print('match found:{result}')
    else:
        print('match not found:{result}')
        

TypeError: expected string or bytes-like object

In [51]:
regex = r'^ab'
strings = ['abyss','abs','alias','an abacus']
        
for string in strings:
    if re.match(regex, string):
        print(f'Matched: {string}')
    else:
        print(f'Not matched: {string}')

Matched: abyss
Matched: abs
Not matched: alias
Not matched: an abacus


In [52]:
regex = r'^a...s$'
strings = ['abyss','abs','alias','an abacus']
        
for string in strings:
    if re.match(regex, string):
        print(f'Matched: {string}')
    else:
        print(f'Not matched: {string}')

Matched: abyss
Not matched: abs
Matched: alias
Not matched: an abacus


In [12]:
s = "We are looking for a Data Scientist with 3+ years of experience in Python and a Master's degree"
match =re.search(r'Data',s)
print(match)

<re.Match object; span=(21, 25), match='Data'>


In [13]:
s = "We are looking for a Data Scientist with 3+ years of experience in Python and a Master's degree"
match =re.search(r'Data',s)
print('Start Index:',match.start())
print('End Index:',match.end())

Start Index: 21
End Index: 25


In [19]:
s = "We are looking for a Data Scientist with 3+ years of experience in Python and a Master's degree"

# without using' .' Matches any character except newline
match = re.search(r'.', s)
print(match)

# using '$' matches the end
match = re.search(r'$', s)
print(match)

<re.Match object; span=(0, 1), match='W'>
<re.Match object; span=(95, 95), match=''>


In [30]:
#Caret (^) symbol matches the beginning of the string i.e. checks whether the string starts 
#with the given character(s) or not.
regex = r'^The'
strings = ['The quick brown fox', 'The lazy dog', 'A quick brown fox']
for string in strings:
    if re.match(regex, string):
        print(f'Matched: {string}')
    else:
        print(f'Not matched: {string}')

Matched: The quick brown fox
Matched: The lazy dog
Not matched: A quick brown fox


In [26]:
#The dot (.) in the pattern represents any character
string = "Implementing Regex techniques to analyze job descriptions is a powerful way"
pattern = r"analyze.job"

match = re.search(pattern, string)
if match:
    print("Match found!")
else:
    print("Match not found.")

Match found!


In [27]:
#regular expression to check if the string ends with “World!”
#Dollar($) symbol matches the end of the string

string = "Hello World! this is my world"
pattern = r"World!$"

match = re.search(pattern, string)
if match:
    print("Match found!")
else:
    print("Match not found.")

Match not found.


In [29]:
#regular expression to check if the string ends with “World!”
#Dollar($) symbol matches the end of the string

string = "Hello World! this is my World!"
pattern = r"World!$"

match = re.search(pattern, string)
if match:
    print("Match found!")
else:
    print("Match not found.")

Match found!


In [7]:
job_description = "We are looking for a Data Scientist with 3+ years of experience in Python and a Master's degree."
regex_pattern = r'\b(?:Data Scientist|Python|Master\'s|3\+ years)\b'
matches = re.findall(regex_pattern, job_description)
print(matches)

['Data Scientist', '3+ years', 'Python', "Master's"]


In [31]:
job_description = "We are looking for a Data Scientist with 3+ years of experience in Python and a Master's degree."
regex_pattern = r'\b(?:Data Scientist|Python|Master\'s|3\+ years)\b'
matches = re.findall(regex_pattern, job_description)
print(matches)

['Data Scientist', '3+ years', 'Python', "Master's"]


In [32]:
# to find all the characters in the string that fall within the range of ‘a’ to ‘m’.
s = "We are looking for a Data Scientist with 3+ years of experience in Python and a Master's degree"
regex = "[a-m]"
match = re.findall(regex,s)
print(match)

['e', 'a', 'e', 'l', 'k', 'i', 'g', 'f', 'a', 'a', 'a', 'c', 'i', 'e', 'i', 'i', 'h', 'e', 'a', 'f', 'e', 'e', 'i', 'e', 'c', 'e', 'i', 'h', 'a', 'd', 'a', 'a', 'e', 'd', 'e', 'g', 'e', 'e']


In [33]:
# in below code except from a to m, remaning all is printed
s = "We are looking for a Data Scientist with 3+ years of experience in Python and a Master's degree"
regex = "[^a-m]"
match = re.findall(regex,s)
print(match)

['W', ' ', 'r', ' ', 'o', 'o', 'n', ' ', 'o', 'r', ' ', ' ', 'D', 't', ' ', 'S', 'n', 't', 's', 't', ' ', 'w', 't', ' ', '3', '+', ' ', 'y', 'r', 's', ' ', 'o', ' ', 'x', 'p', 'r', 'n', ' ', 'n', ' ', 'P', 'y', 't', 'o', 'n', ' ', 'n', ' ', ' ', 'M', 's', 't', 'r', "'", 's', ' ', 'r']


In [None]:
##test case on job descriptions to identify key requirements##

In [34]:
JD = "Under direct supervision, provides remote technical support services to external and internal users of Landmark environment and applications on basic/routine issues via telephone, email and electronic channels while adhering to Customer Support operational processes and best practices."
" Resolves the end user&aposs service request by applying established problem solving techniques including trouble shooting, data quality review, replicating the end user&aposs workflow, understanding how the software is functioning and proposing solutions that allow the end user to achieve their objectives."
" Service requests are limited to basic questions regarding installations, configuration, data formatting and application functionality/workflows. Escalates all complex or novel issues to higher level Support Analysts as needed."
" The nature of the support services provided requires knowledge of the domain science and knowledge of one to few software applications used within the domain."
"Knowledge of domain software applications is acquired through structured training, self-guided learning, and on-the-job experiences. Requires an undergraduate degree."
"No previous experience is required. Concentration in geoscience, engineering, or computer science is preferred."
 
# Define a regex pattern to extract relevant words/phrases
# This pattern looks for skills, qualifications, and experience levels.
regex = r'\b(Data Scientist|Python|R|SQL|Master\'s degree|Computer Science|machine learning|Tableau|3\+ years)\b'

# Find all matches in the job description
matches = re.findall(regex_pattern, job_description)

# Print the extracted words
print("Extracted Words/Phrases:")
for match in matches:
    print(match)

Extracted Words/Phrases:
Data Scientist
3+ years
Python
Master's


In [37]:
from collections import Counter

JD = "Under direct supervision, provides remote technical support services to external and internal users of Landmark environment and applications on basic/routine issues via telephone, email and electronic channels while adhering to Customer Support operational processes and best practices."
" Resolves the end user&aposs service request by applying established problem solving techniques including trouble shooting, data quality review, replicating the end user&aposs workflow, understanding how the software is functioning and proposing solutions that allow the end user to achieve their objectives."
" Service requests are limited to basic questions regarding installations, configuration, data formatting and application functionality/workflows. Escalates all complex or novel issues to higher level Support Analysts as needed."
" The nature of the support services provided requires knowledge of the domain science and knowledge of one to few software applications used within the domain."
"Knowledge of domain software applications is acquired through structured training, self-guided learning, and on-the-job experiences. Requires an undergraduate degree."
"No previous experience is required. Concentration in geoscience, engineering, or computer science is preferred."

# Define a regex pattern to find words (case insensitive)
regex = r'\b\w+\b'  # Matches any word

# Find all words in the text using regex
words = re.findall(regex, JD.lower())  # Convert to lowercase for uniform counting

# Count occurrences of each word
word_counts = Counter(words)

# Print the repetitive words and their counts
print("Repetitive Words and Their Counts:")
for word, count in word_counts.items():
    if count > 1:  # Only display words that appear more than once
        print(f"{word}: {count}")

Repetitive Words and Their Counts:
support: 2
to: 2
and: 4


In [None]:
##r'\b\w+\b': This pattern matches any whole word (defined as a sequence of word characters).
##\b: Asserts a word boundary, ensuring whole words are matched.
##\w+: Matches one or more word characters (letters, digits, underscores).
##Counter(words): Counts the occurrences of each word in the list.

In [None]:
##  now print unique words from JD##

In [38]:
from collections import Counter

JD = "Under direct supervision, provides remote technical support services to external and internal users of Landmark environment and applications on basic/routine issues via telephone, email and electronic channels while adhering to Customer Support operational processes and best practices."
" Resolves the end user&aposs service request by applying established problem solving techniques including trouble shooting, data quality review, replicating the end user&aposs workflow, understanding how the software is functioning and proposing solutions that allow the end user to achieve their objectives."
" Service requests are limited to basic questions regarding installations, configuration, data formatting and application functionality/workflows. Escalates all complex or novel issues to higher level Support Analysts as needed."
" The nature of the support services provided requires knowledge of the domain science and knowledge of one to few software applications used within the domain."
"Knowledge of domain software applications is acquired through structured training, self-guided learning, and on-the-job experiences. Requires an undergraduate degree."
"No previous experience is required. Concentration in geoscience, engineering, or computer science is preferred."

# Define a regex pattern to find words (case insensitive)
regex = r'\b\w+\b'  # Matches any word

# Find all words in the text using regex
words = re.findall(regex, JD.lower())  # Convert to lowercase for uniform counting

# Use a set to find unique words
unique_words = set(words)

# Print the unique words
print("Unique Words:")
for word in sorted(unique_words):  # Sort for better readability
    print(word)

Unique Words:
adhering
and
applications
basic
best
channels
customer
direct
electronic
email
environment
external
internal
issues
landmark
of
on
operational
practices
processes
provides
remote
routine
services
supervision
support
technical
telephone
to
under
users
via
while


In [None]:
##practice on random questions##

In [None]:
#1#Create a definition function without arguments

In [53]:
def python_def_keyword():
    print ("hello")
python_def_keyword()
    

hello


In [59]:
#2#Create a def function to find the subtraction of two numbers.

#function for subtraction of 2 numbers
def python_def_subnumbers(x,y):
    return (x-y)
#main code
a=90
b=40

#finding subtraction
result = python_def_subnumbers(a,b)

print("subtraction of ", a, " and ", b, " is = ", result)

#print("sub of",a, "and", b, "is =" result)


subtraction of  90  and  40  is =  50


In [73]:
##1. Write a Python program to check that a string contains only a certain set of characters (in this case a-z, A-Z and 0-9).

import re

def is_valid_string(s):
# Define the regex pattern for allowed characters (a-z, A-Z, 0-9)
    pattern = r'^[a-zA-Z0-9]+$'

# Use re.match to check if the entire string matches the pattern
    return re.match(pattern, s)is not None

#test cases
test_string=["hello123","hello world!","validstring123","123456","invalid@chars","",]

##test_strings = [
## "Hello123",  # Valid
##    "Hello World!",  # Invalid (space and punctuation)
##  "ValidString456",  # Valid
##    "123456",  # Valid
##    "invalid@chars",  # Invalid (special character)
##    "",  # Invalid (empty string)
##]

# Check each test string
for test in test_string:
    result = is_valid_string(test)
    print(f"'{test}': {'Valid' if result else 'Invalid'}")

'hello123': Valid
'hello world!': Invalid
'validstring123': Valid
'123456': Valid
'invalid@chars': Invalid
'': Invalid


In [None]:
##2. Write a Python program that matches a string that has an a followed by zero or more b's.

import re

def is_valid_string(s):
## Define the regex pattern for "a" followed by zero or more "b"s
    pattern = r'^a(b*)'
    
## use re.match to check if the string matches the pattern
    match = re.match(pattern, s)
    
    if match:
        return f"matched:'{match.group()}' '{with match.group(1)}' (zero or more 'b' s)"
    else:
        retunr "no match"
        
#test cases
    test_strings: ["a","ab","abb","xyz","ab2","aabb","a b"]
        
# Check each test string
for test in test_strings:
    result = match_string(test)
    print(f"'{test}': {result}")
    

In [None]:
## now trying regex on some JDs##

##fh = file handle, and here we are setting file to redad only, and reading it.
the directory precede with r converts a string into a raw string,which helps to avoid conflicts
caused by some machines read characters such as backslashes in directory paths on windows.


In [3]:
fh= open(r"jobs-JDs.txt","r").read()

two arguments in the form of re.findall(pattern, string)
Here, pattern represents the substring we want to find, 
and string represents the main string we want to find it in

In [8]:
regex = r'\b\w+\b'
words = re.findall(regex, "jobs-JDs.txt")
unique_words = set(words)
print(unique_words)

{'txt', 'jobs', 'JDs'}


In [10]:
regex = r'\b\w+\b'
words = re.findall(regex, "jobs-JDs.txt")
unique_words = set(words)
print("unique_words:")
for word in sorted(unique_words):  # Sort for better readability
    print(word)

unique_words:
JDs
jobs
txt


w matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _, and the dash, -.
d matches digits, which means 0-9.
s matches whitespace characters, which include the tab, new line, carriage return, and space characters.
S matches non-whitespace characters.
. matches any character except the new line character n.

In [27]:
for line in re.findall("skill.",fh):
    print(line)

skills
skills
skills
skills
skills
skills
skille
skills
skills
skills
skills
skills
skills
skills


In [35]:
match = re.findall("skill.*", fh)
print(match)

['skills, written and verbal.', 'skills.', 'skills', 'skills (ability to work and communicate with different departments/functions e.g. subsurface, offshore, operations, and external stakeholders).', 'skills, capacity and motivation to deliver in a timely manner.', 'skills.', 'skilled Data Analyst to join our dynamic team. In this role, you will work with large volumes of data to provide valuable business insights, supporting our Analytics team in making data-driven decisions. You will leverage tools like SQL, Tableau, or Power BI to conduct detailed analyses and deliver comprehensive reports that will guide performance assessment and process optimization across the business.', 'skills with an analytical mindset.', 'skills and the ability to explain analytical concepts to a non-technical audience.', 'skills, data manipulation capabilities and business acumen.', 'skills. Ability to lead large organizations through influence.', 'skills, including a proven ability to quickly analyze and d

here above and below are same but printed line wise in below code

In [23]:
fh= open(r"jobs-JDs.txt","r").read()
for line in re.findall("skill.*",fh):
    print(line)

skills, written and verbal.
skills.
skills
skills (ability to work and communicate with different departments/functions e.g. subsurface, offshore, operations, and external stakeholders).
skills, capacity and motivation to deliver in a timely manner.
skills.
skilled Data Analyst to join our dynamic team. In this role, you will work with large volumes of data to provide valuable business insights, supporting our Analytics team in making data-driven decisions. You will leverage tools like SQL, Tableau, or Power BI to conduct detailed analyses and deliver comprehensive reports that will guide performance assessment and process optimization across the business.
skills with an analytical mindset.
skills and the ability to explain analytical concepts to a non-technical audience.
skills, data manipulation capabilities and business acumen.
skills. Ability to lead large organizations through influence.
skills, including a proven ability to quickly analyze and develop data models, identify patter

In [46]:
match = re.findall("skill.*",fh)

for line in match:
    print(re.findall("skill.*", line))

['skills, written and verbal.']
['skills.']
['skills']
['skills (ability to work and communicate with different departments/functions e.g. subsurface, offshore, operations, and external stakeholders).']
['skills, capacity and motivation to deliver in a timely manner.']
['skills.']
['skilled Data Analyst to join our dynamic team. In this role, you will work with large volumes of data to provide valuable business insights, supporting our Analytics team in making data-driven decisions. You will leverage tools like SQL, Tableau, or Power BI to conduct detailed analyses and deliver comprehensive reports that will guide performance assessment and process optimization across the business.']
['skills with an analytical mindset.']
['skills and the ability to explain analytical concepts to a non-technical audience.']
['skills, data manipulation capabilities and business acumen.']
['skills. Ability to lead large organizations through influence.']
['skills, including a proven ability to quickly an

In [50]:
# Define a regex pattern to find words (case insensitive)
regex = r'\b\w+\b'  # Matches any word

# Find all words in the text using regex
words = re.findall(regex, fh.lower())  # Convert to lowercase for uniform counting

# Use a set to find unique words
unique_words = set(words)

# Print the unique words
print("Unique Words:")
for word in sorted(unique_words):  # Sort for better readability
    print(word)

Unique Words:
1
2
3
4
5
7
a
ability
able
about
above
abu
academia
academic
acceptance
account
accountabilities
accountability
accounting
accounts
accuracy
accurate
acquisition
across
actionable
active
activities
activity
acumen
ad
added
adding
additional
address
adekâ
adhere
administration
adobe
advanced
advantage
aems
agc
agile
agreed
ai
algorithms
all
also
an
analyses
analysis
analyst
analysts
analytical
analytics
analyze
analyzing
and
answers
apache
api
apis
applications
applied
applying
approaches
appropriately
approx
arab
architect
architecture
are
arrive
arts
as
asana
assess
assesses
assessment
assets
assist
assistance
associate
assurance
assure
at
atmosphere
attention
attire
audience
audits
automate
automated
automation
autonomously
available
aws
azure
aâ
b
baccalaureate
bachelors
bachelorâ
backbone
backend
based
basic
be
believes
belonging
benchmarking
best
between
bi
block
both
bp
bridge
broad
broader
build
builder
building
business
by
c
call
campuses
can
candidate
canva
capab

In [20]:
import re
from re import search
from re import findall
from re import split

from collections import Counter

In [21]:
fh= open(r"jobs-JDs.txt","r").read()
match = search ('(skill)',fh)
if match:
    print("matched")
else:
    print("no match")

matched
