#Regular Expressions (RE)

A Regular Expression or RegEx is a special sequence of characters that uses a search pattern to find a string or set of strings.

It can detect the presence or absence of a text by matching it with a particular pattern and also can split a pattern into one or more sub-patterns.

In [None]:
import re

s = 'Hello from CSE475'

match = re.search(r'Hello', s)

print('Start Index:', match.start())
print('End Index:', match.end())

Start Index: 0
End Index: 5


Here r character (r’Hello’) stands for raw, not regex. The raw string is slightly different from a regular string, it won’t interpret the \ character as an escape character. This is because the regular expression engine uses \ character for its own escaping purpose.

##re.findall()

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.

In [None]:
import re
string = """Hello my Number is 123456789 and
            my friend's number is 987654321"""
regex = '\d+'

match = re.findall(regex, string)
print(match)

['123456789', '987654321']


This code uses a regular expression (\d+) to find all the sequences of one or more digits in the given string. It searches for numeric values and stores them in a list. In this example, it finds and prints the numbers “123456789” and “987654321” from the input string.

##re.compile()

Regular expressions are compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.

In [None]:
import re
p = re.compile('[a-e]')

print(p.findall("Aye, said Mr. Gibenson Stark"))

['e', 'a', 'd', 'b', 'e', 'a']


Understanding the Output:

First occurrence is ‘e’ in “Aye” and not ‘A’, as it is Case Sensitive.
Next Occurrence is ‘a’ in “said”, then ‘d’ in “said”, followed by ‘b’ and ‘e’ in “Gibenson”, the Last ‘a’ matches with “Stark”.
Metacharacter backslash ‘\’ has a very important role as it signals various sequences. If the backslash is to be used without its special meaning as metacharacter, use’\\’

In [38]:
import re
p = re.compile('\d')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))

p = re.compile('\d+')
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))

['1', '1', '4', '1', '8', '8', '6']
['11', '4', '1886']


The code uses regular expressions to find and list all single digits and sequences of digits in the given input strings. It finds single digits with \d and sequences of digits with \d+.

##re.split()

Split string by the occurrences of a character or a pattern, upon finding that pattern, the remaining characters from the string are returned as part of the resulting list.

In [39]:
from re import split

print(split('\W+', 'Words, words , Words'))
print(split('\W+', "Word's words Words"))
print(split('\W+', 'On 12th Jan 2016, at 11:02 AM'))
print(split('\d+', 'On 12th Jan 2016, at 11:02 AM'))

['Words', 'words', 'Words']
['Word', 's', 'words', 'Words']
['On', '12th', 'Jan', '2016', 'at', '11', '02', 'AM']
['On ', 'th Jan ', ', at ', ':', ' AM']


This code splits a string using non-word characters and spaces as delimiters, returning words: ['Words', 'words', 'Words']. Considers apostrophes as non-word characters: ['Word', 's', 'words', 'Words']. Splits using non-word characters and digits:['On', '12th', 'Jan', '2016', 'at', '11', '02', 'AM']. Splits using digits as the delimiter: ['On ', 'th Jan ', ', at ', ':', ' AM'].

##re.search()

This method either returns None (if the pattern doesn’t match), or a re.MatchObject contains information about the matching part of the string. This method stops after the first match, so this is best suited for testing a regular expression more than extracting data.

In [None]:
import re
regex = r"([a-zA-Z]+) (\d+)"

match = re.search(regex, "I was born on June 24")
if match != None:
    print ("Match at index %s, %s" % (match.start(), match.end()))
    print ("Full match: %s" % (match.group(0)))
    print ("Month: %s" % (match.group(1)))
    print ("Day: %s" % (match.group(2)))

else:
    print ("The regex pattern does not match.")

Match at index 14, 21
Full match: June 24
Month: June
Day: 24


This Python code uses the re module to search for a date format (a word followed by a number, like "June 24") in the string "I was born on June 24". It defines a regular expression pattern r"([a-zA-Z]+) (\d+)", which captures a word and a number. If a match is found, it prints the position of the match in the string, the full matched text, and the individual components (month and day). If no match is found, it notifies that the pattern does not match.


# Tokenization

##Word Tokenization

With the help of nltk.tokenize.word_tokenize() method, we are able to extract the tokens from string of characters by using tokenize.word_tokenize() method. It actually returns the syllables from a single word. A single word can contain one or two syllables.

In [None]:
from nltk import SyllableTokenizer
from nltk import word_tokenize

# Create a reference variable for Class word_tokenize
tk = SyllableTokenizer()

# Create a string input
gfg = "Antidisestablishmentarianism"

# Use tokenize method
geek = tk.tokenize(gfg)

print(geek)


['Anti', 'di', 'ses', 'ta', 'blis', 'hmen', 'ta', 'ria', 'nism']


##Rule-based Tokenization

Rule-based tokenization is a technique where a set of rules is applied to the input text to split it into tokens. These rules can be based on different criteria, such as whitespace, punctuation, regular expressions, or language-specific rules.

In [None]:
# Step 1: Load the input text
text = "The quick brown fox jumps over the lazy dog."

# Step 2: Define the tokenization rules (split on whitespace)
tokens = text.split()

# Step 4: Output the tokens
print(tokens)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']


In [40]:
import re

#Load the input text
text = "Hello, I am working at Geeks-for-Geeks and my email is abc23@gfg.com."

#Define the regular expression pattern
p='([\w]+-[\w]+-[\w]+)|([\w\.-]+@[\w]+.[\w]+)'

# Find matches
matches = re.findall(p, text)
# print output
for match in matches:
    if match[0]:
        print(f"Company Name: {match[0]}")
    else:
        print(f"Email address: {match[1]}")

Company Name: Geeks-for-Geeks
Email address: abc23@gfg.com


In [None]:
import re

# Load the input text
text = "Hello! How can I help you?"

# Define the regular expression pattern
# Matches one or more non-alphanumeric characters
pattern = r'\W+'

# Remove the punctuation and get the resulting string
result = re.sub(pattern, ' ', text)

# tokenize
tokens = re.findall(r'\b\w+\b|[^\w\s]', result)

# Print the result
print(tokens)

['Hello', 'How', 'can', 'I', 'help', 'you']


##Subword Tokenization

Subword Tokenization is a Natural Language Processing technique(NLP) in which a word is split into subwords and these subwords are known as tokens. This technique is used in any NLP task where a model needs to maintain a large vocabulary and complex word structures. The concept behind this, frequently occurring words should be in the vocabulary whereas rare words are split into frequent subwords. For example, the word "unwanted" might be split into "un", "want", and "ed". The word "football" might be split into "foot", and "ball".

In [None]:
import re

test_str = """
GeeksforGeeks is a fantastic resource for geeks
who are looking to enhance their programming skills,
and if you're a geek who wants to become an expert programmer,
then GeeksforGeeks is definitely the go-to place for geeks like you.
"""
# printing original String
print("The original string is : " + str(test_str))
test_str=test_str.lower()
# using findall() to get all regex matches.
res = re.findall( r'\w+|[^\s\w]+', test_str)

# printing result
print("The converted string :\n" + str(res))

The original string is : 
GeeksforGeeks is a fantastic resource for geeks 
who are looking to enhance their programming skills, 
and if you're a geek who wants to become an expert programmer, 
then GeeksforGeeks is definitely the go-to place for geeks like you.

The converted string :
['geeksforgeeks', 'is', 'a', 'fantastic', 'resource', 'for', 'geeks', 'who', 'are', 'looking', 'to', 'enhance', 'their', 'programming', 'skills', ',', 'and', 'if', 'you', "'", 're', 'a', 'geek', 'who', 'wants', 'to', 'become', 'an', 'expert', 'programmer', ',', 'then', 'geeksforgeeks', 'is', 'definitely', 'the', 'go', '-', 'to', 'place', 'for', 'geeks', 'like', 'you', '.']


Since we are taking each word. it creates a large dictionary and because of this, word tokenization can have an exploding vocabulary problem. To get rid of this problem we use tokenization on characters. Character tokens solve this large vocabulary problem. For that, we need to create a dictionary that has the frequency of each word in the sentence after the word tokenization and separate each word by space.

In [None]:
from collections import OrderedDict
res_dict=OrderedDict()
for i in res:
    new_string=' '.join(char for char in i)
    if new_string in res_dict:
        res_dict[new_string]+=1
    else:
        res_dict[new_string]=1
res_dict

OrderedDict([('g e e k s f o r g e e k s', 2),
             ('i s', 2),
             ('a', 2),
             ('f a n t a s t i c', 1),
             ('r e s o u r c e', 1),
             ('f o r', 2),
             ('g e e k s', 2),
             ('w h o', 2),
             ('a r e', 1),
             ('l o o k i n g', 1),
             ('t o', 3),
             ('e n h a n c e', 1),
             ('t h e i r', 1),
             ('p r o g r a m m i n g', 1),
             ('s k i l l s', 1),
             (',', 2),
             ('a n d', 1),
             ('i f', 1),
             ('y o u', 2),
             ("'", 1),
             ('r e', 1),
             ('g e e k', 1),
             ('w a n t s', 1),
             ('b e c o m e', 1),
             ('a n', 1),
             ('e x p e r t', 1),
             ('p r o g r a m m e r', 1),
             ('t h e n', 1),
             ('d e f i n i t e l y', 1),
             ('t h e', 1),
             ('g o', 1),
             ('-', 1),
             ('p l a c e', 1

#Lemmatization

Lemmatization techniques in natural language processing (NLP) involve methods to identify and transform words into their base or root forms, known as lemmas. These approaches contribute to text normalization, facilitating more accurate language analysis and processing in various NLP applications.

In [None]:
# import these modules
import nltk
from nltk.stem import WordNetLemmatizer

try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

In [26]:
lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))

# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos="a"))


[nltk_data] Downloading package wordnet to /root/nltk_data...


rocks : rock
corpora : corpus
better : good


In [23]:
import spacy

# Load the spaCy English model
nlp = spacy.load('en_core_web_sm')

# Define a sample text
text = "The quick brown foxes are jumping over the lazy dogs."

# Process the text using spaCy
doc = nlp(text)

# Extract lemmatized tokens
lemmatized_tokens = [token.lemma_ for token in doc]

# Join the lemmatized tokens into a sentence
lemmatized_text = ' '.join(lemmatized_tokens)

# Print the original and lemmatized text
print("Original Text:", text)
print("Lemmatized Text:", lemmatized_text)


Original Text: The quick brown foxes are jumping over the lazy dogs.
Lemmatized Text: the quick brown fox be jump over the lazy dog .


#Stemming

Stemming is a method in text processing that eliminates prefixes and suffixes from words, transforming them into their fundamental or root form, The main objective of stemming is to streamline and standardize words, enhancing the effectiveness of the natural language processing tasks.

##Porter’s Stemmer

 It is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and simplicity. The main applications of Porter Stemmer include data mining and Information retrieval. However, its applications are only limited to English words. Also, the group of stems is mapped on to the same stem and the output stem is not necessarily a meaningful word. The algorithms are fairly lengthy in nature and are known to be the oldest stemmer.

In [27]:
from nltk.stem import PorterStemmer

# Create a Porter Stemmer instance
porter_stemmer = PorterStemmer()

# Example words for stemming
words = ["running", "jumps", "happily", "running", "happily"]

# Apply stemming to each word
stemmed_words = [porter_stemmer.stem(word) for word in words]

# Print the results
print("Original words:", words)
print("Stemmed words:", stemmed_words)


Original words: ['running', 'jumps', 'happily', 'running', 'happily']
Stemmed words: ['run', 'jump', 'happili', 'run', 'happili']


##Snowball Stemmer

The Snowball Stemmer, compared to the Porter Stemmer, is multi-lingual as it can handle non-English words. It supports various languages and is based on the ‘Snowball’ programming language, known for efficient processing of small strings.

The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer is having greater computational speed.

In [None]:
from nltk.stem import SnowballStemmer

# Choose a language for stemming, for example, English
stemmer = SnowballStemmer(language='english')

# Example words to stem
words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes']

# Apply Snowball Stemmer
stemmed_words = [stemmer.stem(word) for word in words_to_stem]

# Print the results
print("Original words:", words_to_stem)
print("Stemmed words:", stemmed_words)


##Lancaster Stemmer

The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is really faster, but the algorithm is really confusing when dealing with small words. But they are not as efficient as Snowball Stemmers. The Lancaster stemmers save the rules externally and basically uses an iterative algorithm.

In [28]:
from nltk.stem import LancasterStemmer

# Create a Lancaster Stemmer instance
stemmer = LancasterStemmer()

# Example words to stem
words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes']

# Apply Lancaster Stemmer
stemmed_words = [stemmer.stem(word) for word in words_to_stem]

# Print the results
print("Original words:", words_to_stem)
print("Stemmed words:", stemmed_words)


Original words: ['running', 'jumped', 'happily', 'quickly', 'foxes']
Stemmed words: ['run', 'jump', 'happy', 'quick', 'fox']


#Removing stop words

##Removing stop words with NLTK

In [32]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk

# Download the 'punkt_tab' resource if not found
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [33]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = """This is a sample sentence,
				showing off the stop words filtration."""

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)
# converts the words in word_tokens to lower case and then checks whether
#they are present in stop_words or not
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
#with no lower case conversion
filtered_sentence = []

for w in word_tokens:
	if w not in stop_words:
		filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)


['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In the first step, the sample sentence, which reads “This is a sample sentence, showing off the stop words filtration,” is tokenized into words using the word_tokenize function. The code then filters out stopwords by converting each word to lowercase and checking its presence in the set of English stopwords obtained from NLTK. The resulting filtered_sentence is printed, showcasing both lowercased and original versions, providing a cleaned version of the sentence with common English stopwords removed.

##Removing stop words with SpaCy

In [34]:
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "There is a pen on the table"

# Process the text using spaCy
doc = nlp(text)

# Remove stopwords
filtered_words = [token.text for token in doc if not token.is_stop]

# Join the filtered words to form a clean text
clean_text = ' '.join(filtered_words)

print("Original Text:", text)
print("Text after Stopword Removal:", clean_text)


Original Text: There is a pen on the table
Text after Stopword Removal: pen table


The provided Python code utilizes the spaCy library for natural language processing to remove stopwords from a sample text. Initially, the spaCy English model is loaded, and the sample text, “There is a pen on the table,” is processed using spaCy. Stopwords are then filtered out from the processed tokens, and the resulting non-stopword tokens are joined to create a clean version of the text.

##Removing stop words with SkLearn

In [35]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Another sample text
new_text = "The quick brown fox jumps over the lazy dog."

# Tokenize the new text using NLTK
new_words = word_tokenize(new_text)

# Remove stopwords using NLTK
new_filtered_words = [
	word for word in new_words if word.lower() not in stopwords.words('english')]

# Join the filtered words to form a clean text
new_clean_text = ' '.join(new_filtered_words)

print("Original Text:", new_text)
print("Text after Stopword Removal:", new_clean_text)


Original Text: The quick brown fox jumps over the lazy dog.
Text after Stopword Removal: quick brown fox jumps lazy dog .


#POS(Parts-Of-Speech) Tagging

##NLTK

In [36]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
# Importing the NLTK library
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Sample text
text = "NLTK is a powerful library for natural language processing."

# Performing PoS tagging
pos_tags = pos_tag(words)

# Displaying the PoS tagged result in separate lines
print("Original Text:")
print(text)

print("\nPoS Tagging Result:")
for word, pos_tag in pos_tags:
	print(f"{word}: {pos_tag}")


Import the NLTK library and its modules for tokenization. Tokenize the input text into words using word_tokenize. Use the pos_tag function from NLTK to perform part-of-speech tagging on the tokenized words. Print the original text and the resulting POS tags in separate lines, showing each word along with its corresponding part-of-speech tag.

##Spacy

In [37]:
#importing libraries
import spacy

# Load the English language model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "SpaCy is a popular natural language processing library."

# Process the text with SpaCy
doc = nlp(text)

# Display the PoS tagged result
print("Original Text: ", text)
print("PoS Tagging Result:")
for token in doc:
	print(f"{token.text}: {token.pos_}")


Original Text:  SpaCy is a popular natural language processing library.
PoS Tagging Result:
SpaCy: PROPN
is: AUX
a: DET
popular: ADJ
natural: ADJ
language: NOUN
processing: NOUN
library: NOUN
.: PUNCT
