# word / sentence tokenization

word tokenization using nltk and spacy

In [1]:
sample_text = 'Product Allocation (PAL) in advanced Available-to-Promise (aATP) is a mechanism in SAP S/4HANA that helps avoid critical situations in demand and procurement. It allows the allocation of materials in short supply to specific regions and customers for a specific time period. This ensures that the entire available quantity of a material is not allocated to a single customer, enabling subsequent order requirements from other customers to be confirmed. PAL helps in precise planning and control of material delivery to meet customer demands.'

In [2]:
# nltk - word tokenization

import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')
word_tokenize(sample_text) 

[nltk_data] Downloading package punkt to /Users/I748920/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Product',
 'Allocation',
 '(',
 'PAL',
 ')',
 'in',
 'advanced',
 'Available-to-Promise',
 '(',
 'aATP',
 ')',
 'is',
 'a',
 'mechanism',
 'in',
 'SAP',
 'S/4HANA',
 'that',
 'helps',
 'avoid',
 'critical',
 'situations',
 'in',
 'demand',
 'and',
 'procurement',
 '.',
 'It',
 'allows',
 'the',
 'allocation',
 'of',
 'materials',
 'in',
 'short',
 'supply',
 'to',
 'specific',
 'regions',
 'and',
 'customers',
 'for',
 'a',
 'specific',
 'time',
 'period',
 '.',
 'This',
 'ensures',
 'that',
 'the',
 'entire',
 'available',
 'quantity',
 'of',
 'a',
 'material',
 'is',
 'not',
 'allocated',
 'to',
 'a',
 'single',
 'customer',
 ',',
 'enabling',
 'subsequent',
 'order',
 'requirements',
 'from',
 'other',
 'customers',
 'to',
 'be',
 'confirmed',
 '.',
 'PAL',
 'helps',
 'in',
 'precise',
 'planning',
 'and',
 'control',
 'of',
 'material',
 'delivery',
 'to',
 'meet',
 'customer',
 'demands',
 '.']

In [3]:
# spacy - word tokenization

import spacy
# run this first
# !python3 -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")
for i in nlp(sample_text):
    print(i)



Product
Allocation
(
PAL
)
in
advanced
Available
-
to
-
Promise
(
aATP
)
is
a
mechanism
in
SAP
S/4HANA
that
helps
avoid
critical
situations
in
demand
and
procurement
.
It
allows
the
allocation
of
materials
in
short
supply
to
specific
regions
and
customers
for
a
specific
time
period
.
This
ensures
that
the
entire
available
quantity
of
a
material
is
not
allocated
to
a
single
customer
,
enabling
subsequent
order
requirements
from
other
customers
to
be
confirmed
.
PAL
helps
in
precise
planning
and
control
of
material
delivery
to
meet
customer
demands
.


In [4]:
type(nlp(sample_text))

spacy.tokens.doc.Doc

In [5]:
nlp(sample_text)[0],nlp(sample_text)[1],nlp(sample_text)[2],nlp(sample_text)[3]

(Product, Allocation, (, PAL)

for word tokenization, either nltk or spacy libraries work fine 

sentence tokenization

In [6]:
# nltk - sentence tokenization

nltk_sentence_tokens = nltk.sent_tokenize(sample_text)
nltk_sentence_tokens

['Product Allocation (PAL) in advanced Available-to-Promise (aATP) is a mechanism in SAP S/4HANA that helps avoid critical situations in demand and procurement.',
 'It allows the allocation of materials in short supply to specific regions and customers for a specific time period.',
 'This ensures that the entire available quantity of a material is not allocated to a single customer, enabling subsequent order requirements from other customers to be confirmed.',
 'PAL helps in precise planning and control of material delivery to meet customer demands.']

In [7]:
# nltk - sentence tokenization

spacy_sentence_tokens = nlp(sample_text).sents
for i in spacy_sentence_tokens:
    print(i,'\n')

Product Allocation (PAL) in advanced Available-to-Promise (aATP) is a mechanism in SAP S/4HANA that helps avoid critical situations in demand and procurement. 

It allows the allocation of materials in short supply to specific regions and customers for a specific time period. 

This ensures that the entire available quantity of a material is not allocated to a single customer, enabling subsequent order requirements from other customers to be confirmed. 

PAL helps in precise planning and control of material delivery to meet customer demands. 



for sentence tokenization
Sentence tokenization takes a text and splits it into individual sentences. For literature, journalism, and formal documents the tokenization algorithms built into spaCy perform well, since the tokenizer is trained on a corpus of formal English text. The sentence tokenizer shows poor performance for electronic health records featuring abbreviations, medical terms, spatial measurements, and other forms not present in standard written English.


nltk seems to be better for this

# stop words

Remove Stop Words:
StopWords are English words that do not add much meaning to a sentence, so we can remove all the stop words from the text. E.g. “a”, ”the”, ”have”, ”an” etc…

The NLTK data package includes a pre-trained Punkt tokenizer for English.

In [8]:
sample_text

'Product Allocation (PAL) in advanced Available-to-Promise (aATP) is a mechanism in SAP S/4HANA that helps avoid critical situations in demand and procurement. It allows the allocation of materials in short supply to specific regions and customers for a specific time period. This ensures that the entire available quantity of a material is not allocated to a single customer, enabling subsequent order requirements from other customers to be confirmed. PAL helps in precise planning and control of material delivery to meet customer demands.'

In [9]:
# nltk - stop words

from nltk.corpus import stopwords
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /Users/I748920/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/I748920/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [11]:
x=[]
for i in sample_text.split():
    if i not in stopwords.words('english'):
        x.append(i)

x

['Product',
 'Allocation',
 '(PAL)',
 'advanced',
 'Available-to-Promise',
 '(aATP)',
 'mechanism',
 'SAP',
 'S/4HANA',
 'helps',
 'avoid',
 'critical',
 'situations',
 'demand',
 'procurement.',
 'It',
 'allows',
 'allocation',
 'materials',
 'short',
 'supply',
 'specific',
 'regions',
 'customers',
 'specific',
 'time',
 'period.',
 'This',
 'ensures',
 'entire',
 'available',
 'quantity',
 'material',
 'allocated',
 'single',
 'customer,',
 'enabling',
 'subsequent',
 'order',
 'requirements',
 'customers',
 'confirmed.',
 'PAL',
 'helps',
 'precise',
 'planning',
 'control',
 'material',
 'delivery',
 'meet',
 'customer',
 'demands.']

In [12]:
# spacy - stop words

import spacy
nlp = spacy.load("en_core_web_sm")
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [66]:
x=[]
for i in sample_text.split():
    if i not in spacy_stopwords:
        x.append(i)

x

['Product',
 'Allocation',
 '(PAL)',
 'advanced',
 'Available-to-Promise',
 '(aATP)',
 'mechanism',
 'SAP',
 'S/4HANA',
 'helps',
 'avoid',
 'critical',
 'situations',
 'demand',
 'procurement.',
 'It',
 'allows',
 'allocation',
 'materials',
 'short',
 'supply',
 'specific',
 'regions',
 'customers',
 'specific',
 'time',
 'period.',
 'This',
 'ensures',
 'entire',
 'available',
 'quantity',
 'material',
 'allocated',
 'single',
 'customer,',
 'enabling',
 'subsequent',
 'order',
 'requirements',
 'customers',
 'confirmed.',
 'PAL',
 'helps',
 'precise',
 'planning',
 'control',
 'material',
 'delivery',
 'meet',
 'customer',
 'demands.']

Both the libraries, spaCy and NLTK have done a decent job in removing the stop words from the paragraph. Both can get your task done quite efficiently. However, to pick a winner, spaCy has done better in the segment which is quite accurate. Moreover, NLTK requires downloading the required package to perform the task.

# lemmatization

Lemmatization:
Lemmatization is the text normalization technique for spaCy, that will remove words having the same meaning. It is a process of getting the base word of a given word i.e. “counter”, ”count”, so here the base word is “count”.

Lemmatization seeks to distill words to their foundational forms. In this linguistic refinement, the resultant base word is referred to as a “lemma.”

Lemmatization techniques in natural language processing (NLP) involve methods to identify and transform words into their base or root forms, known as lemmas. These approaches contribute to text normalization, facilitating more accurate language analysis and processing in various NLP applications. Three types of lemmatization techniques are:

In [74]:
# Process the text using spaCy
type(nlp(sample_text)[0]),type(nlp(sample_text))

(spacy.tokens.token.Token, spacy.tokens.doc.Doc)

In [73]:
# spacy - lemmatization

import spacy
nlp = spacy.load("en_core_web_sm")

x=[]
# for spacy need to do nlp(text) to create spacy doc / token objects
for i in nlp(sample_text):
    x.append(i.lemma_)
    print(i,i.lemma_)

x

Product Product
Allocation Allocation
( (
PAL PAL
) )
in in
advanced advanced
Available Available
- -
to to
- -
Promise promise
( (
aATP aATP
) )
is be
a a
mechanism mechanism
in in
SAP SAP
S/4HANA S/4HANA
that that
helps help
avoid avoid
critical critical
situations situation
in in
demand demand
and and
procurement procurement
. .
It it
allows allow
the the
allocation allocation
of of
materials material
in in
short short
supply supply
to to
specific specific
regions region
and and
customers customer
for for
a a
specific specific
time time
period period
. .
This this
ensures ensure
that that
the the
entire entire
available available
quantity quantity
of of
a a
material material
is be
not not
allocated allocate
to to
a a
single single
customer customer
, ,
enabling enable
subsequent subsequent
order order
requirements requirement
from from
other other
customers customer
to to
be be
confirmed confirm
. .
PAL pal
helps help
in in
precise precise
planning planning
and and
control control
o

['Product',
 'Allocation',
 '(',
 'PAL',
 ')',
 'in',
 'advanced',
 'Available',
 '-',
 'to',
 '-',
 'promise',
 '(',
 'aATP',
 ')',
 'be',
 'a',
 'mechanism',
 'in',
 'SAP',
 'S/4HANA',
 'that',
 'help',
 'avoid',
 'critical',
 'situation',
 'in',
 'demand',
 'and',
 'procurement',
 '.',
 'it',
 'allow',
 'the',
 'allocation',
 'of',
 'material',
 'in',
 'short',
 'supply',
 'to',
 'specific',
 'region',
 'and',
 'customer',
 'for',
 'a',
 'specific',
 'time',
 'period',
 '.',
 'this',
 'ensure',
 'that',
 'the',
 'entire',
 'available',
 'quantity',
 'of',
 'a',
 'material',
 'be',
 'not',
 'allocate',
 'to',
 'a',
 'single',
 'customer',
 ',',
 'enable',
 'subsequent',
 'order',
 'requirement',
 'from',
 'other',
 'customer',
 'to',
 'be',
 'confirm',
 '.',
 'pal',
 'help',
 'in',
 'precise',
 'planning',
 'and',
 'control',
 'of',
 'material',
 'delivery',
 'to',
 'meet',
 'customer',
 'demand',
 '.']

In [76]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/I748920/nltk_data...


True

In [79]:
# nltk - lemmatization

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()


print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos="a"))
# indicates that the word "better" has been lemmatized to its base form, which is "good," when treated as an adjective (denoted by pos="a").

rocks : rock
corpora : corpus
better : good


# stemming

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm reduces the words “chocolates”, “chocolatey”, and “choco” to the root word, “chocolate” and “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.


Stemming: Faster, but may create unrecognizable words and lose meaning. This is known as “over stemming.” Lemmatization: More accurate, preserves meaning and grammatical function, but slower. It is often used to maintain related words.

lemmatization vs stemming

https://stackoverflow.com/questions/49354665/should-i-perform-both-lemmatization-and-stemming
https://www.datacamp.com/tutorial/stemming-lemmatization-python
* think for now stick to lemmatization

In [83]:
# nltk - stemming

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()
for word in sample_text.split():
    print(word,ps.stem(word))

Product product
Allocation alloc
(PAL) (pal)
in in
advanced advanc
Available-to-Promise available-to-promis
(aATP) (aatp)
is is
a a
mechanism mechan
in in
SAP sap
S/4HANA s/4hana
that that
helps help
avoid avoid
critical critic
situations situat
in in
demand demand
and and
procurement. procurement.
It it
allows allow
the the
allocation alloc
of of
materials materi
in in
short short
supply suppli
to to
specific specif
regions region
and and
customers custom
for for
a a
specific specif
time time
period. period.
This thi
ensures ensur
that that
the the
entire entir
available avail
quantity quantiti
of of
a a
material materi
is is
not not
allocated alloc
to to
a a
single singl
customer, customer,
enabling enabl
subsequent subsequ
order order
requirements requir
from from
other other
customers custom
to to
be be
confirmed. confirmed.
PAL pal
helps help
in in
precise precis
planning plan
and and
control control
of of
material materi
delivery deliveri
to to
meet meet
customer custom
demands. 

In [84]:
# second nltk method

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from functools import reduce
 
ps = PorterStemmer()
 
sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)
 
# using reduce to apply stemmer to each word and join them back into a string
stemmed_sentence = reduce(lambda x, y: x + " " + ps.stem(y), words, "")
 
print(stemmed_sentence)
#This code is contrinuted by Pushpa.

 programm program with program languag


In [85]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
print(ps.stem("running"))  # Output: run
print(ps.stem("happiness"))  # Output: happi

run
happi


In [None]:
# spacy - stemming

# - It might be surprising to you but spaCy doesn't contain any function for stemming as it relies on lemmatization only. Therefore, in this section, we will use NLTK for stemming.

# regex exploration

https://www.youtube.com/watch?v=K8L6KVGG-7o

In [13]:
import re

Normally, Python uses backslashes as escape characters. Prefacing the string definition with 'r' is a useful way to define a string where you need the backslash to be an actual backslash and not part of an escape code that means something else in the string.22 Feb 2022
- usually \n means newline \t is tab, but rawstring will make everything compiled as it is

F-string is a way to format strings in Python. It was introduced in Python 3.6 and aims to make it easier for users to add variables, comma separators, do padding with zeros and date format. Python String. | Image: Frank Andrade. F-string was introduced in Python 3.6 and provides a better way to format strings.14 Mar 2023

In [14]:
text1 = "Contact us at support@example.com or sales@example.com"
text = """
RegExr was created by gskinner.com.

Edit the Expression & Text to see matches. Roll over matches or the expression for details. PCRE & JavaScript flavors of RegEx are supported. Validate your expression with Tests mode.

The side bar includes a Cheatsheet, full Reference, and Help. You can also Save & Share with the Community and view patterns you create or favorite in My Patterns.

Explore results with the Tools below. Replace & List output custom results. Details lists capture groups. Explain describes your expression in plain English.
"""
emails = re.findall(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", text1)
print(emails)  # Output: ['support@example.com', 'sales@example.com']

regex_test = re.findall(r"/the/",text)
print(regex_test)

# alternative is to do things like looks for @ and .com but others might be included like home address or websites, so regex is a more powerful way to do this

['support@example.com', 'sales@example.com']
[]


In [35]:
text2 = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

coreyms.com

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
mr rilwan
mR. nabil
MS. amanda
Ms. TAn
'''


In [19]:
pattern = re.compile(r'abc')
matches = pattern.finditer(text2)
for m in matches:
    print(m) # span tells you which index of the string gives that match, this is why finditer is useful

<re.Match object; span=(1, 4), match='abc'>


!! use regex to find patterns, exact match can just use regualr python

pattern.findall just returns all the actual string matches in a list

In [20]:
pattern = re.compile(r'cba')
matches = pattern.finditer(text2)
for m in matches:
    print(m) 

In [21]:
pattern = re.compile(r'\bHa')
matches = pattern.finditer(text2)
for m in matches:
    print(m)

<re.Match object; span=(67, 69), match='Ha'>
<re.Match object; span=(70, 72), match='Ha'>


In [22]:
pattern = re.compile(r'\BHa')
matches = pattern.finditer(text2)
for m in matches:
    print(m)

<re.Match object; span=(72, 74), match='Ha'>


In [23]:
pattern = re.compile(r'\d\d\d\D\d\d\d\D\d\d\d')
matches = pattern.findall(text2)
for m in matches:
    print(m)

321-555-432
123.555.123
123*555*123
800-555-123
900-555-123


or use . -> . in regex matches any char

In [24]:
pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d')
matches = pattern.findall(text2)
for m in matches:
    print(m)

321-555-432
123.555.123
123*555*123
800-555-123
900-555-123


if only want to match those with - in number

In [25]:
pattern = re.compile(r'\d\d\d[-]\d\d\d[-]\d\d\d')
matches = pattern.findall(text2)
for m in matches:
    print(m)
print()

pattern = re.compile(r'\d\d\d[*]\d\d\d[*]\d\d\d')
matches = pattern.findall(text2)
for m in matches:
    print(m)
print()

pattern = re.compile(r'\d\d\d[-*.]\d\d\d[-*.]\d\d\d')
# matches any char in []
matches = pattern.findall(text2)
for m in matches:
    print(m)

321-555-432
800-555-123
900-555-123

123*555*123

321-555-432
123.555.123
123*555*123
800-555-123
900-555-123


note that in pattern = re.compile(r'\d\d\d[-*.]\d\d\d[-*.]\d\d\d')
the [-*.] still only matches one char, even if you have
[A-Za-z0-9.] it still only matches one char

In [27]:
pattern = re.compile(r'[89]00[-]\d\d\d[-]\d\d\d')
matches = pattern.findall(text2)
for m in matches:
    print(m)
print()

800-555-123
900-555-123



In [28]:
pattern = re.compile(r'[a-z]')
matches = pattern.findall(text2)
for m in matches:
    print(m)
print()

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
u
r
t
u
v
w
x
y
z
a
a
a
e
t
a
h
a
r
a
c
t
e
r
s
e
e
d
t
o
b
e
e
s
c
a
p
e
d
c
o
r
e
y
m
s
c
o
m
r
c
h
a
f
e
r
r
m
i
t
h
s
a
v
i
s
r
s
o
b
i
n
s
o
n
r



In [29]:
pattern = re.compile(r'[A-Z]')
matches = pattern.findall(text2)
for m in matches:
    print(m)
print()

A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
H
H
H
M
C
N
M
S
M
S
M
D
M
R
M
T



In [30]:
pattern = re.compile(r'[1-5]')
matches = pattern.findall(text2)
for m in matches:
    print(m)
print()

1
2
3
4
5
3
2
1
5
5
5
4
3
2
1
1
2
3
5
5
5
1
2
3
4
1
2
3
5
5
5
1
2
3
4
5
5
5
1
2
3
4
5
5
5
1
2
3
4



In [31]:
pattern = re.compile(r'[a-zA-Z]')
matches = pattern.findall(text2)
for m in matches:
    print(m)
print()

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
u
r
t
u
v
w
x
y
z
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
H
a
H
a
H
a
M
e
t
a
C
h
a
r
a
c
t
e
r
s
N
e
e
d
t
o
b
e
e
s
c
a
p
e
d
c
o
r
e
y
m
s
c
o
m
M
r
S
c
h
a
f
e
r
M
r
S
m
i
t
h
M
s
D
a
v
i
s
M
r
s
R
o
b
i
n
s
o
n
M
r
T



In [32]:
text3 = "cat mat pat bat dog"
# to get all with at the back but not starting with b, can do with [^ ]
pattern = re.compile(r'[^b]at')
matches = re.findall(pattern,text3)
matches

['cat', 'mat', 'pat']

In [34]:
# another way to do pattern = re.compile(r'\d\d\d.\d\d\d.\d\d\d')

pattern = re.compile(r'\d{3}.\d{3}.\d{3}')
matches = re.findall(pattern,text2)
matches

['321-555-432', '123.555.123', '123*555*123', '800-555-123', '900-555-123']

In [42]:
print(text2)


abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

Ha HaHa

MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

coreyms.com

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
mr rilwan
mR. nabil
MS. amanda
Ms. TAn



In [48]:
# try and get all the names + designation even tho some have . after Mr some dont etc
# example
# Mr. Schafer
# Mr Smith
# Ms Davis
# Mrs. Robinson
# Mr. T

pattern = re.compile(r'[mM][rRsS][sS]?[\.]?\s\w*')
matches = re.findall(pattern,text2)
matches

['Mr. Schafer',
 'Mr Smith',
 'Ms Davis',
 'Mrs. Robinson',
 'Mr. T',
 'mr rilwan',
 'mR. nabil',
 'MS. amanda',
 'Ms. TAn']

In [54]:
# try and get all the names + designation even tho some have . after Mr some dont etc
# example
# Mr. Schafer
# Mr Smith
# Ms Davis
# Mrs. Robinson
# Mr. T

pattern = re.compile(r'[mM][(r|s|rs)]')
matches = re.findall(pattern,text2)
matches

['ms', 'Mr', 'Mr', 'Ms', 'Mr', 'Mr', 'mr', 'Ms']

In [59]:
# try and get all the names + designation even tho some have . after Mr some dont etc
# example
# Mr. Schafer
# Mr Smith
# Ms Davis
# Mrs. Robinson
# Mr. T

pattern = re.compile(r'M(r|s|rs)\.?')
matches = re.findall(pattern,text2)
matches

['r', 'r', 's', 'r', 'r', 's']

is returning ['r', 'r', 's', 'r', 'r', 's'] is because of the capturing group (r|s|rs). This capturing group captures only the portion inside the parentheses (i.e., "r", "s", or "rs"), which explains why only those letters are being returned.



In [67]:
# try and get all the names + designation even tho some have . after Mr some dont etc
# example
# Mr. Schafer
# Mr Smith
# Ms Davis
# Mrs. Robinson
# Mr. T

pattern = re.compile(r'[mM][(rR|sS|rs|rS|Rs)]\.?\s\w*')
matches = re.findall(pattern,text2)
matches

['Mr. Schafer',
 'Mr Smith',
 'Ms Davis',
 'Mr. T',
 'mr rilwan',
 'mR. nabil',
 'MS. amanda',
 'Ms. TAn']

In [71]:
# try catch the emails

emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''

pattern = re.compile(r'\w*@\w*\.[com|edu|net]')
matches = re.findall(pattern,emails)
matches

['CoreyMSchafer@gmail.c', 'schafer@university.e']

[com|edu] checks for characters, c o m e d u only

In [92]:
# try catch the emails

emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
'''

# pattern = re.compile(r'\w*@\w*\.(com|edu|net)')
# pattern = re.compile(r'\w+@\w+\.(com|edu|net)')
# pattern = re.compile(r'\w*@\w*\.(com)')
pattern = re.compile(r'[a-zA-Z.]+@[a-zA-Z]+\.(com|edu)')
 

matches = re.findall(pattern,emails)
matches

['com', 'edu']

you cant use findall with a group as it will only return whats in the group, should use finditer

In [114]:
# try catch the emails

emails = '''
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
corey-321+schafer@my-work.net
'''

# pattern = re.compile(r'\w*@\w*\.(com|edu|net)')
# pattern = re.compile(r'\w+@\w+\.(com|edu|net)')
# pattern = re.compile(r'\w*@\w*\.(com)')
pattern = re.compile(r'[a-zA-Z0-9.+-]+@[a-zA-Z-]+\.(com|edu|net)')

matches = [match for match in re.finditer(pattern,emails)]
matches

[<re.Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>,
 <re.Match object; span=(25, 53), match='corey.schafer@university.edu'>,
 <re.Match object; span=(54, 83), match='corey-321-schafer@my-work.net'>,
 <re.Match object; span=(84, 113), match='corey-321+schafer@my-work.net'>]

In [94]:
#example that works with findall
 
text = 'bcacaca dcaca dbcaca'
pattern = 'b?(?:.a)*'

## Evaluate regex
result = re.findall(pattern, text)
result

['bcacaca', '', '', 'caca', '', '', 'bcaca', '']

In [126]:
import re

urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

pattern = re.compile(r'https?://[a-zA-Z.]+')
matches = [match for match in re.finditer(pattern,urls)]
matches

[<re.Match object; span=(1, 23), match='https://www.google.com'>,
 <re.Match object; span=(24, 42), match='http://coreyms.com'>,
 <re.Match object; span=(43, 62), match='https://youtube.com'>,
 <re.Match object; span=(63, 83), match='https://www.nasa.gov'>]

or

In [130]:
import re

urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

pattern = re.compile(r'https?://(www\.)?\w+\.\w+')
matches = [match for match in re.finditer(pattern,urls)]
matches

[<re.Match object; span=(1, 23), match='https://www.google.com'>,
 <re.Match object; span=(24, 42), match='http://coreyms.com'>,
 <re.Match object; span=(43, 62), match='https://youtube.com'>,
 <re.Match object; span=(63, 83), match='https://www.nasa.gov'>]

only using groups

In [131]:
import re

urls = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''

pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = [match for match in re.finditer(pattern,urls)]
matches

[<re.Match object; span=(1, 23), match='https://www.google.com'>,
 <re.Match object; span=(24, 42), match='http://coreyms.com'>,
 <re.Match object; span=(43, 62), match='https://youtube.com'>,
 <re.Match object; span=(63, 83), match='https://www.nasa.gov'>]

by doing everything in groups you can find them in the urls

In [133]:
for match in matches:
    print(match.group(0)) #returns entire match
    print(match.group(1)) #returns the matches for the first group meaning first ()
    print(match.group(2))
    print(match.group(3))
    print()
    print()

https://www.google.com
www.
google
.com


http://coreyms.com
None
coreyms
.com


https://youtube.com
None
youtube
.com


https://www.nasa.gov
www.
nasa
.gov




using the groups, you directly call them in the sub method

In [136]:
# example here to sub the url with only the group 2 and 3 to shorten it

pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
subbed_urls = pattern.sub(r'\2\3',urls)
print(subbed_urls)


google.com
coreyms.com
youtube.com
nasa.gov



In [138]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
subbed_urls = pattern.sub(r'\2....\3',urls) #the ... works in r string because this is not being compiled by the regex
print(subbed_urls)


google.....com
coreyms.....com
youtube.....com
nasa.....gov



re.findall just finds the match and returns a list of matches
- in a group it only looks for the return in those groups, or in multiple groups it returns list of tuples

re.match returns only the first match at the beginning of the string

In [139]:
sentence = 'Start a sentence and the bring it to and end.'
pattern = re.compile('Start')
matches =  pattern.match(sentence)
print(matches)

<re.Match object; span=(0, 5), match='Start'>


In [141]:
sentence = 'Start a sentence and the bring it to and end.'
pattern = re.compile('sentence')
matches =  pattern.search(sentence)
print(matches)

<re.Match object; span=(8, 16), match='sentence'>


flags, re.IGNORECASE

In [142]:
sentence = 'Start a sentence and the bring it to and end.'
pattern = re.compile('start',re.IGNORECASE)
matches =  pattern.match(sentence)
print(matches)

<re.Match object; span=(0, 5), match='Start'>


### preprocessing tutorial
* follow this series: https://medium.com/@erhan_arslan/understanding-natural-language-processing-nlp-step-1-bd5030c5a1b2

In [None]:
# try and get all the names + designation even tho some have . after Mr some dont etc
# example
# Mr. Schafer
# Mr Smith
# Ms Davis
# Mrs. Robinson
# Mr. T

pattern = re.compile(r'[mM][(r|s|rs)]')
matches = re.findall(pattern,text2)
matches

In [None]:
### pre

# MT proj preprocessing code

In [87]:
sample_text

'Product Allocation (PAL) in advanced Available-to-Promise (aATP) is a mechanism in SAP S/4HANA that helps avoid critical situations in demand and procurement. It allows the allocation of materials in short supply to specific regions and customers for a specific time period. This ensures that the entire available quantity of a material is not allocated to a single customer, enabling subsequent order requirements from other customers to be confirmed. PAL helps in precise planning and control of material delivery to meet customer demands.'

In [90]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.strip()

    tokens = word_tokenize(text, preserve_line=True)
    stop_words = set(stopwords.words('english'))
    tokens = [i for i in tokens if not i in stop_words]

    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(i) for i in tokens]

    return tokens

In [91]:
preprocess_text(sample_text)

'product allocation (pal) in advanced available-to-promise (aatp) is a mechanism in sap s/hana that helps avoid critical situations in demand and procurement. it allows the allocation of materials in short supply to specific regions and customers for a specific time period. this ensures that the entire available quantity of a material is not allocated to a single customer, enabling subsequent order requirements from other customers to be confirmed. pal helps in precise planning and control of material delivery to meet customer demands.'