# REGEX

## The \w



### \w matches a single alphanumeric character at the start of the string.
### \w+ matches one or more alphanumeric characters (as a word) at the start of the string.
### \w* matches zero or more alphanumeric characters at the start of the string.
### \w{n} matches exactly n alphanumeric characters at the start of the string.

In [15]:
word_regex='\w'
re.match(word_regex,'hi there!')

<re.Match object; span=(0, 1), match='h'>

In [16]:
word_regex='\w+'
re.match(word_regex,'hi there!')

<re.Match object; span=(0, 2), match='hi'>

In [17]:
word_regex='\w*'
re.match(word_regex,'hi there!')

<re.Match object; span=(0, 2), match='hi'>

In [18]:
word_regex='\w{2}'
re.match(word_regex,'hi there!')

<re.Match object; span=(0, 2), match='hi'>

In [39]:
word_regex='\w{5}'
re.match(word_regex,'hi there!')

In [41]:
word_regex = r'\w{5}'
re.search(word_regex, 'hi there!')

<re.Match object; span=(3, 8), match='there'>

## The \d


### \d matches a single digit.
### \d+ matches one or more digits.
### \d* matches zero or more digits (including the empty string where no digits are present).
### \d{n} matches exactly n digits.

In [43]:
digit_regex='\d'
re.search(digit_regex,'I ordered 99 pizzas')

<re.Match object; span=(10, 11), match='9'>

In [10]:
digit_regex='\d+'
re.findall(digit_regex,'I ordered 99 pizzas')

['99']

In [11]:
digit_regex='\d*'
re.findall(digit_regex,'I ordered 99 pizzas')

['', '', '', '', '', '', '', '', '', '', '99', '', '', '', '', '', '', '', '']

In [46]:
digit_regex='\d{2}'
re.search(digit_regex,'I ordered 99 pizzas')

<re.Match object; span=(10, 12), match='99'>

In [45]:
digit_regex='\d{4}'
re.findall(digit_regex,'I ordered 99 pizzas')

[]

## Mixing Regex

In [49]:
my_string="Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

In [50]:
# Split my_string on sentence endings and print the result
sentence_endings = r"[.?!]"
print(re.split(sentence_endings, my_string))


["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']


In [51]:
# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))


['Let', 'RegEx', 'Won', 'Can', 'Or']


In [52]:
# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))



["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']


In [53]:
# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))


['4', '19']


# Tokenization
### For tokenization we will use NLTK


## Basic tokenization

In [54]:
scene_one="SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a temperate zone.\nARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?\nSOLDIER #1: Are you suggesting coconuts migrate?\nARTHUR: Not at all.  They could be carried.\nSOLDIER #1: What?  A swallow carrying a coconut?\nARTHUR: It could grip it by the husk!\nSOLDIER #1: It's not a question of where he grips it!  It's a simple question of weight ratios!  A five ounce bird could not carry a one pound coconut.\nARTHUR: Well, it doesn't matter.  Will you go and tell your master that Arthur from the Court of Camelot is here.\nSOLDIER #1: Listen.  In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?\nARTHUR: Please!\nSOLDIER #1: Am I right?\nARTHUR: I'm not interested!\nSOLDIER #2: It could be carried by an African swallow!\nSOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.  That's my point.\nSOLDIER #2: Oh, yeah, I agree with that.\nARTHUR: Will you ask your master if he wants to join my court at Camelot?!\nSOLDIER #1: But then of course a-- African swallows are non-migratory.\nSOLDIER #2: Oh, yeah...\nSOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop] \nSOLDIER #2: Wait a minute!  Supposing two swallows carried it together?\nSOLDIER #1: No, they'd have to have it on a line.\nSOLDIER #2: Well, simple!  They'd just use a strand of creeper!\nSOLDIER #1: What, held under the dorsal guiding feathers?\nSOLDIER #2: Well, why not?\n"

In [55]:
scene_one

"SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a t

In [63]:
# Import necessary modules
import nltk
from nltk.tokenize import sent_tokenize 
from nltk.tokenize import word_tokenize 
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/mohamed/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [64]:
# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)
print(sentences)

['SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!', '[clop clop clop] \nSOLDIER #1: Halt!', 'Who goes there?', 'ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.', 'King of the Britons, defeator of the Saxons, sovereign of all England!', 'SOLDIER #1: Pull the other one!', 'ARTHUR: I am, ...  and this is my trusty servant Patsy.', 'We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.', 'I must speak with your lord and master.', 'SOLDIER #1: What?', 'Ridden on a horse?', 'ARTHUR: Yes!', "SOLDIER #1: You're using coconuts!", 'ARTHUR: What?', "SOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.", 'ARTHUR: So?', "We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?", 'ARTHUR: We found them.', 'SOLDIER #1: Found them?', 'In Mercea?', "The coconut's tropical!", 'ARTHUR: What 

In [65]:
# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])
print(tokenized_sent)

['ARTHUR', ':', 'It', 'is', 'I', ',', 'Arthur', ',', 'son', 'of', 'Uther', 'Pendragon', ',', 'from', 'the', 'castle', 'of', 'Camelot', '.']


In [67]:
# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = word_tokenize(scene_one)
print(unique_tokens)

['SCENE', '1', ':', '[', 'wind', ']', '[', 'clop', 'clop', 'clop', ']', 'KING', 'ARTHUR', ':', 'Whoa', 'there', '!', '[', 'clop', 'clop', 'clop', ']', 'SOLDIER', '#', '1', ':', 'Halt', '!', 'Who', 'goes', 'there', '?', 'ARTHUR', ':', 'It', 'is', 'I', ',', 'Arthur', ',', 'son', 'of', 'Uther', 'Pendragon', ',', 'from', 'the', 'castle', 'of', 'Camelot', '.', 'King', 'of', 'the', 'Britons', ',', 'defeator', 'of', 'the', 'Saxons', ',', 'sovereign', 'of', 'all', 'England', '!', 'SOLDIER', '#', '1', ':', 'Pull', 'the', 'other', 'one', '!', 'ARTHUR', ':', 'I', 'am', ',', '...', 'and', 'this', 'is', 'my', 'trusty', 'servant', 'Patsy', '.', 'We', 'have', 'ridden', 'the', 'length', 'and', 'breadth', 'of', 'the', 'land', 'in', 'search', 'of', 'knights', 'who', 'will', 'join', 'me', 'in', 'my', 'court', 'at', 'Camelot', '.', 'I', 'must', 'speak', 'with', 'your', 'lord', 'and', 'master', '.', 'SOLDIER', '#', '1', ':', 'What', '?', 'Ridden', 'on', 'a', 'horse', '?', 'ARTHUR', ':', 'Yes', '!', 'SOLDIE

## Mixing Regex with Tokenization

In [71]:
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer

In [72]:
tweets=['This is the best #nlp exercise ive found online! #python',
 '#NLP is super fun! <3 #learning',
 'Thanks @MohamedHassanien :) #nlp #python']

In [73]:
# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"

In [74]:
# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)

['#nlp', '#python']


In [75]:
# Define a regex pattern to find mentions: pattern2
pattern2 = r"@\w+"

In [77]:
# Use the pattern on the first tweet in the tweets list
mentions = regexp_tokenize(tweets[2], pattern2)
print(mentions)

['@MohamedHassanien']


In [78]:
# Write a pattern that matches both mentions (@) and hashtags
pattern3 = r"([\@\#]\w+)"

In [79]:
# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[2], pattern3)
print(mentions_hashtags)

['@MohamedHassanien', '#nlp', '#python']


In [80]:
# Use the TweetTokenizer to tokenize all tweets into one list
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

[['This', 'is', 'the', 'best', '#nlp', 'exercise', 'ive', 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@MohamedHassanien', ':)', '#nlp', '#python']]


## Non-Ascii Tokenization

In [81]:
german_text='Wann gehen wir Pizza essen? 🍕 Und fährst du mit Über? 🚕'

In [82]:
# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)
print(all_words)

['Wann', 'gehen', 'wir', 'Pizza', 'essen', '?', '🍕', 'Und', 'fährst', 'du', 'mit', 'Über', '?', '🚕']


In [83]:
# Tokenize and print only capital words
capital_words = r"[A-ZÜ]\w+"
print(regexp_tokenize(german_text, capital_words))


['Wann', 'Pizza', 'Und', 'Über']


In [84]:
# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

['🍕', '🚕']


# Stemming and Lemmatization

## Normal Stemming and Lemmatization

In [93]:
# Import the required packages from nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import word_tokenize
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/mohamed/nltk_data...


True

In [88]:
#Define our stemmer and our lemmatizer
porter = PorterStemmer()
WNlemmatizer = WordNetLemmatizer()

In [89]:
# Tokenize the GoT string
tokens = word_tokenize(scene_one) 

In [90]:
tokens

['SCENE',
 '1',
 ':',
 '[',
 'wind',
 ']',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'KING',
 'ARTHUR',
 ':',
 'Whoa',
 'there',
 '!',
 '[',
 'clop',
 'clop',
 'clop',
 ']',
 'SOLDIER',
 '#',
 '1',
 ':',
 'Halt',
 '!',
 'Who',
 'goes',
 'there',
 '?',
 'ARTHUR',
 ':',
 'It',
 'is',
 'I',
 ',',
 'Arthur',
 ',',
 'son',
 'of',
 'Uther',
 'Pendragon',
 ',',
 'from',
 'the',
 'castle',
 'of',
 'Camelot',
 '.',
 'King',
 'of',
 'the',
 'Britons',
 ',',
 'defeator',
 'of',
 'the',
 'Saxons',
 ',',
 'sovereign',
 'of',
 'all',
 'England',
 '!',
 'SOLDIER',
 '#',
 '1',
 ':',
 'Pull',
 'the',
 'other',
 'one',
 '!',
 'ARTHUR',
 ':',
 'I',
 'am',
 ',',
 '...',
 'and',
 'this',
 'is',
 'my',
 'trusty',
 'servant',
 'Patsy',
 '.',
 'We',
 'have',
 'ridden',
 'the',
 'length',
 'and',
 'breadth',
 'of',
 'the',
 'land',
 'in',
 'search',
 'of',
 'knights',
 'who',
 'will',
 'join',
 'me',
 'in',
 'my',
 'court',
 'at',
 'Camelot',
 '.',
 'I',
 'must',
 'speak',
 'with',
 'your',
 'lord',
 'and',
 'ma

In [94]:
import time

# Log the start time
start_time = time.time()

# Build a stemmed list
stemmed_tokens = [porter.stem(token) for token in tokens] 

# Log the end time
end_time = time.time()

print('Time taken for stemming in seconds: ', end_time - start_time)
print('Stemmed tokens: ', stemmed_tokens) 

Time taken for stemming in seconds:  0.0026874542236328125
Stemmed tokens:  ['scene', '1', ':', '[', 'wind', ']', '[', 'clop', 'clop', 'clop', ']', 'king', 'arthur', ':', 'whoa', 'there', '!', '[', 'clop', 'clop', 'clop', ']', 'soldier', '#', '1', ':', 'halt', '!', 'who', 'goe', 'there', '?', 'arthur', ':', 'it', 'is', 'i', ',', 'arthur', ',', 'son', 'of', 'uther', 'pendragon', ',', 'from', 'the', 'castl', 'of', 'camelot', '.', 'king', 'of', 'the', 'briton', ',', 'defeat', 'of', 'the', 'saxon', ',', 'sovereign', 'of', 'all', 'england', '!', 'soldier', '#', '1', ':', 'pull', 'the', 'other', 'one', '!', 'arthur', ':', 'i', 'am', ',', '...', 'and', 'thi', 'is', 'my', 'trusti', 'servant', 'patsi', '.', 'we', 'have', 'ridden', 'the', 'length', 'and', 'breadth', 'of', 'the', 'land', 'in', 'search', 'of', 'knight', 'who', 'will', 'join', 'me', 'in', 'my', 'court', 'at', 'camelot', '.', 'i', 'must', 'speak', 'with', 'your', 'lord', 'and', 'master', '.', 'soldier', '#', '1', ':', 'what', '?', '

In [95]:
import time

# Log the start time
start_time = time.time()

# Build a lemmatized list
lem_tokens = [WNlemmatizer.lemmatize(token) for token in tokens]

# Log the end time
end_time = time.time()

print('Time taken for lemmatizing in seconds: ', end_time - start_time)
print('Lemmatized tokens: ', lem_tokens) 

Time taken for lemmatizing in seconds:  2.4541826248168945
Lemmatized tokens:  ['SCENE', '1', ':', '[', 'wind', ']', '[', 'clop', 'clop', 'clop', ']', 'KING', 'ARTHUR', ':', 'Whoa', 'there', '!', '[', 'clop', 'clop', 'clop', ']', 'SOLDIER', '#', '1', ':', 'Halt', '!', 'Who', 'go', 'there', '?', 'ARTHUR', ':', 'It', 'is', 'I', ',', 'Arthur', ',', 'son', 'of', 'Uther', 'Pendragon', ',', 'from', 'the', 'castle', 'of', 'Camelot', '.', 'King', 'of', 'the', 'Britons', ',', 'defeator', 'of', 'the', 'Saxons', ',', 'sovereign', 'of', 'all', 'England', '!', 'SOLDIER', '#', '1', ':', 'Pull', 'the', 'other', 'one', '!', 'ARTHUR', ':', 'I', 'am', ',', '...', 'and', 'this', 'is', 'my', 'trusty', 'servant', 'Patsy', '.', 'We', 'have', 'ridden', 'the', 'length', 'and', 'breadth', 'of', 'the', 'land', 'in', 'search', 'of', 'knight', 'who', 'will', 'join', 'me', 'in', 'my', 'court', 'at', 'Camelot', '.', 'I', 'must', 'speak', 'with', 'your', 'lord', 'and', 'master', '.', 'SOLDIER', '#', '1', ':', 'What'

## Stemming non english data

In [105]:
import pandas as pd
# Specify the path to your CSV file
file_path = 'mixedlnaguage.csv'

# Load the dataset into a Pandas DataFrame
non_english_reviews = pd.read_csv(file_path)

# Display the DataFrame
print(non_english_reviews)

   review_id                                        review_text
0          1            Este producto es increíble, me encanta.
1          2                This product is amazing, I love it.
2          3             Ce produit est incroyable, je l'adore.
3          4         Me gusta mucho este artículo, es muy útil.
4          5  Dieses Produkt ist fantastisch, ich bin begeis...
5          6  Producto de buena calidad, cumple con las expe...
6          7     Il prodotto è di ottima qualità, lo consiglio.
7          8  Producto excelente, llegó a tiempo y en perfec...
8          9    J'adore ce produit, il fonctionne parfaitement.
9         10        Es un buen producto, pero podría ser mejor.


In [108]:
# Import the language detection package
import langdetect

In [109]:
# Loop over the rows of the dataset and append  
languages = [] 
for i in range(len(non_english_reviews)):
    languages.append(langdetect.detect_langs(non_english_reviews.iloc[i, 1]))

In [110]:
# Clean the list by splitting     
languages = [str(lang).split(':')[0][1:] for lang in languages]
# Assign the list to a new feature 
non_english_reviews['language'] = languages

In [111]:
# Select the Spanish ones
filtered_reviews = non_english_reviews[non_english_reviews.language == 'es']
filtered_reviews

Unnamed: 0,review_id,review_text,language
0,1,"Este producto es increíble, me encanta.",es
3,4,"Me gusta mucho este artículo, es muy útil.",es
5,6,"Producto de buena calidad, cumple con las expe...",es
7,8,"Producto excelente, llegó a tiempo y en perfec...",es
9,10,"Es un buen producto, pero podría ser mejor.",es


In [116]:
# Import the required packages
from nltk.stem.snowball import SnowballStemmer

In [113]:
# Import the Spanish SnowballStemmer
SpanishStemmer = SnowballStemmer("spanish")

In [114]:
# Create a list of tokens
tokens = [word_tokenize(review) for review in filtered_reviews.review_text] 

In [115]:
# Stem the list of tokens
stemmed_tokens = [[SpanishStemmer.stem(word) for word in token] for token in tokens]
print(stemmed_tokens[0])

['este', 'product', 'es', 'increibl', ',', 'me', 'encant', '.']


# Part of speech tagging

## POS tagging using spaCy

In [120]:
#import the needed library
import spacy

In [125]:
#Load the en_core_web_sm model
#!python3 -m spacy download en_core_web_sm use this command if you didn't find the model
nlp = spacy.load('en_core_web_sm')

In [126]:
#Initiliaze string
string= "Jane is an amazing guitartist"

In [127]:
#Create a doc object
doc=nlp(string)

In [133]:
 # Generate list of tokens and pos tags
pos=[(token.text, token.pos_) for token in doc]
print(pos)

[('Jane', 'PROPN'), ('is', 'AUX'), ('an', 'DET'), ('amazing', 'ADJ'), ('guitartist', 'NOUN')]
