##### In this exercise, there are two pieces of raw, unstructured text containing user reviews of a product—one in Persian and the other in English.
##### In this exercise, there are two pieces of raw, unstructured text containing user reviews of a product—one in Persian and the other in English.The goal is to prepare these texts for a sentiment analysis task (sentiment analysis) using the techniques learned in the first chapter and to observe the differences in processing between the two languages.

Persian text:

"کاربران می گویند: 'این هدست عالیه!!! ولی باتری آن خیلی زود تمام میشه. :( 
من که واقعا ازش راضی نیستم... البته کیفیت صدا خوبه.' 
این نظر در تاریخ 15/02/1404 ثبت شده. برای اطلاعات بیشتر به سایت ما مراجعه کنید: www.example.com"

English text:

"Users are saying: 'This headset is AMAZING!!! But the battery drains SO quickly. :( 
I'm honestly not satisfied with it... Although the sound quality is pretty good.'
This review was posted on 2024/05/15. Visit our website for more info: www.example.com"


#### Section 1  (Regular Expression)

##### 1-1

##### regex for english text

In [26]:
r"[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]"

'[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]'

In [27]:
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

from parsivar import FindStems
# import stanza
# stanza.download('fa')

In [28]:
text = r"Users are saying: 'This headset is AMAZING!!! But the battery drains SO quickly. :( I'm honestly not satisfied with it... Although the sound quality is pretty good.'This review was posted on 2024/05/15. Visit our website for more info: www.example.com.کاربران می گویند: 'این هدست عالیه!!! ولی باتری آن خیلی زود تمام میشه. :( من که واقعا ازش راضی نیستم... البته کیفیت صدا خوبه.' این نظر در تاریخ 15/02/1404 ثبت شده. برای اطلاعات بیشتر به سایت ما مراجعه کنید: www.example.com"

pattern1 = r"[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]"
# pattern1 = r"\d{4}/\d{2}/\d{2}"

re.findall(pattern1, text)


['2024/05/15']

##### 1-2

In [29]:
pattern2 = r"[Ww]{3}\.[a-zA-Z0-9\-.]+\.[a-zA-Z]+"

re.findall(pattern2, text)

['www.example.com', 'www.example.com']

##### 1-3

In [30]:
pattern3 = r"[!]{3}|[.]{3}|[.]|[:][(]"
re.findall(pattern3, text)

['!!!',
 '.',
 ':(',
 '...',
 '.',
 '.',
 '.',
 '.',
 '.',
 '!!!',
 '.',
 ':(',
 '...',
 '.',
 '.',
 '.',
 '.']

#### Section 2 & 4 (Tokenization & Sentence Segmentation & Stemming & Lemmatization)

In [31]:
comment_en = "This headset is AMAZING!!! But the battery drains SO quickly. :( I'm honestly not satisfied with it... Although the sound quality is pretty good."
comment_fa = "این هدست عالیه!!! ولی باتری آن خیلی زود تمام میشه. :( من که واقعا ازش راضی نیستم... البته کیفیت صدا خوبه."

sentences_en = sent_tokenize(comment_en)
print(sentences_en)

sentences_fa = sent_tokenize(comment_fa)
print(sentences_fa)

print("\n")

# comment_en = comment_en.lower()
tokens_en = word_tokenize(comment_en)
print(tokens_en)

tokens_fa = word_tokenize(comment_fa)
print(tokens_fa)

# sentences = sent_tokenize(text)
# tokens = []

# for sentence in sentences:
#     tokens.extend(word_tokenize(sentence))

# print("\n", tokens)
print("\n")
stemmer_en = PorterStemmer()
print("English stems: ")
english_stems = [stemmer_en.stem(w) for w in tokens_en]
print(english_stems)

print("\n")

stemmer_fa = FindStems()
print("Farsi stems: ")
farsi_stems = [stemmer_fa.convert_to_stem(w) for w in tokens_fa]
print(farsi_stems)

print("\n")

pos_tags = pos_tag(tokens_en)

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
        
lemmatizer = WordNetLemmatizer()
lemmas_en = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in pos_tags]
# lemmas_en = [print(lemmatizer.lemmatize(w)) for w in tokens_en]
print(lemmas_en)

print("\n")

lemmas_fa = [lemmatizer.lemmatize(w) for w in tokens_fa]
print(lemmas_fa)


['This headset is AMAZING!!!', 'But the battery drains SO quickly.', ":( I'm honestly not satisfied with it...", 'Although the sound quality is pretty good.']
['این هدست عالیه!!!', 'ولی باتری آن خیلی زود تمام میشه.', ':( من که واقعا ازش راضی نیستم... البته کیفیت صدا خوبه.']


['This', 'headset', 'is', 'AMAZING', '!', '!', '!', 'But', 'the', 'battery', 'drains', 'SO', 'quickly', '.', ':', '(', 'I', "'m", 'honestly', 'not', 'satisfied', 'with', 'it', '...', 'Although', 'the', 'sound', 'quality', 'is', 'pretty', 'good', '.']
['این', 'هدست', 'عالیه', '!', '!', '!', 'ولی', 'باتری', 'آن', 'خیلی', 'زود', 'تمام', 'میشه', '.', ':', '(', 'من', 'که', 'واقعا', 'ازش', 'راضی', 'نیستم', '...', 'البته', 'کیفیت', 'صدا', 'خوبه', '.']


English stems: 
['thi', 'headset', 'is', 'amaz', '!', '!', '!', 'but', 'the', 'batteri', 'drain', 'so', 'quickli', '.', ':', '(', 'i', "'m", 'honestli', 'not', 'satisfi', 'with', 'it', '...', 'although', 'the', 'sound', 'qualiti', 'is', 'pretti', 'good', '.']


Farsi stem

#### Section 3 (Normalization, Case-folding)

#### Normalization in English and Persian

    In English, normalization means:

        1. Converting all letters to lowercase

        2. Removing extra spaces

        3. Expanding contractions (e.g., I'm → I am)

    In Persian, normalization means:

        1. Fixing half-spaces (zero-width non-joiners)

        2. Removing unnecessary spaces

In [39]:
text_fa = "کاربران می گویند: 'این هدست عالیه!!! ولی باتری آن خیلی زود تمام می شه. :( من که واقعا ازش راضی نیستم... البته کیفیت صدا خوبه.' این نظر در تاریخ 15/02/1404 ثبت شده. برای اطلاعات بیشتر به سایت ما مراجعه کنید: www.example.com"
text_en = "Users are saying: 'This headset is AMAZING!!! But the battery drains SO quickly. :( I'm honestly not satisfied with it... Although the sound quality is pretty good. 'This review was posted on 2024/05/15. Visit our website for more info: www.example.com"

# Normalization for farsi
def normalize_persian(text):
    
    # text = re.sub(r'\s+(‌‌ها|تر|ترین|‌ای)\b', '', text)
    text = re.sub(r"[!!!.:('...]", "", text)
    # deleting half spaces
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'\bمی\s+', 'می‌', text)
    text = text.strip()
    return text

# Normalization for English
def normalize_english(text):
    # lowercase
    text = text.lower()
    # contraction 
    text = re.sub(r"\bi'm\b", "i am", text)
    text = re.sub(r"\bcan't\b", "cannot", text)
    text = re.sub(r"\bwon't\b", "will not", text)
    text = re.sub(r"[!!!.:('...]", "", text)
    # deleting extra spaces
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

norm_fa = normalize_persian(text_fa)
norm_en = normalize_english(text_en)


print("Farsi after normalization:\n", norm_fa)
print("\n")
print("English after normalization:\n", norm_en)


Farsi after normalization:
 کاربران می‌گویند این هدست عالیه ولی باتری آن خیلی زود تمام می‌شه من که واقعا ازش راضی نیستم البته کیفیت صدا خوبه این نظر در تاریخ 15/02/1404 ثبت شده برای اطلاعات بیشتر به سایت ما مراجعه کنید wwwexamplecom


English after normalization:
 users are saying this headset is amazing but the battery drains so quickly i am honestly not satisfied with it although the sound quality is pretty good this review was posted on 2024/05/15 visit our website for more info wwwexamplecom


#### Section 5 (Morphology)

##### 5-1

In [40]:
words = {
    "unsatisfied": ["un", "satisfy", "ed"],
    "happiness": ["happy", "ness"],
    "می‌گوید": ["می‌", "گو", "ید"],  # mi- (prefix), go (root), -yad (suffix)
    "رفته‌ام": ["رفت", "ه", "ام"]     # raft (root), e (perfect tense), am (1st person)
}

# Display morphological breakdown
for word, parts in words.items():
    print(f" {word} → {' + '.join(parts)}")

 unsatisfied → un + satisfy + ed
 happiness → happy + ness
 می‌گوید → می‌ + گو + ید
 رفته‌ام → رفت + ه + ام
