# 3. Syntactic processing

Syntactic analysis or parsing is defined as the process of analyzing the strings of symbols in natural language conforming to the rules of formal grammar. The purpose of this process is to draw exact meaning, or perform dictionary meaning from the text. Syntax analysis checks the text for meaningfulness comparing to the rules of formal grammar. Example:
1. Delhi is the capital of India.
2. Is Delhi the of India capital.

Two sentences have the same word but only sentence 1 was meaningful and syntactically correct. The purpose of sytactic processing is recover the right one.

## 3.1. Part of speech (POS) tagging 

Part of speech is the process  which refers to categorizing words in a text (corpus) in correspondence with a particular part of speech, depending on the definition of the word and its context. To analyze the relationship and understanding meaning of text, pos tagging is very important process. POS tag are useful for building parse trees, which are used in building NERs and extracting relations between words. It is also use for building lemmatizers in 1.3 

Some pos tagging techniques:
- Lexical base method: Assigns the POS tag the most frequently occurring with a word in the training corpus
- Rule based method: Assigns POS tags based on rules in dictionary.
- Probabilistic method: This method assigns the POS tags based on the probability of a particular tag sequence occurring. Conditional Random Fields (CRFs) and Hidden Markov Models (HMMs) are probabilistic approaches to assign a POS Tag.
- Deep learning method: Recurrent Neural Networks can also be used for POS tagging.

### Conditional Random Fields (CRFs)

Conditional Random Fields are a discriminative model, used for predicting sequences. They use contextual information from previous labels, thus increasing the amount of information the model has to make a good prediction.
Discriminative classifier - they model the decision boundary between the different classes (just like logistic regression)

### Hidden Markov Models (HMMs)

## 3.2. Parsing
One of the most important parts of syntactic processing is parsing. It means to break down a given sentence into its *grammatical components*. NLTK doesn't support pre-trained English grammar model, we have to manually specify grammar before parsing a sentence.

# 4. NLP for Vietnamese language 

With the Vietnamese language, the packages haven't developed completely and they don't have specific documentation. Some common packages are pyvi, vncorenlp and underthesea. These packages provide some basic function such as word tokenize, pos tagging and removing accent. There is no quite difference between pyvi and underthesea despite of some pre-trained models that underthesea provides like NER, classify and sentiment analysis

*Reference:* 
[Pyvi](https://pypi.org/project/pyvi/) and 
[Underthesea](https://underthesea.readthedocs.io/en/latest/readme.html)

In [4]:
from pyvi import ViTokenizer, ViPosTagger, ViUtils,ViDiac

In [5]:
ViTokenizer.tokenize("Hồ gươm là danh lam thắng cảnh Hà Nội").split(' ')

['Hồ', 'gươm', 'là', 'danh_lam', 'thắng_cảnh', 'Hà_Nội']

In [17]:
list_token, _ = ViTokenizer.spacy_tokenize("Tôi là sinh viên trường cao đẳng y tế hà tây")
list_token

['Tôi', 'là', 'sinh_viên', 'trường', 'cao_đẳng', 'y_tế', 'hà_tây']

In [7]:
ViPosTagger.postagging("Hồ gươm là danh lam thắng cảnh Hà Nội")

(['Hồ', 'gươm', 'là', 'danh', 'lam', 'thắng', 'cảnh', 'Hà', 'Nội'],
 ['N', 'N', 'V', 'N', 'N', 'V', 'N', 'Np', 'Np'])

In [8]:
ViUtils.remove_accents(u"Hồ gươm là danh lam thắng cảnh Hà Nội")

b'Ho guom la danh lam thang canh Ha Noi'

*Another way to remove accent is using unidecode packages*

In [45]:
import unidecode

unidecode.unidecode('Hồ gươm là danh lam thắng cảnh Hà Nội')

'Ho guom la danh lam thang canh Ha Noi'

In [12]:
import underthesea
from underthesea import sent_tokenize, word_tokenize, pos_tag

In [13]:
text = """Với xử lí tiếng việt, các thư viện chưa phát triển nhiều. Một số thư viện phổ biến là pyvi và underthesea"""
sent_tokenize(text)

['Với xử lí tiếng việt, các thư viện chưa phát triển nhiều.',
 'Một số thư viện phổ biến là pyvi và underthesea']

In [14]:
print(word_tokenize(text))

['Với', 'xử lí', 'tiếng', 'việt', ',', 'các', 'thư viện', 'chưa', 'phát triển', 'nhiều', '.', 'Một số', 'thư viện', 'phổ biến', 'là', 'pyvi', 'và', 'underthesea']


In [15]:
pos_tag(text)

[('Với', 'E'),
 ('xử lí', 'N'),
 ('tiếng', 'N'),
 ('việt', 'V'),
 (',', 'CH'),
 ('các', 'L'),
 ('thư viện', 'N'),
 ('chưa', 'R'),
 ('phát triển', 'V'),
 ('nhiều', 'A'),
 ('.', 'CH'),
 ('Một số', 'L'),
 ('thư viện', 'N'),
 ('phổ biến', 'V'),
 ('là', 'V'),
 ('pyvi', 'N'),
 ('và', 'C'),
 ('underthesea', 'M')]

## 1.1 Regex 

Some string processing techniques with regex was introduce in `2. [python] Classes`. Therefore, below will only introduce some common patterns in text processing

In [18]:
import re

In [19]:
# find special characters
pattern ='[^aàảãáạăằẳẵắặâầẩẫấậbcdđeèẻẽéẹêềểễếệfghiìỉĩíịjklmnoòỏõóọôồổỗốộơờởỡớợpqrstuùủũúụưừửữứựvwxyỳỷỹýỵz\s]'
string = """🔥𝑩𝑨𝑪𝑲 𝑻𝑶 𝑺𝑪𝑯𝑶𝑶𝑳 balo Japan classic™   balo đi học  balo laptop   balo thời trang   balo chống nước
💃 đầm trắng nữ cổ vuông eo chun váy nữ cộc tay chất đũi dáng xòe
🪵có sẵn set áo babydoll thô đũi viền ren kèm quần đùi 🪵
【JN】heybig spring and summer new Korean
🌈𝗡𝗘𝗪 𝗔𝗥𝗥𝗜𝗩𝗔𝗟💢 áo khoác kaki unisex 📽️ videoảnh thật a92
Quần jean nam trơn màu xanh 🔵 𝐅𝐑𝐄𝐄 𝐒𝐇𝐈𝐏 🔵 quần bò nam co giãn thời trang hpfashion"""

print(re.findall(pattern,string))

['🔥', '𝑩', '𝑨', '𝑪', '𝑲', '𝑻', '𝑶', '𝑺', '𝑪', '𝑯', '𝑶', '𝑶', '𝑳', 'J', '™', '💃', '\U0001fab5', '\U0001fab5', '【', 'J', 'N', '】', 'K', '🌈', '𝗡', '𝗘', '𝗪', '𝗔', '𝗥', '𝗥', '𝗜', '𝗩', '𝗔', '𝗟', '💢', '📽', '️', '9', '2', 'Q', '🔵', '𝐅', '𝐑', '𝐄', '𝐄', '𝐒', '𝐇', '𝐈', '𝐏', '🔵']


In [20]:
# Find stop word
pattern = '(^|\s+)(\S(\s+|$))+'
sen = 'I need a doctor'

re.findall(pattern, sen)

[('', 'I ', ' '), (' ', 'a ', ' ')]

In [21]:
# find url link
pattern = 'http\S+'
sen = 'Reference: https://regex101.com/ (regex online checking)'

re.findall(pattern, sen)

['https://regex101.com/']

In [30]:
# find number of episol
# pattern = '(?<=phần\s|tập\s|t|t.)\d+'
# sen = """thiên thần 1001 tập 19 
# thiên thần 1001 tập 18 
# phim trung quốc: hán sở tranh hùng-t.85 
# phim trung quốc: hán sở tranh hùng-t86 
# vụ án ngay bên bạn: bộ hài cốt bí ẩn-phần 7"""

# re.findall(pattern, sen)

In [23]:
# find sub domain of url
pattern = '(?<=//)\S+(?=\.)'
sen = """https://www.google.ca/
https://id.zalo.me/account/outapp"""

re.findall(pattern, sen)

['www.google', 'id.zalo']

In [33]:
# find item id in a url
pattern = '\d+(?=rf\d+|\.htm)'
url1 = 'https://soha.vn/giam-doc-bv-bach-mai-nguyen-quang-tuan-bi-khoi-to-bo-y-te-noi-gi-20211021185707803.htm'
url2 = 'https://soha.vn/phu-tho-ghi-nhan-them-17-ca-duong-tinh-voi-sars-cov-2-20211021121219895rf20211021185707803.htm'
re.findall(pattern, url1)
re.findall(pattern, url2)

['20211021121219895', '20211021185707803']

---
*&#9829; By Quang Hung x Thuy Linh &#9829;*