# NLP Lab 1: Lexical Analysis

Mục tiêu: Thực hiện Tokenization và Information Extraction sử dụng Regular Expressions (regex) trong Python, không sử dụng các thư viện NLP bên ngoài.

In [1]:
import re

## Exercise 1: Basic Tokenization

Yêu cầu: Tách từ và dấu câu. Xử lý trường hợp sở hữu cách (apostrophe).

In [2]:
text1 = "Nepal’s Home Minister Bhim Rawal on Sunday held a meeting with CPN-UML leaders, including Madhav Kumar Nepal, Jhalanath Khanal and Bam Dev Gautam, at the party headquarters in Balkhu. The meeting focused on the implementation of the three-point agreement signed between the major political parties and the Home Ministry’s security plan."

# Regex pattern: 
# 1. Handle apostrophe: \w+’\w+
# 2. Handle words: \w+
# 3. Handle punctuation: [^\w\s]
pattern1 = r"\w+’\w+|\w+|[^\w\s]"
tokens1 = re.findall(pattern1, text1)
print("Tokens Exercise 1:", tokens1)

Tokens Exercise 1: ['Nepal’s', 'Home', 'Minister', 'Bhim', 'Rawal', 'on', 'Sunday', 'held', 'a', 'meeting', 'with', 'CPN', '-', 'UML', 'leaders', ',', 'including', 'Madhav', 'Kumar', 'Nepal', ',', 'Jhalanath', 'Khanal', 'and', 'Bam', 'Dev', 'Gautam', ',', 'at', 'the', 'party', 'headquarters', 'in', 'Balkhu', '.', 'The', 'meeting', 'focused', 'on', 'the', 'implementation', 'of', 'the', 'three', '-', 'point', 'agreement', 'signed', 'between', 'the', 'major', 'political', 'parties', 'and', 'the', 'Home', 'Ministry’s', 'security', 'plan', '.']


## Exercise 2: Chunk Extraction

Yêu cầu: Trích xuất các chuỗi từ viết hoa liên tiếp (Title, Named Entities).

In [3]:
# Regex for capitalized phrases
pattern2 = r"(?:[A-Z][\w’]*\s?)+"
chunks = re.findall(pattern2, text1)
# Filter out single words if necessary or keep as per requirements. Assuming chunks of Capitalized words.
# Refined pattern to ensure it grabs sequence
chunks_refined = [c.strip() for c in chunks if len(c.strip().split()) > 0]
print("Chunks Exercise 2:", chunks_refined)

Chunks Exercise 2: ['Nepal’s Home Minister Bhim Rawal', 'Sunday', 'CPN', 'UML', 'Madhav Kumar Nepal', 'Jhalanath Khanal', 'Bam Dev Gautam', 'Balkhu', 'The', 'Home Ministry’s']


## Exercise 3: Token Classification & Extraction

Yêu cầu: Trích xuất từ viết hoa và số (bao gồm định dạng thập phân kiểu 84.477 hoặc 2,21).

In [4]:
text2 = "The police are firing 50 rounds in the air. 84.477 people were affected."
text3 = "Việt Nam có dân số 99,6 triệu người. Tăng trưởng đạt 2,21%."

# Pattern for numbers (integer, or float with . or ,)
num_pattern = r"\d+(?:[.,]\d+)?"
# Pattern for Capitalized words
cap_pattern = r"[A-Z][a-z]+"

print("--- Text 2 Extraction ---")
print("Numbers:", re.findall(num_pattern, text2))
print("Capitalized:", re.findall(cap_pattern, text2))

print("\n--- Text 3 Extraction ---")
print("Numbers:", re.findall(num_pattern, text3))
print("Capitalized/Named Entities (Vietnamese often start sentences or Proper Nouns):", re.findall(r"[A-ZĐ][a-zà-ỹ]+", text3))

--- Text 2 Extraction ---
Numbers: ['50', '84.477']
Capitalized: ['The']

--- Text 3 Extraction ---
Numbers: ['99,6', '2,21']
Capitalized/Named Entities (Vietnamese often start sentences or Proper Nouns): ['Việt', 'Nam', 'Tăng']
