<a href="https://colab.research.google.com/github/toche7/AI_ITM/blob/main/Lab11_Exampleofpythainlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pythainlp

Collecting pythainlp
  Downloading pythainlp-5.1.0-py3-none-any.whl.metadata (8.0 kB)
Downloading pythainlp-5.1.0-py3-none-any.whl (19.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.3/19.3 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hInstalling collected packages: pythainlp
Successfully installed pythainlp-5.1.0


In [2]:
from pythainlp import word_tokenize, pos_tag
from pythainlp.corpus.common import thai_stopwords
from pythainlp.util import normalize

# Example Thai text
text = "สวัสดีครับ วันนี้อากาศดีมากกๆ"

# Cleaning: Normalize the text to ensure consistency
text_normalized = normalize(text)
text_normalized

'สวัสดีครับ วันนี้อากาศดีมากกๆ'

In [3]:
# Tokenization and word segmentation: Split text into words (default engine is 'newmm')
tokens = word_tokenize(text_normalized)
tokens

['สวัสดี', 'ครับ', ' ', 'วันนี้', 'อากาศ', 'ดี', 'มา', 'กก', 'ๆ']

In [None]:
# Remove stopwords
stopwords = thai_stopwords()
tokens_without_stopwords = [word for word in tokens if word not in stopwords]
tokens_without_stopwords

['สวัสดี', ' ', 'อากาศ', 'ดี', 'กก']

In [None]:
# Part-of-Speech Tagging
pos_tags = pos_tag(tokens_without_stopwords)
pos_tags

[('สวัสดี', 'NCMN'),
 (' ', 'PUNC'),
 ('อากาศ', 'NCMN'),
 ('ดี', 'VATT'),
 ('กก', 'ADVN')]

This code snippet performs the following steps:
* Normalizes the text to a consistent form.
* Tokenizes the text into words.  Tokenization in Thai is mainly about word segmentation due to the absence of space between words.
* Removes stopwords, which are words that usually carry no significant meaning by themselves.
* Tags each word with its part-of-speech.

Unfortunately, as mentioned earlier, stemming is not a typical process in Thai text processing because words do not have the same kind of morphological variations as in languages like English. Therefore, the process usually stops at word segmentation and part-of-speech tagging.
