<a href="https://colab.research.google.com/github/ms624atyale/NLP_PictureBook_2025/blob/main/9_Token_CleanFunction_Lemma_PBL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📖  Tokenization
### 1️⃣ Sentence tokenization (문장 토큰화) or Sentence segmentation ( 문장 분류)⤵️

* 코퍼스 내에서 문장 단위로 구분은 문장 구분자 ("!", "?", ".")를 주로 사용하면 문장 예측을 할 수 있음.
* "."의 예외가 되는 여러 가능성 존재 (e.g., IP 192.168.56.31, email account python@gmail.com, Ph.D, etc.)
* 개별 언어 특수성, 특수 문자 사용, 혹은 오타 때문에 규칙을 찾아내기 어려운 점이 있음.
* NLTK 페키지 안에서 sent_tokenize() 함수 사용

>
### 2️⃣ Small-unit tokenization ⤵️

* **Corpus data (e.g., crawling) should be <font color = 'red'> preprocessed </font> before further analysis by means of <font color = 'red'> Cleaning(정제), Tokenization(토큰화), & Normalization(정규화)**.
>
* Simplest tokenization: 구두점 지운 후 띄어쓰기(whitespace)를 기준으로 잘라내기
>
* [**English Tokenization**](https://wikidocs.net/21698)

  - <font color = 'red'> **Cleaning**</font>
    * 구두점(punctuation (e.g., ".", ",", "?", "!", ";", ":")을 지우기
    * 특수문자 지우기
    * line 표시 등도 정제 가능

  - <font color = 'red'> **Tokenization**</font>

    * Tokenization: 주어진 코퍼스(corpus)에서 토큰(token)이라 불리는 단위 (e.g., word, phrase, strings with meaning)로 나누는 작업
    * 토큰의 단위가 상황에 따라 다르지만, 보통 의미있는 단위로 토큰을 정의합니다.
    * apostrophe, hyphen 등은 Tokenize 용도로 사용하는 함수의 특성에 따라 다양한 방식으로 토큰에 포함시키기도 하고 삭제하기도 한다.    

  - <font color = 'red'> **Normalization**</font>
    * Stemming: am → 'am', having → 'hav'
    * Lemmatization: am → 'be', having → 'have'

### More about <font color = 'red'> **Normalization**
* [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

The goal of both stemming and lemmatization is to reduce **inflectional** forms and sometimes **derivationally** related forms of a word to a **common base form**.

organize (vt)

|Inflection|base|grammatical functions|derived forms|
|:--|--|--|--|
|1.|orgaize(vt.)|3rd per. sg| orgaizes|
|2.||progressive| orgaizing|
|3.||past | orgaized|
|4.||past participle|orgaized|

[Table2] car (noun)

|Inflection|base|grammatical functions|derived forms|
|:--|--|--||
|1.|car(noun)|plural|cars|
|2.||possessive| car's|

[Table3] big (adjective)

|Inflection|base|grammatical functions|derived forms|
|:--|--|--||
|1.|big(adjective)|comparative|bigger|
|2.||superlative| biggest|

[Table4] combine (vt)

|Derivation|base|grammatical category|derived forms|
|:--|--|--||
|1.|combine (verb)| verb| recombine|
|1.|combine (verb)| noun| combination|
|2.||adjective| combinational|

[Table 5] be (vi)

|copular be verb|base|subject-verb agreement|forms|
|:--|--|--||
|1.|be (verb)| 1st. sg. prsnt/past| am/was|
|2.|| 2nd/3rd sg. & pl, prsnt/past|are/were|
|3.||3rd. sg. prsnt/past| is/was|
<p> </p>

### Mapping of text

>* The boy's cars are different colors $\Rightarrow$
the boy car be differ color


- **Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the **<em>removal of derivational affixes</em>**.

- **Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to **remove inflectional endings** only and to return the **base or dictionary form** of a word, which is known as the lemma.

If confronted with the token 'saw', stemming might return just 's', whereas lemmatization would attempt to return either 'see'(verb) or 'saw'(noun) depending on whether the use of the token was as a verb or a noun. Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of such components exist, both commercial and open-source. The most common algorithm for **stemming English**, and one that has repeatedly been shown to be empirically very effective, is **Porter's algorithm (Porter, 1980)**.

You can use a **lemmatizer**, a tool from Natural Language Processing which does full morphological analysis to <em>accurately identify the lemma for each word</em>. <font color = 'blue'> Doing full morphological analysis produces at most very modest benefits for retrieval</font>. It is hard to say more, because either form of normalization tends not to improve English information retrieval performance in aggregate - at least not by very much. While it helps a lot for some queries, it equally hurts performance a lot for others. <font color = 'red'> Stemming increases recall while harming precision.</font>

>The **Porter stemmer** stems all of the following words:
>
>>operate', 'operating, ' operates', ' operation', ' operative', ' operatives', ' operational'
$\Rightarrow$ 'oper'.
>
> Defect $\Rightarrow$ We lose considerable precision on queries such as the following with Porter stemming:
>> "operational and research", "operating and system", "operative and dentistry"

For a case like this, moving to using a lemmatizer would not completely fix the problem because **particular inflectional forms are used in particular <font color = 'blue'>collocations</font>**: a sentence with the words operate and system is not a good match for the query operating and system. Getting better value from term normalization depends more on pragmatic issues of word use than on formal issues of linguistic morphology.

# 📖  Tokenization

### 💡 **For your information**

The **'punkt'** resource is a tokenizer that is commonly used for **splitting text into individual sentences or words**. Once the 'punkt' resource is downloaded, you can proceed to use NLTK's tokenization capabilities in your code. Specifically, you can access and use the tokenizer provided by NLTK. This resource is necessary for certain NLTK functionalities, such as tokenization using the **nltk.tokenize module**.

### 🆘 What is PunktSentenceTokenizer?
* [For more information read the original article](https://www.askpython.com/python-modules/nltk-punkt)

In NLTK, PUNKT is an <font color = 'brown'> **unsupervised trainable model**</font>, which means it can be **trained on unlabeled data** (Data that has not been tagged with information identifying its characteristics, properties, or categories is referred to as unlabeled data.)

It generates a list of sentences from a text by developing a model for **words that start sentences, prepositional phrases, and abbreviations** using an unsupervised technique. Without first being put to use, it has to be trained on a sizable amount of plaintext in the intended language.

🚯 Caution should be taken
* nltk.sent_tokenize를 사용할 경우, punkt 모델을 활용하여 sentence segmentation/tokenization을 진행하게 된다. <font color = 'blue'> punkt 문장 구조를 학습한 일종의 모델로, 어떤 것이 약어에 쓰이는 "."이고(Ex : Ph.D.), 어떤 것이 마침표인지 학습이 되어있다.</font> <font color = 'brown'> 문장을 기본적으로 마침표를 기준으로 나누되, Ph.D., Saint., Professor., 와 같은 약어(Abbreviation)는 Known abbreviation으로 학습하여 한 단어로 취급하는 방식이다.</font> 하지만 punkt model이 모든 약어를 학습하지 못했다보니, Vol. 13, Apr. 13 과 같은 표현 및 U.S. Pat. No. 134 과 같은 복잡한 약어는 Known abbreviation이 아니여서 모두 나눠져버린다.

In [3]:
#@markdown #### 🐹 **Student's Activity 0** ⤵️
#@markdown 👀🐾 **Install <font color = 'red'> NLTK</font> package and download  <font color = 'red'> punkt </font> package.**
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [4]:
#@markdown #### 🐹 **Student's Activity 1** ⤵️
#@markdown 👀🐾 **Call <font color = 'red'> sent_tokenize() </font> function**

#@markdown 🔎 **Exercise for Various Periods**

message = "I'm actively looking for Ph.D. students, \
and you are a Ph.D student. \
Visit IP 192.168.56.31 \
and send the results to my email account. \
It's python@gmail.com."

from nltk.tokenize import sent_tokenize
sentence = sent_tokenize(message)
print('문장 토큰화: %s' %sentence)

문장 토큰화: ["I'm actively looking for Ph.D. students, and you are a Ph.D student.", 'Visit IP 192.168.56.31 and send the results to my email account.', "It's python@gmail.com."]


#📕 **Children's Picture Books**
###[**Project Gutenberg**](https://gutenberg.org/)
- **Beatrix Potter: Search by 'Beatrix Potter'**
    - **The Tale of Peter Rabit**
    - **The Tale of Benjamin Bunny**
    - **The Tale of Jemima Puddle-Duck**
    - **The Tale of Mrs. Tiggy-Winckle**
    - **The Tale of Squirrel Nutkin**
    - **The Tale of Tom Kitten**

- **Leslie Brooke: Search by 'Leslie Brooke'**
    - **The Tailor and the Crow**
    - **The Golden Goose Book**
    - **Jonny Crow's Garden**
    - **A Nursery Rhyme Picture Book**

##**Step 1**

- **For each volume, take a look at a published version with the tab, "READ NOW."**
- **Download UTF-8 on your machine**
- **Open your 메모장 or TextEdit.**
- **Save it as plain text.**


## **Step 2**
- Make a folder on your Github repository with a new name "Data_Plain"
- Upload

# <font color = 'red'> 🐹🐾 **Final Script to prepare input text for further analysis (e.g., Common Core Words, Wordcloud, Lexical Diversity, etc.)**

  - # <font color = 'blue'> 🐹🐾 **Important & Useful!**
  - ### **This script will be based on plain text for 10 volumes above.**