<a href="https://colab.research.google.com/github/kim-ji-youn/Study-with-NLP-books/blob/main/1.%20Mastering%20Natural%20Language%20Processing%20with%20Python/1.%20NLPwithString/1.%20Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenization

## Sentence tokenization
여러 개의 문장으로 구성되어 있는 문서를 각각의 문장으로 분리시켜주는 역할을 한다. 
* sent_tokenize  
```
from nltk.tokenize import sent_tokenize
sent_tokenize(text)
```
* PunktSentenceTokenizer  
```
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer.tokenize(text)
```
* load ```pickle``` file : 외국어의 sentence tokenization
```
tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
tokenizer.tokenize(text)
```


In [4]:
!pip install nltk



In [None]:
import nltk
nltk.download('all')

In [24]:
# sent_tokenize
from nltk.tokenize import sent_tokenize
text = """Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.
"""
sentences = sent_tokenize(text)
print("total sentences are", len(sentences), "sentences")
for i, sent in enumerate(sentences) :
  print(i+1, "\t", sent)

total sentences are 6 sentences
1 	 Python is an interpreted, high-level and general-purpose programming language.
2 	 Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
3 	 Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
4 	 Python is dynamically typed and garbage-collected.
5 	 It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented and functional programming.
6 	 Python is often described as a "batteries included" language due to its comprehensive standard library.


## Word tokenization
* word_tokenize
  * 가장 일반적인 tokenizer
```
from nltk import word_tokenize
word_tokenize(text)
```
* TreebankWokdTokenizer
  * Penn Treebank Corpus에 따른 기준 사용
  * 분리된 축약형 사용 ex) "don't" -> "do", "n't"
```
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)
```
* WordPunctTokenizer
  * 분리된 문장 부호로 작동 ex) "don't" -> "don", "'", "t"
  * 문장 부호를 완전히 새로운 토큰으로 분할하여 제공
```
from nltk.tokenize import WordPunctTokenizer
tokenizer - WordPunctTokenizer()
tokenizer.tokenize(text)
```

In [45]:
#word_tokenize
from nltk import word_tokenize
text = "Don't hesitate to ask questions. Please ask me anything!"
word_tokenize(text)

['Do',
 "n't",
 'hesitate',
 'to',
 'ask',
 'questions',
 '.',
 'Please',
 'ask',
 'me',
 'anything',
 '!']

In [46]:
sentences = sent_tokenize(text)
tokens = []
num_tokens = 0
for sent in sentences :
  tokens.append(word_tokenize(sent))
  num_tokens += len((word_tokenize(sent)))

print("Total tokens are", num_tokens, "tokens")
for i, sent in enumerate(tokens):
  print(f"=== {i+1} sentence ===")
  for j, token in enumerate(tokens[i]):
    print(j+1, "\t", token)

Total tokens are 12 tokens
=== 1 sentence ===
1 	 Do
2 	 n't
3 	 hesitate
4 	 to
5 	 ask
6 	 questions
7 	 .
=== 2 sentence ===
1 	 Please
2 	 ask
3 	 me
4 	 anything
5 	 !


In [47]:
#TreebankWordTokenizer
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)
print("Total number of tokens: ", len(tokens))
print(tokens)

Total number of tokens:  11
['Do', "n't", 'hesitate', 'to', 'ask', 'questions.', 'Please', 'ask', 'me', 'anything', '!']


In [48]:
#WordPunctTokenizer
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize(text)
print("Total number of tokens: ", len(tokens))
print(tokens)

Total number of tokens:  13
['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions', '.', 'Please', 'ask', 'me', 'anything', '!']


## 정규표현식(Regular expressions)을 사용한 Tokenization

1. 클래스 활용 1: import RegexpTokenizer  
1-1. 공백 + 특수 문자
```
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w]+")
tokenizer.tokenize(text)
```
1-2. 공백 단위
```
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\s+', gaps = True)
tokenizer.tokenize(text)
```

2. 함수 활용: import regexp_tokenize
```
from nltk.tokenize import regexp_tokenize
print(regexp_tokenize(text, pattern = '\w+|\$[\d\.]+|\S+'))
```


In [54]:
# 공백 & 특수문자
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w]+")
tokens = tokenizer.tokenize(text)
print(tokens)

['Don', 't', 'hesitate', 'to', 'ask', 'questions', 'Please', 'ask', 'me', 'anything']


In [55]:
#공백
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('\s+', gaps = True)
tokens = tokenizer.tokenize(text)
print(tokens)

["Don't", 'hesitate', 'to', 'ask', 'questions.', 'Please', 'ask', 'me', 'anything!']


In [51]:
# 함수 사용
from nltk.tokenize import regexp_tokenize
print(regexp_tokenize(text, pattern = '\w+|\$[\d\.]+|\S+'))

['Don', "'t", 'hesitate', 'to', 'ask', 'questions', '.', 'Please', 'ask', 'me', 'anything', '!']
