#### NLP preprocessing
- 토큰화(tokenization)
- PoS Parts-of-Speech 태깅
- 불용어(stop word) 제거
- 텍스트 정규화(text normalization)
  - 철자 수정(spelling correction)
  - 어간 추출(stemming)(형태소 분석)
  - 표제어 추출(lemmatization)
- 개체명 인식(Named Entity Recognition)(NER)
- 단어 중의성 해결(word sense disambiguation)
- 문장 경계 인식(sentence boundary detection)

In [80]:
# tokenize

import nltk

nltk.download('punkt')
from nltk import word_tokenize

words = word_tokenize('I am reading NLP Fundamentals')
words

[nltk_data] Downloading package punkt to /home/shane/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['I', 'am', 'reading', 'NLP', 'Fundamentals']

In [81]:
# PoS(Parts of Speech) tagging

nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag

pos_tag(words)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/shane/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('I', 'PRP'),
 ('am', 'VBP'),
 ('reading', 'VBG'),
 ('NLP', 'NNP'),
 ('Fundamentals', 'NNS')]

In [82]:
# removing stop words

nltk.download('stopwords')
from nltk.corpus import stopwords

en_stopwords = stopwords.words('english')
print(f'{stop_words[:5]}...{len(en_stopwords)}')

['i', 'me', 'my', 'myself', 'we']...179


[nltk_data] Downloading package stopwords to /home/shane/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [83]:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

sentence = 'Python was created in the early 1990s by Guido van Rossum at Stichting Mathematisch Centrum (CWI, see https://www.cwi.nl/) in the Netherlands as a successor of a language called ABC. Guido remains Python’s principal author, although it includes many contributions from others.'
print(f'{"original":-^30}')
print(sentence)

words = word_tokenize(sentence)
en_stopwords = stopwords.words('english')
print(f'{"stopwords":-^30}')
print(' '.join([word for word in words if word not in en_stopwords]))

-----------original-----------
Python was created in the early 1990s by Guido van Rossum at Stichting Mathematisch Centrum (CWI, see https://www.cwi.nl/) in the Netherlands as a successor of a language called ABC. Guido remains Python’s principal author, although it includes many contributions from others.
----------stopwords-----------
Python created early 1990s Guido van Rossum Stichting Mathematisch Centrum ( CWI , see https : //www.cwi.nl/ ) Netherlands successor language called ABC . Guido remains Python ’ principal author , although includes many contributions others .


[nltk_data] Downloading package punkt to /home/shane/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/shane/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [84]:
# tokenize -> autocorrect -> PoS
# stemming -> autocorrect

import nltk
from nltk import word_tokenize
from autocorrect import Speller

# nltk.download('punkt')

sentence = 'Natureal Languaage Proccessing deals with the art of extractiing insights from Natyral Langualge'
spell = Speller(lang='en')
' '.join(map(spell, word_tokenize(sentence)))

'Natural Language Processing deals with the art of extracting insights from Natural Language'

In [85]:
# stemming

import nltk
stemmer = nltk.stem.PorterStemmer()

stemmer.stem('waterings')

'water'

In [86]:
# lemmatization

import nltk
from nltk.stem.wordnet import WordNetLemmatizer

nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('waterings'), lemmatizer.lemmatize('waterings', 'v')

[nltk_data] Downloading package wordnet to /home/shane/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


('watering', 'water')

In [93]:
# Named Entity Recognition (NER)

import nltk
from nltk import word_tokenize
nltk.download('maxent_ne_chunker')
nltk.download('words')

sentence = 'The quick brown fox that showed up last dark and stormy night jumps over the lazy dog.'
i = nltk.ne_chunk(nltk.pos_tag(word_tokenize(sentence)), binary=True)
list(i)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/shane/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/shane/nltk_data...
[nltk_data]   Package words is already up-to-date!


[('The', 'DT'),
 ('quick', 'JJ'),
 ('brown', 'NN'),
 ('fox', 'NN'),
 ('that', 'WDT'),
 ('showed', 'VBD'),
 ('up', 'IN'),
 ('last', 'JJ'),
 ('dark', 'NN'),
 ('and', 'CC'),
 ('stormy', 'NN'),
 ('night', 'NN'),
 ('jumps', 'NNS'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog', 'NN'),
 ('.', '.')]