## NLTK 자연어 처리 패키지¶

#### NLTK(Natural Language Toolkit) : 교육용으로 개발된 자연어 처리 및 문서 분석용 파이썬 패키지

#### NLTK 패키지가 제공하는 주요 기능

* 말뭉치
* 토큰 생성
* 형태소 분석
* 품사 태깅


In [2]:
# ! pip install nltk
import nltk
nltk.__version__

'3.5'

### 말뭉치 다운로드

In [4]:
nltk.download("book",quiet=True)
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


#### 저작권이 말소된 문학작품을 포함하는 gutenberg 말뭉치에 들어 있는 작품 샘플들

In [6]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

#### William Shakespeare의 Hamlet (1599) : 원문을 그대로 포함

In [7]:
hamlet_raw = nltk.corpus.gutenberg.raw('shakespeare-hamlet.txt')
print(hamlet_raw[:924])

[The Tragedie of Hamlet by William Shakespeare 1599]


Actus Primus. Scoena Prima.

Enter Barnardo and Francisco two Centinels.

  Barnardo. Who's there?
  Fran. Nay answer me: Stand & vnfold
your selfe

   Bar. Long liue the King

   Fran. Barnardo?
  Bar. He

   Fran. You come most carefully vpon your houre

   Bar. 'Tis now strook twelue, get thee to bed Francisco

   Fran. For this releefe much thankes: 'Tis bitter cold,
And I am sicke at heart

   Barn. Haue you had quiet Guard?
  Fran. Not a Mouse stirring

   Barn. Well, goodnight. If you do meet Horatio and
Marcellus, the Riuals of my Watch, bid them make hast.
Enter Horatio and Marcellus.

  Fran. I thinke I heare them. Stand: who's there?
  Hor. Friends to this ground

   Mar. And Leige-men to the Dane

   Fran. Giue you good night

   Mar. O farwel honest Soldier, who hath relieu'd you?
  Fra. Barnardo ha's my place: giue you goodnight.

Exit Fran.




#### Jane Austen의 Emma (1816) 

In [10]:
emma_raw = nltk.corpus.gutenberg.raw('austen-emma.txt')
print(emma_raw[:1302])

[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness o

## 토큰 생성

#### 문자열 단위를 토큰(token)이라고 하고 문자열을 토큰으로 나누는 작업을 토큰 생성(tokenizing)이라 한다
#### 문자열을 토큰으로 분리하는 함수를 토큰 생성 함수(tokenizer)라고 한다. 토큰 생성 함수는 문자열을 입력받아 토큰 문자열의 리스트를 출력한다.

In [14]:
from nltk.tokenize import sent_tokenize
print(sent_tokenize(emma_raw[:1000])[3])

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.


In [18]:
from nltk.tokenize import word_tokenize
word_tokenize(emma_raw[50:100])

['Emma',
 'Woodhouse',
 ',',
 'handsome',
 ',',
 'clever',
 ',',
 'and',
 'rich',
 ',',
 'with',
 'a']

In [19]:
# 정규 표현식을 사용한 토큰 생성 : RegexpTokenizer
from nltk.tokenize import RegexpTokenizer
retokenize =RegexpTokenizer("[\w]+")
# \w - 문자+숫자(alphanumeric)와 매치, [a-zA-Z0-9_]와 동일한 표현식이다.

retokenize.tokenize(emma_raw[50:100])

['Emma', 'Woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a']

## 형태소 분석
형태소(morpheme) : 일정한 의미가 있는 가장 작은 말의 단위, 자연어 처리에서는 토큰으로 형태소를 이용한다. <br>
형태소 분석(morphological analysis) : 단어로부터 어근, 접두사, 접미사, 품사 등 다양한 언어적 속성을 파악하고 이를 이용하여 형태소를 찾아내거나 처리하는 작업이다 

* 어간 추출(stemming)
* 원형 복원(lemmatizing) : 표제어
* 품사 태깅(Part-Of-Speech tagging)

 ### 어간 추출(stemming)

어간 추출(stemming)은 변화된 단어의 접미사나 어미를 제거하여 같은 의미를 가지는 형태소의 기본형을 찾는 방법이다. <br>
NLTK는 `PorterStemmer`  `LancasterStemmer` 등을 제공한다. <br>
어간 추출법은 단순히 어미를 제거할 뿐이므로 단어의 원형의 정확히 찾아주지는 않는다. 

In [21]:
from nltk.stem import PorterStemmer,LancasterStemmer

st1 = PorterStemmer()
st2 = LancasterStemmer()

words = ["fly", "flies", "flying", "flew", "flown"]

print('PorterStemmer     :',[st1.stem(w) for w in words])
print('LancasterStemmer  :',[st2.stem(w) for w in words])


PorterStemmer     : ['fli', 'fli', 'fli', 'flew', 'flown']
LancasterStemmer  : ['fly', 'fli', 'fly', 'flew', 'flown']


### 표제어(원형) 추출 (lemmatization)
lemmatization 은 같은 의미를 가지는 여러 단어를 사전형으로 통일하는 작업이다. <br>
품사(part of speech)를 지정하는 경우 좀 더 정확한 원형을 찾을 수 있다.

- Stemming

   am → am|

   the going → the go

   having → hav
   
<br>

- Lemmatization

  am → be

  the going → the going

  having → have

In [22]:
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()
[lm.lemmatize(w,pos="v") for w in words]

['fly', 'fly', 'fly', 'fly', 'fly']