#### [자연어 전처리 - 원형 복원 ]
- 동일한 의미를 가지면서 형태가 다른 단어들 ==> 1개의 형태 일치
- 토큰 수 줄일 수 있음
- 방법
	* 형태학적 방법 : Stemming	어간 추출
	* 사전학적 방법 : Lemmatization 표제어 추출

[1] 모듈 로딩 <hr>

In [None]:
## 어간추출과 표제어추출 모듈 로딩
from nltk.stem import LancasterStemmer, WordNetLemmatizer

[2] 어간추출 방식 <hr>

In [13]:
## 어간추출 인스턴스 생성
stemmer = LancasterStemmer()

## 단어들의 어간 추출
print('[working, worked, works]', end=' 어간추출 ==> ')
print(stemmer.stem('working'), stemmer.stem('worked'), stemmer.stem('works'))

[working, worked, works] 어간추출 ==> work work work


In [14]:
print('[amusing, amuses, amused]', end=' 어간추출 ==> ')
print(stemmer.stem('amusing'), stemmer.stem('amuses'), stemmer.stem('amused'))

[amusing, amuses, amused] 어간추출 ==> amus amus amus


In [15]:
print('[happier, happiest]', end=' 어간추출 ==> ')
print(stemmer.stem('happier'), stemmer.stem('happiest'))

print('[fancier, fanciest]', end=' 어간추출 ==> ')
print(stemmer.stem('fancier'), stemmer.stem('fanciest'))

[happier, happiest] 어간추출 ==> happy happiest
[fancier, fanciest] 어간추출 ==> fant fanciest


[3] 표제어 방식 <hr>

In [19]:
## 인스턴스 생성
wnLemma = WordNetLemmatizer()

## 단어들의 어간 추출
print('[working, worked, works]', end=' 어간추출 ==> ')
print(wnLemma.lemmatize('working', 'v'), wnLemma.lemmatize('worked', 'v'), wnLemma.lemmatize('works', 'v'))

[working, worked, works] 어간추출 ==> work work work


In [20]:
print('[amusing, amuses, amused]', end=' 어간추출 ==> ')
print(wnLemma.lemmatize('amusing', 'v'), wnLemma.lemmatize('amuses', 'v'), wnLemma.lemmatize('amused', 'v'))

[amusing, amuses, amused] 어간추출 ==> amuse amuse amuse


In [21]:
print('[happier, happiest]', end=' 어간추출 ==> ')
print(wnLemma.lemmatize('happier', 'a'), wnLemma.lemmatize('happiest', 'a'))

print('[fancier, fanciest]', end=' 어간추출 ==> ')
print(wnLemma.lemmatize('fancier', 'a'), wnLemma.lemmatize('fanciest', 'a'))

[happier, happiest] 어간추출 ==> happy happy
[fancier, fanciest] 어간추출 ==> fancy fancy


[4] 문장 기반 원형 복원 <hr>

In [33]:
# 텍스트 데이터
text = "What a Merry-Go-Round is the eighteenth collection by British fashion designer Alexander McQueen, made for the Autumn/Winter 2001 season of his fashion house Alexander McQueen. The collection drew on imagery of clowns and carnivals, inspired by McQueen's feelings about childhood and his experiences in the fashion industry. The designs were influenced by military chic, cinema such as Nosferatu (1922) and Cabaret (1972), 1920s flapper fashion, and the French Revolution. The palette comprised dark colours complemented with neutrals and muted greens. The show marked the first appearance of the skull motif that became a signature of the brand."

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tag import pos_tag

for sent in sent_tokenize(text):
	pos_tokens = pos_tag(word_tokenize(sent))
	# print(pos_tokens)
	## 형용사 'JJ' => 'a', 동사 'VB' => 'v'
	a = [wnLemma.lemmatize(word, 'a' if pos=='JJ' else 'v') for word, pos in pos_tokens if pos[:2] in ['JJ', 'VB']]
	print(a)
	

['be', 'eighteenth', 'British', 'make']
['draw', 'inspire']
['be', 'influence', 'military', 'such', 'flapper', 'French']
['comprise', 'dark', 'complement', 'mute']
['mark', 'first', 'become']
