# NLTK -- WordNet
1. 是一個由普林斯頓大學建立和維護的`英語字典`。
2. 維基百科說明： https://zh.wikipedia.org/wiki/WordNet

- 要在 python 中使用 WordNet ，必須使用NLTK (Natural Language Toolkit) 
- NLTK是最受歡迎的自然語言處理庫。用 Python 語言編寫的，背後有強大的社群支援。
- NLTK 可以做詞性標記、語法分析、資料擷取等功能。

In [None]:
!pip install nltk

In [1]:
import nltk

#下載
nltk.download('wordnet')

from nltk.corpus import wordnet as wn

#確認載入
wn.ensure_loaded()

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### synset (同義詞集合)查詢 

In [2]:
'''
這裡輸出了五個不同的synset (同義詞集合)，
表示car 有五種不同的意思(不同的同義詞群組)
'''
#查詢 synset's'
wn.synsets('car')

[Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('car.n.03'),
 Synset('car.n.04'),
 Synset('cable_car.n.01')]

In [3]:
car = wn.synset('car.n.01')
car.definition() #解釋

'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

In [4]:
car_2 = wn.synset('car.n.02')
car_2.definition()

'a wheeled vehicle adapted to the rails of railroad'

In [5]:
'''
利用lemma_name 可以了解 car 這個字，
把auto, automobile, machine, motocar 定義為同義詞
'''

car.lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

### Wrodnet 上位詞 hypernym_paths()
越上面，越抽象；越下面，越具體。

In [6]:
car.hypernym_paths()[0]

[Synset('entity.n.01'),
 Synset('physical_entity.n.01'),
 Synset('object.n.01'),
 Synset('whole.n.02'),
 Synset('artifact.n.01'),
 Synset('instrumentality.n.03'),
 Synset('container.n.01'),
 Synset('wheeled_vehicle.n.01'),
 Synset('self-propelled_vehicle.n.01'),
 Synset('motor_vehicle.n.01'),
 Synset('car.n.01')]

### Wrodnet 詞意相似度

In [7]:
car = wn.synset('car.n.01')
novel = wn.synset('novel.n.01')
dog = wn.synset('dog.n.01')
motorcycle = wn.synset('motorcycle.n.01')
cat = wn.synset('cat.n.01')

print('car/novel：',car.path_similarity(novel))
print('car/dog：',car.path_similarity(dog))
print('car/motorcycle：',car.path_similarity(motorcycle))
print('car/cat：',car.path_similarity(cat))

car/novel： 0.05555555555555555
car/dog： 0.07692307692307693
car/motorcycle： 0.3333333333333333
car/cat： 0.05555555555555555


# NLTK -- word_tokenize 分詞(斷詞)
- 「punkt」包含了許多預訓練好的分詞模型
- 分詞（Tokenize）就是將句子拆分成一個個具有意義的「小部件」

In [8]:
from nltk.tokenize import word_tokenize

nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
string = "NORAD regularly tracks Santa's trip around the world each Christmas, but this year is a bit different."

word_tokenize(string)

['NORAD',
 'regularly',
 'tracks',
 'Santa',
 "'s",
 'trip',
 'around',
 'the',
 'world',
 'each',
 'Christmas',
 ',',
 'but',
 'this',
 'year',
 'is',
 'a',
 'bit',
 'different',
 '.']

# NLTK -- pos_tag 詞性標註
- 詞性說明：https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [10]:
from nltk import pos_tag

nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [11]:
pos_tag(word_tokenize(string))

[('NORAD', 'NNP'),
 ('regularly', 'RB'),
 ('tracks', 'VBZ'),
 ('Santa', 'NNP'),
 ("'s", 'POS'),
 ('trip', 'NN'),
 ('around', 'IN'),
 ('the', 'DT'),
 ('world', 'NN'),
 ('each', 'DT'),
 ('Christmas', 'NNP'),
 (',', ','),
 ('but', 'CC'),
 ('this', 'DT'),
 ('year', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('bit', 'RB'),
 ('different', 'JJ'),
 ('.', '.')]

# NLTK -- stemming 詞幹
Stemming是抽取詞的詞幹或詞根形式（不一定能夠表達完整語義）。

In [12]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

[porter.stem(word) for word in word_tokenize(string)]

['norad',
 'regularli',
 'track',
 'santa',
 "'s",
 'trip',
 'around',
 'the',
 'world',
 'each',
 'christma',
 ',',
 'but',
 'thi',
 'year',
 'is',
 'a',
 'bit',
 'differ',
 '.']

# NLTK -- sent_tokenize 分句 (斷句)

In [13]:
from nltk.tokenize import sent_tokenize

strs = "NORAD regularly tracks Santa's trip around the world each Christmas, but this year is a bit different. On Wednesday (Dec. 23), the Federal Aviation Administration gave Santa and his reindeer-powered sleigh an official commercial space license for launches and landings"

sent_tokenize(strs)

["NORAD regularly tracks Santa's trip around the world each Christmas, but this year is a bit different.",
 'On Wednesday (Dec. 23), the Federal Aviation Administration gave Santa and his reindeer-powered sleigh an official commercial space license for launches and landings']