##### Information from : https://ithelp.ithome.com.tw/articles/10261285

## NLP Web

為了運用正則表達式來製造pattern，我們先引入模組 re 。這時候我們使用 re.sub() 這個函式，並且傳遞三個必要引數(required arguments)：
- pattern: 正則表達式，在這裡我們可以設計為 r"<.*?>"
- replacement_text: 符合pattern的字串將被更換為之，在這裡直接換成空字串 ''
- input: 待比對之字串

In [1]:
import re
from nltk.tokenize import sent_tokenize  #NLTK工具箱 sent_tokenize() 用來實現斷句

In [2]:
raw_text = """
<html>
   <head>
      <title>My Garden - Tomatoes</title>
   </head>
   <body>
   <h1>Garden Tomatoes</h1>
   <p>I decided to plant some tomatoes this spring. They're really taking off and I hope to have lots of tomatoes to give to all my friends and family this summer!</p>
   <p>Here are a few things I like about tomatoes:</p>
   <ol>
      <li>They taste great.</li>
      <li>They're good for me.</li>
      <li>They're easy to grow!</li>
   </ol>
   <p>Here's a picture of my garden:</p>
   <img src="http://www.mygardensite.com/images/my-garden-001.jpg" alt="a picture of my garden" />
   <p>Here's a <a href="http://www.welovetomatoes.com">link</a> to check out more interesting things about tomatoes!</p>
   </body>
</html>
"""


text_no_tags = re.sub(r"<.*?>", '', raw_text)
print(text_no_tags)



   
      My Garden - Tomatoes
   
   
   Garden Tomatoes
   I decided to plant some tomatoes this spring. They're really taking off and I hope to have lots of tomatoes to give to all my friends and family this summer!
   Here are a few things I like about tomatoes:
   
      They taste great.
      They're good for me.
      They're easy to grow!
   
   Here's a picture of my garden:
   
   Here's a link to check out more interesting things about tomatoes!
   




## Cleaning the black
使用代表 whitspace、tab、換行的元字元(metacharacter) \s。由於無意義空格佔了兩個半格以上的空間，因此pattern可以設計為 \s{2,}，程式碼如下：

In [53]:
# to remove redundant whitespaces
text_no_whitespace = re.sub(r"\s{2,}", ' ', text_no_tags)
text_no_whitespace
type(text_no_whitespace)

str

## Sentence Segmenation 斷句

在 Python 的實踐上，我們使用自然語言處理工具箱 NLTK (NLP Toolkit) 來協助我們進行處理任務。第一步我們欲將以上的字串拆分成多個句子，此步驟稱之為斷句（Sentence Segmentation）。句號（.）是判斷句子結束很好的依據，但仍有些例外－省略用的句號，如 Mr. Williams、 Ph.D. 等。 好消息是， NLTK工具箱當中的 tokenize 模組，已經定義好了函式 sent_tokenize() 用來實現斷句：

In [52]:
# removing double quotes
text = re.sub(r"\"", '', text_no_whitespace)

# breaking text into sentences
text_sentences = sent_tokenize(text)
print(type(text_sentences))

# printing out sentences
for i, sent in enumerate(text_sentences):
    print("Sentence {}: {}".format(i + 1, sent), end = "\n\n")

<class 'list'>
Sentence 1: Facebook under fire over secret teen research
By Jane Wakefield
Technology reporter Published15 September 2021
Girl taking a selfie
IMAGE SOURCE,GETTY IMAGES
Image caption,
Teenage girls can be very conscious of body image - and Instagram can make them feel worse, the internal studies showed
Facebook-owned Instagram has been criticised for keeping secret its internal research into the effect social media had on teenager users.

Sentence 2: According to the Wall Street Journal, its studies showed teenagers blamed Instagram for increased levels of anxiety and depression.

Sentence 3: Campaign groups and MPs have said it is proof the company puts profit first.

Sentence 4: Instagram said the research showed its commitment to understanding complex and difficult issues.

Sentence 5: The Wall Street Journal's report, not disputed by Facebook, finds: A 2019 presentation slide said: We make body-image issues worse for one in three teenage girls
Another slide said tee

## Tokenisation 斷詞
進一步將句子拆分成更小的單位－單詞。值得注意的是，在英文當中單詞通常被認為能夠表示意義的最小單位－詞條（Token），將字串拆分成詞條的過程就是斷詞（word segementation），又稱記號化（Tokenisation）。此詞我們引入另一個拆分函式 word_tokenize() ：

In [5]:
#List
from nltk.tokenize import word_tokenize

for i, sent in enumerate(text_sentences):
    print("Sentence {}: {}".format(i + 1, sent))
    tokens = word_tokenize(sent)
    print(tokens, end = "\n\n")

Sentence 1:  My Garden - Tomatoes Garden Tomatoes I decided to plant some tomatoes this spring.
['My', 'Garden', '-', 'Tomatoes', 'Garden', 'Tomatoes', 'I', 'decided', 'to', 'plant', 'some', 'tomatoes', 'this', 'spring', '.']

Sentence 2: They're really taking off and I hope to have lots of tomatoes to give to all my friends and family this summer!
['They', "'re", 'really', 'taking', 'off', 'and', 'I', 'hope', 'to', 'have', 'lots', 'of', 'tomatoes', 'to', 'give', 'to', 'all', 'my', 'friends', 'and', 'family', 'this', 'summer', '!']

Sentence 3: Here are a few things I like about tomatoes: They taste great.
['Here', 'are', 'a', 'few', 'things', 'I', 'like', 'about', 'tomatoes', ':', 'They', 'taste', 'great', '.']

Sentence 4: They're good for me.
['They', "'re", 'good', 'for', 'me', '.']

Sentence 5: They're easy to grow!
['They', "'re", 'easy', 'to', 'grow', '!']

Sentence 6: Here's a picture of my garden: Here's a link to check out more interesting things about tomatoes!
['Here', "'s"

In [55]:
#String
sentence_data = "The First sentence is about Python. The Second: about Django. You can learn Python, \
Django and Data Ananlysis here. "

nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)

['The First sentence is about Python.', 'The Second: about Django.', 'You can learn Python, Django and Data Ananlysis here.']


### token.lower() 大小寫轉換

In [6]:
tokenised = ["The", "spectators", "all", "stood", "and", "sang", "the", "national", "anthem"]
# lowercasing each token
tokens_lower = [token.lower() for token in tokenised] 
tokens_lower

['the',
 'spectators',
 'all',
 'stood',
 'and',
 'sang',
 'the',
 'national',
 'anthem']

## Stemming 語幹提取
在語言學中，詞幹（word stem）表示一個單詞中最基本且核心的形式，例如 friendships 就是由 friendship 與詞綴 -s 所組成， friendship 就是其詞幹；而 friendship 則是由 friend 與詞綴 -ship 所構成，此時 friend 則是其詞幹。因此詞幹的提取基於不同理念或不同演算法，有時會得到不同的結果。我們以常見的 Porter Stemming Algorithm、 Lancaster Stemming Algorithm 以及 Snowball Stemming Algorithm 說明，從而比較它們的差異。

In [7]:
# importing stemmer classes
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

tokens = ["the", "spectators", "all", "stood", "and", "sang", "the", "national", "anthem"]

# stemming
port = PorterStemmer()
stemmed_port = [port.stem(token) for token in tokens]

lanc = LancasterStemmer()
stemmed_lanc = [lanc.stem(token) for token in tokens]

snow = SnowballStemmer("english")
stemmed_snow = [snow.stem(token) for token in tokens]

# showing stemmed results
print("Porter: {}".format(stemmed_port)) 
print("Lancaster: {}".format(stemmed_lanc))
print("Snowball: {}".format(stemmed_snow))

Porter: ['the', 'spectat', 'all', 'stood', 'and', 'sang', 'the', 'nation', 'anthem']
Lancaster: ['the', 'spect', 'al', 'stood', 'and', 'sang', 'the', 'nat', 'anthem']
Snowball: ['the', 'spectat', 'all', 'stood', 'and', 'sang', 'the', 'nation', 'anthem']


## 詞形還原（Lemmatisation）
很顯然，萃取詞幹並未能滿足我們減少詞形變化（inflection）的需求，因此我們轉而找尋更能代表單詞基本形式－詞位（lemma），例如 sings、 singing、 sang、 sung 共享同一個詞位 sing。以下我們將借用 NLTK.stem 模組中收錄的 WordNetLemmatizer 類別找出詞位，WordNet為普林斯頓大學所建立的免費公開詞彙資料庫。

In [8]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jerrychien/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [9]:
from nltk.stem import WordNetLemmatizer

tokens = ["the", "spectators", "all", "stood", "and", "sang", "the", "national", "anthem"]

lemmatiser = WordNetLemmatizer()
lemmatised = [lemmatiser.lemmatize(token) for token in tokens]
print("lemmatised: {}".format(lemmatised))

lemmatised: ['the', 'spectator', 'all', 'stood', 'and', 'sang', 'the', 'national', 'anthem']


## 停用詞去除（Stopword Removal）
在文句中有些單詞並對於詞義的傳達並無太大的作用，如 a/ an、 the 、 is/ are等，被稱之為停用詞（ stop words）。

In [10]:
from nltk.corpus import stopwords
nltk.download("stopwords")

# defining stopwords in English
stop_words = set(stopwords.words("english"))

# removing stop words
words_no_stop = [word for word in lemmatised if word not in stop_words]
print("stop words removed: {}".format(words_no_stop))

stop words removed: ['spectator', 'stood', 'sang', 'national', 'anthem']


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jerrychien/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 詞性標註（POS Tagging）
詞性（Part-of-Speech, POS）與語法分析（Syntactic Analysis）
在語言學中，單詞被依照其功能以及詞形變化（inflection）分類為不同的詞性（Part of Speech, POS）。常見的詞性包含了名詞、動詞、形容詞、副詞、介係詞等等，如「 In God we trust. 」這句英文就由介係詞（ in ） + 名詞（ God ） + 代名詞（ we ）+ 動詞（ trust ） 所依序構成，其句法（syntax）有別於由代名詞、動詞、介係詞、名詞依序構成的「 We trust in God. 」。我們將以詞性作為出發點，依循文法規則，進而分析文句的架構，這個過程稱為語法分析（syntactic analysis）。
- https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [11]:
from nltk import pos_tag
nltk.download("averaged_perceptron_tagger")

tokenised_sent = ["their", "decision", "makes", "no", "economic", "sense"]

# POS tagging
pos_tagged_sent = pos_tag(tokenised_sent)
print("POS tagged sentence:\n{}".format(pos_tagged_sent))

POS tagged sentence:
[('their', 'PRP$'), ('decision', 'NN'), ('makes', 'VBZ'), ('no', 'DT'), ('economic', 'JJ'), ('sense', 'NN')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jerrychien/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## 語義組塊（Phrase Chunking）
句構的層次描述可以很簡單，也可以很複雜，取決於我們如何「分塊（ chunking ）」。我們可以依照片語或子句的文法結構指定語義組塊，透過語法剖析器（ parser ）逐步檢查語法（使用正則表達式比對字串），從而產生描述層次結構的分析樹。我們將示範以名詞片語以及動詞片語兩個簡單的文法結構來實踐分塊：

In [12]:
from nltk import RegexpParser

In [13]:
# 名詞片語 given a word tokenised sentence
tokenised_sent = ["their", "decision", "makes", "no", "economic", "sense"]

# POS tagging
pos_tagged_sent = pos_tag(tokenised_sent)

# specifying the formal grammar of an noun phrase: "grammar_name: {RegEx}"
np_chunk_grammar = "NP: {<DT>?<JJ>*<NN.?>}"
# building its parser
np_chunk_parser = RegexpParser(np_chunk_grammar)
# chunk parsing a sentence
np_chunked_sent = np_chunk_parser.parse(pos_tagged_sent)

# visualising parsing result
np_chunked_sent.draw()

## 應用實例：文章資訊檢索
介紹完了文法解析之後，我們接下來瀏覽一篇新聞報導，藉由一系列前處理、詞性標籤以及語塊分析的技巧，找出文本中的關鍵資訊。
我們預先寫好兩個模組：tokenise_words.py 以及 chunk_counters.py
以下為 tokenise_words 模組：將清理過的字串進行斷句與斷詞（小寫轉換 → 斷句 → 斷詞）

## 詞袋模型（Bag-of-Words Model, BoW）
淺談詞「袋」
詞袋模型是一個基於單詞出現頻率來表示文字的方法，它並不考慮單詞的排列順序、或甚至是文法結構。