#題目: 將某篇文章以上下文相同，比方三連詞(trigram)方式修改內容
#說明：某篇文章中我們可以找出所有的三連詞(trigram)，以及在前字與後字出現時，
#按照出現度隨機選出一個字去換掉中間字，這是利用三連詞修改文章內容的最基本作法。
#一旦字典的資料結構建立，我們就以某種機率(比方20%)去置換原文，並將置換文與原文印出來

#延伸: 可用五連詞或七連詞去取代中間字，可利用三連詞之前兩字去更換第三字，
#可增加加詞性的相同性(Parts Of Sentence)提高可讀性，甚至使用 Word2Vec, Glove，或者RNN的

#範例程式檔名: article_modifier_自動文件修改器.py。
#模組: sklearn, random, numpy, nltk, bs4
#輸入檔：./electronics/positive.review
#成績：被置換文的合理性與可讀性


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os

# Current directory
print(os.getcwd())

# change directory
os.chdir('/content/drive/MyDrive/python_training/NLP100Days/project_6_trigram_application')
print(os.getcwd())

/content
/content/drive/MyDrive/python_training/NLP100Days/project_6_trigram_application


In [3]:
!python --version

Python 3.7.10


In [4]:

# 使用三連詞 trigrams 練習簡易文件產生器
from __future__ import print_function, division
from future.utils import iteritems
from builtins import range

import nltk
nltk.download('punkt')
nltk.download('wordnet')
import random
import numpy as np

from bs4 import BeautifulSoup


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [5]:
# load the reviews
positive_reviews = BeautifulSoup(open('positive.review', encoding='ISO-8859-1').read(), "lxml")
positive_reviews = positive_reviews.findAll('review_text')


In [6]:

# 提出 三連詞 並置入字典
# (w1, w3) 當作 key, [ w2 ] 當作值
trigrams = {}
for review in positive_reviews:
    s = review.text.lower()
    tokens = nltk.tokenize.word_tokenize(s)
    for i in range(len(tokens) - 2):
        k = (tokens[i], tokens[i+2])
        if k not in trigrams:
            trigrams[k] = [] 
        trigrams[k].append(tokens[i+1])


In [7]:

# 將中間字矩陣變成或然率向量
trigram_probabilities = {}
for k, words in iteritems(trigrams):
    # 產生一個  word -> count 字典
    if len(set(words)) > 1:
        # 如果中間字middle word不只有一個機率 
        d = {}
        n = 0
        for w in words:
            if w not in d:
                d[w] = 0
            d[w] += 1
            n += 1
        for w, c in iteritems(d):
            d[w] = float(c) / n
        trigram_probabilities[k] = d


In [8]:

def random_sample(d):
    # 從字典隨機選出一個帶機率值的樣本，回傳累積機率值最大的字
    r = random.random()
    cumulative = 0
    for w, p in iteritems(d):
        cumulative += p
        if r < cumulative:
            return w


In [9]:

def test_spinner():
    review = random.choice(positive_reviews)
    s = review.text.lower()
    print("Original:", s)
    tokens = nltk.tokenize.word_tokenize(s)
    for i in range(len(tokens) - 2):
        if random.random() < 0.2: # 20% chance of replacement
            k = (tokens[i], tokens[i+2])
            if k in trigram_probabilities:
                w = random_sample(trigram_probabilities[k])
                tokens[i+1] = w
    print("Spun:")
    print(" ".join(tokens).replace(" .", ".").replace(" '", "'").replace(" ,", ",").replace("$ ", "$").replace(" !", "!"))


In [10]:

if __name__ == '__main__':
    test_spinner()

Original: 
i have used the 9510i for about 2 years and have found it to be flawless in performance. the disaster recovery software provided with this drive is it's biggest star. you can format your hard drive, put in the restore disk that you create for your system, and in about 15 minutes your system is put back to exactly the way it was when you created the disk. or you can simply replace selected files if you want. it does not miss a single thing. fanstastic. the software is extremely user friendly and guides you through the process easier than an "abc" book. i would highly recommend this drive for performance and ease of use

Spun:
i have used the 9510i for about 2 years and have found it to be flawless in performance. the disaster recovery software provided with this drive is it's biggest star. you can format your hard drive, especially in the restore disk that you need for your system, put in about ten minutes your ipod is put back to make the way it was when you have the pen. or