In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import logging

logging.getLogger('jieba').setLevel(logging.WARN)
logging.getLogger('fgclassifier').setLevel(logging.INFO)

import os
os.chdir('..')

This notebook explores the methods to do tag highlighting.

In [2]:
import config
from collections import defaultdict
from sklearn.model_selection import train_test_split

from fgclassifier.utils import read_data, get_dataset

df_en = get_dataset('train_en')
df_zh = get_dataset('train')

2018-12-02 22:16:10,333 [INFO] Reading /opt/storage/english_train.csv..
2018-12-02 22:16:10,511 [INFO] Reading /opt/storage/train/sentiment_analysis_trainingset.csv.tokenized.csv..
2018-12-02 22:16:12,807 [INFO] Reading /opt/storage/train/sentiment_analysis_trainingset.csv..


In [3]:
text_en = df_en['content_raw'][0]
text_zh = df_zh['content_raw'][0]
print(text_en)
print(text_zh)

First of all, the service, I asked the price to the waiter to see if this is 98, she said that it was then when she came to collect the money, but 108, asked her why she could not answer the question, then another person came over to collect the money and then gave it to me. 10 cheaper. Sure enough, Panyu is not a good thing, but it is actually directed. The cooperation between the waiters is poor, the wrong order is made, and the dish is messed up. Ok, say fried chicken. I ordered bt spicy and sweet. When the chicken comes up, I will sip it. Isn't it just a dish of fried sauce and sweet sauce on the fried chicken? Ok, I will bear it first, try it. Um... Too much powder, and it’s too fragile, the chicken inside is a bit dry. And I have eaten one of the chickens with a strange taste. I dare not say that using fresh chicken, it should be an old chilled chicken, it should still be 82 years. Potato chips... um... it’s just a pack of 3 in the supermarket. Potato wedges, good oil, no return 

In [6]:
from snownlp import SnowNLPNLP

SnowNLP('实在是太恶心了').sentiments

0.02571608877754472

In [7]:
import spacy

nlp_zh = spacy.load('zh_core_web_sm')

In [73]:
from snownlp import SnowNLP

doc = SnowNLP(text_zh)
for sent in doc.sentences:
    print(sent, SnowNLP(sent).sentiments)
    for word in nlp_zh(sent):
        print(word, word.tag_)

吼吼吼 0.5834396182746914
吼吼 NN
吼 SFN
萌死人的棒棒糖 0.9586401818140926
萌死 NNP
人 NN
的 DEC
棒棒糖 NN
中了大众点评的霸王餐 0.6498323781194285
中 NN
了 AS
大众 NN
点评 NN
的 DEC
霸王餐 NN
太可爱了 0.6663463298769546
太 RB
可爱 JJ
了 UH
一直就好奇这个棒棒糖是怎么个东西 0.997882635536866
一直 RB
就 RB
好奇 JJ
这个 DT
棒棒糖 NN
是 VC
怎么 RB
个 NNB
东西 NN
大众点评给了我这个土老冒一个见识的机会 0.752802797601812
大众 NN
点评 VV
给 VV
了 AS
我 PRP
这个 DT
土老冒 NNP
一个 RB
见识 VV
的 DEC
机会 NN
看介绍棒棒糖是用德国糖做的 0.992540777499147
看 VV
介绍 VV
棒棒糖 NN
是 VC
用 VV
德国 NNP
糖 NN
做 VV
的 UH
不会很甜 0.7847423585226886
不会 MD
很甜 JJ
中间的照片是糯米的 0.6028834542837995
中间 NN
的 DEC
照片 NN
是 VC
糯米 VV
的 UH
能食用 0.5833333333333336
能 MD
食用 VV
真是太高端大气上档次了 0.9853961767368831
真是太 NN
高端 JJ
大气 NN
上档次 VV
了 UH
还可以买蝴蝶结扎口 0.7683977089874504
还 RB
可以 MD
买 VV
蝴蝶结 NN
扎口 JJ
送人可以买礼盒 0.5572971562741176
送人 VV
可以 MD
买 VV
礼盒 NN
我是先打的卖家电话 0.19276543154215775
我 PRP
是 VC
先 RB
打 VV
的 DEC
卖家 JJ
电话 NN
加了微信 0.638246931497096
加 VV
了 AS
微信 NNP
给卖家传的照片 0.5190102704406371
给 VV
卖家 NN
传 SFV
的 DEC
照片 NN
等了几天 0.38121739130434773
等 NN
了 AS
几天 NN
卖家就告诉我可以取货了 0.21622707986

In [40]:
import re

RE_SENTENCE = re.compile(r'.*?[。….？！?!；~～]+') 
RE_BLANK_AND_MARK = re.compile(r'\s+([。….？！?!；~～])')

text = 'Hahah! ! ! What are you Doing.? Aba Whats..'

text = RE_BLANK_AND_MARK.sub(r'\1', text)
print(text)

for sent in RE_SENTENCE.findall(text):
    sent = sent
    print(sent, SnowNLP(sent).sentiments)

Hahah!!! What are you Doing.? Aba Whats..
Hahah!!! 0.5
 What are you Doing.? 0.4737672181921908
 Aba Whats.. 0.4737672181921908


In [44]:
from textblob import TextBlob

blob = TextBlob(text_en)
for sent in blob.sentences:
    print(sent.string)

First of all, the service, I asked the price to the waiter to see if this is 98, she said that it was then when she came to collect the money, but 108, asked her why she could not answer the question, then another person came over to collect the money and then gave it to me.
10 cheaper.
Sure enough, Panyu is not a good thing, but it is actually directed.
The cooperation between the waiters is poor, the wrong order is made, and the dish is messed up.
Ok, say fried chicken.
I ordered bt spicy and sweet.
When the chicken comes up, I will sip it.
Isn't it just a dish of fried sauce and sweet sauce on the fried chicken?
Ok, I will bear it first, try it.
Um... Too much powder, and it’s too fragile, the chicken inside is a bit dry.
And I have eaten one of the chickens with a strange taste.
I dare not say that using fresh chicken, it should be an old chilled chicken, it should still be 82 years.
Potato chips... um... it’s just a pack of 3 in the supermarket.
Potato wedges, good oil, no return 

In [136]:
from fgclassifier.visualizer.highlight import highlight_noun_chunks

text_zh = """
很高兴和小伙伴们一起品鉴蜀老头烤鱼，此店位于花园路，在小天鹅东来顺的西边，店内环境古香古色，各种装饰道具让人有一种身在蜀中的韵味。包间里的家具都是原木色的实木桌櫈，记得小时候不管教室里还是家里都是这样造型的桌椅让我找到了童年的回忆，很温馨舒服。菜品更是丰富多彩，薯香辣翅中是最受欢迎的一道菜虽然看起来很多辣椒但是鸡翅里只融入了辣椒的香味还有烤得适中的地瓜条味道很不错。葱香清江鱼也是当场很卖座得一道菜，甜咸可口，酱汁浓郁。香辣清江鱼店家使用了五月梅花椒，是即带麻味又带香味的花椒再加上四川绵阳的干辣椒，真是红红火火满锅红。口味自然鲜辣无比，爱吃辣的同学一定不能错过。还有干锅辣鸭头，干锅牛蛙口味都不错而且量大实惠。唯一的缺点就是所有菜品口味有些重，而且越吃会越咸如果能改善一下这方面那就更完美了。
"""

print(highlight_noun_chunks(text_zh, 'zh'))
# print()
# print(highlight_noun_chunks(text_en, 'en'))


很高兴和小伙伴们一起品鉴蜀老头烤鱼，此店位于花园路，在小天鹅东来顺的西边，店内环境古香古色，各种装饰道具让人有一种身在蜀中的韵味。<span class="positive">包间里的家具</span>都是原木色的实木桌櫈，记得小时候不管教室里还是家里都是<span class="positive">这样造型的桌椅</span>让我找到了童年的回忆，很温馨舒服。菜品更是丰富多彩，薯香辣翅中是最受欢迎的一道菜虽然看起来很多辣椒但是鸡翅里只融入了辣椒的香味还有烤得<span class="positive">适中的地瓜条味道</span>很不错。葱香清江鱼也是当场很卖座得一道菜，甜咸可口，酱汁浓郁。香辣清江鱼店家使用了五月梅花椒，是即带麻味又带<span class="positive">香味的花椒</span>再加上四川绵阳的干辣椒，真是红红火火满锅红。口味自然鲜辣无比，爱吃辣的同学一定不能错过。还有<span class="positive">干锅辣</span>鸭头，干锅牛蛙口味都不错而且量大实惠。<span class="positive">唯一的缺点</span>就是所有菜品口味有些重，而且越吃会越咸如果能改善一下这方面那就更完美了。



In [134]:
from fgclassifier.visualizer.highlight import spacy_load
from spacy.lang.zh import TAG_MAP


def zh_noun_chunks_iterator(obj):
    """
    Iterate Chinse noun chunks
    """
    labels = ['nmod', 'punct', 'obj', 'nsubj',
              'dobj', 'nsubjpass', 'pcomp', 'pobj', 'dative',
              'attr', 'ROOT']

    doc = obj.doc # Ensure works on both Doc and Span.
    np_deps = [doc.vocab.strings.add(label) for label in labels]
    conj = doc.vocab.strings.add('conj')
    np_label = doc.vocab.strings.add('NP')
    
    seen = set()
    exclude = set(['，', ','])  # always exclude 「，」
    for i, word in enumerate(obj):
        print(word, '\t', word.left_edge, word.tag_, word.dep_)
        if word.tag_ not in ('NNP', 'NN', 'RB'):
            continue
        # Prevent nested chunks from being produced
        if word.i in seen or word.text in exclude:
            continue
        if word.dep in np_deps:
            # print([w for w in word.subtree])
            if any((w.i in seen or w.text in exclude) for w in word.subtree):
                continue
            seen.update(j for j in range(word.left_edge.i, word.i+1))
            yield word.left_edge.i, word.i+1, np_label
        elif word.dep == conj:
            head = word.head
            while head.dep == conj and head.head.i < head.i:
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
                if any(w.i in seen for w in word.subtree):
                    continue
                seen.update(j for j in range(word.left_edge.i, word.i+1))
                yield word.left_edge.i, word.i+1, np_label

def zh_noun_chunks(doc):
    doc.noun_chunks_iterator = zh_noun_chunks_iterator
    return doc.noun_chunks
    
nlp = spacy_load('zh')
doc = nlp('吼吼吼，萌死人的棒棒糖，中了大众点评的霸王餐，太可爱了。')
# doc = nlp('一直就好奇这个棒棒糖是怎么个东西，大众点评给了我这个土老冒一个见识的机会。')
# doc = nlp('看介绍棒棒糖是用德国糖做的，不会很甜，中间的照片是糯米的，能食用，真是太高端大气上档次了，还可以买蝴蝶结扎口，送人可以买礼盒。')
# doc = nlp('虽然连卖家的面都没见到，但是还是谢谢卖家送我这么可爱的东西，太喜欢了，这哪舍得吃啊。')

for x in zh_noun_chunks(doc):
    print(x)

吼吼 	 吼吼 NN appos
吼 	 吼 SFN nmod
， 	 吼吼 NNP nmod
萌死 	 萌死 IN case:suff
人 	 吼吼 NN det
的 	 的 DEC case:dec
棒棒糖 	 棒棒糖 NN nmod
， 	 吼吼 RB advmod
中 	 中 VV acl
了 	 了 AS case:aspect
大众 	 大众 NN nsubj
点评 	 大众 NN acl:relcl
的 	 的 DEC mark:relcl
霸王餐 	 大众 NN nmod
， 	 吼吼 RB nsubj
太 	 太 RB advmod
可爱 	 吼吼 JJ ROOT
了 	 了 AS case:aspect
。 	 。 FW obj
棒棒糖
大众


In [42]:
from fgclassifier.visualizer.highlight import highlight_noun_chunks

text = """
I want to eat grilled fish, I found out that there is a fish near the subway station, and I think that there is no fish or fish in Nanjing. I am going to eat this. [Location] It\u2019s really good for a person who came to this square for the first time. Difficult to find, I don't know if I still divide it into the east and west districts. I found it in the inside for half an hour. It was next to the hottest Guimanyu. [Environment] is the kind of lighting that is dark and iron-isolated. Quite a lot ~ [Grilled fish] Two people only ordered a tomato grilled fish, 3 kg of Qingjiang fish, added potatoes,
"""

highlight_noun_chunks(text)

'<span class="sentence">I want to eat <span class="neutral">grilled fish</span>, I found out that there is a fish near <span class="neutral">the subway station</span>, and I think that there is no fish or fish in Nanjing.</span><span class="sentence">I am going to eat this.</span><span class="sentence"><span class="positive">[Location</span>] It’s really good for a person who came to this square for <span class="positive">the first time</span>.</span><span class="sentence">Difficult to find, I don\'t know if I still divide it into <span class="negative">the east</span> and <span class="negative">west districts</span>.</span><span class="sentence">I found it in <span class="negative">the inside</span> for <span class="negative">half an hour</span>.</span><span class="sentence">It was next to <span class="neutral">the hottest Guimanyu</span>.</span><span class="sentence"><span class="positive">[Environment</span>] is <span class="positive">the kind</span> of lighting that is dark and iro