## Extracting aspects and opinions using dependency-based rules.

Universal dependencies: http://universaldependencies.org/docsv1/en/dep/index.html

POS tags: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [1]:
import pickle
import pandas as pd
import nltk
import random
import gensim
import pyLDAvis.gensim
import warnings
warnings.filterwarnings('ignore')



In [2]:
all_depedencies = pickle.load(open("deps_en2.pkl", "rb"))
print(len(all_depedencies))

36000


In [3]:
rel_dict = {} # Count each type of relationships
for sentences in all_depedencies.values():
    for sentence in sentences:
        for gov, rel, dep in sentence:
            rel_dict[rel] = rel_dict.get(rel, 0) + 1
df_freq = pd.DataFrame([(k, v) for (k, v) in rel_dict.items()], columns=["rel", "count"])
df_freq.sort_values("count", ascending=False)

Unnamed: 0,rel,count
3,punct,153874
0,nsubj,93575
6,det,79750
2,advmod,66476
7,amod,58212
1,cop,52706
19,case,45074
18,nmod,40146
5,dobj,38176
12,conj,37353


Dependencies that might be useful for extracting aspects and opinions:
<li>nsubj (nominal subject)</li>
<li>amod (adjectival modifier)</li>
<li>advmod (adverbial modifier)</li>
<br>
For opinions only:
<li>dobj (direct object)</li>
<li>neg (negation)</li>
<li>xcomp (open clausal comlement)</li>
<li><s>cop</s> (already captured by nsubj)</li>
<li><s>aux</s> (already captured by nsubj)</li>
<br>
For aspects only:
<li>nmod (nominal modifier)</li>
<li>compound (noun compounds)</li>
<li>appos (appositional modifier)</li>
<br>
For expansion:
<li>conj (conjunction)</li>

In [4]:
def count_types(rel_type):
    types = {}
    for sentences in all_depedencies.values():
        for sentence in sentences:
            for gov, rel, dep in sentence:
                if rel == rel_type:
                    t = gov[1]+"+"+dep[1]
                    types[t] = types.get(t, 0) + 1
    df_count = pd.DataFrame([(k, v) for (k, v) in types.items()], columns=["combination", "count"])
    return df_count.sort_values("count", ascending=False)

In [5]:
count_types("amod")

Unnamed: 0,combination,count
1,N+A,52180
2,A+A,1852
5,N+V,1202
0,NNP+A,1083
3,CD+A,689
4,N+R,324
17,V+A,197
9,N+N,175
6,NNPS+A,98
18,NNP+V,62


amod(N, A) seems to be a dominant pattern.

In [18]:
count_types("advmod")

Unnamed: 0,combination,count
0,A+R,32974
1,V+R,18953
2,N+R,3866
4,R+R,2864
5,V+WRB,1946
6,V+A,1361
3,A+A,948
11,V+IN,466
8,A+WRB,421
14,CD+R,367


advmod(A, R) and advmod(V, R) are for opinions. advmod(N, R) is for aspects and opinions.

In [6]:
count_types("nsubj")

Unnamed: 0,combination,count
1,A+N,24896
5,V+PRP,23593
4,V+N,15104
6,A+PRP,11705
2,N+N,2973
9,N+PRP,2797
3,V+NNP,2583
8,R+N,1984
0,A+NNP,1740
12,V+DT,551


nsubj(A, N) are for aspests and opinions.

In [7]:
# check some samples
counter = 0
for sentences in all_depedencies.values():
    for sentence in sentences:
        for gov, rel, dep in sentence:
            if rel == "nsubj":
                if gov[1] == "N" and dep[1] == "N":
                    if random.randint(0,1) == 1:
                        print(gov, rel, dep)
                        counter += 1
    if counter > 4:
        break

('work', 'N') nsubj ('logistics', 'N')
('baby', 'N') nsubj ('summer', 'N')
('evaluation', 'N') nsubj ('following', 'N')
('leak', 'N') nsubj ('baby', 'N')
('super-penny', 'N') nsubj ('value-for-money', 'N')


<h4>Based on above observations, I will focus on amod(N, A), nsubj(A, N), and advmod(N, R).</h4>

In [8]:
out = []
for i, sentences in all_depedencies.items():
    for sentence in sentences:
        for gov, rel, dep in sentence:
            if rel == "amod":
                if gov[1] == "N" and dep[1] == "A":
                    out.append([i, gov[0], dep[0], "amod"])
            elif rel == "nsubj":
                if gov[1] == "A" and dep[1] == "N":
                    out.append([i, dep[0], gov[0], "nsubj"])
            elif rel == "advmod":
                if gov[1] == "N" and dep[1] == "R":
                    out.append([i, gov[0], dep[0], "advmod"])
df_extract = pd.DataFrame(out, columns=["doc_id", "aspect", "opinion", "rel"])                    

In [9]:
df_extract

Unnamed: 0,doc_id,aspect,opinion,rel
0,1,odor,soft,amod
1,1,absorption,large,amod
2,1,brands,trustworthy,nsubj
3,1,brands,big,amod
4,1,work,fast,advmod
5,1,work,hard,amod
6,2,absorption,good,nsubj
7,2,price,affordable,nsubj
8,3,things,good,nsubj
9,3,packaging,good,nsubj


In [10]:
len(df_extract["aspect"].unique())

2261

In [11]:
df_aspect_count = df_extract["aspect"].value_counts()

In [12]:
df_aspect_count[df_aspect_count.gt(10)]

quality           5509
butt              4830
time              3240
ass               3187
price             2538
brand             2510
things            2365
packaging         2256
baby              2221
absorption        2171
breathability     2066
diaper            1623
delivery          1397
diapers           1362
brands            1210
times             1075
use                969
urine              957
pants              880
service            778
goods              769
store              729
product            641
purchase           615
logistics          550
day                533
shopping           521
taste              503
products           496
thing              487
                  ... 
sister              12
care                12
authenticity        12
village             12
powder              11
boy                 11
names               11
scorpion            11
oil                 11
m162                11
outfit              11
mothers             11
consumption

463 nouns appeared more than 10 times.

In [13]:
df_extract.to_csv(r"extract1.csv", index=False, encoding="utf-8")

#### Topic model

In [14]:
docs_words = []
for doc_id, group in df_extract.groupby("doc_id"):
    doc = []
    for i, row in group.iterrows():
        doc.append(row["aspect"])
        doc.append(row["opinion"])
    docs_words.append(doc)
print(len(docs_words))

29184


In [15]:
id2word = gensim.corpora.Dictionary(docs_words)
id2word.filter_extremes(no_below=3, no_above=0.2)
print(id2word)

Dictionary(1666 unique tokens: ['absorption', 'big', 'brands', 'fast', 'hard']...)


In [16]:
corpus = [id2word.doc2bow(doc) for doc in docs_words]

In [17]:
n_topics = 10
lda = gensim.models.ldamodel.LdaModel(corpus, num_topics=n_topics, id2word=id2word, passes=10)
lda_vis = pyLDAvis.gensim.prepare(lda, corpus, id2word, sort_topics=False)
pyLDAvis.display(lda_vis)

In [18]:
docs_phrases = []
for doc_id, group in df_extract.groupby("doc_id"):
    doc = []
    for i, row in group.iterrows():
        doc.append(row["opinion"]+" "+row["aspect"])
    docs_phrases.append(doc)
print(len(docs_phrases))

29184


In [19]:
id2word = gensim.corpora.Dictionary(docs_phrases)
id2word.filter_extremes(no_below=3, no_above=0.2)
print(id2word)

Dictionary(3115 unique tokens: ['big brands', 'hard work', 'large absorption', 'trustworthy brands', 'affordable price']...)


In [20]:
corpus = [id2word.doc2bow(doc) for doc in docs_phrases]

In [21]:
n_topics = 10
lda = gensim.models.ldamodel.LdaModel(corpus, num_topics=n_topics, id2word=id2word, passes=10)
lda_vis = pyLDAvis.gensim.prepare(lda, corpus, id2word, sort_topics=False)
pyLDAvis.display(lda_vis)

Extracting aspects using topic modeling does not work very well here. The reason seems to be that in the same review, people tend to talk about very different aspects (e.g., logistics, price, brand, quality).