In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import re

In [2]:
os.chdir('../input')
os.getcwd()

'/kaggle/input'

# Aspect Based Sentiment Analysis on Car Reviews
## Taking Toyota Cars as an example

In [3]:
toy_rev = pd.read_csv('../input/Scrapped_Car_Reviews_Toyota.csv',engine='python',index_col=False)
toy_rev.head()

Unnamed: 0.1,Unnamed: 0,Review_Date,Author_Name,Vehicle_Title,Review_Title,Review,Rating
0,0,on 02/02/17 19:53 PM (PST),Ricardo,1997 Toyota Previa Minivan LE 3dr Minivan,"great vehicle, Toyota best design ever. thank you","there is no way back, enjoy what you have .",5.0
1,1,on 12/17/16 16:40 PM (PST),matt,1997 Toyota Previa Minivan LE All-Trac 3dr Min...,"my 4th previa, best van ever made!",1st 95 went over 300k before being totalled b...,5.0
2,2,on 04/14/10 07:43 AM (PDT),Joel G,1997 Toyota Previa Minivan LE 3dr Minivan,Mom's Taxi Babies Ride,Sold 86 Toyota Van 285K miles to be replaced ...,5.0
3,3,on 11/12/08 17:31 PM (PST),Dennis,1997 Toyota Previa Minivan LE All-Trac 3dr Min...,My Favorite Van Ever,"I have owned lots of vans, and the Previa is ...",4.875
4,4,on 04/14/08 22:47 PM (PDT),Alf Skrastins,1997 Toyota Previa Minivan LE All-Trac 3dr Min...,Best Minivan ever,My 1997 AWD Previa is the third one that I ha...,5.0


## **Combining the review title and review body for the text corpus****

In [4]:
toy_rev['review']=toy_rev['Review_Title']+toy_rev['Review']

## **Using spaCy for dependency parsing which forms the crux of aspect extraction**

In [5]:
import spacy
from tqdm import tqdm
nlp = spacy.load('en_core_web_lg', parse=True, tag=True, entity=True)

 ## **Using spaCy's awesome displacy module to show the dependency relations**

In [6]:
txt = 'Great car and has long range'
doc = nlp(txt)
spacy.displacy.render(doc,style='dep',jupyter=True)

**from https://nlp.stanford.edu/software/dependencies_manual.pdf**
### AMOD - adjectival modifier
#### An adjectival modifier of a Noun is any adjectival phrase that serves to modify the meaning of the Noun
### ex - 'Great <--amod-- Car', 'Long <--amod-- range'

In [7]:
txt = 'Drives well has great handling'
doc = nlp(txt)
spacy.displacy.render(doc,style='dep',jupyter=True)

### ADVMOD - adverb modifier
#### An adverb modifier of a word is a (non-clausal) adverb or adverb-headed phrase that serves to modify
#### the meaning of the word
### ex - 'Drives --advmod--> well'

In [8]:
txt =  "wonderful to drive the camry "
doc = nlp(txt)
spacy.displacy.render(doc,style='dep',jupyter=True)

### XCOMP -  open clausal complement
#### An open clausal complement (xcomp) of a verb or an adjective is a predicative or clausal complement without its own subject
### ex - 'wonderful --xcomp--> drive'


In [9]:
txt =  "not wonderful to drive the camry "
doc = nlp(txt)
spacy.displacy.render(doc,style='dep',jupyter=True)

### NEG - self explanatory
### ex - not <--neg-- wonderful

### COMPOUND WORDS
#### Generally from a review standpoint, compound words often do not offer us sentiments per se, hence my code looks for possible compound word pairs and then checks with the aspect words extracted if it can add more detail to the extracted aspects - ex Outstanding passenger van gives *more context* than Outstanding van (which is what my code would have extracted without the compound word search) while the compound word search will identify passenger van as a compound word

In [10]:
competitors = ['Chevy','chevy','Ford','ford','Nissan','nissan','Honda','honda','Chevrolet','chevrolet','Volkswagen','volkswagen','benz','Benz','Mercedes','mercedes','subaru','Subaru','VW']

**Reason for using competitor name list is to remove potential misleading aspects-sentiments, since we are interested to acquire aspect info about Toyota and not any other brand. This is because a reviewer might be comparing a Benz saying it has superior handling when compared to the car the person is reviewing and this can lead to misclassifications******

In [11]:
aspect_terms = []
comp_terms = []
easpect_terms = []
ecomp_terms = []
enemy = []
for x in tqdm(range(len(toy_rev['review']))):
    amod_pairs = []
    advmod_pairs = []
    compound_pairs = []
    xcomp_pairs = []
    neg_pairs = []
    eamod_pairs = []
    eadvmod_pairs = []
    ecompound_pairs = []
    eneg_pairs = []
    excomp_pairs = []
    enemlist = []
    if len(str(toy_rev['review'][x])) != 0:
        lines = str(toy_rev['review'][x]).replace('*',' ').replace('-',' ').replace('so ',' ').replace('be ',' ').replace('are ',' ').replace('just ',' ').replace('get ','').replace('were ',' ').replace('When ','').replace('when ','').replace('again ',' ').replace('where ','').replace('how ',' ').replace('has ',' ').replace('Here ',' ').replace('here ',' ').replace('now ',' ').replace('see ',' ').replace('why ',' ').split('.')       
        for line in lines:
            enem_list = []
            for eny in competitors:
                enem = re.search(eny,line)
                if enem is not None:
                    enem_list.append(enem.group())
            if len(enem_list)==0:
                doc = nlp(line)
                str1=''
                str2=''
                for token in doc:
                    if token.pos_ is 'NOUN':
                        for j in token.lefts:
                            if j.dep_ == 'compound':
                                compound_pairs.append((j.text+' '+token.text,token.text))
                            if j.dep_ is 'amod' and j.pos_ is 'ADJ': #primary condition
                                str1 = j.text+' '+token.text
                                amod_pairs.append(j.text+' '+token.text)
                                for k in j.lefts:
                                    if k.dep_ is 'advmod': #secondary condition to get adjective of adjectives
                                        str2 = k.text+' '+j.text+' '+token.text
                                        amod_pairs.append(k.text+' '+j.text+' '+token.text)
                                mtch = re.search(re.escape(str1),re.escape(str2))
                                if mtch is not None:
                                    amod_pairs.remove(str1)
                    if token.pos_ is 'VERB':
                        for j in token.lefts:
                            if j.dep_ is 'advmod' and j.pos_ is 'ADV':
                                advmod_pairs.append(j.text+' '+token.text)
                            if j.dep_ is 'neg' and j.pos_ is 'ADV':
                                neg_pairs.append(j.text+' '+token.text)
                        for j in token.rights:
                            if j.dep_ is 'advmod'and j.pos_ is 'ADV':
                                advmod_pairs.append(token.text+' '+j.text)
                    if token.pos_ is 'ADJ':
                        for j,h in zip(token.rights,token.lefts):
                            if j.dep_ is 'xcomp' and h.dep_ is not 'neg':
                                for k in j.lefts:
                                    if k.dep_ is 'aux':
                                        xcomp_pairs.append(token.text+' '+k.text+' '+j.text)
                            elif j.dep_ is 'xcomp' and h.dep_ is 'neg':
                                if k.dep_ is 'aux':
                                        neg_pairs.append(h.text +' '+token.text+' '+k.text+' '+j.text)
            
            else:
                enemlist.append(enem_list)
                doc = nlp(line)
                str1=''
                str2=''
                for token in doc:
                    if token.pos_ is 'NOUN':
                        for j in token.lefts:
                            if j.dep_ == 'compound':
                                ecompound_pairs.append((j.text+' '+token.text,token.text))
                            if j.dep_ is 'amod' and j.pos_ is 'ADJ': #primary condition
                                str1 = j.text+' '+token.text
                                eamod_pairs.append(j.text+' '+token.text)
                                for k in j.lefts:
                                    if k.dep_ is 'advmod': #secondary condition to get adjective of adjectives
                                        str2 = k.text+' '+j.text+' '+token.text
                                        eamod_pairs.append(k.text+' '+j.text+' '+token.text)
                                mtch = re.search(re.escape(str1),re.escape(str2))
                                if mtch is not None:
                                    eamod_pairs.remove(str1)
                    if token.pos_ is 'VERB':
                        for j in token.lefts:
                            if j.dep_ is 'advmod' and j.pos_ is 'ADV':
                                eadvmod_pairs.append(j.text+' '+token.text)
                            if j.dep_ is 'neg' and j.pos_ is 'ADV':
                                eneg_pairs.append(j.text+' '+token.text)
                        for j in token.rights:
                            if j.dep_ is 'advmod'and j.pos_ is 'ADV':
                                eadvmod_pairs.append(token.text+' '+j.text)
                    if token.pos_ is 'ADJ':
                        for j in token.rights:
                            if j.dep_ is 'xcomp':
                                for k in j.lefts:
                                    if k.dep_ is 'aux':
                                        excomp_pairs.append(token.text+' '+k.text+' '+j.text)
        pairs = list(set(amod_pairs+advmod_pairs+neg_pairs+xcomp_pairs))
        epairs = list(set(eamod_pairs+eadvmod_pairs+eneg_pairs+excomp_pairs))
        for i in range(len(pairs)):
            if len(compound_pairs)!=0:
                for comp in compound_pairs:
                    mtch = re.search(re.escape(comp[1]),re.escape(pairs[i]))
                    if mtch is not None:
                        pairs[i] = pairs[i].replace(mtch.group(),comp[0])
        for i in range(len(epairs)):
            if len(ecompound_pairs)!=0:
                for comp in ecompound_pairs:
                    mtch = re.search(re.escape(comp[1]),re.escape(epairs[i]))
                    if mtch is not None:
                        epairs[i] = epairs[i].replace(mtch.group(),comp[0])
            
    aspect_terms.append(pairs)
    comp_terms.append(compound_pairs)
    easpect_terms.append(epairs)
    ecomp_terms.append(ecompound_pairs)
    enemy.append(enemlist)
toy_rev['compound_nouns'] = comp_terms
toy_rev['aspect_keywords'] = aspect_terms
toy_rev['competition'] = enemy
toy_rev['competition_comp_nouns'] = ecomp_terms
toy_rev['competition_aspects'] = easpect_terms
toy_rev.head()

 88%|████████▊ | 20078/22702 [23:57<03:46, 11.59it/s]

## We use vaderSentiment for sentiment analysis because of it's speed and simplicity. It offers 3 types of polarity -  positive, negative and neutral. As a result we can filter all aspects which have high neutral scores hence minimizing errors caused due to wrong extraction of aspects and stopwords

In [12]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

In [13]:
import operator
sentiment = []
for i in range(len(toy_rev)):
    score_dict={'pos':0,'neg':0,'neu':0}
    if len(toy_rev['aspect_keywords'][i])!=0: 
        for aspects in toy_rev['aspect_keywords'][i]:
            sent = analyser.polarity_scores(aspects)
            score_dict['neg'] += sent['neg']
            score_dict['pos'] += sent['pos']
        #score_dict['neu'] += sent['neu']
        sentiment.append(max(score_dict.items(), key=operator.itemgetter(1))[0])
    else:
        sentiment.append('NaN')
toy_rev['sentiment'] = sentiment
toy_rev.head()

Unnamed: 0.1,Unnamed: 0,Review_Date,Author_Name,Vehicle_Title,Review_Title,Review,Rating,review,compound_nouns,aspect_keywords,competition,competition_comp_nouns,competition_aspects,sentiment
0,0,on 02/02/17 19:53 PM (PST),Ricardo,1997 Toyota Previa Minivan LE 3dr Minivan,"great vehicle, Toyota best design ever. thank you","there is no way back, enjoy what you have .",5.0,"great vehicle, Toyota best design ever. thank ...",[],"[great vehicle, is back, best design]",[],[],[],pos
1,1,on 12/17/16 16:40 PM (PST),matt,1997 Toyota Previa Minivan LE All-Trac 3dr Min...,"my 4th previa, best van ever made!",1st 95 went over 300k before being totalled b...,5.0,"my 4th previa, best van ever made! 1st 95 went...","[(captain chairs, chairs), (van value, value),...","[minor quirks, red light, fantastic vans, reli...",[],[],[],pos
2,2,on 04/14/10 07:43 AM (PDT),Joel G,1997 Toyota Previa Minivan LE 3dr Minivan,Mom's Taxi Babies Ride,Sold 86 Toyota Van 285K miles to be replaced ...,5.0,Mom's Taxi Babies Ride Sold 86 Toyota Van 285K...,"[(K miles, miles), (reserve weekend, weekend),...","[Use mostly, middle seat, 1st baby, glad to ha...",[],[],[],pos
3,3,on 11/12/08 17:31 PM (PST),Dennis,1997 Toyota Previa Minivan LE All-Trac 3dr Min...,My Favorite Van Ever,"I have owned lots of vans, and the Previa is ...",4.875,My Favorite Van Ever I have owned lots of vans...,"[(Fuel mileage, mileage), (Toyota salesman, sa...","[constantly am, ever owned, Ever owned, Never ...",[],[],[],pos
4,4,on 04/14/08 22:47 PM (PDT),Alf Skrastins,1997 Toyota Previa Minivan LE All-Trac 3dr Min...,Best Minivan ever,My 1997 AWD Previa is the third one that I ha...,5.0,Best Minivan ever My 1997 AWD Previa is the th...,"[(gas mileage, mileage)]","[previously had, even comes, much fun, reasona...",[],[],[],pos


In [14]:
int_sent = []
for sent in toy_rev['sentiment']:
    if sent is 'NaN':
        int_sent.append('NaN')
    elif sent is 'pos':
        int_sent.append('1')
    else:
        int_sent.append('0')
toy_rev['int_sent'] = int_sent
toy_rev.head()

Unnamed: 0.1,Unnamed: 0,Review_Date,Author_Name,Vehicle_Title,Review_Title,Review,Rating,review,compound_nouns,aspect_keywords,competition,competition_comp_nouns,competition_aspects,sentiment,int_sent
0,0,on 02/02/17 19:53 PM (PST),Ricardo,1997 Toyota Previa Minivan LE 3dr Minivan,"great vehicle, Toyota best design ever. thank you","there is no way back, enjoy what you have .",5.0,"great vehicle, Toyota best design ever. thank ...",[],"[great vehicle, is back, best design]",[],[],[],pos,1
1,1,on 12/17/16 16:40 PM (PST),matt,1997 Toyota Previa Minivan LE All-Trac 3dr Min...,"my 4th previa, best van ever made!",1st 95 went over 300k before being totalled b...,5.0,"my 4th previa, best van ever made! 1st 95 went...","[(captain chairs, chairs), (van value, value),...","[minor quirks, red light, fantastic vans, reli...",[],[],[],pos,1
2,2,on 04/14/10 07:43 AM (PDT),Joel G,1997 Toyota Previa Minivan LE 3dr Minivan,Mom's Taxi Babies Ride,Sold 86 Toyota Van 285K miles to be replaced ...,5.0,Mom's Taxi Babies Ride Sold 86 Toyota Van 285K...,"[(K miles, miles), (reserve weekend, weekend),...","[Use mostly, middle seat, 1st baby, glad to ha...",[],[],[],pos,1
3,3,on 11/12/08 17:31 PM (PST),Dennis,1997 Toyota Previa Minivan LE All-Trac 3dr Min...,My Favorite Van Ever,"I have owned lots of vans, and the Previa is ...",4.875,My Favorite Van Ever I have owned lots of vans...,"[(Fuel mileage, mileage), (Toyota salesman, sa...","[constantly am, ever owned, Ever owned, Never ...",[],[],[],pos,1
4,4,on 04/14/08 22:47 PM (PDT),Alf Skrastins,1997 Toyota Previa Minivan LE All-Trac 3dr Min...,Best Minivan ever,My 1997 AWD Previa is the third one that I ha...,5.0,Best Minivan ever My 1997 AWD Previa is the th...,"[(gas mileage, mileage)]","[previously had, even comes, much fun, reasona...",[],[],[],pos,1


### Here we have arbitarily taken ratings greater than 3 as positive and everything else as negative

In [15]:
import math
pos = []
for i in range(len(toy_rev)):
    if not math.isnan(toy_rev['Rating'][i]):
        if int(toy_rev['Rating'][i])>3:
            pos.append('1')
        else:
            pos.append('0')
    else:
        pos.append('0')
toy_rev['Positive Review'] = pos
toy_rev.head()

Unnamed: 0.1,Unnamed: 0,Review_Date,Author_Name,Vehicle_Title,Review_Title,Review,Rating,review,compound_nouns,aspect_keywords,competition,competition_comp_nouns,competition_aspects,sentiment,int_sent,Positive Review
0,0,on 02/02/17 19:53 PM (PST),Ricardo,1997 Toyota Previa Minivan LE 3dr Minivan,"great vehicle, Toyota best design ever. thank you","there is no way back, enjoy what you have .",5.0,"great vehicle, Toyota best design ever. thank ...",[],"[great vehicle, is back, best design]",[],[],[],pos,1,1
1,1,on 12/17/16 16:40 PM (PST),matt,1997 Toyota Previa Minivan LE All-Trac 3dr Min...,"my 4th previa, best van ever made!",1st 95 went over 300k before being totalled b...,5.0,"my 4th previa, best van ever made! 1st 95 went...","[(captain chairs, chairs), (van value, value),...","[minor quirks, red light, fantastic vans, reli...",[],[],[],pos,1,1
2,2,on 04/14/10 07:43 AM (PDT),Joel G,1997 Toyota Previa Minivan LE 3dr Minivan,Mom's Taxi Babies Ride,Sold 86 Toyota Van 285K miles to be replaced ...,5.0,Mom's Taxi Babies Ride Sold 86 Toyota Van 285K...,"[(K miles, miles), (reserve weekend, weekend),...","[Use mostly, middle seat, 1st baby, glad to ha...",[],[],[],pos,1,1
3,3,on 11/12/08 17:31 PM (PST),Dennis,1997 Toyota Previa Minivan LE All-Trac 3dr Min...,My Favorite Van Ever,"I have owned lots of vans, and the Previa is ...",4.875,My Favorite Van Ever I have owned lots of vans...,"[(Fuel mileage, mileage), (Toyota salesman, sa...","[constantly am, ever owned, Ever owned, Never ...",[],[],[],pos,1,1
4,4,on 04/14/08 22:47 PM (PDT),Alf Skrastins,1997 Toyota Previa Minivan LE All-Trac 3dr Min...,Best Minivan ever,My 1997 AWD Previa is the third one that I ha...,5.0,Best Minivan ever My 1997 AWD Previa is the th...,"[(gas mileage, mileage)]","[previously had, even comes, much fun, reasona...",[],[],[],pos,1,1


In [16]:
d = {'sent':toy_rev['Positive Review'],'sent_pred':toy_rev['int_sent']}
metric_df = pd.DataFrame(data=d)
metric_df.head()

Unnamed: 0,sent,sent_pred
0,1,1
1,1,1
2,1,1
3,1,1
4,1,1


In [17]:
len(metric_df.sent)

22702

## Removing NaN values in the sentiment predictions

In [18]:
metric_df = metric_df[metric_df.sent_pred != 'NaN']
len(metric_df.sent)

17869

In [19]:
from sklearn.metrics import accuracy_score,auc,f1_score,recall_score,precision_score
print('accuracy')
print(accuracy_score(metric_df.sent, metric_df.sent_pred))
print('f1 score')
print(f1_score(metric_df.sent, metric_df.sent_pred,pos_label='1'))
print('recall')
print(recall_score(metric_df.sent, metric_df.sent_pred,pos_label='1'))
print('precision')
print(precision_score(metric_df.sent, metric_df.sent_pred,pos_label='1'))

accuracy
0.7869494655548716
f1 score
0.8695026222877319
recall
0.8829713171818435
precision
0.8564386521709771


## Possible improvements that can be made
*  Tricky situation of removing stopwords to reduce unwanted extractions of non-aspects but this can also affect spaCy's dependency parsing. Same goes with noun chunk merging as well. If someone can think of a better way to remove stopwords and still retain spaCy's dependency goodness it can greatly improve the accuracy

* This is not a ML task per se since we do more of parsing than ML. Although Bi-Directional LSTM have been very good at ABSA tasks in the past, unlike semeval tasks we do not have a fixed topic for our aspects to fall into. If someone can use the parsing aspect of the code to implement BLSTM in this case, that would be great

* better alternatives to vaderSentiment if available (unsupervised/ semi-supervised methods might be better here I think)

* The very definition of aspects can be a bit vague at times hence we do not have a valid metric to measure the aspect extraction's accuracy


## P.S This is my first Kaggle Kernel and I am fairly new to python programming as well, hence my non usage of list comprehensions and functions might be evident. I highly encourage everyone to fork my code and add your own twists to increase the accuracy of both aspect extractions and sentiment analysis.