<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Segment-reviews-in-order-to-extract-&quot;opinion-units&quot;" data-toc-modified-id="Segment-reviews-in-order-to-extract-&quot;opinion-units&quot;-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Segment reviews in order to extract "opinion units"</a></span><ul class="toc-item"><li><span><a href="#Load-reviews" data-toc-modified-id="Load-reviews-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load reviews</a></span></li><li><span><a href="#Extract-opinion-unit" data-toc-modified-id="Extract-opinion-unit-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Extract opinion unit</a></span><ul class="toc-item"><li><span><a href="#Adding-rules-for-sentence-segmentation" data-toc-modified-id="Adding-rules-for-sentence-segmentation-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Adding rules for sentence segmentation</a></span></li><li><span><a href="#&quot;so&quot;" data-toc-modified-id="&quot;so&quot;-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>"so"</a></span></li></ul></li><li><span><a href="#&quot;Except&quot;-(few-cases)" data-toc-modified-id="&quot;Except&quot;-(few-cases)-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>"Except" (few cases)</a></span></li><li><span><a href="#Handling-line-breaks-in-case-of-empty-lines" data-toc-modified-id="Handling-line-breaks-in-case-of-empty-lines-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Handling line breaks in case of empty lines</a></span></li></ul></li></ul></div>

# Segment reviews in order to extract "opinion units"
When users write reviews, they may provide their opinion on different aspects of the app (cf. __[this website](https://monkeylearn.com/blog/aspect-based-sentiment-analysis/)__). They might like one feature from the app, and whish another one was better. Also, in order to assess what topic are of highest interest, we'd like to segment app reviews in ordre to extract "opinion units" from users.

After reading some app reviews, a first way of extracting opinion units would be to:
* Use the default sentence segmentation from spaCy 
* Add a rule to the pipeline before the parser, to segment when "but" or "althoug" are used.
* Change the rule to not segment 

In [1]:
import os
import pandas as pd

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

## Load reviews

In [3]:
path = os.getcwd()
filename = 'app_reviews_airvisual-air-quality-forecast_1048912974_by_lang_us.csv'
subfolder = '/../data/1_preprocessed_data/'

In [4]:
df = pd.read_csv(path+subfolder+filename, sep = ';')

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang
0,3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en
1,982,6114444527,5,Super,2020-06-24T02:12:59Z,WillJosue,Easy to keep track on specific Local areas,,,,en
2,1965,6114325210,5,Great App,2020-06-24T01:31:38Z,Nejinater,Full of good information!,,,,en
3,797,6111838742,5,Great app for filtering the air,2020-06-23T11:01:04Z,Jamieissad,Tells you everything you need to know about th...,,,,en
4,962,6104666348,5,I look everyday,2020-06-21T14:29:25Z,TorchPitchfork,This is part of my daily planning. I love the ...,,,,en


## Extract opinion unit

### Adding rules for sentence segmentation

In [6]:
doc = nlp(u'Having allergies is annoying but I’m glad to see this and know when to take extra precautions.')

In [7]:
for sent in doc.sents:
    print(sent)

Having allergies is annoying but I’m glad to see this and know when to take extra precautions.


In [8]:
# add a new rule to the pipeline
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == 'but' or token.text == 'although':
            doc[token.i].is_sent_start = True
    return doc
            
nlp.add_pipe(set_custom_boundaries, before = 'parser')

nlp.pipe_names

['tagger', 'set_custom_boundaries', 'parser', 'ner']

In [9]:
doc2 = nlp(u'Having allergies is annoying but I’m glad to see this and know when to take extra precautions.')
for sent in doc2.sents:
    print(sent)

Having allergies is annoying
but I’m glad to see this and know when to take extra precautions.


In [10]:
doc3 = nlp(u'What a great app , beautiful design, useful handling , perfect information 👍 5 stars and high recommended. What I miss ? Only 1 thing , a complication for the Siri watchface , it would be so cool when I see t by rough the day some cards with the air data on my Siri watchface 👍 any chance for this feature? Keep working on this app ist a 5 Star app 👍')
for sent in doc3.sents:
    print(sent)

What a great app , beautiful design, useful handling , perfect information 👍 5 stars and high recommended.
What I miss ?
Only 1 thing , a complication for the Siri watchface , it would be so cool when I see t by rough the day some cards with the air data on my Siri watchface 👍 any chance for this feature?
Keep working on this app ist a 5 Star app 👍


In [11]:
doc4 = nlp(u'Great app, use it a lot - often multiple times a day to check air quality conditions. Really like the predictions provided for the next few days, in order to assist in planning outdoor activities. 5 stars!')
for sent in doc4.sents:
    print(sent)

Great app, use it a lot - often multiple times a day to check air quality conditions.
Really like the predictions provided for the next few days, in order to assist in planning outdoor activities.
5 stars!


In [12]:
doc5 = nlp(u'I live in Seoul and I used this app all the time. However, I restarted my phone and this app will no longer download :(. It has to be because I’m in Seoul?  Every other app has redownloaded.')
for sent in doc5.sents:
    print(sent)

I live in Seoul and I used this app all the time.
However, I restarted my phone and this app will no longer download :(.
It has to be because I’m in Seoul?  
Every other app has redownloaded.


In [13]:
doc6 = nlp(u'Can track air quality of multiple cities easy and they get their information from multiple sources. Also shows helpful tips on what to do when air qualify is bad - like closings windows and when it’s best not to exercise outside.')
for sent in doc6.sents:
    print(sent)

Can track air quality of multiple cities easy and they get their information from multiple sources.
Also shows helpful tips on what to do when air qualify is bad - like closings windows and when it’s best not to exercise outside.


The segmentation seems to provide good results. Let's explore it further with other reviews. In the preprocessing of reviews, conversion of emojis and emoticons should also be performed, as they contain insights into the sentiments of the users.

### "so"

In [14]:
# It seems that there's a segmentation with "so" followed by subject + verb as well.
s = "Great app.  The predictive forecast is great, although not completely accurate. It seems as weather changes ability to forecast air quality does as well so you have to take that with a grain of salt"
#s = "Great app.  The predictive forecast is great, although not completely accurate. It seems as weather changes ability to forecast air quality does as well you have to take that with a grain of salt"

doc7 = nlp(s)
for sent in doc7.sents:
    print(sent)

Great app.  
The predictive forecast is great,
although not completely accurate.
It seems as weather changes ability to forecast air quality
does as well
so you have to take that with a grain of salt


In [15]:
s = "love using this to check the air quality in my area and whenever I travel. i like that it has the widget option so i can see it in my today view as well on iphone"

doc8 = nlp(s)
for sent in doc8.sents:
    print(sent)

love using this to check the air quality in my area and whenever I travel.
i like that it has the widget option
so i can see it in my today view as well on iphone


In [16]:
s = "My cousin usually has acute pneumonia so this app has been very useful. We are able get access to the air quality wherever we go, and avoid unnecessary contact with fine dust."

doc9 = nlp(s)
for sent in doc9.sents:
    print(sent)

My cousin usually has acute pneumonia so this app has been very useful.
We are able get access to the air quality wherever we go, and avoid unnecessary contact with fine dust.


In [17]:
s = "Never followed this before but now I’m understanding what’s effecting my at times debilitating congestion is not allergies. Really accurate for my neighborhood so far (brooklyn, New York). You can even have the air quality number stay on top of the app the same way notifications usually appear. But seems like you have to open the app for it to be updated/refresh to the latest air quality number. I guess that’s the only thing that’s not the best so far."

doc10 = nlp(s)
for sent in doc10.sents:
    print(sent)

Never followed this before
but now I’m understanding what’s effecting my at times debilitating congestion is not allergies.
Really accurate for my neighborhood so far (brooklyn, New York).
You can even have the air quality number stay on top of the app the same way notifications usually appear.
But seems like you have to open the app for it to be updated/refresh to the latest air quality number.
I guess that’s the only thing that’s not the best so far.


"so" sometimes leads to sentence segmentation, while in the coupe of reviews taken as examples, there should not be a segmentation. We could remove the rule linked to "so". 

In [18]:
# reset to the original
nlp = spacy.load('en_core_web_sm')

# add new rules to the pipeline
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == 'but' or token.text == 'although':
            doc[token.i].is_sent_start = True
        elif token.text == 'so':
            doc[token.i].is_sent_start = False
    return doc
            
nlp.add_pipe(set_custom_boundaries, before = 'parser')

nlp.pipe_names

['tagger', 'set_custom_boundaries', 'parser', 'ner']

In [19]:
s = "love using this to check the air quality in my area and whenever I travel. i like that it has the widget option so i can see it in my today view as well on iphone"

doc8 = nlp(s)
for sent in doc8.sents:
    print(sent)

love using this to check the air quality in my area and whenever I travel.
i like that it has the widget option so i can see it in my today view as well on iphone


## "Except" (few cases)

In [20]:
s = "Working and living in South Korea is a joy EXCEPT for the air quality.\n\n While I can survive with irritated eyes, my wife is most sensitive to the air quality which is sadly lacking due to Yellow Dust from the Gobi Desert as well as industrial pollution making its way from China.\n\n We depend on the app daily to determine the extent of our outdoor exposure as well as turning on our indoor air filters."

doc11 = nlp(s)
for sent in doc11.sents:
    print(sent)

Working and living in South Korea is a joy EXCEPT for the air quality.

 
While I can survive with irritated eyes, my wife is most sensitive to the air quality which is sadly lacking due to Yellow Dust from the Gobi Desert as well as industrial pollution making its way from China.

 
We depend on the app daily to determine the extent of our outdoor exposure as well as turning on our indoor air filters.


In [21]:
s = u"This app is very good, only except that it does not cover enough cities, e.g. Gainesville FL. Hope your later version can cover more cities."

doc12 = nlp(s)
for sent in doc12.sents:
    print(sent)

This app is very good, only except that it does not cover enough cities, e.g. Gainesville FL.
Hope your later version can cover more cities.


A rule might be added to segment with "except" if "except" is followed by a structure such as "subject" + "verb"? (But since, there are only a few cases, might do it manually).

## Handling line breaks in case of empty lines

In [22]:
s = u"Working and living in South Korea is a joy EXCEPT for the air quality.\n\n While I can survive with irritated eyes, my wife is most sensitive to the air quality which is sadly lacking due to Yellow Dust from the Gobi Desert as well as industrial pollution making its way from China.\n\n We depend on the app daily to determine the extent of our outdoor exposure as well as turning on our indoor air filters."

doc13 = nlp(s)
for sent in doc13.sents:
    print(sent)

Working and living in South Korea is a joy EXCEPT for the air quality.

 
While I can survive with irritated eyes, my wife is most sensitive to the air quality which is sadly lacking due to Yellow Dust from the Gobi Desert as well as industrial pollution making its way from China.

 
We depend on the app daily to determine the extent of our outdoor exposure as well as turning on our indoor air filters.


In [23]:
doc14 = nlp(df.iloc[2147]['review'])
for sent in doc14.sents:
    print(sent)

Discovered this app a couple of months ago and find it very useful in evaluating air quality, ventilating my house, and considering how to improve air filtration in my house.
With continuing fires and air pollution in N California, I have developed bronchial irritation and increased congestion (which closely tracks air quality conditions reported by air visual).
I attribute the good info to lots of reporting stations in my area.

Of particular interest is knowing when to ventilate my house (during periods of good air reported by air visual).
Also, the app is helping me evaluate the performance of my whole house air filter system; my house filter helps
but need improvement.

As a result of the app's functionality, I'm considering buying a larger version of IQAir's home air purifier as an adjunct to my whole house filtering.

I give a 4 star rating because there is always room for improvement.


Once the opinion unit extraction will be carried out, we might test for and get rid of empty sentences.