<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Segment-reviews-in-order-to-extract-&quot;opinion-units&quot;" data-toc-modified-id="Segment-reviews-in-order-to-extract-&quot;opinion-units&quot;-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Segment reviews in order to extract "opinion units"</a></span><ul class="toc-item"><li><span><a href="#Load-reviews" data-toc-modified-id="Load-reviews-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load reviews</a></span></li><li><span><a href="#Modify-rules-for-sentence-segmentation" data-toc-modified-id="Modify-rules-for-sentence-segmentation-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Modify rules for sentence segmentation</a></span></li><li><span><a href="#Removing-line-breaks" data-toc-modified-id="Removing-line-breaks-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Removing line breaks</a></span></li><li><span><a href="#Extract-opinion-units" data-toc-modified-id="Extract-opinion-units-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Extract opinion units</a></span></li><li><span><a href="#Create-new-dataframe-with-opinion-units" data-toc-modified-id="Create-new-dataframe-with-opinion-units-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Create new dataframe with opinion units</a></span></li><li><span><a href="#Export-file" data-toc-modified-id="Export-file-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Export file</a></span></li></ul></li></ul></div>

# Segment reviews in order to extract "opinion units"
When users write reviews, they may provide their opinion on different aspects of the app (cf. __[this website](https://monkeylearn.com/blog/aspect-based-sentiment-analysis/)__). They might like one feature from the app, and whish another one was better. Also, in order to assess what topic are of highest interest, we'd like to segment app reviews in ordre to extract "opinion units" from users.

After reading some app reviews, we've identified how to modify the default rules from spaCy to extract opinion units from our app reviews:
* Use the default sentence segmentation from spaCy 
* Add a rule to the pipeline before the parser, to segment when "but" or "althoug" are used.
* Change the rule to never segment when "so" is encountered
Empty lines should also be removed.

Here, since our dataset is not too large, we'll duplicate the review information and associate it with each opinion unit.

NB: After going through reviews to identify topics, tags and sentiment, I've realized that for certain reviews, information is mostly in the title. Will add the review title as an opinion unit. Will also add an index for the opinion units, as it may make tagging (topics, sentiment) easier.

In [1]:
import os
import pandas as pd

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

## Load reviews

In [3]:
path = os.getcwd()
filename = 'app_reviews_airvisual-air-quality-forecast_1048912974_by_lang_us.csv'
subfolder = '/../data/1_preprocessed_data/'

In [4]:
df = pd.read_csv(path+subfolder+filename, sep = ';')

In [5]:
df.head()

Unnamed: 0.1,Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang
0,3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en
1,982,6114444527,5,Super,2020-06-24T02:12:59Z,WillJosue,Easy to keep track on specific Local areas,,,,en
2,1965,6114325210,5,Great App,2020-06-24T01:31:38Z,Nejinater,Full of good information!,,,,en
3,797,6111838742,5,Great app for filtering the air,2020-06-23T11:01:04Z,Jamieissad,Tells you everything you need to know about th...,,,,en
4,962,6104666348,5,I look everyday,2020-06-21T14:29:25Z,TorchPitchfork,This is part of my daily planning. I love the ...,,,,en


## Modify rules for sentence segmentation

In [6]:
# modify rules before the parser in the pipeline
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        # segment at "but" and "although"
        if token.text == 'but' or token.text == 'although':
            doc[token.i].is_sent_start = True
        # never segment at "so"
        elif token.text == 'so':
            doc[token.i].is_sent_start = False
    return doc
            
nlp.add_pipe(set_custom_boundaries, before = 'parser')

nlp.pipe_names

['tagger', 'set_custom_boundaries', 'parser', 'ner']

In [7]:
# check the application of the new rules
s = "love using this to check the air quality in my area and whenever I travel, but not at work. i like that it has the widget option so i can see it in my today view as well on iphone"

doc8 = nlp(s)
for sent in doc8.sents:
    print(sent.text)

love using this to check the air quality in my area and whenever I travel,
but not at work.
i like that it has the widget option so i can see it in my today view as well on iphone


## Removing line breaks

In [8]:
import re

In [9]:
clean_line_break = lambda s : ' '.join(re.findall('[^\n]+', s))

## Extract opinion units

In [10]:
extract_sent = lambda s : [clean_line_break(sent.text) for sent in nlp(s).sents]

In [11]:
extract_sent(s)

['love using this to check the air quality in my area and whenever I travel,',
 'but not at work.',
 'i like that it has the widget option so i can see it in my today view as well on iphone']

In [12]:
s2 = df.iloc[2147]['review']

In [13]:
extract_sent(s2)

['Discovered this app a couple of months ago and find it very useful in evaluating air quality, ventilating my house, and considering how to improve air filtration in my house.',
 'With continuing fires and air pollution in N California, I have developed bronchial irritation and increased congestion (which closely tracks air quality conditions reported by air visual).',
 'I attribute the good info to lots of reporting stations in my area.',
 'Of particular interest is knowing when to ventilate my house (during periods of good air reported by air visual).',
 'Also, the app is helping me evaluate the performance of my whole house air filter system; my house filter helps',
 'but need improvement.',
 "As a result of the app's functionality, I'm considering buying a larger version of IQAir's home air purifier as an adjunct to my whole house filtering.",
 'I give a 4 star rating because there is always room for improvement.']

In [14]:
df['opinion_units'] = df['review'].apply(extract_sent)

In [15]:
df.head()

Unnamed: 0.1,Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang,opinion_units
0,3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en,"[Having allergies is annoying, but I’m glad to..."
1,982,6114444527,5,Super,2020-06-24T02:12:59Z,WillJosue,Easy to keep track on specific Local areas,,,,en,[Easy to keep track on specific Local areas]
2,1965,6114325210,5,Great App,2020-06-24T01:31:38Z,Nejinater,Full of good information!,,,,en,[Full of good information!]
3,797,6111838742,5,Great app for filtering the air,2020-06-23T11:01:04Z,Jamieissad,Tells you everything you need to know about th...,,,,en,[Tells you everything you need to know about t...
4,962,6104666348,5,I look everyday,2020-06-21T14:29:25Z,TorchPitchfork,This is part of my daily planning. I love the ...,,,,en,"[This is part of my daily planning., I love th..."


In [16]:
df.iloc[2143]['opinion_units']

['It is a very good app to visit!',
 'It shows you all the times and day where you can see directly when and where you should need to know what the air quality is!',
 'Awesome']

## Create new dataframe with opinion units

In [17]:
cols = df.columns.values.tolist()
cols.append('opi_id')
cols

['Unnamed: 0',
 'review_id',
 'rating',
 'title',
 'review_date',
 'user_name',
 'review',
 'response_id',
 'dev_response',
 'response_date',
 'lang',
 'opinion_units',
 'opi_id']

In [43]:
df_out = pd.DataFrame(columns = cols)

In [44]:
df_out

Unnamed: 0.1,Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang,opinion_units,opi_id


In [22]:
# method to use the title as an additional opinion unit
def get_title_opinion(df, i, counter):
    
    title_r = {}
    
    # copy the info from all the columns except the last one
    for c in df.columns[:-1]:
        title_r[c] = df.iloc[i].loc[c]
        
    # extract the title review to fill the 'opinion_units' column
    opinion_c = df.columns[-1]
    title_r[opinion_c] = df.iloc[i].loc['title']
    
    # add the id of the opinion unit
    title_r['opi_id'] = counter
    
    return title_r

In [23]:
get_title_opinion(df, 0, 0)

{'Unnamed: 0': 3095,
 'review_id': 6121840341,
 'rating': 5,
 'title': 'Happy to finally see when and why I can’t breathe!',
 'review_date': '2020-06-25T23:18:57Z',
 'user_name': 'Abbsteroni',
 'review': 'Having allergies is annoying but I’m glad to see this and know when to take extra precautions.',
 'response_id': nan,
 'dev_response': nan,
 'response_date': nan,
 'lang': 'en',
 'opinion_units': 'Happy to finally see when and why I can’t breathe!',
 'opi_id': 0}

In [24]:
# method to define the content of the row for the dataframe of opinion units
def new_row(df, i, j, counter):
    
    new_r = {}
    
    # copy the info from all the columns except the last one
    for c in df.columns[:-1]:
        new_r[c] = df.iloc[i].loc[c]
        
    # extract the j th item from the list in the 'opinion_units' column of df
    opinion_c = df.columns[-1]
    new_r[opinion_c] = df.iloc[i].loc[opinion_c][j]
    
    # add the id of the opinion unit
    new_r['opi_id'] = counter
    
    return new_r

In [25]:
new_row(df, 0, 0, 1)

{'Unnamed: 0': 3095,
 'review_id': 6121840341,
 'rating': 5,
 'title': 'Happy to finally see when and why I can’t breathe!',
 'review_date': '2020-06-25T23:18:57Z',
 'user_name': 'Abbsteroni',
 'review': 'Having allergies is annoying but I’m glad to see this and know when to take extra precautions.',
 'response_id': nan,
 'dev_response': nan,
 'response_date': nan,
 'lang': 'en',
 'opinion_units': 'Having allergies is annoying',
 'opi_id': 1}

In [26]:
new_row(df, 0, 1, 2)

{'Unnamed: 0': 3095,
 'review_id': 6121840341,
 'rating': 5,
 'title': 'Happy to finally see when and why I can’t breathe!',
 'review_date': '2020-06-25T23:18:57Z',
 'user_name': 'Abbsteroni',
 'review': 'Having allergies is annoying but I’m glad to see this and know when to take extra precautions.',
 'response_id': nan,
 'dev_response': nan,
 'response_date': nan,
 'lang': 'en',
 'opinion_units': 'but I’m glad to see this and know when to take extra precautions.',
 'opi_id': 2}

In [45]:
opinion_c = df.columns[-1]
counter = 0

# build the dataframe of opinion units
for i in range(len(df)): 
#for i in range(20): 
    
    # add a row to extract the review title as an opinion unit
    new_r = get_title_opinion(df, i, counter)
    df_out = df_out.append(new_r, ignore_index=True)
    counter += 1
    
    # add row(s) from opinion units extracted from the review
    lst_len = len(df.iloc[i].loc[opinion_c])
    
    for j in range(lst_len):
        new_r = new_row(df, i, j, counter)
        df_out = df_out.append(new_r, ignore_index=True)
        counter += 1

In [50]:
# rename the last columns
df_out.rename({'opinion_units':'opinion_unit'}, axis = 1, inplace = True)

In [46]:
df_out.head(30)

Unnamed: 0.1,Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang,opinion_units,opi_id
0,3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en,Happy to finally see when and why I can’t brea...,0
1,3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en,Having allergies is annoying,1
2,3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en,but I’m glad to see this and know when to take...,2
3,982,6114444527,5,Super,2020-06-24T02:12:59Z,WillJosue,Easy to keep track on specific Local areas,,,,en,Super,3
4,982,6114444527,5,Super,2020-06-24T02:12:59Z,WillJosue,Easy to keep track on specific Local areas,,,,en,Easy to keep track on specific Local areas,4
5,1965,6114325210,5,Great App,2020-06-24T01:31:38Z,Nejinater,Full of good information!,,,,en,Great App,5
6,1965,6114325210,5,Great App,2020-06-24T01:31:38Z,Nejinater,Full of good information!,,,,en,Full of good information!,6
7,797,6111838742,5,Great app for filtering the air,2020-06-23T11:01:04Z,Jamieissad,Tells you everything you need to know about th...,,,,en,Great app for filtering the air,7
8,797,6111838742,5,Great app for filtering the air,2020-06-23T11:01:04Z,Jamieissad,Tells you everything you need to know about th...,,,,en,Tells you everything you need to know about th...,8
9,962,6104666348,5,I look everyday,2020-06-21T14:29:25Z,TorchPitchfork,This is part of my daily planning. I love the ...,,,,en,I look everyday,9


## Export file

In [47]:
export_filename = filename[:-4]+'_ou_v2.csv'
export_filename

'app_reviews_airvisual-air-quality-forecast_1048912974_by_lang_us_ou_v2.csv'

In [48]:
export_subfolder = '/../data/2_opinion_units/'
export_subfolder

'/../data/2_opinion_units/'

In [49]:
df_out.to_csv(path+export_subfolder+export_filename)