<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Segment-expanded-titles-and-reviews-in-order-to-extract-&quot;opinion-units&quot;" data-toc-modified-id="Segment-expanded-titles-and-reviews-in-order-to-extract-&quot;opinion-units&quot;-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Segment expanded titles and reviews in order to extract "opinion units"</a></span><ul class="toc-item"><li><span><a href="#Load-reviews" data-toc-modified-id="Load-reviews-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load reviews</a></span></li><li><span><a href="#Modify-rules-for-sentence-segmentation" data-toc-modified-id="Modify-rules-for-sentence-segmentation-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Modify rules for sentence segmentation</a></span></li><li><span><a href="#Removing-line-breaks" data-toc-modified-id="Removing-line-breaks-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Removing line breaks</a></span></li><li><span><a href="#Extract-opinion-units" data-toc-modified-id="Extract-opinion-units-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Extract opinion units</a></span></li><li><span><a href="#Create-new-dataframe-with-opinion-units" data-toc-modified-id="Create-new-dataframe-with-opinion-units-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Create new dataframe with opinion units</a></span></li><li><span><a href="#Export-file" data-toc-modified-id="Export-file-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Export file</a></span></li></ul></li></ul></div>

# Segment expanded titles and reviews in order to extract "opinion units"
When users write reviews, they may provide their opinion on different aspects of the app (cf. __[this website](https://monkeylearn.com/blog/aspect-based-sentiment-analysis/)__). They might like one feature from the app, and whish another one was better. Also, in order to assess what topic are of highest interest, we'd like to segment app reviews in ordre to extract "opinion units" from users.

After reading some app reviews, we've identified how to modify the default rules from spaCy to extract opinion units from our app reviews:
* Use the default sentence segmentation from spaCy 
* Add a rule to the pipeline before the parser, to segment when "but" or "althoug" are used.
* Change the rule to never segment when "so" is encountered
Empty lines should also be removed.

Here, since our dataset is not too large, we'll duplicate the review information and associate it with each opinion unit.

NB: After going through reviews to identify topics, tags and sentiment, I've realized that for certain reviews, information is mostly in the title. Will add the review title as an opinion unit. Will also add an index for the opinion units, as it may make tagging (topics, sentiment) easier.

In [1]:
import os
import pandas as pd

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

## Load reviews

In [3]:
path = os.getcwd()
filename = 'app_reviews_airvisual-air-quality-forecast_1048912974_by_lang_us_exp_abb.csv'
subfolder = '/../data/1_preprocessed_data/'

In [5]:
df = pd.read_csv(path+subfolder+filename)

In [6]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang,title_expanded,review_expanded
0,0,3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en,Happy to finally see when and why I ca n’t bre...,Having allergies is annoying but I ’m glad to ...
1,1,982,6114444527,5,Super,2020-06-24T02:12:59Z,WillJosue,Easy to keep track on specific Local areas,,,,en,Super,Easy to keep track on specific Local areas
2,2,1965,6114325210,5,Great App,2020-06-24T01:31:38Z,Nejinater,Full of good information!,,,,en,Great App,Full of good information !
3,3,797,6111838742,5,Great app for filtering the air,2020-06-23T11:01:04Z,Jamieissad,Tells you everything you need to know about th...,,,,en,Great app for filtering the air,Tells you everything you need to know about th...
4,4,962,6104666348,5,I look everyday,2020-06-21T14:29:25Z,TorchPitchfork,This is part of my daily planning. I love the ...,,,,en,I look everyday,This is part of my daily planning . I love the...


## Modify rules for sentence segmentation

In [7]:
# modify rules before the parser in the pipeline
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        # segment at "but" and "although"
        if token.text == 'but' or token.text == 'although':
            doc[token.i].is_sent_start = True
        # never segment at "so"
        elif token.text == 'so':
            doc[token.i].is_sent_start = False
    return doc
            
nlp.add_pipe(set_custom_boundaries, before = 'parser')

nlp.pipe_names

['tagger', 'set_custom_boundaries', 'parser', 'ner']

In [8]:
# check the application of the new rules
s = "love using this to check the air quality in my area and whenever I travel, but not at work. i like that it has the widget option so i can see it in my today view as well on iphone"

doc8 = nlp(s)
for sent in doc8.sents:
    print(sent.text)

love using this to check the air quality in my area and whenever I travel,
but not at work.
i like that it has the widget option so i can see it in my today view as well on iphone


## Removing line breaks

In [9]:
import re

In [10]:
clean_line_break = lambda s : ' '.join(re.findall('[^\n]+', s))

## Extract opinion units

In [11]:
extract_sent = lambda s : [clean_line_break(sent.text) for sent in nlp(s).sents]

In [12]:
extract_sent(s)

['love using this to check the air quality in my area and whenever I travel,',
 'but not at work.',
 'i like that it has the widget option so i can see it in my today view as well on iphone']

In [13]:
s2 = df.iloc[2147]['review_expanded']

In [None]:
extract_sent(s2)

In [14]:
df['opinion_units'] = df['review_expanded'].apply(extract_sent)

In [21]:
df.iloc[300:310]

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang,title_expanded,review_expanded,opinion_units
300,300,1817,5242021261,5,Super amazing app that helped during the CA wi...,2019-12-07T00:59:33Z,Davis-DD,"LIFE SAVING, literally\nINDISPENSABLE way to ...",,,,en,Super amazing app that helped during the Calif...,"LIFE SAVING , literally \n INDISPENSABLE way...","[LIFE SAVING , literally INDISPENSABLE way..."
301,301,1948,5240900262,5,Good app,2019-12-06T17:19:08Z,Nesta8,"Very helpful app, but Widget crashes on iphone xs",,,,en,Good app,"Very helpful app , but Widget crashes on iphon...","[Very helpful app ,, but Widget crashes on iph..."
302,302,480,5237483737,5,Helpful app for local air quality,2019-12-05T20:41:34Z,Hfrencijdbsijcen,This app has been helpful for looking up local...,,,,en,Helpful app for local air quality,This app has been helpful for looking up local...,[This app has been helpful for looking up loca...
303,303,801,5237080060,5,Very helpful app!,2019-12-05T17:55:48Z,Evpraxia,This app helps me prepare for the day and the ...,,,,en,Very helpful app !,This app helps me prepare for the day and the ...,[This app helps me prepare for the day and the...
304,304,2471,5236021069,5,Reza tavakoli khoo,2019-12-05T12:19:12Z,reza tavakoli khoo,Perfect app and useful,,,,en,Reza tavakoli khoo,Perfect app and useful,[Perfect app and useful]
305,305,2027,5235194088,5,So useful app,2019-12-05T07:32:49Z,Enkhamgalan,I’m glad to know this app. Super cool app,,,,en,So useful app,I ’m glad to know this app . Super cool app,"[I ’m glad to know this app ., Super cool app]"
306,306,135,5234638862,5,So accurate and a MUST HAVE!!!!!,2019-12-05T04:02:05Z,7207,I found this app from my Dr during the CA fire...,,,,en,So accurate and a MUST HAVE ! ! ! ! !,I found this app from my Doctor during the Cal...,[I found this app from my Doctor during the Ca...
307,307,1408,5234247246,5,Love that it keeps me informed!,2019-12-05T01:32:13Z,cfsama420,Very helpful! Let’s me know to stay inside on ...,,,,en,Love that it keeps me informed !,Very helpful ! Let ’s me know to stay inside o...,"[Very helpful !, Let ’s me know to stay inside..."
308,308,1210,5234243444,5,Seems great so far,2019-12-05T01:30:36Z,Summit mountain,Nice dashboard approach and specific multiple ...,,,,en,Seems great so far,Nice dashboard approach and specific multiple ...,[Nice dashboard approach and specific multiple...
309,309,2172,5234131774,5,Awesome App!,2019-12-05T00:43:56Z,Zuriwang,This app is extremely helpful 👍🏾,,,,en,Awesome App !,This app is extremely helpful 👍 🏾,[This app is extremely helpful 👍 🏾]


In [22]:
df.iloc[305]['opinion_units']

['I ’m glad to know this app .', 'Super cool app']

## Create new dataframe with opinion units

In [23]:
cols = df.columns.values.tolist()
cols.append('opi_id')
cols

['Unnamed: 0',
 'Unnamed: 0.1',
 'review_id',
 'rating',
 'title',
 'review_date',
 'user_name',
 'review',
 'response_id',
 'dev_response',
 'response_date',
 'lang',
 'title_expanded',
 'review_expanded',
 'opinion_units',
 'opi_id']

In [24]:
df_out = pd.DataFrame(columns = cols)

In [25]:
df_out

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang,title_expanded,review_expanded,opinion_units,opi_id


In [26]:
# method to use the title as an additional opinion unit
def get_title_opinion(df, i, counter):
    
    title_r = {}
    
    # copy the info from all the columns except the last one
    for c in df.columns[:-1]:
        title_r[c] = df.iloc[i].loc[c]
        
    # extract the title review to fill the 'opinion_units' column
    opinion_c = df.columns[-1]
    title_r[opinion_c] = df.iloc[i].loc['title_expanded']
    
    # add the id of the opinion unit
    title_r['opi_id'] = counter
    
    return title_r

In [27]:
get_title_opinion(df, 0, 0)

{'Unnamed: 0': 0,
 'Unnamed: 0.1': 3095,
 'review_id': 6121840341,
 'rating': 5,
 'title': 'Happy to finally see when and why I can’t breathe!',
 'review_date': '2020-06-25T23:18:57Z',
 'user_name': 'Abbsteroni',
 'review': 'Having allergies is annoying but I’m glad to see this and know when to take extra precautions.',
 'response_id': nan,
 'dev_response': nan,
 'response_date': nan,
 'lang': 'en',
 'title_expanded': 'Happy to finally see when and why I ca n’t breathe !',
 'review_expanded': 'Having allergies is annoying but I ’m glad to see this and know when to take extra precautions .',
 'opinion_units': 'Happy to finally see when and why I ca n’t breathe !',
 'opi_id': 0}

In [28]:
# method to define the content of the row for the dataframe of opinion units
def new_row(df, i, j, counter):
    
    new_r = {}
    
    # copy the info from all the columns except the last one
    for c in df.columns[:-1]:
        new_r[c] = df.iloc[i].loc[c]
        
    # extract the j th item from the list in the 'opinion_units' column of df
    opinion_c = df.columns[-1]
    new_r[opinion_c] = df.iloc[i].loc[opinion_c][j]
    
    # add the id of the opinion unit
    new_r['opi_id'] = counter
    
    return new_r

In [29]:
opinion_c = df.columns[-1]
counter = 0

# build the dataframe of opinion units
for i in range(len(df)): 
#for i in range(20): 
    
    # add a row to extract the review title as an opinion unit
    new_r = get_title_opinion(df, i, counter)
    df_out = df_out.append(new_r, ignore_index=True)
    counter += 1
    
    # add row(s) from opinion units extracted from the review
    lst_len = len(df.iloc[i].loc[opinion_c])
    
    for j in range(lst_len):
        new_r = new_row(df, i, j, counter)
        df_out = df_out.append(new_r, ignore_index=True)
        counter += 1

In [30]:
# rename the last columns
df_out.rename({'opinion_units':'opinion_unit'}, axis = 1, inplace = True)

In [31]:
df_out.head(10)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang,title_expanded,review_expanded,opinion_unit,opi_id
0,0,3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en,Happy to finally see when and why I ca n’t bre...,Having allergies is annoying but I ’m glad to ...,Happy to finally see when and why I ca n’t bre...,0
1,0,3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en,Happy to finally see when and why I ca n’t bre...,Having allergies is annoying but I ’m glad to ...,Having allergies is annoying,1
2,0,3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en,Happy to finally see when and why I ca n’t bre...,Having allergies is annoying but I ’m glad to ...,but I ’m glad to see this and know when to tak...,2
3,1,982,6114444527,5,Super,2020-06-24T02:12:59Z,WillJosue,Easy to keep track on specific Local areas,,,,en,Super,Easy to keep track on specific Local areas,Super,3
4,1,982,6114444527,5,Super,2020-06-24T02:12:59Z,WillJosue,Easy to keep track on specific Local areas,,,,en,Super,Easy to keep track on specific Local areas,Easy to keep track on specific Local areas,4
5,2,1965,6114325210,5,Great App,2020-06-24T01:31:38Z,Nejinater,Full of good information!,,,,en,Great App,Full of good information !,Great App,5
6,2,1965,6114325210,5,Great App,2020-06-24T01:31:38Z,Nejinater,Full of good information!,,,,en,Great App,Full of good information !,Full of good information !,6
7,3,797,6111838742,5,Great app for filtering the air,2020-06-23T11:01:04Z,Jamieissad,Tells you everything you need to know about th...,,,,en,Great app for filtering the air,Tells you everything you need to know about th...,Great app for filtering the air,7
8,3,797,6111838742,5,Great app for filtering the air,2020-06-23T11:01:04Z,Jamieissad,Tells you everything you need to know about th...,,,,en,Great app for filtering the air,Tells you everything you need to know about th...,Tells you everything you need to know about th...,8
9,4,962,6104666348,5,I look everyday,2020-06-21T14:29:25Z,TorchPitchfork,This is part of my daily planning. I love the ...,,,,en,I look everyday,This is part of my daily planning . I love the...,I look everyday,9


## Export file

In [32]:
export_filename = filename[:-4]+'_ou_v3.csv'
export_filename

'app_reviews_airvisual-air-quality-forecast_1048912974_by_lang_us_exp_abb_ou_v3.csv'

In [33]:
export_subfolder = '/../data/2_opinion_units/'
export_subfolder

'/../data/2_opinion_units/'

In [34]:
df_out.to_csv(path+export_subfolder+export_filename)