<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#[Preprocessing]-Replace-some-abbreviations" data-toc-modified-id="[Preprocessing]-Replace-some-abbreviations-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>[Preprocessing] Replace some abbreviations</a></span><ul class="toc-item"><li><span><a href="#Define-a-dictionary-of-main-abbreviations" data-toc-modified-id="Define-a-dictionary-of-main-abbreviations-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Define a dictionary of main abbreviations</a></span></li><li><span><a href="#Define-a-function-to-expand-specific-abbreviations" data-toc-modified-id="Define-a-function-to-expand-specific-abbreviations-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Define a function to expand specific abbreviations</a></span></li><li><span><a href="#Load-reviews" data-toc-modified-id="Load-reviews-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Load reviews</a></span></li><li><span><a href="#Apply-abbreviation-expansion" data-toc-modified-id="Apply-abbreviation-expansion-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Apply abbreviation expansion</a></span></li><li><span><a href="#Removing-line-breaks" data-toc-modified-id="Removing-line-breaks-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Removing line breaks</a></span></li><li><span><a href="#Export-file" data-toc-modified-id="Export-file-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Export file</a></span></li></ul></li></ul></div>

# [Preprocessing] Replace some abbreviations
After having read numerous app reviews, frequent abbreviations have been identified. In order to correctly identify certain locations (eg California) and specific words, we define a dictionary of those frequent abbreviations. Then we use this dictionary to expand abbreviations in the review title and in the review text.

In [1]:
import os
import pandas as pd
import re

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

## Define a dictionary of main abbreviations

Based on observations from numerous app reviews (related to air quality), we build a dictionary to handle most frequent abbreviations.

In [3]:
abbreviation_dict = {'SF':'San Francisco',
'LA':'Los Angeles',
                     
'AQ':'air qualiy',
'AQI':'Air Quality Index',
'PG&E':'Pacific Gas and Electricity',# only appears in one review
'Cal':'California',
'CA':'California',
'Ca':'California',
'Cali':'California',
'NorCal':'Northern California',
'Thx':'Thanks',
'hr':'hour',
'bc':'because',
'+':'and',
'Dr':'Doctor',
'gov':'government',
'Gov':'Government',
'V. good':'Very good'}

## Define a function to expand specific abbreviations

In [4]:
def expand_abbreviations(s):
    expanded_tokens = []
    doc = nlp(s)
    for token in doc:
        is_in_dict = 0
        # test if the token text is in the abbreviation dictionary
        for k,v in abbreviation_dict.items():
            if token.text == k:
                expanded_tokens.append(v)
                is_in_dict = 1
                # if a match is found, exit the for loop
                break
        # if no match is found, keep the original token text
        if is_in_dict == 0:
            expanded_tokens.append(token.text)
    # return the result as a string made of expanded tokens separated by a whitespace.
    return ' '.join(expanded_tokens)

In [5]:
phrase = "I live in Cali, aka CA. Gov provides us info with AQ. AQI is displayed on an app only for NorCal made by PG&E, This is it!"

In [6]:
# define a random review to test the function to expand abbreviations
expand_abbreviations(phrase)

'I live in California , aka California . Government provides us info with air qualiy . Air Quality Index is displayed on an app only for Northern California made by Pacific Gas and Electricity , This is it !'

## Load reviews

In [7]:
path = os.getcwd()
filename = 'app_reviews_airvisual-air-quality-forecast_1048912974_by_lang_us.csv'
subfolder = '/../data/1_preprocessed_data/'

In [8]:
df = pd.read_csv(path+subfolder+filename, sep = ';')

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang
0,3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en
1,982,6114444527,5,Super,2020-06-24T02:12:59Z,WillJosue,Easy to keep track on specific Local areas,,,,en
2,1965,6114325210,5,Great App,2020-06-24T01:31:38Z,Nejinater,Full of good information!,,,,en
3,797,6111838742,5,Great app for filtering the air,2020-06-23T11:01:04Z,Jamieissad,Tells you everything you need to know about th...,,,,en
4,962,6104666348,5,I look everyday,2020-06-21T14:29:25Z,TorchPitchfork,This is part of my daily planning. I love the ...,,,,en


## Apply abbreviation expansion

In [10]:
expand_abbreviations(df.iloc[0].loc['title'])

'Happy to finally see when and why I ca n’t breathe !'

In [11]:
df_test = df.iloc[300:310]
df_test

Unnamed: 0.1,Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang
300,1817,5242021261,5,Super amazing app that helped during the CA wi...,2019-12-07T00:59:33Z,Davis-DD,"LIFE SAVING, literally\nINDISPENSABLE way to ...",,,,en
301,1948,5240900262,5,Good app,2019-12-06T17:19:08Z,Nesta8,"Very helpful app, but Widget crashes on iphone xs",,,,en
302,480,5237483737,5,Helpful app for local air quality,2019-12-05T20:41:34Z,Hfrencijdbsijcen,This app has been helpful for looking up local...,,,,en
303,801,5237080060,5,Very helpful app!,2019-12-05T17:55:48Z,Evpraxia,This app helps me prepare for the day and the ...,,,,en
304,2471,5236021069,5,Reza tavakoli khoo,2019-12-05T12:19:12Z,reza tavakoli khoo,Perfect app and useful,,,,en
305,2027,5235194088,5,So useful app,2019-12-05T07:32:49Z,Enkhamgalan,I’m glad to know this app. Super cool app,,,,en
306,135,5234638862,5,So accurate and a MUST HAVE!!!!!,2019-12-05T04:02:05Z,7207,I found this app from my Dr during the CA fire...,,,,en
307,1408,5234247246,5,Love that it keeps me informed!,2019-12-05T01:32:13Z,cfsama420,Very helpful! Let’s me know to stay inside on ...,,,,en
308,1210,5234243444,5,Seems great so far,2019-12-05T01:30:36Z,Summit mountain,Nice dashboard approach and specific multiple ...,,,,en
309,2172,5234131774,5,Awesome App!,2019-12-05T00:43:56Z,Zuriwang,This app is extremely helpful 👍🏾,,,,en


In [12]:
print(df_test['title'].apply(expand_abbreviations))

300    Super amazing app that helped during the Calif...
301                                             Good app
302                    Helpful app for local air quality
303                                   Very helpful app !
304                                   Reza tavakoli khoo
305                                        So useful app
306                So accurate and a MUST HAVE ! ! ! ! !
307                     Love that it keeps me informed !
308                                   Seems great so far
309                                        Awesome App !
Name: title, dtype: object


In [13]:
print(df_test['review'].apply(expand_abbreviations))

300    LIFE SAVING , literally \n INDISPENSABLE   way...
301    Very helpful app , but Widget crashes on iphon...
302    This app has been helpful for looking up local...
303    This app helps me prepare for the day and the ...
304                               Perfect app and useful
305          I ’m glad to know this app . Super cool app
306    I found this app from my Doctor during the Cal...
307    Very helpful ! Let ’s me know to stay inside o...
308    Nice dashboard approach and specific multiple ...
309                    This app is extremely helpful 👍 🏾
Name: review, dtype: object


## Removing line breaks

In [14]:
# apply abbreviation expansion to review title
df['title_expanded'] = df['title'].apply(expand_abbreviations)

In [15]:
# check the result
df.iloc[300:310]

Unnamed: 0.1,Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang,title_expanded
300,1817,5242021261,5,Super amazing app that helped during the CA wi...,2019-12-07T00:59:33Z,Davis-DD,"LIFE SAVING, literally\nINDISPENSABLE way to ...",,,,en,Super amazing app that helped during the Calif...
301,1948,5240900262,5,Good app,2019-12-06T17:19:08Z,Nesta8,"Very helpful app, but Widget crashes on iphone xs",,,,en,Good app
302,480,5237483737,5,Helpful app for local air quality,2019-12-05T20:41:34Z,Hfrencijdbsijcen,This app has been helpful for looking up local...,,,,en,Helpful app for local air quality
303,801,5237080060,5,Very helpful app!,2019-12-05T17:55:48Z,Evpraxia,This app helps me prepare for the day and the ...,,,,en,Very helpful app !
304,2471,5236021069,5,Reza tavakoli khoo,2019-12-05T12:19:12Z,reza tavakoli khoo,Perfect app and useful,,,,en,Reza tavakoli khoo
305,2027,5235194088,5,So useful app,2019-12-05T07:32:49Z,Enkhamgalan,I’m glad to know this app. Super cool app,,,,en,So useful app
306,135,5234638862,5,So accurate and a MUST HAVE!!!!!,2019-12-05T04:02:05Z,7207,I found this app from my Dr during the CA fire...,,,,en,So accurate and a MUST HAVE ! ! ! ! !
307,1408,5234247246,5,Love that it keeps me informed!,2019-12-05T01:32:13Z,cfsama420,Very helpful! Let’s me know to stay inside on ...,,,,en,Love that it keeps me informed !
308,1210,5234243444,5,Seems great so far,2019-12-05T01:30:36Z,Summit mountain,Nice dashboard approach and specific multiple ...,,,,en,Seems great so far
309,2172,5234131774,5,Awesome App!,2019-12-05T00:43:56Z,Zuriwang,This app is extremely helpful 👍🏾,,,,en,Awesome App !


In [16]:
# apply abbreviation expansion to review text
df['review_expanded'] = df['review'].apply(expand_abbreviations)

In [17]:
# check the result
df.iloc[1250:1255]

Unnamed: 0.1,Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang,title_expanded,review_expanded
1250,2128,3438109967,5,Thank you,2018-11-20T03:49:45Z,kak0901,Love this app,,,,en,Thank you,Love this app
1251,2275,3437771163,5,Good app,2018-11-20T01:31:03Z,Fantasydance,It does help a lot,,,,en,Good app,It does help a lot
1252,471,3437628149,5,The best air quality app! (I checked),2018-11-20T00:32:55Z,whenvoice,I started using this during the California fir...,,,,en,The best air quality app ! ( I checked ),I started using this during the California fir...
1253,655,3437584420,5,Best app for air quality,2018-11-20T00:15:10Z,Stev012,If your interested in know how healthy the air...,,,,en,Best app for air quality,If your interested in know how healthy the air...
1254,1822,3437516140,5,Great app and very informative,2018-11-19T23:46:21Z,aanaroz,Amazing how cool the app is. Thank you for all...,,,,en,Great app and very informative,Amazing how cool the app is . Thank you for al...


## Export file

In [18]:
export_filename = filename[:-4]+'_exp_abb.csv'
export_filename

'app_reviews_airvisual-air-quality-forecast_1048912974_by_lang_us_exp_abb.csv'

In [19]:
export_subfolder = '/../data/1_preprocessed_data/'
export_subfolder

'/../data/1_preprocessed_data/'

In [20]:
df.to_csv(path+export_subfolder+export_filename)