<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preprocessing,-exploration-and-export-of-app-reviews-[USA-appstore]" data-toc-modified-id="Preprocessing,-exploration-and-export-of-app-reviews-[USA-appstore]-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preprocessing, exploration and export of app reviews [USA appstore]</a></span><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#Ratings" data-toc-modified-id="Ratings-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Ratings</a></span></li><li><span><a href="#Detect-language" data-toc-modified-id="Detect-language-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Detect language</a></span></li><li><span><a href="#Sort-data" data-toc-modified-id="Sort-data-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Sort data</a></span></li><li><span><a href="#Export-data" data-toc-modified-id="Export-data-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Export data</a></span></li></ul></li></ul></div>

# Preprocessing, exploration and export of app reviews [USA appstore]

We have scraped reviews on a [specific app](https://apps.apple.com/fr/app/airvisual-qualit%C3%A9-de-lair/id1048912974#see-all/reviews) in the French appstore. This app is related to air quality. Our goal is to analyse these reviews to try to find out about :
* usages
* most relevant app features
* "missing" app features, or features that users would like the app to have
* technical issues.

Data preparation will be key to help analyse the reviews, such as sorting reviews according to selected criteria.  This will also give us the opportunity to test some NLP tools as needed (language detection, sentiment analysis...).

In [1]:
import pandas as pd
from langdetect import detect
import warnings
warnings.filterwarnings('ignore')
import os

## Load data

In [2]:
path = os.getcwd()
filename ='app_reviews_us.json' 

In [3]:
df = pd.read_json(path+"/../data/0_scraped_data/"+filename)

In [4]:
df.head()

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date
0,1444976984,5,"Did my research, get it",2016-09-05T14:22:49Z,Grindelli,I have lived in China for 3 years and have a t...,,,
1,5106173514,5,Your AQI in an app,2019-11-08T19:14:02Z,Kido@PDX,This free app can help you with information an...,,,
2,4525953569,5,Potential Lifesaver,2019-07-27T13:37:32Z,ttu101,It is only just beginning to seep into the pub...,,,
3,3058376013,1,Forecast data is completely unreliable,2018-08-13T11:40:29Z,c0rlette,While current local condition air quality seem...,,,
4,5333716395,5,Peace of Mind,2019-12-29T08:29:19Z,TangoXray,The app is so helpful as a quick reference to ...,,,


In [5]:
df.columns

Index(['review_id', 'rating', 'title', 'review_date', 'user_name', 'review',
       'response_id', 'dev_response', 'response_date'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3097 entries, 0 to 3096
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   review_id      3097 non-null   int64  
 1   rating         3097 non-null   int64  
 2   title          3097 non-null   object 
 3   review_date    3097 non-null   object 
 4   user_name      3097 non-null   object 
 5   review         3097 non-null   object 
 6   response_id    7 non-null      float64
 7   dev_response   7 non-null      object 
 8   response_date  7 non-null      object 
dtypes: float64(1), int64(2), object(6)
memory usage: 217.9+ KB


In [7]:
df['review']

0       I have lived in China for 3 years and have a t...
1       This free app can help you with information an...
2       It is only just beginning to seep into the pub...
3       While current local condition air quality seem...
4       The app is so helpful as a quick reference to ...
                              ...                        
3092    Easy to read and understand. Visually pleasing...
3093           Très bonne application, fiable et précise.
3094                                        best app ever
3095    Having allergies is annoying but I’m glad to s...
3096    I got this app a few weeks ago because it seem...
Name: review, Length: 3097, dtype: object

## Ratings

In [8]:
# assess the distribution of ratings
df['rating'].value_counts()

5    2518
4     383
1      89
3      71
2      36
Name: rating, dtype: int64

In [9]:
# assess mean rating
df['rating'].mean()

4.680658701969648

## Detect language

In [10]:
# Define a function to identify language and catch exceptions
def lang_detect(text):
    # use deterministic approach for language detection
    from langdetect import DetectorFactory
    DetectorFactory.seed = 0
    try:
        return detect(text)
    except:
        return "language not detected"

In [11]:
# Detect the language used in the reviews
df['lang-r'] = df['review'].apply(lang_detect)

In [12]:
# What are the detected languages?
df['lang-r'].unique()

array(['en', 'it', 'ro', 'no', 'ca', 'et', 'af', 'da', 'tr', 'sv', 'pt',
       'sq', 'sl', 'tl', 'de', 'nl', 'so', 'es', 'cs', 'fr',
       'language not detected', 'sw', 'fi', 'id', 'cy', 'vi', 'sk', 'hr',
       'pl', 'ko', 'ja', 'zh-cn', 'uk', 'th', 'ru', 'hu'], dtype=object)

In [13]:
# What is the distribution of the detected languages?
df['lang-r'].value_counts()

en                       2641
so                         69
vi                         67
af                         42
ko                         42
language not detected      28
ro                         25
zh-cn                      23
it                         19
ca                         15
es                         14
fr                         13
th                         10
nl                          8
da                          7
cy                          7
no                          6
sw                          6
sl                          6
de                          6
id                          6
tl                          5
sq                          4
et                          4
sv                          4
tr                          4
cs                          3
ja                          2
fi                          2
sk                          2
pt                          2
ru                          1
uk                          1
hu        

As expected, most reviews are detected as being in English, since reviews were collected from the US appstore.

In [14]:
# Look at reviews where the language could not be detected
df.loc[df['lang-r']=='language not detected']

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang-r
2240,4620151372,5,Breath better,2019-08-15T14:02:32Z,PERRYPOST RESTERAUNT,?,,,,language not detected
2271,5697230971,5,Great App for Determining Air Quality,2020-03-22T16:07:07Z,Teysha,....,,,,language not detected
2284,3060787039,5,Works well,2018-08-14T01:13:17Z,EvolutionXIII,👍🏼,,,,language not detected
2331,5410556931,5,Great and reliable!,2020-01-16T12:26:19Z,star.p_,👍🏽,,,,language not detected
2394,5418929112,5,Very useful,2020-01-18T13:21:55Z,Jack111188,..,,,,language not detected
2511,3649027847,5,So helpful !,2019-01-14T12:23:36Z,bammmmmmmm1551,❤️❤️,,,,language not detected
2534,3437121196,5,Best pollution detection app ever,2018-11-19T20:53:35Z,samuel_banapour,!!!😜,,,,language not detected
2567,4925884828,5,Phongtag,2019-10-12T00:17:38Z,phongtag,6,,,,language not detected
2618,3997524762,5,👍 Good stuff,2019-04-11T07:02:44Z,Intermiss,👍👍,,,,language not detected
2629,5308269360,5,Great!!,2019-12-23T11:18:18Z,Pixon99,5*,,,,language not detected


Most reviews where the language could not be detected are made of emoticones. Some of mathematical symbols.

Given the number of reviews, we'll export only those where review language was detected as being English. 

In [18]:
# check cases where language review is not French or English
dfout = df.loc[df['lang-r']=='en']

In [19]:
dfout['lang'] = 'en'

In [20]:
del dfout['lang-r']

In [21]:
dfout.head()

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang
3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en
982,6114444527,5,Super,2020-06-24T02:12:59Z,WillJosue,Easy to keep track on specific Local areas,,,,en
1965,6114325210,5,Great App,2020-06-24T01:31:38Z,Nejinater,Full of good information!,,,,en
797,6111838742,5,Great app for filtering the air,2020-06-23T11:01:04Z,Jamieissad,Tells you everything you need to know about th...,,,,en
962,6104666348,5,I look everyday,2020-06-21T14:29:25Z,TorchPitchfork,This is part of my daily planning. I love the ...,,,,en


In [22]:
len(dfout)

2641

## Sort data

In [23]:
dfout.sort_values(by=['lang','rating','review_date'], inplace = True, ascending = False)

In [24]:
dfout

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang
3095,6121840341,5,Happy to finally see when and why I can’t brea...,2020-06-25T23:18:57Z,Abbsteroni,Having allergies is annoying but I’m glad to s...,,,,en
982,6114444527,5,Super,2020-06-24T02:12:59Z,WillJosue,Easy to keep track on specific Local areas,,,,en
1965,6114325210,5,Great App,2020-06-24T01:31:38Z,Nejinater,Full of good information!,,,,en
797,6111838742,5,Great app for filtering the air,2020-06-23T11:01:04Z,Jamieissad,Tells you everything you need to know about th...,,,,en
962,6104666348,5,I look everyday,2020-06-21T14:29:25Z,TorchPitchfork,This is part of my daily planning. I love the ...,,,,en
...,...,...,...,...,...,...,...,...,...,...
664,1606714251,1,App Didn't Recognize Air Quality Alert,2017-05-06T13:07:12Z,Tired of Being Hounded,National Weather Service had issued an Ozone A...,,,,en
7,1601019558,1,Horribly inaccurate.,2017-04-30T01:43:28Z,Galia Stauffer,I live in South Korea and have for a total of ...,,,,en
2828,1404336381,1,Bad App,2016-07-02T22:34:12Z,1nEden,The info is inaccurate. My location had a gree...,,,,en
2835,1396610393,1,Research Scientist,2016-06-18T18:57:39Z,Elicia P,This app is completely inaccurate for Philadel...,,,,en


## Export data

In [25]:
# export to csv
dfout.to_csv('app_reviews_airvisual-air-quality-forecast_1048912974_by_lang_us.csv', encoding='utf-8-sig', sep =';')

NB: To detect language, other tools are available and might yield better results.