<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preprocessing,-exploration-and-export-of-app-reviews" data-toc-modified-id="Preprocessing,-exploration-and-export-of-app-reviews-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preprocessing, exploration and export of app reviews</a></span><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#Ratings" data-toc-modified-id="Ratings-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Ratings</a></span></li><li><span><a href="#Detect-language" data-toc-modified-id="Detect-language-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Detect language</a></span></li><li><span><a href="#Sort-data" data-toc-modified-id="Sort-data-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Sort data</a></span></li><li><span><a href="#Export-data" data-toc-modified-id="Export-data-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Export data</a></span></li></ul></li></ul></div>

# Preprocessing, exploration and export of app reviews

We have scraped reviews on a [specific app](https://apps.apple.com/fr/app/airvisual-qualit%C3%A9-de-lair/id1048912974#see-all/reviews) in the French appstore. This app is related to air quality. Our goal is to analyse these reviews to try to find out about :
* usages
* most relevant app features
* "missing" app features, or features that users would like the app to have
* technical issues.

Data preparation will be key to help analyse the reviews, such as sorting reviews according to selected criteria.  This will also give us the opportunity to test some NLP tools as needed (language detection, sentiment analysis...).

In [2]:
import pandas as pd
from langdetect import detect
import warnings
warnings.filterwarnings('ignore')

## Load data

In [3]:
#filename = 'app_reviews_airvisual-air-quality-forecast_1048912974.json'
filename ='app_reviews_en.json' # from Great Britain appstore

In [4]:
df = pd.read_json(filename)

In [5]:
df.head()

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date
0,4915894551,5,A tragic reality,2019-10-10T01:06:52Z,Si & Ro,"Amazingly helpful app, both for health and as ...",,,
1,5387399470,5,Really useful information,2020-01-11T01:19:08Z,Suez62,We have an ‘eco’ woodburner and as an asthmati...,,,
2,4057061182,5,Air quality app,2019-04-25T07:07:47Z,# alone at xmas,I was introduced to this app via a friend who ...,,,
3,4858351034,5,Accurate & Reliable,2019-09-29T07:38:33Z,r2thebizel,"Very clear, easy to use and engaging. Very rel...",,,
4,3883883779,5,Best aqi app yet,2019-03-15T12:07:55Z,jhugs43,"I’ve tried several aqi apps over the years, mo...",,,


In [6]:
df.columns

Index(['review_id', 'rating', 'title', 'review_date', 'user_name', 'review',
       'response_id', 'dev_response', 'response_date'],
      dtype='object')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243 entries, 0 to 242
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   review_id      243 non-null    int64  
 1   rating         243 non-null    int64  
 2   title          243 non-null    object 
 3   review_date    243 non-null    object 
 4   user_name      243 non-null    object 
 5   review         243 non-null    object 
 6   response_id    2 non-null      float64
 7   dev_response   2 non-null      object 
 8   response_date  2 non-null      object 
dtypes: float64(1), int64(2), object(6)
memory usage: 17.2+ KB


In [8]:
df['review']

0      Amazingly helpful app, both for health and as ...
1      We have an ‘eco’ woodburner and as an asthmati...
2      I was introduced to this app via a friend who ...
3      Very clear, easy to use and engaging. Very rel...
4      I’ve tried several aqi apps over the years, mo...
                             ...                        
238                                       Thanks the app
239    Living in Baoding, China this app is essential...
240                Good app gives a realistic assessment
241             Pretty good. Will recommend to everyone.
242                      Great App! Needs more publicity
Name: review, Length: 243, dtype: object

## Ratings

In [9]:
# assess the distribution of ratings
df['rating'].value_counts()

5    203
4     30
3      5
2      3
1      2
Name: rating, dtype: int64

In [10]:
# assess mean rating
df['rating'].mean()

4.765432098765432

## Detect language

In [11]:
# Define a function to identify language and catch exceptions
def lang_detect(text):
    # use deterministic approach for language detection
    from langdetect import DetectorFactory
    DetectorFactory.seed = 0
    try:
        return detect(text)
    except:
        return "language not detected"

In [12]:
# Detect the language used in the reviews
df['lang-r'] = df['review'].apply(lang_detect)

In [13]:
# What are the detected languages?
df['lang-r'].unique()

array(['en', 'af', 'nl', 'et', 'de', 'it', 'language not detected', 'tr',
       'so', 'fr', 'sk', 'ru', 'zh-cn', 'zh-tw', 'es', 'sv'], dtype=object)

In [14]:
# What is the distribution of the detected languages?
df['lang-r'].value_counts()

en                       218
so                         4
af                         3
language not detected      2
it                         2
zh-cn                      2
et                         2
fr                         2
sv                         1
ru                         1
sk                         1
es                         1
de                         1
zh-tw                      1
tr                         1
nl                         1
Name: lang-r, dtype: int64

As expected, most reviews are detected as being in English, since reviews were collected from the Great Britain appstore.

In [15]:
# Look at reviews where the language could not be detected
df.loc[df['lang-r']=='language not detected']

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang-r
69,3734013494,5,One of the realest,2019-02-04T23:53:47Z,DeanBlunt,🍑,,,,language not detected
123,5527186972,5,makibg75,2020-02-13T21:17:28Z,makibg75,🙌,,,,language not detected


Two reviews where the language could not be detected are made of emoticones.

In [17]:
# Look at the reviews where the languages is neither French, English or not detected
df.loc[(df['lang-r']!='en')
       &(df['lang-r']!='language not detected')].sort_values(by='lang-r').head(23)

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang-r
36,2599431784,5,Very good,2018-05-26T15:44:04Z,Theworldis!,Works well,,,,af
41,5458614134,5,Great app,2020-01-28T13:45:19Z,UKNPN,Keep the good work!,,,,af
126,4102003763,5,V. Good,2019-05-05T02:17:37Z,jagmes,V. Good,,,,af
65,4812621500,5,Great app,2019-09-21T01:39:54Z,ChungChungvn,Fast.,,,,de
141,3311259676,5,Caracteristicas del aire oportuno.,2018-10-17T07:41:13Z,CharliePT,Es de suma importancia saber con antelación la...,,,,es
46,2516097380,3,Useful but some glitches,2018-05-07T13:03:03Z,qdgnjfdjydgkbb,See title,,,,et
101,3593000020,5,Great,2018-12-30T21:25:27Z,Killmallock,V useful,,,,et
104,5849975615,5,Hi,2020-04-22T22:20:20Z,stolarzuk,Super app,,,,fr
236,5356697087,5,Great app,2020-01-03T19:41:39Z,SRD7142,Gives you visual pollution rates.,,,,fr
89,5725223295,5,great,2020-03-28T09:37:22Z,kisne4444,so usefull,,,,it


When the review is short, especially if there is a typo, the language detection is not correct. Will manually remove reviews where language detected is Spanish, Russian, Chinese Mandarin, and Taiwanese Mandarin.

In [19]:
# check cases where language review is not French or English
dfout = df.loc[(df['lang-r']!='es')&(df['lang-r']!='ru')&(df['lang-r']!='zh-cn')&(df['lang-r']!='zh-tw')]

In [22]:
dfout['lang'] = 'en'

In [23]:
del dfout['lang-r']

In [24]:
dfout.head()

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang
0,4915894551,5,A tragic reality,2019-10-10T01:06:52Z,Si & Ro,"Amazingly helpful app, both for health and as ...",,,,en
1,5387399470,5,Really useful information,2020-01-11T01:19:08Z,Suez62,We have an ‘eco’ woodburner and as an asthmati...,,,,en
2,4057061182,5,Air quality app,2019-04-25T07:07:47Z,# alone at xmas,I was introduced to this app via a friend who ...,,,,en
3,4858351034,5,Accurate & Reliable,2019-09-29T07:38:33Z,r2thebizel,"Very clear, easy to use and engaging. Very rel...",,,,en
4,3883883779,5,Best aqi app yet,2019-03-15T12:07:55Z,jhugs43,"I’ve tried several aqi apps over the years, mo...",,,,en


## Sort data

In [25]:
dfout.sort_values(by=['lang','rating','review_date'], inplace = True, ascending = False)

In [26]:
dfout

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang
84,6102009673,5,Klarkov,2020-06-20T22:32:21Z,klarkov,Love this app,,,,en
57,6090391442,5,The only essential app,2020-06-18T06:29:17Z,Micrian,Would like to see this data being used to brin...,,,,en
96,6024828701,5,Kate1290,2020-06-02T07:50:03Z,1290Kate,As a mother with children in year 01 and year ...,,,,en
85,5986928118,5,Good,2020-05-24T08:01:51Z,Tethanskronosdia,Good app,,,,en
14,5952085335,5,"Simple, easy to use and very helpful",2020-05-16T08:08:24Z,Super_sonic-saiyajin,So far so good!! This app is quite useful and ...,,,,en
...,...,...,...,...,...,...,...,...,...,...
34,2340258010,2,Poorly,2018-03-24T08:00:33Z,Gosoo123,It isn’t online if reads 12 times a day.,,,,en
230,1643271194,2,User experience?,2017-06-13T18:46:29Z,Karl219,How can you fail to make deleting a weather st...,,,,en
124,1385080584,2,Widget not working.,2016-05-28T00:51:58Z,david171,"The widget does not work, no matter what diffe...",,,,en
63,4902507071,1,Accuracy concern,2019-10-07T02:49:33Z,Long181,I have high concern about accuracy of the meas...,,,,en


## Export data

In [27]:
# export to csv
dfout.to_csv('app_reviews_airvisual-air-quality-forecast_1048912974_by_lang_gb.csv', encoding='utf-8-sig', sep =';')