<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preprocessing,-exploration-and-export-of-app-reviews" data-toc-modified-id="Preprocessing,-exploration-and-export-of-app-reviews-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preprocessing, exploration and export of app reviews</a></span><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#Ratings" data-toc-modified-id="Ratings-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Ratings</a></span></li><li><span><a href="#Detect-language" data-toc-modified-id="Detect-language-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Detect language</a></span></li><li><span><a href="#Sort-data" data-toc-modified-id="Sort-data-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Sort data</a></span></li><li><span><a href="#Export-data" data-toc-modified-id="Export-data-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Export data</a></span></li></ul></li></ul></div>

# Preprocessing, exploration and export of app reviews

We have scraped reviews on a [specific app](https://apps.apple.com/fr/app/airvisual-qualit%C3%A9-de-lair/id1048912974#see-all/reviews) in the French appstore. This app is related to air quality. Our goal is to analyse these reviews to try to find out about :
* usages
* most relevant app features
* "missing" app features, or features that users would like the app to have
* technical issues.

Data preparation will be key to help analyse the reviews, such as sorting reviews according to selected criteria.  This will also give us the opportunity to test some NLP tools as needed (language detection, sentiment analysis...).

In [1]:
import pandas as pd
from langdetect import detect
import warnings
warnings.filterwarnings('ignore')

## Load data

In [2]:
filename = 'app_reviews_airvisual-air-quality-forecast_1048912974.json'

In [3]:
df = pd.read_json(filename)

In [4]:
df.head()

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date
0,1512229481,5,Parfait,2016-12-30T22:00:51Z,PaysDuMontBlancPollué,"Appli indispensable au pays du Mont-Blanc, et ...",,,
1,4004281790,5,Très utile !,2019-04-12T20:00:15Z,lucasfinck1211,Cela fait désormais une semaine que j’ai téléc...,,,
2,1513779072,5,Pays de Savoie -> utile!!!,2017-01-02T09:00:04Z,Pornawak333,"Habitant d Annecy, ville magnifique mais forte...",,,
3,1517811476,5,Indispensable,2017-01-08T11:50:03Z,pascalwirth,"Très bonne application, une démarche réseau af...",,,
4,1508935600,4,Recherche de ville,2016-12-25T20:42:43Z,YodaMoi,Très bonne appli. mais la recherche de ville m...,,,


In [5]:
df.columns

Index(['review_id', 'rating', 'title', 'review_date', 'user_name', 'review',
       'response_id', 'dev_response', 'response_date'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 334 entries, 0 to 333
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   review_id      334 non-null    int64  
 1   rating         334 non-null    int64  
 2   title          334 non-null    object 
 3   review_date    334 non-null    object 
 4   user_name      334 non-null    object 
 5   review         334 non-null    object 
 6   response_id    3 non-null      float64
 7   dev_response   3 non-null      object 
 8   response_date  3 non-null      object 
dtypes: float64(1), int64(2), object(6)
memory usage: 23.6+ KB


There are 334 reviews. There are only 3 responses from the developer.

In order to verify that it was not due to an issue with web scraping, we can go back to the [see-all reviews page](https://apps.apple.com/fr/app/airvisual-qualit%C3%A9-de-lair/id1048912974#see-all/reviews). After loading all the reviews, we could see only 3 responses from the developer (as of 26/06/2020).

In [7]:
df['review']

0      Appli indispensable au pays du Mont-Blanc, et ...
1      Cela fait désormais une semaine que j’ai téléc...
2      Habitant d Annecy, ville magnifique mais forte...
3      Très bonne application, une démarche réseau af...
4      Très bonne appli. mais la recherche de ville m...
                             ...                        
329                                              Parfait
330                                              parfait
331    You should publish on Facebook . The governmen...
332                                                 Bien
333               Very usefull and seems to be accurate.
Name: review, Length: 334, dtype: object

## Ratings

In [8]:
# assess the distribution of ratings
df['rating'].value_counts()

5    230
4     75
1     14
3     11
2      4
Name: rating, dtype: int64

In [9]:
# assess mean rating
df['rating'].mean()

4.505988023952096

## Detect language

In [10]:
# Define a function to identify language and catch exceptions
def lang_detect(text):
    # use deterministic approach for language detection
    from langdetect import DetectorFactory
    DetectorFactory.seed = 0
    try:
        return detect(text)
    except:
        return "language not detected"

In [11]:
# Detect the language used in the reviews
df['lang-r'] = df['review'].apply(lang_detect)

In [12]:
# What are the detected languages?
df['lang-r'].unique()

array(['fr', 'it', 'en', 'language not detected', 'de', 'ca', 'nl', 'id',
       'pt', 'es', 'cy', 'ko', 'sq', 'af', 'sk', 'so'], dtype=object)

In [13]:
# What is the distribution of the detected languages?
df['lang-r'].value_counts()

fr                       231
en                        70
language not detected      6
ca                         5
id                         4
de                         3
it                         3
pt                         2
af                         2
cy                         2
ko                         1
es                         1
sk                         1
sq                         1
nl                         1
so                         1
Name: lang-r, dtype: int64

As expected, most reviews are detected as being in French, since reviews were collected from the French appstore. However 70 reviews are detected as being in English.

In [14]:
# Look at reviews where the language could not be detected
df.loc[df['lang-r']=='language not detected']

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang-r
112,5666850190,4,Très bonne appli,2020-03-16T04:57:56Z,David1903,👌👌,,,,language not detected
147,1590022124,4,Bonne appli,2017-04-16T07:41:05Z,Jerba de Lyon,👌👍,,,,language not detected
165,3915545398,5,ᎤᏆᏍ ᎧᎿ,2019-03-23T09:04:09Z,EvanBench,ᎳᎹᎿ ᎢᎢᎾᏆᏍ ᏓᏜᏆᏍᏍ ᎢᎡᎠᎳ ᎢᎾᏀ ᏋᏓᏩᏝ ᎦᎹᎧᎴ ᎹᎾ ᏔᎾ ᏣᏌᏓᏝ ...,,,,language not detected
182,5414278660,5,Super,2020-01-17T10:08:48Z,matsdslp,❤️❤️,,,,language not detected
216,1381991531,1,N u l !,2016-05-21T15:14:24Z,16@5.3,0/10,,,,language not detected
325,3625199005,4,Très bien... je recommande,2019-01-08T05:55:13Z,zito22,...👍🏼,,,,language not detected


Most of reviews where the language could not be detected are made of emoticones or mathematical symbols. Detecting the language using the title should help. However, we're focusing on reviews where enough words to identify information we're looking for. Then, we'll discard the reviews where the language could not be detected.

In [15]:
# Look at the reviews where the languages is neither French, English or not detected
df.loc[(df['lang-r']!='fr')
       &(df['lang-r']!='en')
       &(df['lang-r']!='language not detected')].head(27)

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang-r
69,5305778459,5,Excellente application pour connaître la quali...,2019-12-22T19:38:07Z,pilt74,Excellente application,,,,it
116,5473537088,5,Bravo.,2020-02-01T08:48:52Z,SoulFetsih,Bien,,,,de
119,3567776246,4,Merci !,2018-12-24T15:51:44Z,Lóret,"Très utile, merci !",,,,ca
160,5694564037,5,Top,2020-03-22T02:46:32Z,pas mwa,Trop bien!!,,,,nl
162,3857994146,5,Très bien,2019-03-08T19:06:01Z,RueilMalmaison,Super.,,,,id
164,5791121182,5,Top !!,2020-04-10T13:38:31Z,Kuriboh14,Top !! j’adore 🥰,,,,pt
167,4822055315,5,Top,2019-09-22T06:41:57Z,Nuinuita,Merci,,,,es
168,3880071586,5,Super good,2019-03-14T11:49:33Z,amdo baby,Parfait,,,,cy
180,2376187657,5,Super,2018-04-02T21:39:01Z,Maurice1718,Super,,,,id
181,2168815632,5,Super,2018-02-05T05:11:55Z,j98champio,Super,,,,id


When the review is short, especially if there is a typo, the language detection is not correct. Let's check if language could be detected using the title.

In [16]:
df['lang-t'] = df.loc[(df['lang-r']!='fr')
       &(df['lang-r']!='en')
       &(df['lang-r']!='language not detected')]['title'].apply(lang_detect)

In [17]:
df_2 = df.loc[(df['lang-r']!='fr')
       &(df['lang-r']!='en')
       &(df['lang-r']!='language not detected')][['title', 'review', 'lang-r', 'lang-t']]

In [18]:
df_2.head()

Unnamed: 0,title,review,lang-r,lang-t
69,Excellente application pour connaître la quali...,Excellente application,it,fr
116,Bravo.,Bien,de,sk
119,Merci !,"Très utile, merci !",ca,it
160,Top,Trop bien!!,nl,en
162,Très bien,Super.,id,fr


In [19]:
df_2['lang-t'].value_counts()

fr    6
en    4
id    3
it    3
sk    2
de    1
pt    1
af    1
cy    1
ro    1
nl    1
lt    1
vi    1
fi    1
Name: lang-t, dtype: int64

In [20]:
df_2.sort_values(by='lang-t', inplace=True)

In [21]:
df_2.head(27)

Unnamed: 0,title,review,lang-r,lang-t
168,Super good,Parfait,cy,af
256,Newgyp,Ok,af,cy
332,Bien,Bien,de,de
279,Application Stable et Efficace,RAS,de,en
160,Top,Trop bien!!,nl,en
167,Top,Merci,es,en
329,Hugofinix82,Parfait,cy,en
219,John john69,Super,id,fi
258,Très belle application,Bravo,sk,fr
249,Très bien,Shume i mire,sq,fr


In the final dataframe, we'll keep only the reviews where French or English languages have been detected either in the review or in the title.

In [22]:
len(df.loc[(df['lang-r']=='fr')|(df['lang-r']=='en')
          |(df['lang-t']=='fr')|(df['lang-t']=='en')])

311

In [23]:
dfout = df.loc[(df['lang-r']=='fr')|(df['lang-r']=='en')
          |(df['lang-t']=='fr')|(df['lang-t']=='en')]

In [24]:
# we define a new column 'lang': it's the language of the review if it's 
# in French or in English
dfout.loc[(df['lang-r']=='fr')|(df['lang-r']=='en'),'lang'] = df.loc[(df['lang-r']=='fr')|(df['lang-r']=='en'),'lang-r']

In [25]:
# check the data
dfout.head()

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang-r,lang-t,lang
0,1512229481,5,Parfait,2016-12-30T22:00:51Z,PaysDuMontBlancPollué,"Appli indispensable au pays du Mont-Blanc, et ...",,,,fr,,fr
1,4004281790,5,Très utile !,2019-04-12T20:00:15Z,lucasfinck1211,Cela fait désormais une semaine que j’ai téléc...,,,,fr,,fr
2,1513779072,5,Pays de Savoie -> utile!!!,2017-01-02T09:00:04Z,Pornawak333,"Habitant d Annecy, ville magnifique mais forte...",,,,fr,,fr
3,1517811476,5,Indispensable,2017-01-08T11:50:03Z,pascalwirth,"Très bonne application, une démarche réseau af...",,,,fr,,fr
4,1508935600,4,Recherche de ville,2016-12-25T20:42:43Z,YodaMoi,Très bonne appli. mais la recherche de ville m...,,,,fr,,fr


In [26]:
# check cases where language review is not French or English
dfout.loc[(df['lang-r']!='fr')&(df['lang-r']!='en')]

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang-r,lang-t,lang
69,5305778459,5,Excellente application pour connaître la quali...,2019-12-22T19:38:07Z,pilt74,Excellente application,,,,it,fr,
160,5694564037,5,Top,2020-03-22T02:46:32Z,pas mwa,Trop bien!!,,,,nl,en,
162,3857994146,5,Très bien,2019-03-08T19:06:01Z,RueilMalmaison,Super.,,,,id,fr,
167,4822055315,5,Top,2019-09-22T06:41:57Z,Nuinuita,Merci,,,,es,en,
221,3467886745,5,Best app for air quality,2018-11-28T00:18:23Z,Wilko2505,Essential app !,,,,ca,fr,
245,5530739700,5,Cette application est vraiment ICÔNIC,2020-02-14T16:02:03Z,Anastasia7542,j’adore,,,,pt,fr,
249,5777624373,4,Très bien,2020-04-07T21:15:05Z,Marko albania,Shume i mire,,,,sq,fr,
258,4057049021,4,Très belle application,2019-04-25T07:03:07Z,Monalissa13,Bravo,,,,sk,fr,
279,3861390231,4,Application Stable et Efficace,2019-03-09T14:25:32Z,mook972,RAS,,,,de,en,
329,5280043693,5,Hugofinix82,2019-12-16T15:29:46Z,Hugofinix82,Parfait,,,,cy,en,


In [27]:
# else, the value for 'lang' is the language detected in the title (French or English)
dfout.loc[(df['lang-r']!='fr')&(df['lang-r']!='en'),'lang'] = dfout.loc[(df['lang-r']!='fr')&(df['lang-r']!='en'),'lang-t'] 

In [28]:
# check again cases where language review is not French or English
dfout.loc[(df['lang-r']!='fr')&(df['lang-r']!='en')]

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang-r,lang-t,lang
69,5305778459,5,Excellente application pour connaître la quali...,2019-12-22T19:38:07Z,pilt74,Excellente application,,,,it,fr,fr
160,5694564037,5,Top,2020-03-22T02:46:32Z,pas mwa,Trop bien!!,,,,nl,en,en
162,3857994146,5,Très bien,2019-03-08T19:06:01Z,RueilMalmaison,Super.,,,,id,fr,fr
167,4822055315,5,Top,2019-09-22T06:41:57Z,Nuinuita,Merci,,,,es,en,en
221,3467886745,5,Best app for air quality,2018-11-28T00:18:23Z,Wilko2505,Essential app !,,,,ca,fr,fr
245,5530739700,5,Cette application est vraiment ICÔNIC,2020-02-14T16:02:03Z,Anastasia7542,j’adore,,,,pt,fr,fr
249,5777624373,4,Très bien,2020-04-07T21:15:05Z,Marko albania,Shume i mire,,,,sq,fr,fr
258,4057049021,4,Très belle application,2019-04-25T07:03:07Z,Monalissa13,Bravo,,,,sk,fr,fr
279,3861390231,4,Application Stable et Efficace,2019-03-09T14:25:32Z,mook972,RAS,,,,de,en,en
329,5280043693,5,Hugofinix82,2019-12-16T15:29:46Z,Hugofinix82,Parfait,,,,cy,en,en


## Sort data

In [29]:
dfout.sort_values(by=['lang','rating','review_date'], inplace = True, ascending = False)

In [30]:
dfout

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang-r,lang-t,lang
133,6085970206,5,La Maison Blanche,2020-06-17T06:10:30Z,tsm energies services,L’appareil est très bien fait,,,,fr,,fr
253,6045921703,5,Top 👍,2020-06-07T08:24:01Z,Sécu2007,"Très bonne application, très utile pour connaî...",,,,fr,,fr
233,5974704788,5,Très bien,2020-05-21T13:31:13Z,Cjchavanne,Très bien,,,,fr,,fr
192,5935245643,5,Pratique,2020-05-12T10:32:33Z,ManuLes,Application très pratique pour savoir si l’air...,,,,fr,,fr
82,5928224600,5,Indispensable pour comprendre la planète !,2020-05-10T16:15:52Z,Zlyzinc,Application indispensable à donner - notamment...,,,,fr,,fr
...,...,...,...,...,...,...,...,...,...,...,...,...
148,2201840940,4,very useful,2018-02-13T05:16:18Z,MisterGD......,very useful,,,,en,,en
114,2164215003,4,Visual and clear,2018-02-04T01:47:15Z,Tycé,Good app with clear information,,,,en,,en
124,2434915979,3,little Tribu from Seoul South Korea,2018-04-17T23:58:52Z,little Tribu,Very useful for us when we decide to do sport:...,,,,en,,en
149,3688408352,1,Bug,2019-01-24T10:13:12Z,Mimi75,Bug sur l’Apple watch,7003136.0,"Hi Mimi75,\nWe are sorry you are experiencing ...",2019-01-25T03:21:34Z,en,,en


## Export data

In [31]:
# export to csv
dfout.to_csv('app_reviews_airvisual-air-quality-forecast_1048912974_by_lang.csv', encoding='utf-8-sig', sep =';')