<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preprocessing,-exploration-and-export-of-app-reviews-[Sensio---USA-appstore]" data-toc-modified-id="Preprocessing,-exploration-and-export-of-app-reviews-[Sensio---USA-appstore]-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preprocessing, exploration and export of app reviews [Sensio - USA appstore]</a></span><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#Ratings" data-toc-modified-id="Ratings-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Ratings</a></span></li><li><span><a href="#Detect-language" data-toc-modified-id="Detect-language-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Detect language</a></span></li><li><span><a href="#Sort-data" data-toc-modified-id="Sort-data-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Sort data</a></span></li><li><span><a href="#Export-data" data-toc-modified-id="Export-data-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Export data</a></span></li></ul></li></ul></div>

# Preprocessing, exploration and export of app reviews [Sensio - USA appstore]

We have scraped reviews on a [specific app](https://apps.apple.com/us/app/airvisual-qualit%C3%A9-de-lair/id1048912974#see-all/reviews) in the French appstore. This app is related to air quality. Our goal is to analyse these reviews to try to find out about :
* usages
* most relevant app features
* "missing" app features, or features that users would like the app to have
* technical issues.

Data preparation will be key to help analyse the reviews, such as sorting reviews according to selected criteria.  This will also give us the opportunity to test some NLP tools as needed (language detection, sentiment analysis...).

In [1]:
import pandas as pd
from langdetect import detect
import warnings
warnings.filterwarnings('ignore')
import os

## Load data

In [2]:
path = os.getcwd()
filename ='app_reviews_sensio_us.json' 

In [3]:
df = pd.read_json(path+"/../data/0_scraped_data/"+filename)

In [4]:
df.head()

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date
0,2579715523,4,More updates to come,2018-05-22T05:13:01Z,Samsoumitunes,I am sure this app will be like a fundamental ...,5175794.0,Thank you for your comment! We love getting yo...,2018-09-22T18:48:20Z
1,2714285310,3,Is not tracking my symptoms but is still helpful.,2018-06-17T22:01:32Z,p-money7,"I want to love this app, but I’ve been logging...",3893613.0,Thank you for your feedback! You need to log y...,2018-06-19T11:17:10Z
2,3650933831,4,"So far, so good!",2019-01-15T00:44:27Z,egretsrus,How can I delete the daily notifications? They...,6848026.0,"Thanks for your comment, buses and tubes are c...",2019-01-15T14:14:11Z
3,5201671987,1,FORCED TO CREATE AN ACCOUNT,2019-11-27T02:36:05Z,Topbridge,I HATE HATE H A T E apps that force you to sig...,,,
4,3174901787,3,"One day, this may become a great app.",2018-09-10T14:00:07Z,APColliflower,I wish that the app tracked my area. I live i...,6390681.0,"Hello, thank you very much for your feedback a...",2018-12-14T16:54:38Z


In [5]:
df.columns

Index(['review_id', 'rating', 'title', 'review_date', 'user_name', 'review',
       'response_id', 'dev_response', 'response_date'],
      dtype='object')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   review_id      26 non-null     int64  
 1   rating         26 non-null     int64  
 2   title          26 non-null     object 
 3   review_date    26 non-null     object 
 4   user_name      26 non-null     object 
 5   review         26 non-null     object 
 6   response_id    17 non-null     float64
 7   dev_response   17 non-null     object 
 8   response_date  17 non-null     object 
dtypes: float64(1), int64(2), object(6)
memory usage: 2.0+ KB


In [7]:
df['review']

0     I am sure this app will be like a fundamental ...
1     I want to love this app, but I’ve been logging...
2     How can I delete the daily notifications? They...
3     I HATE HATE H A T E apps that force you to sig...
4     I wish that the app tracked my area.  I live i...
5     I wish that the app tracked my area.  I live i...
6     I wish that the app tracked my area.  I live i...
7     If yiu believe the adage that you can only imp...
8     Love the app, it really packs lots of features...
9     Have had too many glitchy problems getting inf...
10    Great product, very helpful to anybody that is...
11    What an incredible app!!! Not only is it easy ...
12    App seems like it would be great but is greatl...
13    Maybe I’m missing something? But the app doesn...
14                          Sets it apart from everyone
15    I live in Indiana, United States. There are on...
16                   Great way to monitor air quality !
17    This app makes no sense if you are unable 

## Ratings

In [8]:
# assess the distribution of ratings
df['rating'].value_counts()

5    7
2    6
1    6
3    5
4    2
Name: rating, dtype: int64

In [9]:
# assess mean rating
df['rating'].mean()

2.923076923076923

## Detect language

In [10]:
# Define a function to identify language and catch exceptions
def lang_detect(text):
    # use deterministic approach for language detection
    from langdetect import DetectorFactory
    DetectorFactory.seed = 0
    try:
        return detect(text)
    except:
        return "language not detected"

In [11]:
# Detect the language used in the reviews
df['lang-r'] = df['review'].apply(lang_detect)

In [12]:
# What are the detected languages?
df['lang-r'].unique()

array(['en'], dtype=object)

## Sort data

In [13]:
dfout = df.sort_values(by=['lang-r','rating','review_date'], ascending = False)

In [14]:
dfout

Unnamed: 0,review_id,rating,title,review_date,user_name,review,response_id,dev_response,response_date,lang-r
8,4125441347,5,Great way to keep track of my allergies,2019-05-09T20:56:28Z,Cylb,"Love the app, it really packs lots of features...",8699613.0,Thank you very much for your feedback!,2019-05-16T09:04:59Z,en
14,3630990329,5,You can write your symptoms,2019-01-09T18:44:04Z,professoor scripts,Sets it apart from everyone,6766230.0,"Hello, thank you for your 5 Stars Rating.",2019-01-10T09:13:29Z,en
11,2612985502,5,Awesome App!,2018-05-29T15:31:23Z,Momma Berta,What an incredible app!!! Not only is it easy ...,5175773.0,Thank you so much for your kind feedback!,2018-09-22T18:43:38Z,en
10,2609528181,5,Finally an app that can help me manage my alle...,2018-05-28T21:38:56Z,Richard jolio,"Great product, very helpful to anybody that is...",5175778.0,Thank you for your feedback Richard! Don't for...,2018-09-22T18:45:09Z,en
23,2592144865,5,Brilliant,2018-05-25T03:03:39Z,Ahmad muhaisen,This is a awesome!,,,,en
16,1795231875,5,Great App,2017-09-17T22:01:52Z,Tiger701,Great way to monitor air quality !,5175800.0,Thank you for your feedback!,2018-09-22T18:49:32Z,en
7,1775601497,5,A great idea that's well implemented,2017-09-05T19:56:20Z,David Haddad,If yiu believe the adage that you can only imp...,,,,en
2,3650933831,4,"So far, so good!",2019-01-15T00:44:27Z,egretsrus,How can I delete the daily notifications? They...,6848026.0,"Thanks for your comment, buses and tubes are c...",2019-01-15T14:14:11Z,en
0,2579715523,4,More updates to come,2018-05-22T05:13:01Z,Samsoumitunes,I am sure this app will be like a fundamental ...,5175794.0,Thank you for your comment! We love getting yo...,2018-09-22T18:48:20Z,en
20,5992518406,3,Poor Alaska,2020-05-25T15:49:12Z,TheOneThatMatters,No Anchorage?,,,,en


## Export data

In [15]:
export_filename = filename[:-5]+'.csv'
print(export_filename)

app_reviews_sensio_us.csv


In [16]:
# exporta to csv
dfout.to_csv(path+"/../data/1_preprocessed_data/"+export_filename, encoding='utf-8-sig', sep =';')