# NLP - Cleaning and Preprocessing Text Data of User Reviews in AppStore

In [1]:
# pandas
import pandas as pd
# natural language toolkit
import nltk
# string for punctuation list
import string
# to remove links, numbers
import re
# to get stopwords from smart stopword list link
from urllib.request import urlopen
# wordnet for part of the speech
from nltk.corpus import wordnet
from collections import Counter
# Tokenizer
from nltk.tokenize import RegexpTokenizer
# Lemmatizer
from nltk.stem import WordNetLemmatizer
#Stemmers
from nltk.stem.porter import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer
import numpy as np

##  CSV Read and DataFrame Creation

We load a CSV file, create a DataFrame, and verify its shape. Initially, we have a dataset with 3097 rows and 16 columns, where each row represents a distinct reviews posted on AppStore for 10 different apps.

In [2]:
def get_data(file):
    data = pd.read_csv(file)
    print(data.shape)
    return data

In [3]:
file = "gps_reannotation-full.csv"
df = get_data(file)
df.info()

(3101, 13)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3101 entries, 0 to 3100
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   reviewId              3101 non-null   object
 1   userName              3101 non-null   object
 2   userImage             3101 non-null   object
 3   content               3101 non-null   object
 4   score                 3101 non-null   int64 
 5   thumbsUpCount         3101 non-null   int64 
 6   reviewCreatedVersion  2584 non-null   object
 7   at                    3101 non-null   object
 8   replyContent          428 non-null    object
 9   repliedAt             429 non-null    object
 10  app_name              3101 non-null   object
 11  cat1                  3101 non-null   object
 12  inex1                 2049 non-null   object
dtypes: int64(2), object(11)
memory usage: 315.1+ KB


In [4]:
# Get unique values of apps and raised ethical concerns of reviews
apps = df['app_name'].unique()
print('Apps:', ', '.join(apps))

concerns = df['cat1'].unique()
print('\nRaised Ethical Concerns: ', ', '.join(concerns))


Apps: tiktok, facebook, uber, zoom, vinted, alexa, googlehome, linkedin, instagram, youtube

Raised Ethical Concerns:  Addiction (internal), Other, inappropriate content (internal), accountability  (can be internal or external), privacy, discrimination (can be internal or external), spreading false information (internal), censorship (internal), Cyberbullying/toxicity (internal), safety (can be internal or external), identity theft (internal), scam  (can be internal or external), harmful advertising (internal), Noise, transparency  (can be internal or external), accessibility   (can be internal or external), privacy (can be external or internal), Content theft, transparency, accountability, scam, none, sustainability, safety, accessibility, censorship


## Remove links

In [5]:
def removeLink(text):
    no_link = ' '.join(re.sub("(w+://S+)", " ", text).split())
    return no_link

In [6]:
df['clean_content'] = df['content'].apply(lambda x: removeLink(x))
df['clean_content']

0       This is yhr best app ever im littetally addict...
1       like tik tok because it gives people a chance ...
2                        I love this app like im addicted
3       This app allows pedophile acts, underage half ...
4       A very good app but I dont like how theres soo...
                              ...                        
3096    Great app if you don't mind getting ripped off...
3097    Myself and my colleagues wanted to car pool an...
3098    Its 2020 and dark mode is not even rolled out ...
3099    Instagram give a room to share my experiences ...
3100    CAUTION THIS APP DOESNT ALLOW YOU TO DELETE YO...
Name: clean_content, Length: 3101, dtype: object

## Remove numbers

In [7]:
def removeNumber(text):
    return ' '.join(re.sub(r'[0-9]',' ', text).split())

In [8]:
df['clean_content'] = df['clean_content'].apply(lambda x: removeNumber(x))

df['clean_content']

0       This is yhr best app ever im littetally addict...
1       like tik tok because it gives people a chance ...
2                        I love this app like im addicted
3       This app allows pedophile acts, underage half ...
4       A very good app but I dont like how theres soo...
                              ...                        
3096    Great app if you don't mind getting ripped off...
3097    Myself and my colleagues wanted to car pool an...
3098    Its and dark mode is not even rolled out for e...
3099    Instagram give a room to share my experiences ...
3100    CAUTION THIS APP DOESNT ALLOW YOU TO DELETE YO...
Name: clean_content, Length: 3101, dtype: object

## Remove Emojis

In [9]:
def deEmojify(text):
    return text.encode('ascii', 'ignore').decode('ascii')

In [10]:
df['clean_content'] = df['clean_content'].apply(lambda x: deEmojify(x))

#df['clean_content']
print(df.loc[450, ['content','clean_content']].values)

['janb mera ubar acont galti se band ho gya h plzz usy dobara bahl karva de'
 'janb mera ubar acont galti se band ho gya h plzz usy dobara bahl karva de']


## Converting all characters to lowercase

In [11]:
df['clean_content'] = df['clean_content'].apply(lambda x: x.lower())
df['clean_content']

0       this is yhr best app ever im littetally addict...
1       like tik tok because it gives people a chance ...
2                        i love this app like im addicted
3       this app allows pedophile acts, underage half ...
4       a very good app but i dont like how theres soo...
                              ...                        
3096    great app if you don't mind getting ripped off...
3097    myself and my colleagues wanted to car pool an...
3098    its and dark mode is not even rolled out for e...
3099    instagram give a room to share my experiences ...
3100    caution this app doesnt allow you to delete yo...
Name: clean_content, Length: 3101, dtype: object

## Remove stopwords
* nltk.corpus.stopwords.words('english') could be also used. However, it contains 179, whereas smart stopword list does 571 words, including ‘i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘you’, ‘he’, ‘his’, for instance. 
* stpwrd is here extended with app names that are mentioned in the reviews as well since they are going to be included in every reviews that belong to them.

In [12]:
def generate_stopwords():
    stpwrd_url = "http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/english.stop"
    response = urlopen(stpwrd_url)
    stpwrds = response.read().decode('utf-8').replace("\n", " ").split()
    return stpwrds

In [13]:
def remove_stopwords(text, stpwrds):
    text = text.split(" ")
    words = [w for w in text if w not in stpwrds]
    return ' '.join(words)

In [14]:
stpwrds = generate_stopwords()
df['clean_content'] = df['clean_content'].apply(lambda x: remove_stopwords(x, stpwrds))
df['clean_content'] 

0       yhr app im littetally addicted things personal...
1       tik tok people chance share life stories sad h...
2                                    love app im addicted
3       app pedophile acts, underage half naked girls ...
4       good app dont soo tiktokers committing offense...
                              ...                        
3096    great app mind ripped driving global warming. ...
3097    colleagues wanted car pool respective places l...
3098    dark mode rolled everyone. imagine wasting pow...
3099    instagram give room share experiences coffee, ...
3100    caution app doesnt delete card details beware !!!
Name: clean_content, Length: 3101, dtype: object

In [15]:
#df['clean_content']
print(df.loc[400, ['content','clean_content']].values)

["Calculate more fare than usual and never get any solution of any problem or\ncouldn't report for driver's bad behavior"
 "calculate fare usual solution problem report driver's bad behavior"]


## Remove punctuation
The process of punctuation elimination involves iterating through the series using list comprehension and preserving all elements that do not exist in the __string.punctuation__ list. This list, imported at the beginning using __import string__, comprises all punctuation marks.

In [16]:
def removePunctuation(text):
    no_punc = "".join([c for c in text if c not in string.punctuation])
    return no_punc

In [17]:
df['clean_content'] = df['clean_content'].apply(lambda x: removePunctuation(x))
df['clean_content']

0       yhr app im littetally addicted things personal...
1       tik tok people chance share life stories sad h...
2                                    love app im addicted
3       app pedophile acts underage half naked girls i...
4       good app dont soo tiktokers committing offense...
                              ...                        
3096    great app mind ripped driving global warming c...
3097    colleagues wanted car pool respective places l...
3098    dark mode rolled everyone imagine wasting powe...
3099    instagram give room share experiences coffee p...
3100       caution app doesnt delete card details beware 
Name: clean_content, Length: 3101, dtype: object

In [18]:
df['clean_content'] = df['clean_content'].apply(lambda x: remove_stopwords(x, stpwrds))

## Tokenizing words

* __RegexpTokenizer__ is a function that is used to break down a string into smaller substrings based on a specified regular expression pattern. The selected pattern splits up by spaces that are not attached to a digit as numbers are already cleaned from reviews.
* __discard\_empty__ is set to True. It ensures that any empty tokens produced by the tokenizer are removed from the resulting output. 
(see in https://www.nltk.org/_modules/nltk/tokenize/regexp.html) 

In [19]:
tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+', discard_empty=True)
df['clean_content'] = df['clean_content'].apply(lambda x: tokenizer.tokenize(x))

In [20]:
print(df['clean_content'])
print("\nOne particular review:")
print(df.loc[400, ['content','clean_content']].values)

0       [yhr, app, im, littetally, addicted, things, p...
1       [tik, tok, people, chance, share, life, storie...
2                               [love, app, im, addicted]
3       [app, pedophile, acts, underage, half, naked, ...
4       [good, app, dont, soo, tiktokers, committing, ...
                              ...                        
3096    [great, app, mind, ripped, driving, global, wa...
3097    [colleagues, wanted, car, pool, respective, pl...
3098    [dark, mode, rolled, imagine, wasting, power, ...
3099    [instagram, give, room, share, experiences, co...
3100    [caution, app, doesnt, delete, card, details, ...
Name: clean_content, Length: 3101, dtype: object

One particular review:
["Calculate more fare than usual and never get any solution of any problem or\ncouldn't report for driver's bad behavior"
 list(['calculate', 'fare', 'usual', 'solution', 'problem', 'report', 'drivers', 'bad', 'behavior'])]


## Lemmatizing

#### WordNet

In [21]:
def get_part_of_speech(word):
    probable_part_of_speech = wordnet.synsets(word)
    pos_counts = Counter()
    #pos_counts["n"] = len([item for item in probable_part_of_speech if item.pos() == "n"])
    pos_counts["v"] = len([item for item in probable_part_of_speech if item.pos() == "v"])
    #pos_counts["n"] = len([item for item in probable_part_of_speech if item.pos() == "n"])
    pos_counts["a"] = len([item for item in probable_part_of_speech if item.pos() == "a"])  
    pos_counts["n"] = len([item for item in probable_part_of_speech if item.pos() == "n"])
    #pos_counts["r"] = len([item for item in probable_part_of_speech if item.pos() == "r"])

    most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
    return most_likely_part_of_speech

In [22]:
def word_lemmatizer(text, lemmatizer):
    lem_text = [lemmatizer.lemmatize(i, get_part_of_speech(i)) for i in text]
    return lem_text


In [23]:
wordnetlemma =  WordNetLemmatizer()
df['clean_content'] = df['clean_content'].apply(lambda x: word_lemmatizer(x, wordnetlemma))

In [24]:
print(df[['content','clean_content']])
print("\nOne particular review:")
print(df.loc[0, ['content','clean_content']].values)

                                                content  \
0     This is yhr best app ever im littetally addict...   
1     like tik tok because it gives people a chance ...   
2                      I love this app like im addicted   
3     This app allows pedophile acts, underage half ...   
4     A very good app but I dont like how theres soo...   
...                                                 ...   
3096  Great app if you don't mind getting ripped off...   
3097  Myself and my colleagues wanted to car pool an...   
3098  Its 2020 and dark mode is not even rolled out ...   
3099  Instagram give a room to share my experiences ...   
3100  CAUTION THIS APP DOESNT ALLOW YOU TO DELETE YO...   

                                          clean_content  
0     [yhr, app, im, littetally, addict, thing, pers...  
1     [tik, tok, people, chance, share, life, story,...  
2                               [love, app, im, addict]  
3     [app, pedophile, act, underage, half, naked, g...  
4

In [25]:
df['clean_content'] = [' '.join(x) for x in df['clean_content']]
df['clean_content']

0       yhr app im littetally addict thing personally ...
1       tik tok people chance share life story sad hap...
2                                      love app im addict
3       app pedophile act underage half naked girl ina...
4       good app dont soo tiktokers commit offense rep...
                              ...                        
3096    great app mind rip drive global warm carbon em...
3097    colleague want car pool respective place long ...
3098    dark mode roll imagine waste power contribute ...
3099    instagram give room share experience coffee pe...
3100         caution app doesnt delete card detail beware
Name: clean_content, Length: 3101, dtype: object

In [26]:
df['cat1']

0                                Addiction (internal)
1                                               Other
2                                Addiction (internal)
3                    inappropriate content (internal)
4       accountability  (can be internal or external)
                            ...                      
3096                                   sustainability
3097                                   sustainability
3098                                   sustainability
3099                                             none
3100                                          privacy
Name: cat1, Length: 3101, dtype: object

In [27]:
df['cat1_clean'] = df['cat1'].apply(lambda x: x.lower())
df['cat1_clean'] = df['cat1_clean'].str.extract(r'^(.*?)\(', expand=True)
df['cat1_clean'].fillna(df['cat1'], inplace=True)
df['cat1_clean']

0                   addiction 
1                        Other
2                   addiction 
3       inappropriate content 
4             accountability  
                 ...          
3096            sustainability
3097            sustainability
3098            sustainability
3099                      none
3100                   privacy
Name: cat1_clean, Length: 3101, dtype: object

In [28]:
df['cat1_clean'] = df['cat1_clean'].str.strip()

df['cat1'] = df['cat1_clean']
df['cat1'].unique()

array(['addiction', 'Other', 'inappropriate content', 'accountability',
       'privacy', 'discrimination', 'spreading false information',
       'censorship', 'cyberbullying/toxicity', 'safety', 'identity theft',
       'scam', 'harmful advertising', 'Noise', 'transparency',
       'accessibility', 'Content theft', 'none', 'sustainability'],
      dtype=object)

In [29]:
df.head()

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt,app_name,cat1,inex1,clean_content,cat1_clean
0,gp:AOqpTOF5UoM-6ovAjd8ULHKjifvZCeJyoJWi4F_IaPO...,Max Toon,https://play-lh.googleusercontent.com/a-/AOh14...,This is yhr best app ever im littetally addict...,5,0,18.1.3,2020-12-11 19:51:21,,,tiktok,addiction,internal,yhr app im littetally addict thing personally ...,addiction
1,gp:AOqpTOGZ4VkRpsQa_bVYJdVz65yIOLs5jsENbB_aDJe...,Sandy Mason,https://play-lh.googleusercontent.com/-iFGfy5d...,like tik tok because it gives people a chance ...,5,0,17.3.4,2020-08-21 21:32:37,,,tiktok,Other,,tik tok people chance share life story sad hap...,Other
2,gp:AOqpTOF1KZT5ggeQqGpl62-V6QzBxhROn0eutiZMm9l...,izzyiscool,https://play-lh.googleusercontent.com/a-/AOh14...,I love this app like im addicted,5,0,,2020-10-12 1:38:54,,,tiktok,addiction,internal,love app im addict,addiction
3,gp:AOqpTOFRuZB5C5PEpW09xVx3pts_63bcWm9DFf4rajR...,Kristopher Lyons,https://play-lh.googleusercontent.com/a-/AOh14...,"This app allows pedophile acts, underage half ...",1,17,17.9.5,2020-12-03 9:55:01,,,tiktok,inappropriate content,internal,app pedophile act underage half naked girl ina...,inappropriate content
4,gp:AOqpTOFf_z_S6J2LZIb70xQY4oWiR19R0HN4oIVV4MW...,Certified Skillz,https://play-lh.googleusercontent.com/a-/AOh14...,A very good app but I dont like how theres soo...,5,0,,2020-10-13 1:58:36,,,tiktok,accountability,internal,good app dont soo tiktokers commit offense rep...,accountability


In [30]:
print(df.loc[400, 'content'])
print(df.loc[400, 'clean_content'])

Calculate more fare than usual and never get any solution of any problem or
couldn't report for driver's bad behavior
calculate fare usual solution problem report driver bad behavior


In [31]:
%store df

Stored 'df' (DataFrame)
