## Data Preprocessing

Before feeding textual data to a ML model, it needs to be carefully preprocessed, otherwise the results will be poor. This is especially true for short, noisy data, such as user reviews.

My data preprocessing consisted of the following steps:

- **Filtering out uninformative reviews**
- **Removing English and corpus-specific stopwords**
- **Basic spellcheck**
- **Lemmatization**

I am going to illustrate this technique using the investing app dataset.

## 1. Filtering out Uninformative Reviews

Reviews vary in quality. Some are extremely short and contain uninformative feedback for app developers (e.g., *This app sucks!*), while others are long and contain valuable information in the form of bug reports and feature requests. 

There are plethora of methods in the literature to filter out uninformative feedback. However, most of them rely on supervised techniques, which is not feasible for the current task. To keep things simple, we use the approach from [this paper](https://ieeexplore.ieee.org/abstract/document/9283933?casa_token=fG8wgt1iNNoAAAAA:YZbVCUM0kDatSSphbQdbs8Su4KeoqRe9kJF_oLSxF-q_Mfbzln9WBgJ7ZQG_sauC9OrpevhBCQ). The main idea behind this approach is to look for indicative keywords. Reviews that contain such keywords are more likely to be informative. Such a method allows us to quickly filter out uninformative feedback with minimum effort.

In [2]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd, numpy as np

pd.set_option('display.max_colwidth', None)

# read data
data = pd.read_csv('investing.csv')
data = data.drop(columns=['Unnamed: 0'])

# indicative keywords
markers = ['after', 'as soon as', 'before', 'every time', 'then', 'until', 'when', 'whenever', 'while', 'during'] 

markers_re = r"\b(" + '|'.join(markers) + r")\b"

# this column marks informative reviews
data['for_analysis'] = False
data['for_analysis'] = (data['content'].str.contains(markers_re, regex=True) & (data['score'] <= 2) & (data['date'] <= '2021-01-01') & (data.app.isin(data.app.value_counts()[data.app.value_counts() > 10000].index)))

print(data['for_analysis'].value_counts())

False    688290
True      20760
Name: for_analysis, dtype: int64


**~20k** reviews out of **688k** contained the indicative keywords. This number can be increased by identifying a larger set of keywords (for the future work).

In [3]:
data.loc[data['for_analysis'] == True,['app']].value_counts()

app            
robinhood          7879
acorn              4343
stash              2445
etrade             1605
fidelity           1497
tdameritrade       1403
schwab             1079
personalcapital     509
dtype: int64

## 2. Removing English and Corpus-Specific Stopwords

To improve the performance of the NLP algorithm, I remove the common English stop-words from the reviews using **NLTK**. In addition, I remove corpus-specific stop words. Corpus-specific stop-words can be identified at a later stage after performing the modeling and examining the topics. Such stop-words do not add any useful information to the topic interpretation. Examples include *app*, *make*, and *get*.

## 3. Basic Spellcheck

User reviews are very informal and contain colloquial langauge. Therefore, I added an additional step to my NLP pipeline in the form of a basic spelling correction. Examples of it include *ppl*->*people* and *hrs*->*hours*. I found that such corrections increased the topic quality.

## 4. Lemmatization

Different word forms can convey essentially the same meaning (e.g., *work* and *working*). However, the topic model might not necessarily be able to infer that information, especially when the frequency of one form or the other is too low. Therefore, as a final step, the different word forms need to be collapsed into single, most general ones. 

There are two main wasy to do that: stemming and lemmatization. Stemming works by extracting the root form of the word and is commonly performed using PorterStemmer. The main disadvantage of stemming is lost information: for example, the word *argue* would become *argu*, which is not very interpretable. Lemmatization, on the other hand, converts the words into their dictionary form: *arguing* would become *argue*. The main disadvantage of lemmatiation over stemming is that lemmatization produces more words with similar meanings in the context of topic modeling. However, the main advantage is that topics become more interpretable.

After experimenting with both stemming and lemmatization I found that lemmatization produces dignificantly better topics. In the next cell I combine steps 2-4 into a single pipeline to produce the final review cohort for the topic model.

In [5]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
import nltk
import re

wordnet_map = {
    "N": wordnet.NOUN,
    "V": wordnet.VERB,
    "J": wordnet.ADJ,
    "R": wordnet.ADV,
    "S": wordnet.ADJ_SAT
}

# app names were added to stop word list to enhance generalizability
cohort_stopwords = data['app'].value_counts().index.tolist() + ['fargo'] + ['capitalone']
grammar_fix = {}

# adding the rest of the stop words
with open('stop_words.txt', 'r') as f:
    lines = f.readlines()
    for line in lines:
        s = line.split()
        if len(s) <= 1 and s[0] != '\n':
            cohort_stopwords.append(s[0])
        elif len(s) > 1 and s[0] != '\n':
            grammar_fix[s[0]] = " ".join([w for w in s[1:]])

lem = WordNetLemmatizer()            

def preprocess(text):
    try:
        text = re.sub("\$", " money ", text)
        text = re.sub(r'(\d+)(\w+)', r'\1 \2', text) #2hrs = 2 hrs
        words = [word.lower() for word in word_tokenize(text) if word.isalpha()]
        words = [grammar_fix[word] if word in grammar_fix else word for word in words]
        words = [word for word in words if word not in stopwords.words('english')]
        words = [lem.lemmatize(word, wordnet_map.get(nltk.pos_tag([word])[0][1][0], wordnet.NOUN)) for word in words]
        words = [word for word in words if word not in cohort_stopwords]
        return words
    except:
        return []


In [6]:
preprocess("it takes 2hrs just to get my food. when it says 30mins.. don't get this app.")

['take', 'hour', 'food', 'say', 'minute']

In [12]:
# preprocessing all of the texts
texts = data.loc[data['for_analysis'], 'content'].values.tolist()
pr_texts = [preprocess(text) for text in texts]
data.loc[data['for_analysis'], 'content_processed'] = [" ".join(txt) for txt in pr_texts]

In [13]:
data.loc[data['for_analysis'], 'content_processed'].sample(5)

687369                                                                                                                                                                                                                                                                 download curiosity decides month later interact take fee bank account never investment plan question wait month inactive start charge fee email
309480    since around good start invest learn monthly fee ridiculous broker free set forget mentality isnt anymore either master option trading point normal broker sell equity right away price sell stuff random price day scam trading window stock etf absolutely reason trading like mutual fund yearly cost worth price give ebb flow market honestly waste time money slowly grown think swim want pay free do
281583                                                                                                                                                                                    

In [None]:
# saving the resulted texts into csv
data.content_processed = data.content_processed.fillna("")
data.loc[(data.content_processed.apply(len) == 0), 'for_analysis'].for_analysis = False
data.to_csv('dataset.csv')