In [64]:
# install datasets
!pip install datasets

from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

train_dataset = dataset['train']
test_dataset = dataset['test']



In [65]:
import re
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [66]:
from nltk.corpus import stopwords

In [67]:
import json

In [68]:
corpus = []

#Pre-processing

In this notebook, we analyzed some pre-processing tecniques and compare widely known architectures and pre-processing pipelines.

Transformer model | Vocabulary size

---
* bert-base-uncased, 30522
* bert-base-cased, 28996
* bert-base-multilingual-cased, 119547
* xlm-mlm-en-2048, 30145
* gpt2 & roberta, 50257
* word2vec, 3000000
* glove, 400000

We will work on the first 1000 samples

# Steps with explanation

If we apply lower(), does it affect negatively the meaning of the review? **NO**

So make all reviews lower case.

Removing special characters can reduce the impact of some emotions of course. However ,for this sentences, it is very likely to have other words realated to this emotions. For example:
* avesomeee! I liked it :)
* very bad taste of food. feeling angry to service :@

Also we will preserve some stopwords that affect the meaning of the sentence.

In [69]:
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')
all_stopwords.remove('no')
all_stopwords.remove('but')
all_stopwords.remove("won't")
all_stopwords = set(all_stopwords)

In [70]:

for i, row in enumerate(train_dataset):
  text = review = re.sub('[^a-zA-Z]', ' ', row['text'])
  text = text.lower().replace('\n', ' ').replace('\r', ' ').replace('\t', ' ')
  words = text.split()
  words = [word for word in words if word not in all_stopwords]
  text = ' '.join(words)
  corpus.append({'text':text, 'label':row['label']})

* re.sub() : substitutes any character that is not a lowercase or uppercase letter with a space. This step removes special characters, numbers, and punctuation.
* .lower() : change all characters to lovercase
* .split() : split the text with respect to whitespace. NOT TOKENIZATION

CONSIDER POS TAGGING!!!!

Does text contains any markup, or some words that have non-content information?
There is a bunch of methods to find out if a string is url or not, but spacy library is most trusted compared to other tools like regex etc. It is most slow one though.

Also checking if word is email or not.

In [71]:
import spacy
nlp = spacy.load("en_core_web_sm")

corpus_without_url = []
i = 0

texts = [row['text'] for row in corpus]
labels = [row['label'] for row in corpus]
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
  print(i)
  words = []
  for token in doc:
      if token.like_url:
          continue
      if token.like_email:
          continue
      else:
        words.append(token.text)
  corpus_without_url.append({'text': ' '.join(words), 'label': labels[i]})
  i = i + 1

corpus = corpus_without_url


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
645000
645001
645002
645003
645004
645005
645006
645007
645008
645009
645010
645011
645012
645013
645014
645015
645016
645017
645018
645019
645020
645021
645022
645023
645024
645025
645026
645027
645028
645029
645030
645031
645032
645033
645034
645035
645036
645037
645038
645039
645040
645041
645042
645043
645044
645045
645046
645047
645048
645049
645050
645051
645052
645053
645054
645055
645056
645057
645058
645059
645060
645061
645062
645063
645064
645065
645066
645067
645068
645069
645070
645071
645072
645073
645074
645075
645076
645077
645078
645079
645080
645081
645082
645083
645084
645085
645086
645087
645088
645089
645090
645091
645092
645093
645094
645095
645096
645097
645098
645099
645100
645101
645102
645103
645104
645105
645106
645107
645108
645109
645110
645111
645112
645113
645114
645115
645116
645117
645118
645119
645120
645121
645122
645123
645124
645125
645126
645127
645128
645129
645130
645131
645132
6451

Considering the dataset, it is a valid point that users often make their review in very informal way. They can make misspelling, use slang words etc.
* Misspelling can change the meaning of sentence as expexted.
* Slangs can generally have a real meaning other than the meaning that comes to mind at first glance.

At this point, tokenization this kind of review can lead to missing some information [1].

For example some slangs can be:

* WD: Well done!
* Headdesk: Frustration

At this point we will employ some spell-correction and also slang dictionary to replace the slang words with their real meaning.

In [72]:
!pip install textblob



We investigated TextBlob results. However it changes some word in very undesirable way. For example:
* appetizer -> appetize
* garbage -> garage
* thrift -> thrifty

Its harm to meaning of sentence dominated its potential advantages.

In [73]:
''' from textblob import TextBlob
for row in corpus:
  corrected = str(TextBlob(row['text']).correct())
  print("old")
  print(row['text'])
  print("new")
  print(corrected)
  row['text'] = corrected
 '''
## COMMENTED, BECAUSE RESULTS DID NOT SATISFY US


' from textblob import TextBlob\nfor row in corpus:\n  corrected = str(TextBlob(row[\'text\']).correct())\n  print("old")\n  print(row[\'text\'])\n  print("new")\n  print(corrected)\n  row[\'text\'] = corrected\n '

Considering the [4], Gingerit is the best API that can give better results compating other tools we have. However, it is not free and API limits are very low. So we scrape the website https://www.noslang.com/ and created a dictionary.

In [74]:
slang_dict = json.load(open('slang.json', 'r'))
slang_dict_freq = {}

i = 0
corpus_without_slang = []
for row in corpus:
  i +=1
  words = row['text'].split()
  corrected_slang_words = []
  for word in words:
    if word in slang_dict:
      if word in slang_dict_freq:
        slang_dict_freq[word] += 1
      else:
        slang_dict_freq[word] = 1
      word = slang_dict[word]
    corrected_slang_words.append(word)

  corpus_without_slang.append({'text': ' '.join(corrected_slang_words), 'label':row['label']})

corpus = corpus_without_slang
print(slang_dict_freq)
with open('slang_freq.json','w') as file:
  json.dump(slang_dict_freq, file, indent=4)




{'nyc': 2521, 'st': 9305, 'n': 1027082, 'ni': 155947, 'kinda': 10960, 'btw': 2332, 'rep': 1595, 'atm': 1169, 'da': 1550, 'pics': 1509, 'nd': 5800, 'mins': 9249, 'admin': 107, 'na': 12405, 'tomorrow': 1962, 'tv': 12986, 'uber': 734, 'air': 9195, 'nah': 440, 'nif': 17050, 'ok': 58639, 'lol': 8932, 'nt': 34146, 'alright': 6793, 'min': 11224, 'u': 85837, 'gal': 1810, 'thru': 6847, 'app': 3553, 'hubby': 5222, 'def': 2554, 'pots': 865, 'apps': 2732, 'cod': 1783, 'og': 303, 'owned': 4501, 'poo': 269, 'wth': 782, 're': 840, 'lil': 1795, 'nin': 6906, 'usa': 601, 'promos': 98, 'bf': 5638, 'c': 12968, 'nps': 1243, 'bc': 2045, 'ah': 1233, 'bbq': 19348, 'meh': 8085, 'r': 5387, 'wanna': 2503, 'fyi': 1359, 'gf': 2570, 'info': 2267, 'yea': 1515, 'biz': 2772, 'gm': 829, 'tp': 168, 'rb': 52, 'deets': 17, 'fave': 1342, 'ta': 3989, 'moh': 52, 'noob': 68, 'k': 4396, 'snl': 40, 'np': 1451, 'sorta': 739, 'dfs': 2, 'nb': 384, 'pita': 5489, 'comp': 2481, 'abd': 56, 'ur': 1048, 'tho': 1741, 'xs': 1854, 'pf': 89

Stemming is also very beneficial because it makes use of root of the words and helps to represenet the meaning of the sentence. However it can cause *False-Positive* results sometimes [2].

In [75]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [76]:
lemmatizer = WordNetLemmatizer()
lemmatized_corpus = []
i = 0
for row in corpus:
  print(i)
  i += 1
  words = nltk.word_tokenize(row['text'])
  words = [lemmatizer.lemmatize(word) for word in words]
  text = ' '.join(words)
  lemmatized_corpus.append({'text':text, 'label':row['label']})

corpus = lemmatized_corpus

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
645000
645001
645002
645003
645004
645005
645006
645007
645008
645009
645010
645011
645012
645013
645014
645015
645016
645017
645018
645019
645020
645021
645022
645023
645024
645025
645026
645027
645028
645029
645030
645031
645032
645033
645034
645035
645036
645037
645038
645039
645040
645041
645042
645043
645044
645045
645046
645047
645048
645049
645050
645051
645052
645053
645054
645055
645056
645057
645058
645059
645060
645061
645062
645063
645064
645065
645066
645067
645068
645069
645070
645071
645072
645073
645074
645075
645076
645077
645078
645079
645080
645081
645082
645083
645084
645085
645086
645087
645088
645089
645090
645091
645092
645093
645094
645095
645096
645097
645098
645099
645100
645101
645102
645103
645104
645105
645106
645107
645108
645109
645110
645111
645112
645113
645114
645115
645116
645117
645118
645119
645120
645121
645122
645123
645124
645125
645126
645127
645128
645129
645130
645131
645132
6451

We think spell checking must be done at the end. So we also apply spell checking with processed text.

In [77]:
!pip install autocorrect



In [78]:
import itertools
from autocorrect import Speller

#One letter in a word should not be present more than twice in continuation
corpus_spell_checked = []
i = 0
spell = Speller(lang='en', fast=True)
for row in corpus:
  i += 1
  text = row['text']
  ans = spell(text)
  corpus_spell_checked.append({'text':ans, 'label': row['label']})

corpus = corpus_spell_checked

Save the CORPUS

In [79]:
with open('corpus_preprocessed.json','w') as file:
  json.dump(corpus, file, indent=4)

In [80]:
from google.colab import files
files.download('corpus_preprocessed.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

###REFERENCES
[1] https://www.semanticscholar.org/paper/Review-preprocessing-using-data-cleaning-and-Gharatkar-Ingle/7fc874f14491ee09ce84486aef6a9a485bea72bf

[2] Ms. Anjali Ganesh Jivani, “A Comparative Study of Stemming
Algorithms”, IJCTA, Volume 2, 2011, pp 1930-1938.

[3] https://textblob.readthedocs.io/en/dev/

[4] https://medium.com/coinmonks/remaking-of-shortened-sms-tweet-post-slangs-and-word-contraction-into-sentences-nlp-7bd1bbc6fcff

[5] https://github.com/Azd325/gingerit

[6] https://www.semanticscholar.org/paper/Text-Mining-for-Yelp-Dataset-Challenge-Wang-Qiu/29107038293904776bb2ace25f9c2535a421916c

[7] https://www.semanticscholar.org/paper/Deep-Learning-for-Aspect-Based-Sentiment-Analysis-Cahyaningtyas-Fudholi/b9cf902da7d0bc02b593f406b0fc46e7e3bdc6aa

[8] https://www.semanticscholar.org/paper/Aspect-based-sentiment-analysis-for-Indonesian-Ekawati-Khodra/7aa6a0241bef1b6af640a4123d5fce8f43de82e2