We go here over the text reviews and process the reviews that correspond to restaurants of interest. The processing consists of removing stop words, accented characters and some special characters, expanding the contractions, and replacing the words with their lemmas. The main libraries that correspond to natural language processing used are: spacy and nltk. 
The processing steps followed here were greatly inspired from the following [tutorial](https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72).

We first load the required packages and the violation dataset. 

In [1]:
import pandas as pd
import unicodedata
import re
from contractions import CONTRACTION_MAP
import spacy
import numpy as np
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
import json

Note that the contractions map imported is downloaded from [here](https://github.com/dipanjanS/practical-machine-learning-with-python/blob/master/bonus%20content/nlp%20proven%20approach/contractions.py)

In [2]:
data = pd.read_csv('dataset.csv')

We extract the ids of businesses inspected, in order to extract their text reviews from the reviews dataset.

In [3]:
ids_of_intrest = list(data['business_id'])

In [4]:
with open('YELP/review.json', 'r', encoding='utf8') as f:
    reviews = []
    for line in f:
        review = json.loads(line)
        if (review['business_id'] in ids_of_intrest):
            reviews.append(review)
rw = pd.DataFrame(reviews, index=None)

We drop any empty text reviews.

In [5]:
rw.drop(index=rw[rw.text.isnull()].index, inplace=True)

We next define the functions that we use in processing the texts and then clan the text reviews.

In [6]:
nlp = spacy.load('en', parse=True, tag=True, entity=True)
tokenizer = ToktokTokenizer()
stopword_list = stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')
stopword_list.append(['would','us'])

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

def remove_special_characters(text, remove_digits=True):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text


def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

def normalize_text(doc):
    
    doc = remove_accented_chars(doc)
    doc = expand_contractions(doc)
    doc = doc.lower()
    doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
    doc = lemmatize_text(doc)
    # insert spaces between special characters to isolate them    
    special_char_pattern = re.compile(r'([{.(-)!}])')
    doc = special_char_pattern.sub(" \\1 ", doc)
    doc = remove_special_characters(doc)  
    # remove extra whitespace
    doc = re.sub(' +', ' ', doc)
    # remove stopwords
    doc = remove_stopwords(doc)
  
        
    return doc

In [7]:
rw['clean_text'] = rw.text.apply(normalize_text)

We finally save the obtained results.

In [8]:
rw.to_csv('rw.csv')