## Basic text pre-processing

The following basic preprocessing will be examined in this notebook. A similar example for this study can be found at [Analytics Vidhya's Website](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/).

- Lower casing
- Punctuation removal
- Stopwords removal
- Frequent words removal
- Rare words removal
- Spelling correction
- Tokenization
- Stemming
- Lemmatization

In [1]:
# imports
import nltk
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import json

from textblob import TextBlob
from textblob import Word
from nltk.stem import PorterStemmer
try:
    from nltk.corpus import stopwords
except:
    nltk.download('stopwords')

In [2]:
# data imports
json_data = {}

with open('../../data/raw/yp_dukes-waikiki-honolulu-2_rws.json') as f:
    json_data = json.loads(f.read())

dataset = json_normalize(json_data['reviews'])
dataset.head()

lines = dataset.iloc[:,2].values


## Text Preprocessing Methods

### Lower casing

In [3]:
def lower_line(line):
    line_arr = [x.lower() for x in line.split()]
    return(' '.join(line_arr))

### Punctuation Removal

In [4]:
def punctuation_line(line):
    return(line.replace('[^\w\s]',''))

### Stopwords Removal

In [5]:
def stopwords_line(line):
    stopwords_list = set(stopwords.words('english'))
    line = ' '.join(x for x in line.split() if x not in stopwords_list)
    return(line)

### Frequent Words Removal

In [6]:
freq = pd.Series(' '.join(lines)).value_counts()[:10]
freq = list(freq.index)

def freqwords_line(line):
    line = " ".join(x for x in line.split() if x not in freq)
    return(line)

### Rare Words Removal

In [7]:
rare = pd.Series(' '.join(lines)).value_counts()[-10:]

def rarewords_line(line):
    line = " ".join(x for x in line.split() if x not in rare)

### Spelling Correction

In [8]:
def spellcheck_line(line):
    return(str(TextBlob(line).correct()))

### Tokenization

In [9]:
def tokenize_line(line):
    return(" ".join(TextBlob(str(line)).words))

### Stemming

In [10]:
st = PorterStemmer()
def stemming_line(line):
    line = " ".join([st.stem(word) for word in line.split()])
    return(line)

### Lemmatization

In [11]:
def lemnatize_line(line):
    line = " ".join([Word(word).lemmatize() for word in line.split()])
    return(line)

##  Implementation on sample dataset

In [12]:
nltk.download('punkt')
nltk.download('wordnet')
# tokenize
lines_arr = []
for line in lines:
    # lower
    line = lower_line(line)
    # punctuation
    line = punctuation_line(line)
    # stopwords
    line = stopwords_line(line)
    # freq
    line = freqwords_line(line)
    # rare
    # line = rarewords_line(line)
    # spelling
    # line = spelling_line(line)
    # tokenize
    line = tokenize_line(line)
    # stemming
    line = stemming_line(line)
    # lemmatization
    line = lemnatize_line(line)
    
    lines_arr.append(line)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Anika\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Anika\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [13]:
print(lines[2])
print('--------')
print(lines_arr[2])

Stopped by on a weekday night and the place was absolutely packed with 50 minute wait times both for the dining area and beachside seating. Every after some rain/sprinkles which washed some people out from their tables, it still took them awhile to bus and refill tables. In addition the hostess on the beachside was out of wireless buzzers to call you, so we had to check back in every 10-15 minutes to finally get one and often he forgot who needed one next.

Our server however was absolutely amazing, very friendly, attentive and provided excellent drink recommendations. Without a doubt try the coconut mojito if you're looking a refreshing, not overly sweet cocktail to enjoy the view with. 

Food however was quite disappointing, at least for my meal, which was the macadamia crusted chicken katsu. The chicken was over cooked and dry, as was the crust itself which lacked any macadamia nuts. Rice had definitely sat under a heat lamp for awhile as it had formed a hard outer crust.... and why