## Basic text pre-processing

The following basic preprocessing will be examined in this notebook. A similar example for this study can be found at [Analytics Vidhya's Website](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/).

- Lower casing
- Punctuation removal
- Stopwords removal
- Frequent words removal
- Rare words removal
- Spelling correction
- Tokenization
- Stemming
- Lemmatization

In [1]:
# imports
import nltk
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import json

from textblob import TextBlob
from textblob import Word
from nltk.stem import PorterStemmer
try:
    from nltk.corpus import stopwords
except:
    nltk.download('stopwords')

In [2]:
# data imports
json_data = {}

with open('../../data/raw/yp_dukes-waikiki-honolulu-2_rws.json') as f:
    json_data = json.loads(f.read())

dataset = json_normalize(json_data['reviews'])
dataset.head()

lines = dataset.iloc[:,2].values


## Text Preprocessing Methods

### Lower casing

In [3]:
def lower_line(line):
    line_arr = [x.lower() for x in line.split()]
    return(' '.join(line_arr))

### Punctuation Removal

In [4]:
def punctuation_line(line):
    return(line.replace('[^\w\s]',''))

### Stopwords Removal

In [5]:
def stopwords_line(line):
    stopwords_list = set(stopwords.words('english'))
    line = ' '.join(x for x in line.split() if x not in stopwords_list)
    return(line)

### Frequent Words Removal

In [6]:
freq = pd.Series(' '.join(lines)).value_counts()[:10]
freq = list(freq.index)

def freqwords_line(line):
    line = " ".join(x for x in line.split() if x not in freq)
    return(line)

### Rare Words Removal

In [7]:
rare = pd.Series(' '.join(lines)).value_counts()[-10:]

def rarewords_line(line):
    line = " ".join(x for x in line.split() if x not in rare)

### Spelling Correction

In [8]:
def spellcheck_line(line):
    return(str(TextBlob(line).correct()))

### Tokenization

In [9]:
def tokenize_line(line):
    return(" ".join(TextBlob(str(line)).words))

### Stemming

In [10]:
st = PorterStemmer()
def stemming_line(line):
    line = " ".join([st.stem(word) for word in line.split()])
    return(line)

### Lemmatization

In [11]:
def lemnatize_line(line):
    line = " ".join([Word(word).lemmatize() for word in line.split()])
    return(line)

##  Implementation on sample dataset

In [12]:
nltk.download('punkt')
nltk.download('wordnet')
# tokenize
lines_arr = []
for line in lines:
    # lower
    line = lower_line(line)
    # punctuation
    line = punctuation_line(line)
    # stopwords
    line = stopwords_line(line)
    # freq
    line = freqwords_line(line)
    # rare
    # line = rarewords_line(line)
    # spelling
    # line = spelling_line(line)
    # tokenize
    line = tokenize_line(line)
    # stemming
    line = stemming_line(line)
    # lemmatization
    line = lemnatize_line(line)
    
    lines_arr.append(line)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Anika\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Anika\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [13]:
print(lines[2])
print('--------')
print(lines_arr[2])

My boyfriend and I went to Dukes on our first night in Maui. The waitress was really nice and the view was phenomenal... other than that I don't think there was much to this place. The cheeseburger was worse than one I have made myself... overdone meat... basic idea plus dripping with garlic aoil... very boring for a $19 dollar price tag. If something is going to be that expensive it needs to be a really good burger. A burger is so basic I don't understand how an upscale place like this can do it so bad. Also the fries could have been amazing I loveeeee tiny shoelace fries but they were stale! 

My boyfriend had the chicken and it was fairly good. He would give it 3 stars. It wasn't anything so exciting that we have to go back for it, but the mash potatoes were really good and the greens were done perfectly. Nothing super special about the chicken, but it wasn't dry.

My boyfriend wants to go back for Happy Hour to try it again so we probably will. Hopefully it's better.
--------
boyfr