## Basic text pre-processing

The following basic preprocessing will be examined in this notebook. A similar example for this study can be found at [Analytics Vidhya's Website](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/).

- Lower casing
- Punctuation removal
- Stopwords removal
- Frequent words removal
- Rare words removal
- Spelling correction
- Tokenization
- Stemming
- Lemmatization

In [2]:
# imports
import nltk
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import json

from textblob import TextBlob
from textblob import Word
from nltk.stem import PorterStemmer
try:
    from nltk.corpus import stopwords
except:
    nltk.download('stopwords')

In [5]:
# data imports
json_data = {}

with open('../../data/raw/yp_leilanis-lahaina-2_rws.json') as f:
    json_data = json.loads(f.read())

dataset = json_normalize(json_data['reviews'])
dataset.head()

lines = dataset.iloc[:,2].values


Unnamed: 0,author,datePublished,description,ratingValue
0,Giorgio C.,2018-10-13,"Try good service, beach front so a bit loud. M...",4
1,Maxx C.,2018-10-05,When we arrived they gave us a choice of eatin...,5
2,Al D.,2018-10-04,Stopped in here on a Tuesday evening around 8p...,4
3,Zachary D.,2018-09-29,Hawaiian chain type restaurant with pretty dec...,4
4,Chilly P.,2018-07-10,Oh my. Where do I even begin...\n\nLet's start...,1


## Text Preprocessing Methods

### Lower casing

In [6]:
def lower_line(line):
    line_arr = [x.lower() for x in line.split()]
    return(' '.join(line_arr))

### Punctuation Removal

In [7]:
def punctuation_line(line):
    return(line.replace('[^\w\s]',''))

### Stopwords Removal

In [8]:
def stopwords_line(line):
    stopwords_list = set(stopwords.words('english'))
    line = ' '.join(x for x in line.split() if x not in stopwords_list)
    return(line)

### Frequent Words Removal

In [9]:
freq = pd.Series(' '.join(lines)).value_counts()[:10]
freq = list(freq.index)

def freqwords_line(line):
    line = " ".join(x for x in line.split() if x not in freq)
    return(line)

### Rare Words Removal

In [10]:
rare = pd.Series(' '.join(lines)).value_counts()[-10:]

def rarewords_line(line):
    line = " ".join(x for x in line.split() if x not in rare)

### Spelling Correction

In [11]:
def spellcheck_line(line):
    return(str(TextBlob(line).correct()))

### Tokenization

In [12]:
def tokenize_line(line):
    return(" ".join(TextBlob(str(line)).words))

### Stemming

In [13]:
st = PorterStemmer()
def stemming_line(line):
    line = " ".join([st.stem(word) for word in line.split()])
    return(line)

### Lemmatization

In [14]:
def lemnatize_line(line):
    line = " ".join([Word(word).lemmatize() for word in line.split()])
    return(line)

##  Implementation on sample dataset

In [15]:

# tokenize
lines_arr = []
for line in lines:
    # lower
    line = lower_line(line)
    # punctuation
    line = punctuation_line(line)
    # stopwords
    line = stopwords_line(line)
    # freq
    line = freqwords_line(line)
    # rare
    # line = rarewords_line(line)
    # spelling
    # line = spelling_line(line)
    # tokenize
    line = tokenize_line(line)
    # stemming
    line = stemming_line(line)
    # lemmatization
    line = lemnatize_line(line)
    
    lines_arr.append(line)


**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  Searched in:
    - 'C:\\Users\\Anika/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Users\\Anika\\Anaconda3\\nltk_data'
    - 'C:\\Users\\Anika\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\Anika\\AppData\\Roaming\\nltk_data'
    - ''
**********************************************************************



MissingCorpusError: 
Looks like you are missing some required data for this feature.

To download the necessary data, simply run

    python -m textblob.download_corpora

or use the NLTK downloader to download the missing data: http://nltk.org/data.html
If this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.


In [53]:
print(lines[2])
print('--------')
print(lines_arr[2])

Stopped in here on a Tuesday evening around 8pm and didn't have a problem getting a table for two on the patio. Menu is a bit higher priced than I think it should be but still less than most of the hotel restaurants so not too much of a rip off. I ended up getting the fish tacos which were pretty good. The server we had was friendly and fast, and we enjoyed sitting and watching the sunset. 

This is a solid spot if you're looking for a simple menu and don't want to dine in at the hotel restaurants.
--------
stop tuesday even around 8pm problem get tabl two patio menu bit higher price think still le hotel restaur much rip off end get fish taco pretti good server friendli fast enjoy sit watch sunset solid spot look simpl menu want dine hotel restaur
