## Basic text pre-processing

The following basic preprocessing will be examined in this notebook. A similar example for this study can be found at [Analytics Vidhya's Website](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/).

- Lower casing
- Punctuation removal
- Stopwords removal
- Frequent words removal
- Rare words removal
- Spelling correction
- Tokenization
- Stemming
- Lemmatization

In [1]:
# imports
import nltk
import numpy as np
import pandas as pd

from textblob import TextBlob
from textblob import Word
from nltk.stem import PorterStemmer
try:
    from nltk.corpus import stopwords
except:
    nltk.download('stopwords')

In [2]:
dataset = pd.read_csv('../../data/processed/yp_leilanis-lahaina-2_rws.csv')
dataset.head()


Unnamed: 0,status,reviews
0,1,"Try good service, beach front so a bit loud. M..."
1,1,When we arrived they gave us a choice of eatin...
2,1,Stopped in here on a Tuesday evening around 8p...
3,1,Hawaiian chain type restaurant with pretty dec...
4,0,Oh my. Where do I even begin...\n\nLet's start...


In [3]:
reviews = dataset.iloc[:,1]
reviews.head()

0    Try good service, beach front so a bit loud. M...
1    When we arrived they gave us a choice of eatin...
2    Stopped in here on a Tuesday evening around 8p...
3    Hawaiian chain type restaurant with pretty dec...
4    Oh my. Where do I even begin...\n\nLet's start...
Name: reviews, dtype: object

## Text Preprocessing Methods

### Lower casing

In [45]:
def lower_line(line):
    line_arr = [x.lower() for x in line.split()]
    return(' '.join(line_arr))

### Punctuation Removal

In [72]:
def punctuation_line(line):
    line = line.replace("[^a-zA-Z#]", " ")
    # line = line.replace('[^\w\s]','')
    return(line)

### Short Words Removal

In [47]:
def shortwords_line(line):
    # line.apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))
    return (' '.join([w for w in line.split() if len(w)>2]))

### Stopwords Removal

In [48]:
def stopwords_line(line):
    stopwords_list = set(stopwords.words('english'))
    line = ' '.join(x for x in line.split() if x not in stopwords_list)
    return(line)

### Frequent Words Removal

In [82]:
freq = pd.Series(' '.join(lines).split()).value_counts()[:10]

freq = list(freq.index)

def freqwords_line(line):
    line = " ".join(x for x in line.split() if x not in freq)
    return(line)

### Rare Words Removal

In [81]:
rare = pd.Series(' '.join(lines).split()).value_counts()[-10:]

rare = list(rare.index)

def rarewords_line(line):
    line = " ".join(x for x in line.split() if x not in rare)

### Spelling Correction

In [51]:
    def spellcheck_line(line):
        return(str(TextBlob(line).correct()))

### Tokenization

In [52]:
def tokenize_line(line):
    return(" ".join(TextBlob(str(line)).words))

### Stemming

In [53]:
st = PorterStemmer()
def stemming_line(line):
    line = " ".join([st.stem(word) for word in line.split()])
    return(line)

### Lemmatization

In [54]:
def lemnatize_line(line):
    line = " ".join([Word(word).lemmatize() for word in line.split()])
    return(line)

##  Implementation on sample dataset

In [80]:


line = "It started out really good.  Crab cakes were ok, \
beet salad ok even though I had no idea what the cheese was \
and then after 45 minutes, cold entree.  But you know how it is. \
You wait so long and then it's cold like the runner took his 15 minute \
break and forgot about you but your starving so you eat it anyways. \
That's how it was for me."
# lower
line = lower_line(line)
# punctuation
line = punctuation_line(line)
# stopwords
line = stopwords_line(line)
# freq
line = freqwords_line(line)
# shortwords
line = shortwords_line(line)
# rare
# line = rarewords_line(line)
# spelling
# line = spelling_line(line)
# tokenize
line = tokenize_line(line)
# stemming
line = stemming_line(line)
# lemmatization
line = lemnatize_line(line)

line

"start realli good crab cake ok beet salad even though idea chees minut cold entre know is wait long cold like runner took minut break forgot starv eat anyway that 's me"

In [83]:
# tokenize
lines_arr = []
for line in reviews:
    # lower
    line = lower_line(line)
    # punctuation
    line = punctuation_line(line)
    # stopwords
    line = stopwords_line(line)
    # freq
    line = freqwords_line(line)
    # shortwords
    line = shortwords_line(line)
    # rare
    line = rarewords_line(line)
    # spelling
    # line = spelling_line(line)
    # tokenize
    line = tokenize_line(line)
    # stemming
    line = stemming_line(line)
    # lemmatization
    line = lemnatize_line(line)
    
    lines_arr.append(line)

In [85]:
print(lines[2])
print('--------')
print(lines_arr[2])

Stopped in here on a Tuesday evening around 8pm and didn't have a problem getting a table for two on the patio. Menu is a bit higher priced than I think it should be but still less than most of the hotel restaurants so not too much of a rip off. I ended up getting the fish tacos which were pretty good. The server we had was friendly and fast, and we enjoyed sitting and watching the sunset. 

This is a solid spot if you're looking for a simple menu and don't want to dine in at the hotel restaurants.
--------
none


In [59]:
dataset['reviews'] = lines_arr
dataset.head()

Unnamed: 0,status,reviews
0,1,tri good servic beach front bit loud menu dive...
1,1,arriv gave choic eat din room patio there 's t...
2,1,stop tuesday even around 8pm problem get tabl ...
3,1,hawaiian chain type restaur pretti decent food...
4,0,my even begin let 's start posit dine fancier ...


In [60]:
dataset.to_csv('../../data/processed/yp_leilanis-lahaina-2_rws_1.csv', index=False)

In [62]:
dataset_new = pd.read_csv('../../data/processed/yp_leilanis-lahaina-2_rws_1.csv')
dataset_new.head()

Unnamed: 0,status,reviews
0,1,tri good servic beach front bit loud menu dive...
1,1,arriv gave choic eat din room patio there 's t...
2,1,stop tuesday even around 8pm problem get tabl ...
3,1,hawaiian chain type restaur pretti decent food...
4,0,my even begin let 's start posit dine fancier ...
