# Your Personal Trip Advisor
#### Recommending attractions in Yosemite National Park based on personal preferences, by referencing reviews written by users on Trip Advisor.

## Part 2 of 4

**Objective**
In this notebook, we will clean the reviews in various ways, that have been scrapped from Trip Advisor. The aim is to optimize the corpus for topic modeling towards the overall project objectives.

In [8]:
import re
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize, RegexpTokenizer, regexp_tokenize, WhitespaceTokenizer
from nltk.corpus import stopwords, wordnet
from nltk.tag import pos_tag
import spacy
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

#Custom python module to help clean the text here
import nlp_preprocessing

spacy_nlp = spacy.load('en')

pd.set_option("display.max_rows", 50)
pd.set_option("display.max_columns", None)
pd.set_option('display.max_colwidth', None)

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [9]:
df = pd.read_csv("../Data/attraction_point_reviews.csv")
df.sample()
df.shape

(10725, 11)

## Cleaning the Text 

#### To get an initial feel for the data, the following steps are being undertaken:
1. All reviews are combined into a giant single string corpus.
1. All words are changed to lower letter case
1. website links & email ids are dropped
1. A lot of words are wrongly connected with punctuations. Simply dropping the punctuations will connect these words. Instead, these will be substituted with whitespace for the following punctuations: <.*?>;-!()/,:&—\ 
1. Everything, except for letters & whitespace, is dropped with no substitutions in between.
1. The string is tokenized on whitespace.
1. It is then converted into a word counter using FreqDist to explore frequencies and look at the various words used.

In [12]:
all_reviews = ' '.join(df.review_text.tolist())
all_reviews = all_reviews.lower()
all_reviews = re.sub('http\S+', '' , all_reviews)
all_reviews = re.sub('\S*@\S+', '', all_reviews)

all_reviews = re.sub(r'[<.*?>;\-!()/,:&—\\]+', ' ', all_reviews)
all_reviews = re.sub(r'[^A-Za-z\s]', '', all_reviews)
words = WhitespaceTokenizer().tokenize(all_reviews)
words_count = FreqDist(words)

long_words = [w for w in words_count if len(w) > 15]
small_words = [w for w in words_count if len(w) < 4]

The above process was carried out iteratively to ensure as many long words as possible can be captured properly and not words arbitrarily combined together using punctuations

**The cleaning strategy was applied to the reviews in the dataframe.**

In [6]:
df['reviews_basic_clean'] = df.review_text.map(nlp_preprocessing.cleaning)
df.reviews_basic_clean.sample()

1202    having visited all of the major parks in the us  trekking the himalayas   and touring europe  asia  and africa  this ranks in the top ten spots  parking can be difficult  we had a handicap parking permit  and drove another couple about a mile to their car  the parks department is missing a marketing opportunity  by not having cast models of half dome  or the vista  not crowded walking around  very limited cell phone service for those who need to be connected  with all the hiking trails  i would think this could be a safety issue  no sympathy for just those who are addicted to being connected however  
Name: reviews_basic_clean, dtype: object

The words are then lemmatized after cleaning, to reduce the variety of words itself in the corpus. Additionally, all pronouns are dropped.

In [7]:
%%time
df['review_lemma'] = df.reviews_basic_clean.apply(nlp_preprocessing.spacy_lemmatization)

Wall time: 7min 8s


In [8]:
df[['review_lemma','reviews_basic_clean']].sample(2)

Unnamed: 0,review_lemma,reviews_basic_clean
8312,park car at the parking lot close to the village store and take a walk to the low waterfall to minimize walk can also take the free shuttle to bus stop be approx minute one way through pave but in some part snowy trail with awesome view especially from the small bridge where the flow river be partly freeze the fall be greatly pour and splash water to the surround area include the bridge despite the warning a few visitor be climb the rock try to get close to the fall,we parked our car at the parking lot close to the village store and took a walk to the lower waterfall to minimize walking you can also take the free shuttle to bus stop it was approx minutes one way through paved but in some parts snowy trail with awesome view especially from the small bridges where the flowing river was partly frozen the fall was greatly pouring and splashing water to the surrounding areas including the bridge despite the warning a few visitors were climbing the rocks trying to get close to the fall
7893,this be a difficult hike because the trail be full of large stone in most place still the reward from the top of the fall be magnificent and well worth do,this is a difficult hike because the trail is full of large stones in most places still the reward from the top of the falls is magnificent and well worth doing


### Removing Standard Stop Words

In [11]:
sklearn_stop_words = sorted(list(ENGLISH_STOP_WORDS))

#removing the below stop words since they might denote a negative sentiment that is relevant towards reviews
words_to_remove_from_stop_list = ['again','against','no', 'not',]

sklearn_stop_words = [word for word in sklearn_stop_words if word not in words_to_remove_from_stop_list]

df['review_remove_stop_words'] = df.review_lemma.map(lambda x: nlp_preprocessing.remove_stopwords(x, sklearn_stop_words ))

df[['review_remove_stop_words','review_text']].sample(2)

Unnamed: 0,review_remove_stop_words,review_text
10659,wonderful hike snow melt away hike late spring worth bit challenging people not use altitude hiking just time step rail way,"a wonderful hike after the snow melts away. We have done this hike in late spring and it was well worth it. It is a bit challenging for people not used to altitude or hiking, just take your time, it has steps and railing most of the way up."
129,book tour glacier point driver guide collette knoweldgeable entertaining time great view point picnic just time eat picnic explore point coach trip,"WE booked a tour to Glacier Point. Our driver and guide, Collette was very knoweldgeable and entertaining at times!! Great views at the point - we took a picnic but only just had enough time to eat our picnic and explore the point before the coach trip back down."


### Custom Stop Words

In [80]:
# Making a list of custom stop words from the names of the attractions, to remove from the corpus

landmarks = df.attraction_name.unique().tolist()
landmark_string = " ".join(landmarks)
landmark_words = landmark_string.lower().split()

landmark_words.remove('trail')
landmark_words.remove('view')
landmark_words.remove('falls')
landmark_words.remove('fall')

print(landmark_words)

['glacier', 'point', 'yosemite', 'valley', 'mariposa', 'grove', 'of', 'giant', 'sequoias', 'half', 'dome', 'tunnel', 'tioga', 'pass', 'el', 'capitan', 'mist', 'yosemite', 'vernal']


In [84]:
additional_removal_words = ['washburn', 'lower', 'upper', 'worth', 'muir', 'john', 'sentinel', \
                            'columbia',  'nevada', 'vernall', 'park', 'national', 'think', 'want', \
                            'feel', 'thing', 'say', 'year', 'pm', 'bridal' , 'like', 'veil', 'bit', 'san' , \
                            'redwood', 'sequoia', 'wawona', 'toulumne', 'tahoe', 'tenaya', 'lee',\
                           'meadow', 'olmstead', 'pass', 'elcapitan' , 'tuolumne']

       
additional_removal_words.extend(landmark_words)

df['review_remove_additional_words'] = df.review_remove_stop_words.map(
    lambda x: nlp_preprocessing.remove_stopwords(x, additional_removal_words ))

df[['review_text', 'review_remove_additional_words']].sample(2)

Unnamed: 0,review_text,review_remove_additional_words
1661,Incredible views of mountains with a bit of a hike to get there; tough if you get short of breath easily as the elevation is 7000+,incredible view mountain hike tough short breath easily elevation
2306,Only just enough parking so don’t arrive too late if going for the day. Misty trail exillerating. Amazing waterfalls and rainbow in the mist. It is quite strenuous for a 60 year old with lots of steps up to the second waterfall. A poncho could be useful in cooler weather. Ok when it’s warm as the mist cools you down and you soon dry off. Mirror Lake is a mirror at the moment but later in the summer I would think there wouldn’t be enough water to create the mirror effect. The path on the right of the lake going further up the valley was flooded so we had to clamber over boulders and fallen trees on a very indistinct path. Later on that path we came across a bear! We were glad we hadn’t bumped into when clambering. I was surprised there were not many wardens/rangers around or a notice to inform us the path was flooded. Restrooms could be more signposted or even marked on given map. Would also be helpful if they were open!!!Pizza restaurant very reasonably priced.,just parking not arrive late day misty trail exillerate amazing waterfall rainbow quite strenuous old lot step second waterfall poncho useful cooler weather ok warm cool soon dry mirror lake mirror moment later summer not water create mirror effect path right lake far flood clamber boulder fall tree indistinct path later path come bear glad not bump clamber surprised not warden ranger notice inform path flood restroom signposted mark map helpful open pizza restaurant reasonably price


### Outlook Sentiment
Using the ratings to get a binary outlook sentiment from ratings, for Logistic Regression & Scattertext plots

In [86]:
df['Outlook_Sentiment'] = df.rating.map(lambda x: 'Positive' if x >= 4 else 'Negative')
df['outlook_sentiment_number'] = df.rating.map(lambda x: 1 if x >= 4 else 0)

In [87]:
df.to_csv("Reviews_cleaned_for_NLP.csv", index = False)

The text has now been preprocessed and is ready for topic modeling.  
Helper functions mentioned in this notebook can be found in the [nlp_preprocessing](nlp_preprocessing.py) python file.

Additional pre-processing steps have been mentioned below for future reference. 

### For future work, the following cleaning can also be undertaken:
1. letter repeats (tweet tokenizer & Crazy Tokenizer)
1. Spell Check
1. Different languages
1. Named Entity Extraction
3. Identifying actual hypentated words instead of separating them 
1. Removing words that are less than 4 letters perhaps - but selectively

Exploring the reviews for entities that are present and seeing what can be dropped

In [109]:
sample_review_df = df.loc[[3982],'review_text']
sample_review = ' '.join(sample_review_df.tolist())

sample_review_cleaned = nlp_preprocessing.cleaning(sample_review)

spacy_tokens = spacy_nlp(sample_review_cleaned)
[[token.text, token.lemma_, token.pos_, token.ent_type_] for token in spacy_tokens]

[['the', 'the', 'DET', ''],
 ['mist', 'mist', 'PROPN', ''],
 ['trail', 'trail', 'NOUN', ''],
 ['was', 'be', 'AUX', ''],
 ['an', 'an', 'DET', ''],
 ['adventure', 'adventure', 'NOUN', ''],
 ['we', '-PRON-', 'PRON', ''],
 ['all', 'all', 'DET', ''],
 ['enjoyed', 'enjoy', 'VERB', ''],
 [' ', ' ', 'SPACE', ''],
 ['from', 'from', 'ADP', ''],
 ['ages', 'age', 'NOUN', ''],
 [' ', ' ', 'SPACE', ''],
 ['to', 'to', 'PART', ''],
 ['  ', '  ', 'SPACE', ''],
 ['the', 'the', 'DET', ''],
 ['steps', 'step', 'NOUN', ''],
 ['up', 'up', 'ADV', ''],
 ['were', 'be', 'AUX', ''],
 ['super', 'super', 'ADV', ''],
 ['wet', 'wet', 'ADJ', ''],
 ['and', 'and', 'CCONJ', ''],
 ['a', 'a', 'DET', ''],
 ['bit', 'bit', 'ADV', ''],
 ['scary', 'scary', 'ADJ', ''],
 [' ', ' ', 'SPACE', ''],
 ['but', 'but', 'CCONJ', ''],
 ['with', 'with', 'ADP', ''],
 ['careful', 'careful', 'ADJ', ''],
 ['footing', 'footing', 'NOUN', ''],
 [' ', ' ', 'SPACE', ''],
 ['you', '-PRON-', 'PRON', ''],
 ['can', 'can', 'VERB', ''],
 ['do', 'do', 'AUX

Based on the above text, entity labels that equate to 'PERSON', 'LOCATION', 'GPE', 'FAC' or 'ORG' will be removed.

To get a better idea of the POS in the corpus, using a couple of random reviews, let's look the words itself, its lemma, and its pos. 
Based on that we can can decide to only keep certain types of POS for our corpus.

In [13]:
spacy_all_reviews = spacy_nlp(all_reviews[:10000])
[[ent.text, ent.label_] for ent in spacy_all_reviews.ents]

[['pm', 'TIME'],
 ['today', 'DATE'],
 ['oct', 'DATE'],
 ['yosemite valley', 'LOC'],
 ['several miles', 'QUANTITY'],
 ['yosemite valley', 'LOC'],
 ['washburn', 'PERSON'],
 ['about   mile', 'QUANTITY'],
 ['nevada', 'GPE'],
 ['half', 'CARDINAL'],
 ['four mile', 'QUANTITY'],
 ['yosemite valley', 'LOC'],
 ['only  feet', 'QUANTITY'],
 ['a mile', 'QUANTITY'],
 ['half', 'CARDINAL'],
 ['year', 'DATE'],
 ['october', 'DATE'],
 ['half', 'CARDINAL'],
 ['half', 'CARDINAL'],
 ['a few hours', 'TIME'],
 ['half', 'CARDINAL'],
 ['yosemite valley', 'LOC'],
 ['the vista point', 'ORG'],
 ['half', 'CARDINAL'],
 ['first', 'ORDINAL'],
 ['mid summer', 'DATE'],
 ['nevada', 'GPE'],
 ['the yosemite park', 'LOC'],
 ['a typical season', 'DATE'],
 ['the low season', 'DATE'],
 ['next year', 'DATE'],
 ['one', 'CARDINAL'],
 ['a year or more', 'DATE'],
 ['a couple of miles', 'QUANTITY'],
 ['yosemite valley', 'LOC'],
 ['yosemite valley', 'LOC'],
 ['nevada', 'GPE'],
 ['mid august', 'DATE'],
 ['night', 'TIME'],
 ['nevada', 

Based on perusing the above corpus sample, only 'NOUN', 'VERB' and 'ADJ' will be kept.  
Both, entity removal & pos tags can be used to further remove unnecessary words.

In [None]:
%%time

pos =  ['NOUN', 'VERB', 'ADJ', 'ADV']
ents = ['PERSON', 'LOC', 'GPE', 'FAC', 'ORG']

df['review_pos_ent_filter'] = df.reviews_basic_clean.map(lambda x: spacy_pos_filtering(x, pos, ents))
df['review_pos_ent_filter'] = df.review_pos_ent_filter.map(lambda x: remove_stopwords(x, sklearn_stop_words ))
df['review_pos_ent_filter'] = df.review_pos_ent_filter.map(lambda x: remove_stopwords(x, additional_removal_words ))

df[['review_text','review_pos_ent_filter']].sample(2)

#Must add method to account for negative modifiers - not, nothing, etc to ensure meaning is preserved

Since important modifiers were being dropped using the above method, for now, the corpus will not undergo further pre-processing, involving the aspects defined in the above code block.

### Example Code Snippets for Reference to help with the NLP Pre-processsing steps

### Tokenization 

In [138]:
from nltk.tokenize import word_tokenize

my_text = "Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from \
the store. Should I pick up some black-eyed peas as well?"

print(word_tokenize(my_text))

# (N-Grams)

from nltk.util import ngrams
my_words = word_tokenize(my_text) # This is the list of all words
twograms = list(ngrams(my_words,2)) # This is for two-word combos, but can pick any n
print(twograms)

# Regular Expressions

from nltk.tokenize import RegexpTokenizer

# RegexpTokenizer with whitespace delimiter
whitespace_tokenizer = RegexpTokenizer("\s+", gaps=True)
print(whitespace_tokenizer.tokenize(my_text))

# RegexpTokenizer to match only capitalized words
cap_tokenizer = RegexpTokenizer("[A-Z]['\w]+")
print(cap_tokenizer.tokenize(my_text))

from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize

s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
regexp_tokenize(s, pattern='\w+|\$[\d\.]+|\S+')

wordpunct_tokenize(s)

blankline_tokenize(s)


[&#39;Hi&#39;, &#39;Mr.&#39;, &#39;Smith&#39;, &#39;!&#39;, &#39;I&#39;, &#39;’&#39;, &#39;m&#39;, &#39;going&#39;, &#39;to&#39;, &#39;buy&#39;, &#39;some&#39;, &#39;vegetables&#39;, &#39;(&#39;, &#39;tomatoes&#39;, &#39;and&#39;, &#39;cucumbers&#39;, &#39;)&#39;, &#39;from&#39;, &#39;the&#39;, &#39;store&#39;, &#39;.&#39;, &#39;Should&#39;, &#39;I&#39;, &#39;pick&#39;, &#39;up&#39;, &#39;some&#39;, &#39;black-eyed&#39;, &#39;peas&#39;, &#39;as&#39;, &#39;well&#39;, &#39;?&#39;]
[(&#39;Hi&#39;, &#39;Mr.&#39;), (&#39;Mr.&#39;, &#39;Smith&#39;), (&#39;Smith&#39;, &#39;!&#39;), (&#39;!&#39;, &#39;I&#39;), (&#39;I&#39;, &#39;’&#39;), (&#39;’&#39;, &#39;m&#39;), (&#39;m&#39;, &#39;going&#39;), (&#39;going&#39;, &#39;to&#39;), (&#39;to&#39;, &#39;buy&#39;), (&#39;buy&#39;, &#39;some&#39;), (&#39;some&#39;, &#39;vegetables&#39;), (&#39;vegetables&#39;, &#39;(&#39;), (&#39;(&#39;, &#39;tomatoes&#39;), (&#39;tomatoes&#39;, &#39;and&#39;), (&#39;and&#39;, &#39;cucumbers&#39;), (&#39;cucumbers&#3

### Preprocessing: Stop Words

In [137]:
from nltk.corpus import stopwords
set(stopwords.words('english'))

#Example impact with code

my_text = ["Hi Mr. Smith! I’m going to buy some vegetables (tomatoes and cucumbers) from \
the store. Should I pick up some black-eyed peas as well?"]

# Incorporate stop words when creating the count vectorizer
cv = CountVectorizer(stop_words='english')
X = cv.fit_transform(my_text)
pd.DataFrame(X.toarray(), columns=cv.get_feature_names())

### POS Tagging With NLTK

In [128]:
from nltk.tag import pos_tag
my_text = "James Smith lives in the United States."
tokens = pos_tag(word_tokenize(my_text))
print(tokens)

#For help on the codes, use the below
# nltk.help.upenn_tagset()

[(&#39;James&#39;, &#39;NNP&#39;), (&#39;Smith&#39;, &#39;NNP&#39;), (&#39;lives&#39;, &#39;VBZ&#39;), (&#39;in&#39;, &#39;IN&#39;), (&#39;the&#39;, &#39;DT&#39;), (&#39;United&#39;, &#39;NNP&#39;), (&#39;States&#39;, &#39;NNPS&#39;), (&#39;.&#39;, &#39;.&#39;)]


### Named Entity Recognition

In [132]:
from nltk.chunk import ne_chunk
my_text = "James Smith lives in the United States."
tokens = pos_tag(word_tokenize(my_text)) # this labels each word as a part of speech
entities = ne_chunk(tokens) # this extracts entities from the list of words
# help(entities)

### Compound Term Extraction

In [21]:
from nltk.tokenize import MWETokenizer # multi-word expression
my_text = "You all are the greatest students of all time."
mwe_tokenizer = MWETokenizer([('You','all'), ('of', 'all', 'time')])
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(my_text))
' '.join(mwe_tokens)

'You_all are the greatest students of_all_time .'