## Table of Content

1. [Data Description](#section1)<br>
2. [Data Loading](#section2)<br>    
3. [Remove Punctuation and StopWords from Text](#section3)<br>
4. [Train Model](#section4)<br>
    - 4.1 [Train Test Split](#section401)<br>
    - 4.2 [Create CountVectorizer Object](#section402)<br>
    - 4.3 [Train MultiNomialNB Model](#section403)<br>
    - 4.4 [Display Confusion Matrix](#section404)<br>
5. [Display HMM POS Tagging](#section5)<br>
6. [Use Viterbi Parser](#section6)<br>


In [1]:
## Organizing Imports
import pandas as pd
import string
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
from nltk.tokenize import word_tokenize
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report 
import spacy
pd.set_option('display.max_colwidth', -1)

## 1. Data Description

- Each observation in this dataset is a review of a particular business by a particular user. 
- The "stars" column is the number of stars (1 through 5) assigned by the reviewer to the business. 
- (Higher stars is better.) 
- In other words, it is the rating of the business by the person who wrote the review.

## 2. Data Loading 
- Read the yelp.csv file and set it as a Dataframe called yelp. 
- Check the head, info, and describe methods on yelp  

In [2]:
yelp=pd.read_csv('yelp.csv')

In [3]:
yelp.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,"My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best ""toast"" I've ever had.\n\nAnyway, I can't wait to go back!",review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,"I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.\n\nIn any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we'll be seated when the girl comes back from seating someone else. We were seated at 5:52 and the waiter came and got our drink orders. Everyone was very pleasant from the host that seated us to the waiter to the server. The prices were very good as well. We placed our orders once we decided what we wanted at 6:02. We shared the baked spaghetti calzone and the small ""Here's The Beef"" pizza so we can both try them. The calzone was huge and we got the smallest one (personal) and got the small 11"" pizza. Both were awesome! My friend liked the pizza better and I liked the calzone better. The calzone does have a sweetish sauce but that's how I like my sauce!\n\nWe had to box part of the pizza to take it home and we were out the door by 6:42. So, everything was great and not like these bad reviewers. That goes to show you that you have to try these things yourself because all these bad reviewers have some serious issues.",review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
2,6oRAC4uyJCsJl1X0WZpVSA,2012-06-14,IESLBzqUCLdSzSqm0eCSxQ,4,love the gyro plate. Rice is so good and I also dig their candy selection :),review,0hT2KtfLiobPvh6cDC8JQg,0,1,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!! It's very convenient and surrounded by a lot of paths, a desert xeriscape, baseball fields, ballparks, and a lake with ducks.\n\nThe Scottsdale Park and Rec Dept. does a wonderful job of keeping the park clean and shaded. You can find trash cans and poopy-pick up mitts located all over the park and paths.\n\nThe fenced in area is huge to let the dogs run, play, and sniff!",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,"General Manager Scott Petello is a good egg!!! Not to go into detail, but let me assure you if you have any issues (albeit rare) speak with Scott and treat the guy with some respect as you state your case and I'd be surprised if you don't walk out totally satisfied as I just did. Like I always say..... ""Mistakes are inevitable, it's how we recover from them that is important""!!!\n\nThanks to Scott and his awesome staff. You've got a customer for life!! .......... :^)",review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0


In [4]:
yelp.shape

(10000, 10)

In [5]:
yelp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
business_id    10000 non-null object
date           10000 non-null object
review_id      10000 non-null object
stars          10000 non-null int64
text           10000 non-null object
type           10000 non-null object
user_id        10000 non-null object
cool           10000 non-null int64
useful         10000 non-null int64
funny          10000 non-null int64
dtypes: int64(4), object(6)
memory usage: 781.3+ KB


In [6]:
yelp.describe()

Unnamed: 0,stars,cool,useful,funny
count,10000.0,10000.0,10000.0,10000.0
mean,3.7775,0.8768,1.4093,0.7013
std,1.214636,2.067861,2.336647,1.907942
min,1.0,0.0,0.0,0.0
25%,3.0,0.0,0.0,0.0
50%,4.0,0.0,1.0,0.0
75%,5.0,1.0,2.0,1.0
max,5.0,77.0,76.0,57.0


### There are 10,000 observations in the dataset with 10 columns

## 3. Remove Punctuation and StopWords from Text
- Remove punctuations and stopwords from the text in ‘text’ column

In [7]:
## Data Before removing punctuations and StopWords from text column

In [8]:
yelp[['text','user_id']][0:2]

Unnamed: 0,text,user_id
0,"My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best ""toast"" I've ever had.\n\nAnyway, I can't wait to go back!",rLtl8ZkDX5vH5nAx9C3q5Q
1,"I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.\n\nIn any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we'll be seated when the girl comes back from seating someone else. We were seated at 5:52 and the waiter came and got our drink orders. Everyone was very pleasant from the host that seated us to the waiter to the server. The prices were very good as well. We placed our orders once we decided what we wanted at 6:02. We shared the baked spaghetti calzone and the small ""Here's The Beef"" pizza so we can both try them. The calzone was huge and we got the smallest one (personal) and got the small 11"" pizza. Both were awesome! My friend liked the pizza better and I liked the calzone better. The calzone does have a sweetish sauce but that's how I like my sauce!\n\nWe had to box part of the pizza to take it home and we were out the door by 6:42. So, everything was great and not like these bad reviewers. That goes to show you that you have to try these things yourself because all these bad reviewers have some serious issues.",0a2KyEL0d3Yb1V6aivbIuQ


In [9]:
stop_words = stopwords.words('english')
stop_words[0:10]


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [10]:
## Creating a function to remove punctuation and stopwords
def remove_punctuation_stopwords(sentence):
    sentence = sentence.lower()
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(sentence)
    stop_words = stopwords.words('english')
    filtered_words = [w for w in tokens if not w in stop_words]
    return " ".join(filtered_words)

In [11]:
## Creating a new column called new_text which contains text after removing punctuation and stopwords
yelp['new_text']=yelp.apply(lambda x: remove_punctuation_stopwords(x['text']),axis=1)

In [12]:
## Check the new_text column after punctuation removal 
yelp[['text','new_text']][0:2]

Unnamed: 0,text,new_text
0,"My wife took me here on my birthday for breakfast and it was excellent. The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure. Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning. It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary. It was phenomenal and simply the best I've ever had. I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it. It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious. It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete. It was the best ""toast"" I've ever had.\n\nAnyway, I can't wait to go back!",wife took birthday breakfast excellent weather perfect made sitting outside overlooking grounds absolute pleasure waitress excellent food arrived quickly semi busy saturday morning looked like place fills pretty quickly earlier get better favor get bloody mary phenomenal simply best ever pretty sure use ingredients garden blend fresh order amazing everything menu looks excellent white truffle scrambled eggs vegetable skillet tasty delicious came 2 pieces griddled bread amazing absolutely made meal complete best toast ever anyway wait go back
1,"I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.\n\nIn any case, my friend and I arrived at about 5:50 PM this past Sunday. It was pretty crowded, more than I thought for a Sunday evening and thought we would have to wait forever to get a seat but they said we'll be seated when the girl comes back from seating someone else. We were seated at 5:52 and the waiter came and got our drink orders. Everyone was very pleasant from the host that seated us to the waiter to the server. The prices were very good as well. We placed our orders once we decided what we wanted at 6:02. We shared the baked spaghetti calzone and the small ""Here's The Beef"" pizza so we can both try them. The calzone was huge and we got the smallest one (personal) and got the small 11"" pizza. Both were awesome! My friend liked the pizza better and I liked the calzone better. The calzone does have a sweetish sauce but that's how I like my sauce!\n\nWe had to box part of the pizza to take it home and we were out the door by 6:42. So, everything was great and not like these bad reviewers. That goes to show you that you have to try these things yourself because all these bad reviewers have some serious issues.",idea people give bad reviews place goes show please everyone probably griping something fault many people like case friend arrived 5 50 pm past sunday pretty crowded thought sunday evening thought would wait forever get seat said seated girl comes back seating someone else seated 5 52 waiter came got drink orders everyone pleasant host seated us waiter server prices good well placed orders decided wanted 6 02 shared baked spaghetti calzone small beef pizza try calzone huge got smallest one personal got small 11 pizza awesome friend liked pizza better liked calzone better calzone sweetish sauce like sauce box part pizza take home door 6 42 everything great like bad reviewers goes show try things bad reviewers serious issues


## 4. Train Model
- Create two objects X and y.
- X will be the 'text' column of yelp dataframe and y will be the 'stars' column of yelp
- create a CountVectorizer object and split the data into training and testing sets
- Train a MultinomialNB model 
- Display the confusion Matrix

In [13]:
X= yelp.new_text
y= yelp.stars

In [14]:
# check the shapes of X and y
print('X dimensionality', X.shape)
print('y dimensionality', y.shape)

X dimensionality (10000,)
y dimensionality (10000,)


In [15]:
# examine the class distribution
yelp.stars.value_counts()

4    3526
5    3337
3    1461
2    927 
1    749 
Name: stars, dtype: int64

In [16]:
## Examining Classfication Counts whether Balanced / Imbalanced 

In [17]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(8,6))
yelp.groupby('stars').new_text.count().plot.bar(ylim=0)
plt.show()

<Figure size 800x600 with 1 Axes>

### 4.1 Train Test Split

In [18]:
# split X and y into training and testing sets  by default, it splits 75% training and 25% test
# random_state=1 for reproducibility
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(7500,)
(2500,)
(7500,)
(2500,)


### Note: 
-  We do the train/test split before the CountVectorizer to properly simulate the real world where our future data contains words we have not seen before

### 4.2 Create CountVectorizer Object

In [19]:
# 1. import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer

# 2. instantiate CountVectorizer (vectorizer)
vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')

In [20]:
# 3. fit and transform training data
X_train_dtm = vect.fit_transform(X_train)

In [21]:
# examine the fitted vocabulary
X_train_dtm

<7500x25684 sparse matrix of type '<class 'numpy.int64'>'
	with 425424 stored elements in Compressed Sparse Row format>

In [22]:
# 4. transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<2500x25684 sparse matrix of type '<class 'numpy.int64'>'
	with 134955 stored elements in Compressed Sparse Row format>

### 4.3 Build MultinomialNB Model

In [23]:
# Instantiate Multinomial Naive Bayes Model
mnb = MultinomialNB()

In [24]:
# Train the Model 
%time mnb.fit(X_train_dtm, y_train)

CPU times: user 18.7 ms, sys: 3.29 ms, total: 22 ms
Wall time: 26.2 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [25]:
# make class predictions for X_test_dtm
y_pred_class = mnb.predict(X_test_dtm)

In [26]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.4672

In [27]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[ 58,  20,  23,  56,  28],
       [ 23,  18,  37, 133,  23],
       [  5,   8,  37, 273,  42],
       [  4,   1,  20, 633, 226],
       [  7,   4,   7, 392, 422]])

### Note : Since the classes are imbalanced from the bar plot above . Let's try to use SMOTE technique to resample

In [28]:
from imblearn.over_sampling import SMOTE 

Using TensorFlow backend.


In [29]:
sm = SMOTE(random_state = 2) 
X_train_res, y_train_res = sm.fit_sample(X_train_dtm, y_train.ravel()) 

In [30]:

print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1))) 
print("After OverSampling, counts of label '2': {}".format(sum(y_train_res == 2))) 
print("After OverSampling, counts of label '3': {}".format(sum(y_train_res == 3))) 
print("After OverSampling, counts of label '4': {}".format(sum(y_train_res == 4))) 
print("After OverSampling, counts of label '5': {}".format(sum(y_train_res == 5))) 

After OverSampling, counts of label '1': 2642
After OverSampling, counts of label '2': 2642
After OverSampling, counts of label '3': 2642
After OverSampling, counts of label '4': 2642
After OverSampling, counts of label '5': 2642


In [31]:
mnb.fit(X_train_res, y_train_res.ravel())

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [32]:
predictions = mnb.predict(X_test_dtm) 


In [33]:
from sklearn.metrics import confusion_matrix, classification_report 

print(classification_report(y_test, predictions)) 


              precision    recall  f1-score   support

           1       0.47      0.41      0.44       185
           2       0.39      0.21      0.28       234
           3       0.32      0.21      0.25       365
           4       0.46      0.66      0.54       884
           5       0.58      0.50      0.54       832

    accuracy                           0.48      2500
   macro avg       0.45      0.40      0.41      2500
weighted avg       0.48      0.48      0.47      2500



In [34]:
metrics.confusion_matrix(y_test, predictions)

array([[ 76,  37,  19,  33,  20],
       [ 39,  50,  52,  73,  20],
       [ 15,  19,  75, 215,  41],
       [ 11,  11,  60, 582, 220],
       [ 21,  10,  26, 358, 417]])

In [35]:
metrics.accuracy_score(y_test, predictions)

0.48

### Conclusion : 
- Using MultinomialNB with or without SMOTE for imbalanced classes in this case did not improve the performance much 

## 5. Display HMM POS Tagging

In [36]:
nlp=spacy.load('en_core_web_sm')

In [37]:
def convert_to_pos_tags(text_value):
    doc = nlp(text_value)
    pos_tag_list=[]
    # iterating over each text and then adding it to a list
    for token in doc:
        pos_tag_list.append((token.text,token.pos_))
    return pos_tag_list

### 5.1 Display the HMM POS tagging on the first 4 rows of ‘text’

In [38]:
tags_pos=yelp['new_text'][0:4].apply(lambda x:convert_to_pos_tags(x))

In [39]:
## Displaying the data as pandas dataframe
## Note: the text after the stop words and punctuation removal is used here 
pd.DataFrame(tags_pos)

Unnamed: 0,new_text
0,"[(wife, NOUN), (took, VERB), (birthday, NOUN), (breakfast, NOUN), (excellent, ADJ), (weather, NOUN), (perfect, ADJ), (made, VERB), (sitting, VERB), (outside, ADP), (overlooking, VERB), (grounds, NOUN), (absolute, ADJ), (pleasure, NOUN), (waitress, NOUN), (excellent, ADJ), (food, NOUN), (arrived, VERB), (quickly, ADV), (semi, ADV), (busy, ADJ), (saturday, PROPN), (morning, NOUN), (looked, VERB), (like, SCONJ), (place, NOUN), (fills, NOUN), (pretty, ADV), (quickly, ADV), (earlier, ADV), (get, VERB), (better, ADJ), (favor, NOUN), (get, AUX), (bloody, ADJ), (mary, PROPN), (phenomenal, PROPN), (simply, ADV), (best, ADV), (ever, ADV), (pretty, ADV), (sure, ADJ), (use, VERB), (ingredients, NOUN), (garden, VERB), (blend, NOUN), (fresh, ADJ), (order, NOUN), (amazing, ADJ), (everything, PRON), (menu, NOUN), (looks, VERB), (excellent, ADJ), (white, ADJ), (truffle, NOUN), (scrambled, VERB), (eggs, PROPN), (vegetable, NOUN), (skillet, NOUN), (tasty, ADJ), (delicious, PROPN), (came, VERB), (2, NUM), (pieces, NOUN), (griddled, VERB), (bread, NOUN), (amazing, ADJ), (absolutely, ADV), (made, VERB), (meal, NOUN), (complete, ADJ), (best, ADJ), (toast, NOUN), (ever, ADV), (anyway, INTJ), (wait, VERB), (go, VERB), (back, ADV)]"
1,"[(idea, NOUN), (people, NOUN), (give, VERB), (bad, ADJ), (reviews, NOUN), (place, NOUN), (goes, VERB), (show, NOUN), (please, INTJ), (everyone, PRON), (probably, ADV), (griping, VERB), (something, PRON), (fault, VERB), (many, ADJ), (people, NOUN), (like, SCONJ), (case, NOUN), (friend, NOUN), (arrived, VERB), (5, NUM), (50, NUM), (pm, NOUN), (past, ADP), (sunday, PROPN), (pretty, ADV), (crowded, VERB), (thought, NOUN), (sunday, PROPN), (evening, NOUN), (thought, VERB), (would, VERB), (wait, VERB), (forever, ADV), (get, VERB), (seat, NOUN), (said, VERB), (seated, ADJ), (girl, NOUN), (comes, VERB), (back, ADV), (seating, VERB), (someone, PRON), (else, ADV), (seated, VERB), (5, NUM), (52, NUM), (waiter, NOUN), (came, VERB), (got, VERB), (drink, NOUN), (orders, NOUN), (everyone, PRON), (pleasant, ADJ), (host, NOUN), (seated, VERB), (us, PRON), (waiter, ADJ), (server, NOUN), (prices, NOUN), (good, ADJ), (well, ADV), (placed, VERB), (orders, NOUN), (decided, VERB), (wanted, VERB), (6, NUM), (02, NUM), (shared, VERB), (baked, VERB), (spaghetti, NOUN), (calzone, PROPN), (small, ADJ), (beef, NOUN), (pizza, NOUN), (try, VERB), (calzone, PROPN), (huge, PROPN), (got, VERB), (smallest, ADJ), (one, NUM), (personal, NOUN), (got, VERB), (small, ADJ), (11, NUM), (pizza, NOUN), (awesome, ADJ), (friend, NOUN), (liked, VERB), (pizza, NOUN), (better, ADV), (liked, VERB), (calzone, NOUN), (better, ADJ), (calzone, PROPN), (sweetish, ADJ), (sauce, NOUN), (like, SCONJ), (sauce, NOUN), (box, NOUN), ...]"
2,"[(love, NOUN), (gyro, PROPN), (plate, VERB), (rice, PROPN), (good, ADJ), (also, ADV), (dig, VERB), (candy, NOUN), (selection, NOUN)]"
3,"[(rosie, PROPN), (dakota, PROPN), (love, PROPN), (chaparral, PROPN), (dog, PROPN), (park, PROPN), (convenient, NOUN), (surrounded, VERB), (lot, NOUN), (paths, NOUN), (desert, NOUN), (xeriscape, PROPN), (baseball, NOUN), (fields, NOUN), (ballparks, NOUN), (lake, PROPN), (ducks, PROPN), (scottsdale, PROPN), (park, PROPN), (rec, PROPN), (dept, NOUN), (wonderful, ADJ), (job, NOUN), (keeping, VERB), (park, NOUN), (clean, ADJ), (shaded, ADJ), (find, VERB), (trash, NOUN), (cans, NOUN), (poopy, NOUN), (pick, NOUN), (mitts, PROPN), (located, VERB), (park, NOUN), (paths, NOUN), (fenced, VERB), (area, NOUN), (huge, ADJ), (let, VERB), (dogs, NOUN), (run, VERB), (play, VERB), (sniff, NOUN)]"


## 6. Use Viterbi Parser to Parse rows

- Parse the first 4 rows of ‘text’ using Viterbi Parser 
- [Use toy_pcfg1 and toy_pcfg2 to get the probabilistic context free grammars
- use the PCFG suitable for each sentence

In [40]:
import sys, time
from nltk import tokenize
from nltk.parse import ViterbiParser
from nltk.grammar import toy_pcfg1, toy_pcfg2
from nltk.corpus import treebank

In [41]:
# Define two demos.  Each demo has a sentence and a grammar.
demos = [
    ("I was busy on my wife birthday", toy_pcfg1),
    ("the boy saw a cookie under my table", toy_pcfg2),
    ]

In [42]:
demos

[('I was busy on my wife birthday', <Grammar with 17 productions>),
 ('the boy saw a cookie under my table', <Grammar with 23 productions>)]

In [89]:
import nltk
from nltk.grammar import Nonterminal
from nltk.corpus import treebank
# load and view training data
training_set = treebank.parsed_sents()
print(training_set[1])

(S
  (NP-SBJ (NNP Mr.) (NNP Vinken))
  (VP
    (VBZ is)
    (NP-PRD
      (NP (NN chairman))
      (PP
        (IN of)
        (NP
          (NP (NNP Elsevier) (NNP N.V.))
          (, ,)
          (NP (DT the) (NNP Dutch) (VBG publishing) (NN group))))))
  (. .))


In [90]:
# extract the productions for all annotated training sentences
treebank_productions = list(
                        set(production
                            for sent in training_set
                            for production in sent.productions()
) )
# view some production rules
treebank_productions[0:10]

[CD -> '5.435',
 NN -> 'Safety',
 NP-SBJ -> PRP$ NNP CC NNP NN,
 VB -> 'total',
 VP -> VBZ PP-MNR-CLR SBAR-PRP,
 VP -> VBZ ADVP PP-CLR,
 NNPS -> 'Manufacturers',
 NNP -> 'F.H.',
 VBN -> 'designated',
 ADVP-CLR -> RB NP]

In [91]:
# add productions for each word, POS tag
for word, tag in treebank.tagged_words():
    t = nltk.Tree.fromstring("("+ tag + " " + word  +")")
    for production in t.productions():
        treebank_productions.append(production)

# build the PCFG based grammar
treebank_grammar = nltk.grammar.induce_pcfg(Nonterminal('S'), treebank_productions)

In [92]:
# build the parser
viterbi_parser = nltk.ViterbiParser(treebank_grammar)
# get sample sentence tokens
tokens = nltk.word_tokenize(yelp['new_text'][0])
# get parse tree for sample sentence
try:
    result = list(viterbi_parser.parse(tokens))
except:
    print('ValueError: Grammar does not cover some of the input words:' )

ValueError: Grammar does not cover some of the input words:


- #### Note: If the word is not in grammar it will give ValueError as given above

In [111]:
#### Since, it gives Value error lets add unknown word to the grammar first 
### Below example shows for the 1st text 

In [98]:
tagged_sent = nltk.pos_tag(nltk.word_tokenize(yelp['new_text'][0]))
print(tagged_sent)

[('wife', 'NN'), ('took', 'VBD'), ('birthday', 'JJ'), ('breakfast', 'NN'), ('excellent', 'NN'), ('weather', 'NN'), ('perfect', 'NN'), ('made', 'VBD'), ('sitting', 'VBG'), ('outside', 'JJ'), ('overlooking', 'VBG'), ('grounds', 'NNS'), ('absolute', 'JJ'), ('pleasure', 'NN'), ('waitress', 'NN'), ('excellent', 'JJ'), ('food', 'NN'), ('arrived', 'VBD'), ('quickly', 'RB'), ('semi', 'JJ'), ('busy', 'JJ'), ('saturday', 'JJ'), ('morning', 'NN'), ('looked', 'VBD'), ('like', 'IN'), ('place', 'NN'), ('fills', 'NNS'), ('pretty', 'RB'), ('quickly', 'RB'), ('earlier', 'RBR'), ('get', 'VB'), ('better', 'JJR'), ('favor', 'NN'), ('get', 'VB'), ('bloody', 'JJ'), ('mary', 'JJ'), ('phenomenal', 'NN'), ('simply', 'RB'), ('best', 'JJS'), ('ever', 'RB'), ('pretty', 'RB'), ('sure', 'JJ'), ('use', 'NN'), ('ingredients', 'NNS'), ('garden', 'JJ'), ('blend', 'VBP'), ('fresh', 'JJ'), ('order', 'NN'), ('amazing', 'VBG'), ('everything', 'NN'), ('menu', 'NN'), ('looks', 'VBZ'), ('excellent', 'JJ'), ('white', 'JJ'), ('

In [99]:
# extend productions for sample sentence tokens
for word,tag in tagged_sent:
    t = nltk.Tree.fromstring("("+ tag + " " + word+ ")")
    for production in t.productions():
        treebank_productions.append(production)

In [100]:
# rebuild grammar
treebank_grammar = nltk.grammar.induce_pcfg(Nonterminal('S'),treebank_productions)

In [101]:
# rebuild parser
viterbi_parser = nltk.ViterbiParser(treebank_grammar)
# get parse tree for sample sentence


In [102]:
tokens =nltk.word_tokenize(yelp['new_text'][0])
tokens[0:10]

['wife',
 'took',
 'birthday',
 'breakfast',
 'excellent',
 'weather',
 'perfect',
 'made',
 'sitting',
 'outside']

### Test Case : Priting PCFG for only the 1st review Text 

In [103]:
#result = list(viterbi_parser.parse(tokens))
# print parse tree
for t in viterbi_parser.parse(tokens[0:10]):
    print(t)

(S
  (NP-SBJ-27 (NN wife))
  (VP
    (VBD took)
    (PRT (JJ birthday))
    (NP
      (NN breakfast)
      (NN excellent)
      (NN weather)
      (NN perfect))
    (PP (VBN made) (PP (VBG sitting) (PP (IN outside)))))) (p=8.28183e-49)


In [109]:
final_result=[]
def parsing_the_sentence(review_text):
    tagged_sent = nltk.pos_tag(nltk.word_tokenize(review_text))
    # extend productions for sample sentence tokens
    for word,tag in tagged_sent:
        t = nltk.Tree.fromstring("("+ tag + " " + word+ ")")
        for production in t.productions():
            treebank_productions.append(production)

    # rebuild grammar
    treebank_grammar = nltk.grammar.induce_pcfg(Nonterminal('S'),treebank_productions)
    # rebuild parser
    viterbi_parser = nltk.ViterbiParser(treebank_grammar)
    tokens =nltk.word_tokenize(review_text)
    for t in viterbi_parser.parse(tokens):
        final_result.append(t)
    return final_result

In [94]:
'''
function to split the each text and just take first 10 words for the each text 
in order to use it for viterbi parsing
'''
def get_subset_data():
    yelp_new_data=[]
    for x in yelp['new_text'][0:5]:
        list1=x.split(" ")[0:10] 
        yelp_new_data.append(" ".join(list1))
    return yelp_new_data


In [95]:
yelp_new_data=get_subset_data()

#### Parse the first 4 rows of ‘text’ using Viterbi Parser 
- Note : Here only 1st 10 words of the text for each row is taken as its taking huge time in local system

In [96]:
new_data_with_text=pd.DataFrame(yelp_new_data,columns=['text_subset'])
new_data_with_text

Unnamed: 0,text_subset
0,wife took birthday breakfast excellent weather perfect made sitting outside
1,idea people give bad reviews place goes show please everyone
2,love gyro plate rice good also dig candy selection
3,rosie dakota love chaparral dog park convenient surrounded lot paths
4,general manager scott petello good egg go detail let assure


In [110]:
result=new_data_with_text['text_subset'].apply(lambda x:parsing_the_sentence(x))

### Priting the subset of the result for first 4 words

In [108]:
print(result)

0    [[[(NN wife) (p=0.000441585)], [(VBD took) (p=0.00923852), (PRT (JJ birthday)) (p=0.000105028), (NP\n  (NN breakfast)\n  (NN excellent)\n  (NN weather)\n  (NN perfect)) (p=7.14244e-18), (PP (VBN made) (PP (VBG sitting) (PP (IN outside)))) (p=3.31147e-14)]], [[(NN idea) (p=0.000883002), (NNS people) (p=0.00786981)], [(VB give) (p=0.00492005), (NP-CLR (JJ bad) (NNS reviews)) (p=1.19064e-07), (S-CLR\n  (NP-SBJ-1 (NN place))\n  (VP\n    (VBZ goes)\n    (ADJP-PRD (VB show))\n    (NP-TMP (IN please) (NN everyone)))) (p=1.58766e-25)]], [[(NP (NN love) (NN gyro) (NN plate) (NN rice)) (p=4.24074e-20), (ADJP (JJ good) (RB also)) (p=2.26861e-06)], [(VBZ dig) (p=0.00040833), (NP-TMP (NN candy) (NN selection)) (p=9.46313e-11)]], [[(NN rosie) (p=6.30199e-05), (NN dakota) (p=6.30199e-05), (NN love) (p=6.30199e-05)], [(JJ chaparral) (p=0.000131251), (NP-TMP (NN dog)) (p=1.50047e-06), (PP\n  (NN park)\n  (NP\n    (NN convenient)\n    (VBD surrounded)\n    (NN lot)\n    (NNS paths))) (p=6.42294e-24