### Group Members:
    
 - Aiswarya S Parvathy
 - Vengadesh S
 - Nipun Gupta   

### Objective: 

Build a prediction model to predict whether a review on the restaurant is positive or negative.

### Importing pandas

In [1]:
import pandas as pd

Reading the csv dataset as a pandas dataframe

In [67]:
Rest_rev = pd.read_csv('RestaurantReview.csv')

In [68]:
Rest_rev.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


Checking details of the dataframe

In [69]:
Rest_rev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 Review    1000 non-null object
Liked      1000 non-null int64
dtypes: int64(1), object(1)
memory usage: 15.7+ KB


In [70]:
Rest_rev.columns

Index([' Review', 'Liked'], dtype='object')

The column **Review** has a leading space in its name. So we can rename the column to remove this leading space

In [71]:
Rest_rev = Rest_rev.rename(columns={' Review':'Review'})

In [72]:
Rest_rev.columns

Index(['Review', 'Liked'], dtype='object')

The leading space has been removed from the **Reivew** column

In [73]:
Rest_rev.describe()

Unnamed: 0,Liked
count,1000.0
mean,0.5
std,0.50025
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


The output of the **describe()** functions shows only the numerical column and not the text column

Checking for presence of missing values

In [7]:
Rest_rev.isnull().sum()

Review    0
Liked     0
dtype: int64

There are no missing values in the dataframe

Checking for presence of duplicates

In [8]:
Rest_rev.duplicated(subset=None, keep='first').sum()

4

There are 4 duplicate records in the dataframe

In [9]:
Rest_rev.shape

(1000, 2)

Removing the duplicate records

In [10]:
Rest_rev = Rest_rev[Rest_rev.duplicated(Rest_rev.columns.tolist(), keep='first')==False]

In [11]:
Rest_rev.duplicated(subset=None, keep='first').sum()

0

In [12]:
Rest_rev.shape

(996, 2)

Duplicate records have been removed from the dataframe

**Cleaning the punctuation marks**

In [13]:
import re
import string

List of punctuation marks to be removed

In [14]:
print(list(string.punctuation))

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [15]:
def remove_punctuation(text):
    no_punct = "".join([c if c not in string.punctuation else " " for c in text])
    return no_punct

In [16]:
Rest_rev['Review'] = Rest_rev['Review'].apply(lambda x: remove_punctuation(x))

Converting all uppercase characters to lowercase

In [17]:
Rest_rev['Review'] = Rest_rev['Review'].str.lower()

Eliminating the numbers from the **Review** column as they do not contribute in predicting the sentiment of a review

In [18]:
Rest_rev['Review'] = [re.sub('\d+', '', e) for e in Rest_rev['Review']]

Removing unnecessary spaces from the text

In [19]:
Rest_rev['Review'] = [re.sub('\s+', ' ', e) for e in Rest_rev['Review']]

Removing leading or trailing spaces from the text

In [20]:
Rest_rev['Review'] = Rest_rev['Review'].str.strip()

In [23]:
#Rest_rev.to_csv('Rest_rev_1.csv', index=False)

In [115]:
Rest_rev = pd.read_csv('Rest_rev_1.csv')

#### Importing nltk

In [116]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nipun.gupta\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### Tokenization

In [117]:
from nltk.tokenize import word_tokenize

In [118]:
Rest_rev['Review'] = Rest_rev['Review'].apply(word_tokenize)

In [119]:
Rest_rev.head()

Unnamed: 0,Review,Liked
0,"[wow, loved, this, place]",1
1,"[crust, is, not, good]",0
2,"[not, tasty, and, the, texture, was, just, nasty]",0
3,"[stopped, by, during, the, late, may, bank, ho...",1
4,"[the, selection, on, the, menu, was, great, an...",1


The **Review** column has been tokenized i.e. split into tokens/pieces

#### Removing the stopwords

In [120]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nipun.gupta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Loading the **English** language stopwords

In [121]:
stopwords_english = stopwords.words('english')

List of english stopwords

In [122]:
print(stopwords_english)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Certain words in the above including **not**, **until**, **while**, **against** etc may contribute to the sentiment of a review. So they may need to be retained in the review and thus need to be removed from the above list

In [123]:
stopwords_english_set = set(stopwords_english)

In [124]:
stopwords_english_set = stopwords_english_set.difference({'until', 'while', 'against', 'between', 'during', 'before', 'after', 'above', 'below', 'not'})

In [125]:
stopwords_english = list(stopwords_english_set)

In [126]:
print(stopwords_english)

['each', 'herself', "haven't", 'once', 'haven', 'being', 'same', 'these', 'her', 'an', 'm', "mustn't", 'what', "couldn't", 'all', 'weren', "should've", 'again', 'or', 'that', 'doesn', 'his', 'our', 'some', "you'd", 'here', "shan't", 'yourself', 'through', 'no', 'won', 'too', 'did', 'had', 'own', 'having', 'when', "you're", 'nor', 't', 'didn', 'i', 'he', 'how', "shouldn't", 'is', "isn't", 'are', 'off', 'should', "that'll", 'ma', 'at', "won't", 'don', 'we', 'under', "hadn't", 'does', 'hadn', 'wasn', 're', 'only', 'as', 'they', 'about', 'the', 'been', 'such', "needn't", 'o', 'over', 'now', "doesn't", 'just', 'whom', 'more', "she's", 'you', 'which', "don't", 'll', 'their', 'any', 'but', 'who', 'most', 'me', 'wouldn', 'both', "wouldn't", 'in', 'on', 'were', 'if', 'from', 'will', 've', 'themselves', 'isn', 'am', 'then', 'aren', 'up', 'was', 'shan', 'himself', 'needn', 'than', 'there', 'itself', 'other', 'so', 'out', 'for', 'shouldn', 'and', 'ours', "you'll", 'she', 'them', 'd', 'have', 's', 

In [127]:
def remove_stopwords(text):        
    words = [w for w in text if w not in stopwords_english]
    return words

In [128]:
Rest_rev['Review'] = Rest_rev['Review'].apply(lambda x: remove_stopwords(x))

In [129]:
Rest_rev.head()

Unnamed: 0,Review,Liked
0,"[wow, loved, place]",1
1,"[crust, not, good]",0
2,"[not, tasty, texture, nasty]",0
3,"[stopped, during, late, may, bank, holiday, ri...",1
4,"[selection, menu, great, prices]",1


The stopwords have been removed from the **Review** column

#### Lemmatization of Reviews

In [130]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nipun.gupta\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [131]:
from nltk.stem import WordNetLemmatizer
#from nltk.stem import PorterStemmer

In [132]:
lemmatizer = WordNetLemmatizer()

In [133]:
def word_lemmatizer(text):
    lem_text = [lemmatizer.lemmatize(i) for i in text]
    return lem_text

In [134]:
Rest_rev['Review'] = Rest_rev['Review'].apply(lambda x: word_lemmatizer(x))

In [135]:
Rest_rev.head()

Unnamed: 0,Review,Liked
0,"[wow, loved, place]",1
1,"[crust, not, good]",0
2,"[not, tasty, texture, nasty]",0
3,"[stopped, during, late, may, bank, holiday, ri...",1
4,"[selection, menu, great, price]",1


In [136]:
Rest_rev = Rest_rev.rename(columns={'Review':'cleaned_Review'})

In [137]:
Rest_rev.head()

Unnamed: 0,cleaned_Review,Liked
0,"[wow, loved, place]",1
1,"[crust, not, good]",0
2,"[not, tasty, texture, nasty]",0
3,"[stopped, during, late, may, bank, holiday, ri...",1
4,"[selection, menu, great, price]",1
