# Assignment 2
## Extracting Reviews and creating Feedback Detection Model

### Nripendra Pratap Singh - 19200326
The aim of this assignment is to extract data from the website (http://mlg.ucd.ie/modules/yalp/). For this Assignment, I have chosen 3 categories of businesses:

- Restaurants
- Gym
- Automobile


In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import os
import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support, f1_score, precision_score, recall_score
from sklearn.naive_bayes import BernoulliNB


In [2]:
def get_webpage_bs(url):
    feedback_page = url
    page = urlopen(feedback_page)
    soup = BeautifulSoup(page, 'html.parser')
    return soup

weblink = 'http://mlg.ucd.ie/modules/yalp/'
soup = get_webpage_bs(weblink)
mydivs = soup.findAll("div", {"class": "category"})

In [3]:
mydivs

[<div class="category"><h4><a href="automotive_list.html">Category: Automotive</a>  (132 businesses)</h4></div>,
 <div class="category"><h4><a href="cafes_list.html">Category: Cafes</a>  (96 businesses)</h4></div>,
 <div class="category"><h4><a href="fashion_list.html">Category: Fashion</a>  (159 businesses)</h4></div>,
 <div class="category"><h4><a href="gym_list.html">Category: Gym</a>  (122 businesses)</h4></div>,
 <div class="category"><h4><a href="hair_salons_list.html">Category: Hair and Salons</a>  (143 businesses)</h4></div>,
 <div class="category"><h4><a href="hotels_list.html">Category: Hotels</a>  (113 businesses)</h4></div>,
 <div class="category"><h4><a href="restaurants_list.html">Category: Restaurants</a>  (100 businesses)</h4></div>]

#### Creatiing Empty Dataframes to store data from the websites

In [4]:
df_auto = pd.DataFrame(columns=['review', 'label'])
df_gym = pd.DataFrame(columns=['review', 'label'])
df_resto = pd.DataFrame(columns=['review', 'label'])
english_stop_words = stopwords.words('english')

### Defining Functions that will be used in the Program throughout the assignment

##### Webpage data extraction program

In [5]:
def get_data_df(weblink, url_category):
    df_temp = pd.DataFrame(columns=['review', 'label'])
    category_sites = get_webpage_bs(url_category)
    #print(category_sites)
    for link in category_sites.find_all('a', href=True):
        #print(link.text)
        curr_company = get_webpage_bs(weblink+link['href'])
        info = curr_company.findAll("div", {"class": "review"})
        for curr_info in info:
            rating = curr_info.select("p:nth-of-type(2)")
            review = curr_info.select("p:nth-of-type(3)")
            rating = int(rating[0].img.get('alt').split('-')[0].strip())
            rating = "positive" if rating > 3 else "negative"
            review = preprocess_reviews(review[0].text)
            new_row = pd.Series({'review': review, 'label': rating})
            df_temp = df_temp.append(new_row, ignore_index = True)
    return df_temp

#### The Preprocessing Section
This section is responsible to preprocess each review before being sent to the csv backup which would later be used to run our model upon

The preprocess_reviews function takes in a review, removes all the punctutations, and sends the review for further preprocessing

In [6]:
def preprocess_reviews(line):
    REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
    REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
    line = REPLACE_NO_SPACE.sub("", line.lower())
    line = get_stemmed_text(remove_stop_words(line))
    
    return line

##### Remove Stop Words
This Function is responsible to remove all the words that do not provide or define meaning of the sentence. These words are used to frame the sentence, however, their presence or absence do not affect the overall meaning of the statement. 

For ex: "somewhere, there is a cat", here, "somewhere" and "cat" only provide meaning to sentence and words like 'there', "is" and 'a' are the stop words

In [7]:
def remove_stop_words(review):
    removed_stop_words = []
    removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in english_stop_words])
        )
    removed_stop_words = ' '.join(w for w in removed_stop_words)
    return removed_stop_words

###### Porter Stemmer
This is a process wherein the word is reduced to its stem or base form, which helps the system to remove the variations of many words that may have arised due to it's usage in the context, however, once reduced to it's root, it makes it easier for the system to understand their usage.

In [8]:
def get_stemmed_text(review):
    
    stemmer = PorterStemmer()
    return ' '.join([stemmer.stem(word) for word in review.split()])

In [9]:
def print_cm(cm, labels, hide_zeroes=False, hide_diagonal=False, hide_threshold=None):
    """pretty print for confusion matrixes"""
    columnwidth = max([len(x) for x in labels] + [5])  # 5 is value length
    empty_cell = " " * columnwidth
    # Print header
    print("    " + empty_cell, end=" ")
    for label in labels:
        print("%{0}s".format(columnwidth) % label, end=" ")
    print()
    # Print rows
    for i, label1 in enumerate(labels):
        print("    %{0}s".format(columnwidth) % label1, end=" ")
        for j in range(len(labels)):
            cell = "%{0}.1f".format(columnwidth) % cm[i, j]
            if hide_zeroes:
                cell = cell if float(cm[i, j]) != 0 else empty_cell
            if hide_diagonal:
                cell = cell if i != j else empty_cell
            if hide_threshold:
                cell = cell if cm[i, j] > hide_threshold else empty_cell
            print(cell, end=" ")
        print()

###### Backup Creation
This section reads the data from the CSV files and if the data is not present, only then it focuses on extracting the same from the website. Since website creawling is an expensive operation, it helps if we already have data in a backup which we can directly read from.

In [10]:
if not os.path.exists('data_backup'):
    os.makedirs('data_backup')
    flg = False
exists_auto = os.path.isfile('data_backup/df_auto.csv')
exists_gym = os.path.isfile('data_backup/df_gym.csv')
exists_resto = os.path.isfile('data_backup/df_resto.csv')
if exists_auto:
    df_auto = pd.read_csv('data_backup/df_auto.csv')
    auto = False
else:
    auto = True
if exists_gym:
    df_gym = pd.read_csv('data_backup/df_gym.csv')
    gym = False
else:
    gym = True
if exists_resto:
    df_resto = pd.read_csv('data_backup/df_resto.csv')
    resto = False
else:
    resto = True

In [11]:
for curr_id in mydivs:
    if 'Automotive' in curr_id.text and auto :
        url_category = weblink + curr_id.a.get('href')
        df_auto = get_data_df(weblink, url_category)
        df_auto.to_csv('data_backup/df_auto.csv', index=False)
    if 'Gym' in curr_id.text and gym:
        url_category = weblink + curr_id.a.get('href')
        df_gym = get_data_df(weblink, url_category)
        df_gym.to_csv('data_backup/df_gym.csv', index=False)
    if 'Restaurants' in curr_id.text and resto:
        url_category = weblink + curr_id.a.get('href')
        df_resto = get_data_df(weblink, url_category)
        df_resto.to_csv('data_backup/df_resto.csv', index=False)

In [12]:
df_auto.head()

Unnamed: 0,review,label
0,man work tonight 8-12-17 rude real jerk need h...,negative
1,chri rude person gave attitud chang peopl go w...,negative
2,one favorit ga station stop store alway clean ...,positive
3,oh thank heaven seven eleven dont know thank s...,negative
4,five star guy work weekday morn around 8-9am-i...,positive


In [13]:
df_gym.head()

Unnamed: 0,review,label
0,your look box east valley highli recommend gym...,positive
1,realli excit tri fun workout routin would also...,negative
2,interest take box bootcamp class research foun...,negative
3,work 1 1 box bout 6 month love price reason al...,positive
4,place liter kick butt everi singl time actual ...,positive


In [14]:
df_resto.head()

Unnamed: 0,review,label
0,husband rare afternoon decid tri place friend ...,negative
1,year thought wine store sister stop told go ni...,positive
2,place charm went husband love simpl clean deco...,positive
3,want tri place coupl year final stop last nigh...,positive
4,decor look ok layout busi difficult walk sit/t...,negative


Once the data has been preprocessed, we need to convert the data into numeric values that could be understood by the Machine Learning Algorithm we are going to use to predict the class labels of the reviews. 

In this assignment's case, we are going to use TF-IDF Vectoriser. The TF-IDF or Term Frequency - Inverse Document Frequency is used to determine how important a word is in the document the TFIDF is being run upon. If the word appears to be most important, the TF-IDF rating of the word is going to be 1, and 0 in case otherwise. 

In the below run of the TFIDF Vectoriser, we have limited the number of words or features to 2500 in order to:
- Have similar number of features across models to enable cross validation between categories
- words beyond top 2500 are lesser important words and their absence would not make a great difference to the model

### Running TFIDF Vectoriser over the Data in the Category Automobile

In [15]:
tfidf_vectorizer = TfidfVectorizer(max_features = 2500)
tfidf_vectorizer.fit(df_auto['review'])
X_auto = tfidf_vectorizer.transform(df_auto['review'])
X_test_auto = tfidf_vectorizer.transform(df_auto['review'])

#Splitting Data into Train Test and Validation Test Sets
X_train, X_val, y_train, y_val = train_test_split(
    X_auto, df_auto['label'], train_size = 0.75) 
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

# Final Model comprising of the entire data
final_tfidf_auto = LogisticRegression(C=0.2)
final_tfidf_auto.fit(X_auto, df_auto['label'])

print ("Final Accuracy: %s" 
       % accuracy_score(df_auto['label'], final_tfidf_auto.predict(X_test_auto)))

print_cm(confusion_matrix(df_auto['label'], final_tfidf_auto.predict(X_test_auto)),labels = ['negative','positive'])

Accuracy for C=0.01: 0.626
Accuracy for C=0.05: 0.64
Accuracy for C=0.25: 0.852
Accuracy for C=0.5: 0.884
Accuracy for C=1: 0.908
Final Accuracy: 0.895
             negative positive 
    negative    594.0    194.0 
    positive     16.0   1196.0 




In the above run, we also tried to figure out the performance of Logistic Regression over the models and tried to vary the inverse of regularisation paramter from 0.01 to 1. The Regularisation Parameter helps us to prevent overfitting which is done by heavily penalising on the errors. Here 1 means least penalty. 

We have also set the final value of C as 0.2 to avoid overfitting, but also to avoid underfitting. 


### Running TFIDF Vectoriser over the Data in the Category Gym

In [16]:
tfidf_vectorizer = TfidfVectorizer(max_features = 2500)
tfidf_vectorizer.fit(df_gym['review'])
X_gym = tfidf_vectorizer.transform(df_gym['review'])
X_test_gym = tfidf_vectorizer.transform(df_gym['review'])

#Splitting Data into Train Test and Validation Test Sets
X_train, X_val, y_train, y_val = train_test_split(
    X_gym, df_gym['label'], train_size = 0.75) 
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    
    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, lr.predict(X_val))))

final_tfidf_gym = LogisticRegression(C=0.2)
final_tfidf_gym.fit(X_gym, df_gym['label'])
print ("Final Accuracy: %s" 
       % accuracy_score(df_gym['label'], final_tfidf_gym.predict(X_test_gym)))

print_cm(confusion_matrix(df_gym['label'], final_tfidf_gym.predict(X_test_gym)),labels = ['negative','positive'])

Accuracy for C=0.01: 0.628
Accuracy for C=0.05: 0.646
Accuracy for C=0.25: 0.808
Accuracy for C=0.5: 0.856
Accuracy for C=1: 0.876
Final Accuracy: 0.8615
             negative positive 
    negative    433.0    268.0 
    positive      9.0   1290.0 




### Running TFIDF Vectoriser over the Data in the Category Automobile

In [17]:
tfidf_vectorizer = TfidfVectorizer(max_features = 2500)
tfidf_vectorizer.fit(df_resto['review'])
X_resto = tfidf_vectorizer.transform(df_resto['review'])
X_test_resto = tfidf_vectorizer.transform(df_resto['review'])

#Splitting Data into Train Test and Validation Test Sets
X_train, X_val, y_train, y_val = train_test_split(X_resto, df_resto['label'], train_size = 0.75) 

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c)
    lr.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" % (c, accuracy_score(y_val, lr.predict(X_val))))

    
final_tfidf_resto = LogisticRegression(C=0.2)
final_tfidf_resto.fit(X_resto, df_resto['label'])
print ("Final Accuracy: %s" 
       % accuracy_score(df_resto['label'], final_tfidf_resto.predict(X_test_resto)))


print_cm(confusion_matrix(df_resto['label'], final_tfidf_resto.predict(X_test_resto)),labels = ['negative','positive'])

Accuracy for C=0.01: 0.564
Accuracy for C=0.05: 0.6
Accuracy for C=0.25: 0.82
Accuracy for C=0.5: 0.862
Accuracy for C=1: 0.864
Final Accuracy: 0.868
             negative positive 
    negative    597.0    241.0 
    positive     23.0   1139.0 




##### Naive Bayes

In order to supplement our research on the accuracy being gained by the Logistic Regression model, I thought of performing a Naive Bayes Classification to see how does it impact our Training Model and their Precision Measures.

In [18]:
NB_auto = BernoulliNB(alpha=0.02)
NB_auto.fit(X_auto, df_auto['label'])

NB_gym = BernoulliNB(alpha=0.02)
NB_gym.fit(X_gym, df_gym['label'])

NB_resto = BernoulliNB(alpha=0.02)
NB_resto.fit(X_resto, df_resto['label'])

BernoulliNB(alpha=0.02, binarize=0.0, class_prior=None, fit_prior=True)

From the above Tests we can conclude that:
- Logistic Regression was chosen as an algorithm to train the model and it's performance was analysed. 
- The Logistic Regression returned values with high accuracy nearing 90% with C = 0.2, false negatives were very less however, false positives were seen to be higher in number. 
- As Mental Accounting states, if a review is classified as negative if it was actually positive, it would be more harmful to the business, than a negative review being classed as Positive. When this happens, the customer can still go through other reviews and determine the overall effectiveness of a business. However, a negative-labelled feedback weighs more heavily on the customer than a positive one. 
- with the above theory in mind, we can assess the confusion matrix and see that not only does the model classify reviews mostly appropriately, it also has neglible false negatives (positive review classified as negative), thus we can accept this model 

## STEP 3

## Automobile Classifer Model
Model trained on Automobile, run on restaurant and gym data

### Logistic Regression
Running on Gym Data

In [19]:
print ("Final Accuracy: %s" % accuracy_score(df_gym['label'], final_tfidf_auto.predict(X_test_gym)))

Final Accuracy: 0.6515


###### Confusion Matrix

In [20]:
print_cm(confusion_matrix(df_gym['label'], final_tfidf_auto.predict(X_test_gym)),labels = ['negative','positive'])

             negative positive 
    negative     18.0    683.0 
    positive     14.0   1285.0 


###### F1 Score

In [21]:
f1_score(df_gym['label'], final_tfidf_auto.predict(X_test_gym), pos_label="negative")

0.04911323328785812

###### Precision

In [22]:
precision_score(df_gym['label'], final_tfidf_auto.predict(X_test_gym), pos_label="negative")

0.5625

Running on Restaurant Data

In [23]:
print ("Final Accuracy: %s" % accuracy_score(df_resto['label'], final_tfidf_auto.predict(X_test_resto)))

Final Accuracy: 0.581


In [24]:
print_cm(confusion_matrix(df_resto['label'], final_tfidf_auto.predict(X_test_resto)),labels = ['negative','positive'])

             negative positive 
    negative     15.0    823.0 
    positive     15.0   1147.0 


### Naive Bayes

In [25]:
print ("Final Accuracy: %s" % accuracy_score(df_resto['label'], NB_auto.predict(X_test_resto)))

Final Accuracy: 0.545


In [26]:
print ("Final Accuracy: %s" % accuracy_score(df_gym['label'], NB_auto.predict(X_test_gym)))

Final Accuracy: 0.5285


In [27]:
print_cm(confusion_matrix(df_gym['label'], NB_auto.predict(X_test_gym)),labels = ['negative','positive'])

             negative positive 
    negative    390.0    311.0 
    positive    632.0    667.0 


In [28]:
f1_score(df_gym['label'], NB_auto.predict(X_test_gym), pos_label="negative")

0.4526987811955891

From the above runs of Naive Bayes and Logistic Regression, we observe that Logistic Regression Produces better accuracy, but suffers from Low F1 score, whereas, Naive Bayers produces lower accuracy, but leads to higher F1 Score

When we are dealing with reviews, which is the case in our assignment, if the review is being classified as Positive, when it is indeed negative or vice versa, it does not do a significant damage. For example, if this was a cancer prediction model that were to predict if a patient has cancer or not, classifying someone who has cancer as cancer-free would lead to real life implications. In such cases, where False Positives and False Negatives play an important role, we need to depend a lot on increasing the value of Precision, Recall and F1 Score. 

In our case, since false positive and False negative do not play a significant role, we do not need to focus on getting higher values in Precision, Recall and F1 score. We can only focus upon increasing the accuracy of the model.

This observation of mine lead to me making a decision to go ahead with Logistic Regression for this case study as Naive Bayes did provide a significantly increase F1 Score, however, it did not help with the accuracy.

## Gym Classifier Model

In [29]:
print ("Final Accuracy: %s" % accuracy_score(df_auto['label'], final_tfidf_gym.predict(X_test_auto)))

Final Accuracy: 0.6045


In [30]:
print ("Final Accuracy: %s" % accuracy_score(df_resto['label'], final_tfidf_gym.predict(X_test_resto)))

Final Accuracy: 0.5815


###### Confusion Matrix
Confusion Matrix for Gym_Model on Automobile Data 

In [31]:
print_cm(confusion_matrix(df_auto['label'], final_tfidf_gym.predict(X_test_auto)),labels = ['negative','positive'])

             negative positive 
    negative      2.0    786.0 
    positive      5.0   1207.0 


Confusion Matrix for Restaurant_Model on Automobile Data

In [32]:
print_cm(confusion_matrix(df_resto['label'], final_tfidf_gym.predict(X_test_resto)),labels = ['negative','positive'])

             negative positive 
    negative      5.0    833.0 
    positive      4.0   1158.0 


## Resto Classifier model

In [33]:
print ("Final Accuracy: %s" % accuracy_score(df_auto['label'], final_tfidf_resto.predict(X_test_auto)))

Final Accuracy: 0.603


In [34]:
print ("Final Accuracy: %s" % accuracy_score(df_gym['label'], final_tfidf_resto.predict(X_test_gym)))

Final Accuracy: 0.6475


###### Confusion Matrix
Confusion Matrix for Gym_Model on Automobile Data 

In [35]:
print_cm(confusion_matrix(df_auto['label'], final_tfidf_resto.predict(X_test_auto)),labels = ['negative','positive'])

             negative positive 
    negative     71.0    717.0 
    positive     77.0   1135.0 


Confusion Matrix for Gym_Model on Automobile Data 

In [36]:
print_cm(confusion_matrix(df_gym['label'], final_tfidf_resto.predict(X_test_gym)),labels = ['negative','positive'])

             negative positive 
    negative     24.0    677.0 
    positive     28.0   1271.0 


### Conlusions

From the above experiment we can see that:
- why we chose Logistic Regression over Naive Bayes, this was because the current case study warranted a more accuracy based approach rather than focusing on Precision and Recall
- The Logistic Regression was able to correctly identify the True Positive, but failed to work on True Negative and classified most of them as Positives. This can be due to the fact that the reviews from each of the categories are cateogry dependent and often use terms that are relevant to those categories, thus this makes using context from one category in training set and testing it on other set less feasible. 
- From category dependent Logistic Regression Models we can see that the predictor works well with better scores in True Positives and True Negative, but fails to replicate similar results when running through cross category data. This further strengthens our findings that reviews from category A won't be a suitable data for predicting reviews from Category B as they do not have any inter-relatibility between the reviews