# Advanced Topics in Data Mining and Knowledge Discovery 
## Assignment 2 
Yelp is a company that  develops, hosts and markets the Yelp.com website and the Yelp mobile app, which publish crowd-sourced reviews about businesses. It also operates an online reservation service called Yelp Reservations. [Wikipedia](https://en.wikipedia.org/wiki/Yelp)

In this assignment you will classifying reviews into 3 categories: restaurants, beauty and shopping. The reviews we will be using are from the Kaggle Yelp reviews dataset.

The columns are as follows:
* review_id  - a unique id for each review.
* user_id - a unique id for each user.
* business_id - a unique id for each business.
* review_date - the date the review was published.
* **review_text** - the review itself. 
* **business_category** - the category the review belong to, either **restaurant**, **beauty** or **shopping**.

## Questions 

### Text Data Cleaning and Preprocessing

You're given the following text:

"Eugene loves all animals, but especially cats, he loves cats so much that he has 8 of them. His cats surely love him back, but you never know, as cats are independent creatures."

 You're using either tf–idf or Count vectorization techique for text representation.

1. Given that "cat" is one of your features, what is the count of "cat" in this sentence?

Without using lemmatize preprocess the count of cat is 0.

2. What can you do to the text so cat and cats will be considered the same? When is it important to do so?

I need to run lemmatisation algorithm on the text that will group similiar meaning words to the same groups.

3. What other cleaning operations are important when working with text and why? 

Important cleaning operations when wroking on text are:  
1. lemmarisartion ( like mentioned in the last question)
2. Remove punctuation and very common words that have little meaning, such as ‘the’, ‘and’, etc. All the stop words. This step is important because we dont want those words get high score ( become important paet of the sentence 
3. Transform all words to lower case. Will make lemmarizartion easier.


# CODE

In [None]:
# Imports
import pandas as pd

In [171]:
# Data Loading:
df = pd.read_csv('https://raw.githubusercontent.com/m-braverman/ta_dm_course_data/master/train3.csv')

4. Prepare the text for the classifier.

In [None]:
import nltk
import string
import re
from copy import deepcopy
from nltk.tokenize import  word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

# nltk.download() - You must download nltk for using 
stopwords = set(nltk.corpus.stopwords.words('english') + ['reuter', '\x03'])
text_column = 'review_text'
lemmatizer = WordNetLemmatizer()
table = str.maketrans('', '', string.punctuation)

def pre_process(row, column_name):
    text_column = deepcopy(row[column_name])
    
    # Remove punctuation
    text_column = text_column.translate(table)
    
    # Replace numbers wih 'num'
    text_column = re.sub(r'\d+', 'num', text_column)
    
    # Tokenize
    tokenized_row = word_tokenize(text_column)
    
    # Lemmatize + remove stop words + lower
    new_array_of_words = []
    for word in tokenized_row:
        word = word.lower()
        if word not in stopwords:
            new_array_of_words.append(lemmatizer.lemmatize(word))
    
    text_column = " ".join(new_array_of_words)
        
    return text_column

df[f"final_{text_column}"] = df.apply(lambda row: pre_process(row, text_column), axis=1)

# Convert to feature vector
feature_extraction = TfidfVectorizer()
X = feature_extraction.fit_transform(df[f"final_{text_column}"].values)

df



5. Split the data into train and validation sets.

In [173]:

from sklearn.model_selection import train_test_split
import numpy as np



train, validation = train_test_split(df, test_size=0.3)
msk = np.random.rand(len(df)) < 0.7

X_train = X[msk]
y_train = df['business_category'][msk]

X_validation = X[~msk]
y_validation = df['business_category'][~msk]




6. Create and train the a classifier of your choosing:

In [None]:


# train classifier
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train, y_train)


7. Predict on the validation set you set aside previously:

In [175]:

pred = clf.predict(X_validation)


8. Calculate and display the accuracy, precision, recall and F1 score on the validation set:

In [261]:
from sklearn import metrics
from sklearn.metrics import f1_score,precision_score, recall_score

confusion_matrix = metrics.confusion_matrix(y_validation, pred)
f1 = f1_score(y_validation, pred, average="macro")
precision = precision_score(y_validation, pred, average="macro")
recall = recall_score(y_validation, pred, average="macro")

print(f"Accuracy is - {np.mean(pred == y_validation) * 100}%")
print(f"F1 score is - {f1}")
print(f"Precision score is - {precision}")
print(f"Recall score is - {recall}")
print("Confusion Matrix is - ")
pd.DataFrame(confusion_matrix)



Accuracy is - 79.51959544879898%
F1 score is - 0.7366708072629126
Precision score is - 0.8357003746606698
Recall score is - 0.7374508021174284
Confusion Matrix is - 


Unnamed: 0,0,1,2
0,195,27,4
1,2,363,0
2,52,77,71


9. Why do we use validation?

We use validation because we need to test our model on data that was not in the train to get our model performance. If we will test our model on trainned data than the results will be too good (overfitted) and we won't know our model performance.

## LIME
LIME is used to explain what machine learning classifiers (or models) are doing.

In this part you'll be using lime to gain a deeper understaning of *WHY* the classifier decided to classify a review as a particular category. 

In [None]:
! pip install lime

10. Create an LIME explainer:

In [250]:

from sklearn.pipeline import make_pipeline
from random import sample

# c = make_pipeline(feature_extraction, clf)
# print(c.predict_proba([validation[f"final_{text_column}"]]))
class_names = ['Beauty', 'Restaurant','Shopping']
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=class_names)

11. Explain using the LIME explainer the reviews in the generated indices (run the random generator once). Display the results in this notebook, the explanation should be present for all classes.

In [258]:
# Random index generator:
samples = sample(range(1,200), 3)

c = make_pipeline(feature_extraction,clf)

for random_index in samples:
    random_variable = list(validation["final_review_text"])[random_index]
    exp = explainer.explain_instance(random_variable, c.predict_proba, num_features=6)
    probabilities = c.predict_proba([random_variable])
    print('Document id: %d' % random_index)
    print('Probability(Beauty) =', probabilities[0,0])
    print('Probability(Resturant) =', probabilities[0,1])
    print('Probability(Shopping) =', probabilities[0,2])
    print('True class: %s' % list(validation["business_category"])[random_index])

    print(exp.as_list())

Document id: 72
Probability(Beauty) = 0.0016778360183894163
Probability(Resturant) = 0.9972963655636207
Probability(Shopping) = 0.0010257984179918363
True class: Restaurant
[('ordered', 0.01006996042832716), ('pasta', 0.00844659031359341), ('meal', 0.008101461552225232), ('beef', 0.007969118078242661), ('menu', 0.0076294100923698805), ('terroni', -0.006585050388098162)]
Document id: 134
Probability(Beauty) = 0.974411137263074
Probability(Resturant) = 0.022435006412692738
Probability(Shopping) = 0.003153856324238081
True class: Beauty
[('spa', -0.045087946827372785), ('treatment', -0.03704682727963171), ('appointment', -0.030080710348156334), ('massage', -0.028154534987072915), ('facility', -0.02147455329709962), ('room', -0.020320916247461168)]
Document id: 65
Probability(Beauty) = 0.9024222859258673
Probability(Resturant) = 0.08766692978168626
Probability(Shopping) = 0.00991078429245204
True class: Beauty
[('pool', -0.09074446445045903), ('tower', -0.038811260935328766), ('room', -0.0