<a href="https://colab.research.google.com/github/kiranbkulkarni/Data_Explorer/blob/master/19200530_Assignment_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COMP47670 Assignment 2: Text Classification

**Presented By**: Kiran Kulkarni(*19200530*)

### Overview:  
The objective of this assignment is to scrape consumer reviews from a set of web pages and evaluate the performance of text classification on the data. The reviews have been divided into five categories here:
    * http://mlg.ucd.ie/modules/yalp *
Each review has a star rating. For this assignment, we will assume that 1-star to 3-star reviews are “negative”, and 4-star to 5-star reviews as “positive”.


The following tasks were completed as part of the assignment

* Task 1: To select three categories from the above mentioned base URL. To scrape all the reviews for each cartegory ans store them as three seperate datasets. To segregate the reviews based on the ratings and to provide a class label.(i.e., Positive or Negative)

* Task 2: For the three category datasets: 
    * a.	From the reviews in this category, apply appropriate preprocessing steps to create a numeric representation of the data, suitable for classification.
    * b.	Build a classification model using a classifier of your choice, to distinguish between “positive” and “negative” reviews.
    * c.	Test the predictions of the classification model using an appropriate evaluation strategy. Report and discuss the evaluation results in your notebook.

* 3.	Evaluate the performance of each of your three classification models when applied to data from the other two selected categories. That is, for each unique pair of selected categories (A,B), run the experiments:
    * a.	Train a classification model on the data from “Category A”, and evaluate its performance on the data from “Category B”.
    * b.	Train a classification model on the data from “Category B”, and evaluate its performance on the data from “Category A”.
    
    
#### GUIDELINES: 
For the assignment, only these third-party packages were used: NumPy, Pandas, Scikit-learn, NLTK, SciPy, Requests, BeautifulSoup, Matplotlib, Seaborn, Gensim


### TASK 1: Scraping for reviews

Before we begin the task 1, few necessary python packages should be installed on the machine. Some machines does have them installed and some don't.

In [0]:
#Installing the necessary packages

#!pip install beautifulsoup4
#!pip install lxml
#!pip install langdetect
#!pip install requests
#!pip intsall gensim
#!pip install seaborn
#!pip install nltk

Once the necessary packges are installed, we should import them to our project.

In [0]:
#import nltk - natural language toolkit
import nltk

#import the beautifulsoup 4 package for scraping the websites for reviews.
import bs4 as bs

#import the request package for hanfling http requests.
import urllib.request

#import pandas to hangle the panel data and store the scraped data into dataframes.
import pandas as pd

#import datetime. Not relevant to this project. 
from datetime import datetime

#import numpy as np. To handle numerical data.
import numpy as np

#import regular expression
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

from sklearn.linear_model import LogisticRegression

#accuracy using accuracy_score
from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split

After installing NLTK package, a few sub modules needs to be downloaded. Using the following lines of code, 'punkt', 'stopwords' are downloadedm

In [0]:
#download the stopwords subpackage of nltk

#nltk.download ('stopwords')

#download the punkt subpackage of nltk

#nltk.download ('punkt')

#download all packages in nltk

#nltk.download()

To begin with, I have chosen Automotive, Cafe, and Fashion as my three choices of category. This choice is arbitrary. Choosing the categories which are more similar in nature(i.e., cafes and restaurants) could add bias to our Machine Learning model. So, keeping this in mind I have taken categories that are not closely related. 

### Scrape for data

In order tp scrape the data from the url, I have written a function to each category which takes a url as input and returns a data frame containing the requested data. This function is more specific the kind of data  we are trying to scrape. It cannot be generalised.

In [0]:

# A function to scrape data from the base url
def scrape_data(url): # url of each category
    df_category = pd.DataFrame() #initialise an empty dataframe to hold the data
    review_rating = [] # a list to hold the review text for each review
    review_text = [] # a list to hold review rating for each review

    sauce_category = urllib.request.urlopen(url).read() # passing the category url to create a sauce object
    soup_category = bs.BeautifulSoup(sauce_category, 'lxml') # passing the sauce object to create a soup object
    for item_url in soup_category.find_all('a'): # for loop to iterate overate each url for each shop.
        if item_url.get('href') == "index.html": # an if condition to skip the index.html
            continue # to skip and continue to next item
        sauce_n = urllib.request.urlopen('http://mlg.ucd.ie/modules/yalp/'+item_url.get('href')).read() #passing the item_url to create a item_sauce object
        soup_n = bs.BeautifulSoup(sauce_n, 'html.parser') #passing the item_suace object to create a item_soup object
        for review_containers in soup_n.find_all("div", {"class": "review"}): # for lopp to iterate over each review
            review_rating.append(review_containers.find_all("p", {"class": "rating"})[0].find_all('img')[0]["alt"]) #extracting review rating and appending it to the review_rating list
            review_text.append(review_containers.find_all("p", {"class": "review-text"})[0].text) # extracting the review_text and appending it to the review_text list
        
    df_category['review_rating'] = review_rating #assigning the review rating to data frame
    df_category['review_text'] = review_text #assigning the review text to data frame 
    
    return df_category # return the data in the form of a dataframe.

In [0]:
cafe_url = 'http://mlg.ucd.ie/modules/yalp/cafes_list.html' # a variable to hold the cafe url
auto_url = 'http://mlg.ucd.ie/modules/yalp/automotive_list.html' # a variable to hold automotive url
fash_url = 'http://mlg.ucd.ie/modules/yalp/fashion_list.html' # a variable to hold fashion url

df_cafe = scrape_data(cafe_url) # a dataframe to hold the cafe data
df_auto = scrape_data(auto_url) # a dataframe to hold the automotive data
df_fash = scrape_data(fash_url) # a dataframe to hold the fashion data

Storing the data into a csv.

In [0]:
df_cafe.to_csv('cafe_reviews.csv') # saving data frame into a csv file
df_auto.to_csv('automotive_reviews.csv')
df_fash.to_csv('fashion_reviews.csv')

### Text pre-processing and assigning the traget label

Before beginning the text pre-processing, let's first clean the collected data

In [0]:
df_cafe['review_rating'] = df_cafe['review_rating'].str.split('-', expand=True)[0].astype(int) # split function applied to extract the rating(numeric part) from rating text 
df_auto['review_rating'] = df_auto['review_rating'].str.split('-', expand=True)[0].astype(int)
df_fash['review_rating'] = df_fash['review_rating'].str.split('-', expand=True)[0].astype(int)

In [0]:
print(df_cafe.head()) # to confirm the changes in previous step have been applied

print(type(df_cafe.review_rating[0]))

   review_rating                                        review_text
0              4  Pros: Lots of items you would not expect from ...
1              4  Best egg-tarts in town! There's really not muc...
2              2  I've been to ABC Bakery a few times since I re...
3              1  FYI, Closed Monday's New ownership for about 1...
4              4  The inside may not look like much but they mak...
<class 'numpy.int32'>


### To classify the reviews based on the rating. If the rating is below 3-star it's considered as Negative else it's Positive.

In [0]:

# a funtion to classify the review text based on the rating
def classify_review(df):
    df['class_label'] = np.where(df['review_rating']>=4, 'Positive', 'Negative') # if the rating is equal to or greater than 4, it's classified as Postive else Negative
    return df


In [0]:
classify_review(df_cafe) #applying classify_review() on each category
classify_review(df_auto)
classify_review(df_fash)

Unnamed: 0,review_rating,review_text,class_label
0,5,Looking for the best tactical supplies? Look n...,Positive
1,1,Stood in line like an idiot for 5 minutes to p...,Negative
2,4,Another great store with quality Equipment. Th...,Positive
3,5,The Problem with this store is not that they h...,Positive
4,5,Great place! We went in at almost closing time...,Positive
...,...,...,...
1995,4,"God, I'd never thought I'd see the day when I'...",Positive
1996,1,They keep shooting themselves in the foot. Apo...,Negative
1997,1,"Extremely dark., so dark you can't see the out...",Negative
1998,1,"This place is dark, loud, and filled with enou...",Negative


In [0]:
df_cafe.sample() #checking the sample

Unnamed: 0,review_rating,review_text,class_label
1960,3,If you want a smoothie this is a great place. ...,Negative


In [0]:

# function to to do the first level of cleaning the data.  
def text_preprocess(review_text):
    #Make text lower, remove text in square brackets, remove punctuation and remove words.
    review_text = review_text.lower()
    review_text = re.sub('\[.*?\]', '', review_text)
    review_text = re.sub('[%s]' % re.escape(string.punctuation), '', review_text)
    review_text = re.sub('\w*\d\w*', '', review_text)
    review_text = re.sub('[''""...]', '', review_text)
    review_text = re.sub('\n', '', review_text)
    
    return review_text

#storing the reference to the function in a lambda
pre_process = lambda txt: text_preprocess(txt)

In [0]:
df_cafe_clean = df_cafe #copying the raw data into a new dataframe
df_cafe_clean.review_text = df_cafe_clean.review_text.apply(pre_process)

df_auto_clean = df_auto #copying the raw data into a new dataframe
df_auto_clean.review_text = df_auto_clean.review_text.apply(pre_process)

df_fash_clean = df_fash #copying the raw data into a new dataframe
df_fash_clean.review_text = df_fash_clean.review_text.apply(pre_process)

In [0]:
df_cafe_clean

Unnamed: 0,review_rating,review_text,class_label
0,4,pros lots of items you would not expect from a...,Positive
1,4,best eggtarts in town theres really not much t...,Positive
2,2,ive been to abc bakery a few times since i rea...,Negative
3,1,fyi closed mondays new ownership for about we...,Negative
4,4,the inside may not look like much but they mak...,Positive
...,...,...,...
1995,3,i hate to be one of those obnoxious people who...,Negative
1996,3,its always a stop here for me either for a qui...,Negative
1997,3,it is nice to go there if youd like to go shop...,Negative
1998,3,my girlfriend and i had lunch there a few days...,Negative


In [0]:
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

#from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify max features 
# CountVectorizer(ngram_range=(min_n, max_n)), specify the n-grams required
vect = CountVectorizer(ngram_range=(1,3),max_features=None, stop_words='english')
# Fit the vectorizer
vect.fit(df_cafe_clean.review_text)

# Transform the review column
X_cafe_review = vect.transform(df_cafe_clean.review_text)

# Create the bow representation
X_df=pd.DataFrame(X_cafe_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   aaa  aaa meat  aaa meat tender  aaa ribeye  aaa ribeye steak  aaand  \
0    0         0                0           0                 0      0   
1    0         0                0           0                 0      0   
2    0         0                0           0                 0      0   
3    0         0                0           0                 0      0   
4    0         0                0           0                 0      0   

   aaand im  aaand im dead  aau  aau basketball  ...  été accueillis par  \
0         0              0    0               0  ...                   0   
1         0              0    0               0  ...                   0   
2         0              0    0               0  ...                   0   
3         0              0    0               0  ...                   0   
4         0              0    0               0  ...                   0   

   été décu  été décu le  été très  été très bien  été époustouflée  \
0         0            0   

In [0]:
# Import the required vectorizer package and stop words list
from sklearn.feature_extraction.text import TfidfVectorizer,ENGLISH_STOP_WORDS

# Define the vectorizer and specify the arguments
#my_pattern = r'\b[^\d\W][^\d\W]+\b'
vect = TfidfVectorizer(ngram_range=(1, 2), max_features=None,stop_words=ENGLISH_STOP_WORDS).fit(df_cafe_clean.review_text)

# Transform the vectorizer
X_txt = vect.transform(df_cafe_clean.review_text)

# Transform to a data frame and specify the column names
X=pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names())
print('Top 5 rows of the DataFrame: ', X.head())

Top 5 rows of the DataFrame:     aaa  aaa meat  aaa ribeye  aaand  aaand im  aau  aau basketball   ab  \
0  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
1  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
2  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
3  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
4  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   

   ab eggr  aback  ...  étoile plus  étrangement  étrangement leau  été  \
0      0.0    0.0  ...          0.0          0.0               0.0  0.0   
1      0.0    0.0  ...          0.0          0.0               0.0  0.0   
2      0.0    0.0  ...          0.0          0.0               0.0  0.0   
3      0.0    0.0  ...          0.0          0.0               0.0  0.0   
4      0.0    0.0  ...          0.0          0.0               0.0  0.0   

   été accueillis  été décu  été très  été époustouflée  üppige  \
0

In [0]:
from sklearn.linear_model import LogisticRegression

#accuracy using accuracy_score
from sklearn.metrics import accuracy_score

In [0]:
# Import the required vectorizer package and stop words list
from sklearn.feature_extraction.text import TfidfVectorizer,ENGLISH_STOP_WORDS

# Define the vectorizer and specify the arguments
#my_pattern = r'\b[^\d\W][^\d\W]+\b'
vect = TfidfVectorizer(ngram_range=(1, 2), max_features=None,stop_words=ENGLISH_STOP_WORDS).fit(df_cafe_clean.review_text)

# Transform the vectorizer
X_txt = vect.transform(df_cafe_clean.review_text)

# Transform to a data frame and specify the column names
X=pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names())
print('Top 5 rows of the DataFrame: ', X.head())

Top 5 rows of the DataFrame:     aaa  aaa meat  aaa ribeye  aaand  aaand im  aau  aau basketball   ab  \
0  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
1  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
2  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
3  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
4  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   

   ab eggr  aback  ...  étoile plus  étrangement  étrangement leau  été  \
0      0.0    0.0  ...          0.0          0.0               0.0  0.0   
1      0.0    0.0  ...          0.0          0.0               0.0  0.0   
2      0.0    0.0  ...          0.0          0.0               0.0  0.0   
3      0.0    0.0  ...          0.0          0.0               0.0  0.0   
4      0.0    0.0  ...          0.0          0.0               0.0  0.0   

   été accueillis  été décu  été très  été époustouflée  üppige  \
0

In [0]:
y = df_cafe_clean.class_label

# Build a logistic regression model and calculate the accuracy
log_reg = LogisticRegression().fit(X, y)
print('Accuracy of logistic regression: ', log_reg.score(X, y))



Accuracy of logistic regression:  0.855


In [0]:
# Create an array of prediction
y_predict = log_reg.predict(X)

# Print the accuracy using accuracy score
print('Accuracy of logistic regression: ', accuracy_score(y, y_predict))

Accuracy of logistic regression:  0.855


In [0]:
from sklearn.model_selection import train_test_split

In [0]:
y = df_fashion_clean_en.class_label

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123, test_size=0.2, stratify= y)

In [0]:

# Build a logistic regression model and calculate the accuracy
log_reg = LogisticRegression().fit(X_train, y_train)
print('Accuracy of training data: ', log_reg.score(X_train, y_train))



Accuracy of training data:  0.826875


In [0]:
print('Accuracy on Testing data', log_reg.score(X_test, y_test))

Accuracy on Testing data 0.75


In [0]:
y_predicted = log_reg.predict(X_test)
print('Accuracy score on Test data', accuracy_score(y_test, y_predicted))

Accuracy score on Test data 0.75


In [0]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, y_predicted)/len(y_test))

[[0.02 0.25]
 [0.   0.73]]


In [0]:
def vectorise_reviews(df_category):
    vect = TfidfVectorizer(ngram_range=(1, 2), max_features=None,stop_words=ENGLISH_STOP_WORDS).fit(df_category.review_text)

    # Transform the vectorizer
    X_txt = vect.transform(df_category.review_text)

    # Transform to a data frame and specify the column names
    X=pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names())
    print('Top 5 rows of the DataFrame: ', X.head())
    
    return X

In [0]:
vectorised_cafe = vectorise_reviews(df_cafe_clean)

Top 5 rows of the DataFrame:     aaa  aaa meat  aaa ribeye  aaand  aaand im  aau  aau basketball   ab  \
0  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
1  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
2  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
3  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
4  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   

   ab eggr  aback  ...  étoile plus  étrangement  étrangement leau  été  \
0      0.0    0.0  ...          0.0          0.0               0.0  0.0   
1      0.0    0.0  ...          0.0          0.0               0.0  0.0   
2      0.0    0.0  ...          0.0          0.0               0.0  0.0   
3      0.0    0.0  ...          0.0          0.0               0.0  0.0   
4      0.0    0.0  ...          0.0          0.0               0.0  0.0   

   été accueillis  été décu  été très  été époustouflée  üppige  \
0

In [0]:
def fit_and_predict(df_c):
    vectorised_reviews_X = vectorise_reviews(df_c)
    target_y = df_c.class_label

    X_train, X_test, y_train, y_test = train_test_split(vectorised_reviews_X, target_y, random_state=123, test_size=0.2, stratify= y)

    # Build a logistic regression model and calculate the accuracy
    log_reg = LogisticRegression().fit(X_train, y_train)
    print('Accuracy of training data: ', log_reg.score(X_train, y_train))
    print('Accuracy on Testing data', log_reg.score(X_test, y_test))
    y_predicted = log_reg.predict(X_test)
    print('Accuracy score on Test data', accuracy_score(y_test, y_predicted))
    from sklearn.metrics import confusion_matrix

    print(confusion_matrix(y_test, y_predicted)/len(y_test))

In [0]:
fit_and_predict(df_cafe_clean)

Top 5 rows of the DataFrame:     aaa  aaa meat  aaa ribeye  aaand  aaand im  aau  aau basketball   ab  \
0  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
1  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
2  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
3  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   
4  0.0       0.0         0.0    0.0       0.0  0.0             0.0  0.0   

   ab eggr  aback  ...  étoile plus  étrangement  étrangement leau  été  \
0      0.0    0.0  ...          0.0          0.0               0.0  0.0   
1      0.0    0.0  ...          0.0          0.0               0.0  0.0   
2      0.0    0.0  ...          0.0          0.0               0.0  0.0   
3      0.0    0.0  ...          0.0          0.0               0.0  0.0   
4      0.0    0.0  ...          0.0          0.0               0.0  0.0   

   été accueillis  été décu  été très  été époustouflée  üppige  \
0



Accuracy of training data:  0.826875
Accuracy on Testing data 0.75
Accuracy score on Test data 0.75
[[0.02 0.25]
 [0.   0.73]]


In [0]:
fit_and_predict(df_auto_clean)

Top 5 rows of the DataFrame:      aa  aa auto  aaa  aaa allied  aaa approved  aaa aproval  aaa came  \
0  0.0      0.0  0.0         0.0           0.0          0.0       0.0   
1  0.0      0.0  0.0         0.0           0.0          0.0       0.0   
2  0.0      0.0  0.0         0.0           0.0          0.0       0.0   
3  0.0      0.0  0.0         0.0           0.0          0.0       0.0   
4  0.0      0.0  0.0         0.0           0.0          0.0       0.0   

   aaa car  aaa card  aaa member  ...  zsofia summer  zullo  zullo internet  \
0      0.0       0.0         0.0  ...            0.0    0.0             0.0   
1      0.0       0.0         0.0  ...            0.0    0.0             0.0   
2      0.0       0.0         0.0  ...            0.0    0.0             0.0   
3      0.0       0.0         0.0  ...            0.0    0.0             0.0   
4      0.0       0.0         0.0  ...            0.0    0.0             0.0   

   zullo tenaflynj  électroniques  électroniques ont  ét



Accuracy of training data:  0.979375
Accuracy on Testing data 0.8275
Accuracy score on Test data 0.8275
[[0.2325 0.1575]
 [0.015  0.595 ]]


In [0]:
fit_and_predict(df_fash_clean)

Top 5 rows of the DataFrame:     aaa  aaa specific  aahed  aahed cars  aaron  aaron athletic  \
0  0.0           0.0    0.0         0.0    0.0             0.0   
1  0.0           0.0    0.0         0.0    0.0             0.0   
2  0.0           0.0    0.0         0.0    0.0             0.0   
3  0.0           0.0    0.0         0.0    0.0             0.0   
4  0.0           0.0    0.0         0.0    0.0             0.0   

   aaron recommend  aaron ross  aaron salesmans  abandoned  ...  zurich  \
0              0.0         0.0              0.0        0.0  ...     0.0   
1              0.0         0.0              0.0        0.0  ...     0.0   
2              0.0         0.0              0.0        0.0  ...     0.0   
3              0.0         0.0              0.0        0.0  ...     0.0   
4              0.0         0.0              0.0        0.0  ...     0.0   

   zurich surprise  échange  échange et  écouté  écouté son  également  \
0              0.0      0.0         0.0     0.0 



Accuracy of training data:  0.96875
Accuracy on Testing data 0.8325
Accuracy score on Test data 0.8325
[[0.2    0.1525]
 [0.015  0.6325]]


In [0]:
def predict_on_test(cat_a, cat_b, cat_c):
    vect = TfidfVectorizer(ngram_range=(1, 2), max_features=None,stop_words=ENGLISH_STOP_WORDS).fit(cat_a.review_text)

    # Transform the vectorizer
    X_cat_a_txt = vect.transform(cat_a.review_text)
    X_cat_b_txt = vect.transform(cat_b.review_text)
    X_cat_c_txt = vect.transform(cat_c.review_text)

    # Transform to a data frame and specify the column names
    X_cat_a_features = pd.DataFrame(X_cat_a_txt.toarray(), columns=vect.get_feature_names())
    X_cat_b_features = pd.DataFrame(X_cat_b_txt.toarray(), columns=vect.get_feature_names())
    X_cat_c_features = pd.DataFrame(X_cat_c_txt.toarray(), columns=vect.get_feature_names())
    #print('Top 5 rows of the DataFrame: ', X_cafe_features.head())
    #print('Top 5 rows of the DataFrame: ', X_auto_features.head())
    #print('Top 5 rows of the DataFrame: ', X_fash_features.head())
    X_cat_a_train = X_cat_a_features
    y_cat_a_train = cat_a['class_label']


    # Build a logistic regression model and calculate the accuracy
    log_reg = LogisticRegression().fit(X_cat_a_train, y_cat_a_train)

    X_test_cat_b_features = X_cat_b_features
    y_test_cat_b = cat_b['class_label']

    X_test_cat_c_features = X_cat_c_features
    y_test_cat_c = cat_c['class_label']


    y_predict_cat_b = log_reg.predict(X_test_cat_b_features)
    y_predict_cat_c = log_reg.predict(X_test_cat_c_features)

    accuracy_cat_b = accuracy_score(y_test_cat_b, y_predict_cat_b)
    accuracy_cat_c = accuracy_score(y_test_cat_c, y_predict_cat_c)


    print(accuracy_cat_b)
    print(accuracy_cat_c)


In [0]:
predict_on_test(df_cafe_clean, df_auto_clean, df_fash_clean)



0.7645
0.736


In [0]:
predict_on_test(df_auto_clean, df_fash_clean, df_cafe_clean)



0.8265
0.817


In [0]:
predict_on_test(df_fash_clean, df_cafe_clean, df_auto_clean)



0.8345
0.8735
