## CSE 151A Project: Predicting Restaurant Price Categories Based on Review Data

Abstract: As food prices continue to surge, amateur and seasoned foodies are left to question: what qualities or features lead a restaurant to establish their prices? Over the next 6 weeks we plan to answer this question. By leveraging the CSE Google Review dataset, we will (A) create a machine learning model that accurately predicts the price bracket of a restaurant, and (B) determine the statistical significance of each input feature. Specifically, we will train a regression model on sklearn, on engineered features such as geolocation and sentiment derived from text reviews. Since many features of a restaurant such as their food suppliers  are not public knowledge, the logic behind restaurant prices is often unseen by its customers. We hope that with the use of feature engineering and machine learning, we can encourage more clarity on what differentiates inexpensive from expensive restaurants.


# Library Imports

In [None]:
# Add imports here as needed
import pandas as pd
import numpy as np
import json
import urllib
import requests
import gzip
import ast
from collections import Counter
import re

# Sentiment analysis
# !pip install vaderSentiment
# !pip install install googletrans==3.1.0a0
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Geographic object imports
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim
from sklearn.feature_extraction.text import TfidfVectorizer
from googletrans import Translator
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import RepeatedKFold, cross_validate
from sklearn.naive_bayes import CategoricalNB
# from imblearn.over_sampling import RandomOverSampler
from sklearn.preprocessing import StandardScaler
# import our svm libraries
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from google.colab import files
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix
import imblearn
from imblearn.over_sampling import RandomOverSampler
# from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, make_scorer
from imblearn.over_sampling import SMOTE
# pip install keras_tuner
# pip install scikeras
import keras
from keras.models import Sequential
from keras.layers import Dense
from scikeras.wrappers import KerasClassifier


In [None]:
import gensim.downloader as api

wv = api.load('word2vec-google-news-300') #Download Google API to compute word similarity to
#Download time: 11 minutes



# Load Preprocessed Data in Database

### Loading Place Data
- Filter by prices are not None; standardize prices by $; one-hot encode
- Get geographical location

Load data:

In [None]:
data_link = 'https://datarepo.eng.ucsd.edu/mcauley_group/data/googlelocal/places.clean.json.gz'

In [None]:
def load_place(line, place_list):
  #Read a line, remove parsing characters like u"
  place = line.decode('utf-8').strip().replace('u"', '"')

  #Place is now a dictionary
  place = ast.literal_eval(place)

  if (place['price'] is not None) and (place['price'] != '         '):
    #Remove unnecessary keys (hours, phone)
    place.pop('hours', None)
    place.pop('phone', None)

    #Append place to the list
    place_list.append(place)

In [None]:
place_list = [] #Store in empty list

response = urllib.request.urlopen(data_link) #Open the file at the link

with gzip.open(response, 'rb') as f:
  for line in f:
    load_place(line, place_list=place_list)

#Time to load: 5m 41s
#You may have to restart your runtime to get 5m runtime! If it's 8m>, something is wrong -- you should try again. Make sure your RAM graph looks accurate/fine.

In [None]:
places = pd.DataFrame(place_list) #Converting to dataframe #Time = 9s

Standardize price from foreign currency to USD

In [None]:
def convert_to_usd(symbol):
  conversion = { '$': 1, '‚Ç¨': 1.08, '¬£': 1.26, '‚Ç¥': 0.027, '‚Ç©':0.00075, '‡∏ø':0.028, 'R':0.052, '‚Ç±':0.01788}
  if len(symbol) > 0 and symbol[0] in conversion:
    return 10 * len(symbol) * conversion[symbol[0]]
  else:
    return None

def convert_to_symbol(dollars):
  # $ = Inexpensive, usually $10 and under
  # $$ = Moderately expensive, usually between $10-$25
  # $$$ = Expensive, usually between $25-$45
  # $$$$ = Very Expensive, usually $50 and up'
  if dollars < 10:
    return '$'
  elif dollars < 25:
    return '$$'
  elif dollars < 45:
    return '$$$'
  return '$$$$'

In [None]:
places['price'] = places['price'].apply(convert_to_usd) #Convert into USD
places['price'] = places['price'].apply(convert_to_symbol) #Convert into $, $$, $$$, etc.

In [None]:
places['price'].value_counts()

$$      209259
$$$     192061
$$$$      2543
$          821
Name: price, dtype: int64

One-Hot Encode Price:

In [None]:
places = pd.get_dummies(places, columns=['price'], prefix='Class')

Get geographical objects and save to `places`

In [None]:
df_geo = pd.DataFrame(places[['gPlusPlaceId', 'address']])
df_geo = df_geo.assign(latitude = places[places['gps'].notna()]['gps'].apply(lambda x: x[0]))
df_geo = df_geo.assign(longitude = places[places['gps'].notna()]['gps'].apply(lambda x: x[1]))

#Filter for only not-null
df_geo = df_geo[df_geo['latitude'].notna()]

#Get location kmeans cluster
## can adjust k based on needs
kmeans = KMeans(n_clusters = 200, random_state=42)
df_geo['location_cluster'] = kmeans.fit_predict(df_geo[['latitude', 'longitude']])
#Runtime: 1m34s



In [None]:
#Add location_cluster to the places database
places['location_cluster'] = pd.Series(df_geo.location_cluster).astype(int)

In [None]:
places.head()

Unnamed: 0,name,address,closed,gPlusPlaceId,gps,Class_$,Class_$$,Class_$$$,Class_$$$$,location_cluster
0,T C's Referee Sports Bar,"[5322 W 26th St, Sioux Falls, SD 57106]",False,100327153115986850675,"[43.529494, -96.792244]",0,1,0,0,0.0
1,Old Chicago,"[17960 NW Evergreen Pkwy, Beaverton, OR 97006]",False,118222137795476771294,"[45.535176, -122.862242]",0,1,0,0,0.0
2,China Cottage,"[3718 Wilmington Pike, Dayton, OH 45429]",False,106432060150136868000,"[39.692899, -84.136173]",0,1,0,0,0.0
3,Smokey Mountain Wings,"[3607 Outdoor Sportsman Pl, Kodak, TN 37764]",False,100184392614713668281,"[35.98598, -83.610598]",0,1,0,0,0.0
4,Sabatinos Italian Kitchen,"[242 Massachusetts Ave, Arlington, MA 02474]",False,110300304875024740707,"[42.406904, -71.143994]",0,0,1,0,0.0


### Loading Review Data
- Group by category
- Get rid of non-restaurant google locations
- Translate reviews

Load data:

In [None]:
data_link = 'https://datarepo.eng.ucsd.edu/mcauley_group/data/googlelocal/reviews.clean.json.gz'

In [None]:
def load_review(line, review_list):
  #Read a line, remove parsing characters like /n, 'b, u"
  review = line.decode('utf-8').strip().replace("u'", "'")

  #Place is now a dictionary
  review = ast.literal_eval(review)

  if review['gPlusPlaceId'] is not None:
    #Remove unnecessary keys (hours, phone)
    review.pop('reviewerName', None)
    review.pop('phone', None)
    review.pop('gPlusUserId')

    #Append place to the list
    review_list.append(review)

In [None]:
review_list = [] #Store in empty list

response = urllib.request.urlopen(data_link) #Open the file at the link

with gzip.open(response, 'rb') as f:
  #We want to get 1,000 reviews to start
  for i in range(5000):
    line = f.readline()
    load_review(line, review_list=review_list)

#Time to load data: 1s

In [None]:
reviews = pd.DataFrame(review_list)

In [None]:
reviews.head()

Unnamed: 0,rating,reviewText,categories,gPlusPlaceId,unixReviewTime,reviewTime
0,3.0,Ch·∫•t l∆∞·ª£ng t·∫°m ·ªïn,[Gi·∫£i Tr√≠ - Caf√©],108103314380004200232,1372687000.0,"Jul 1, 2013"
1,5.0,Wc si temiz duzenli..,[Turkish Cuisine],102194128241608748649,1342871000.0,"Jul 21, 2012"
2,5.0,‰ΩïÂõû„ÇÇÁßÅ„ÅØ‰∫àÂÆö„Å´‰ºë„Åø„Åå„Çª„É´„Éê„Å´Ë°å„Å£„Åü„ÅÆ„ÅßË¶ö„Åà„Å¶Ë¶ã„Å¶„ÄÅÂàÜ„Åã„Çä„Åæ„Åô‚ùó,"[Fishing, Pond Fish Supplier, Seafood Market]",101409858828175402384,1390654000.0,"Jan 25, 2014"
3,5.0,‰ªäÂ∫¶„ÅØ‰∫àÂÆö„Å´‰ºë„Åø„ÅåÁôªÁ±≥Â∏Ç„Å´Ë°å„Åç„Åü„ÅÑ‚ùó‚òÄüòÖüåå Ê•Ω„Åó„ÅÑ„Å´Êó•Â∏∞„Çä„Å´ÁôªÁ±≥„ÅÆË¶ãÂ≠¶„ÅÆË¶≥ÂÖâ(*^)(*^-...,[Museum],101477177500158511502,1389188000.0,"Jan 8, 2014"
4,4.0,Ê∞ó‰ªôÊ≤ºË≠¶ÂØüÁΩ≤„Å´ÁßªËª¢‰∏≠„Å´ÁµÜ üëÆüêé‚ò∫üôãüöìÈ†ëÂºµ„Çç„ÅÜ‚ùó,[Police],106994170641063333085,1390486000.0,"Jan 23, 2014"


Clean categories and pick the most important one

In [None]:
#Clean categories/translate them
def trans(cats):
  translator=Translator()
  if cats is not None:
    return translator.translate(cats[0]).text
  else:
    return None

In [None]:
#Apply translation to the first word in the categories column
reviews['top_category'] = reviews['categories'].apply(trans)
#Runtime: 8 minutes

Pick which categories are 'relevant'

In [None]:
def similar(category, threshold=0.5):
  sims = []
  if category is None:
    return False

  for word in category.split():
    if word in wv:
      sims.append(wv.similarity(word, 'restaurant'))
    else:
      sims.append(0)

  return max(sims) > threshold

In [None]:
#Apply threshhold to reviews
relevant_category = reviews['top_category'].apply(similar)
reviews['relevant_cat'] = relevant_category

In [None]:
#Drop non-relevant places/review tags
indexRel = reviews[reviews.relevant_cat == False].index
reviews.drop(indexRel, inplace=True)

Translate reviews:

In [None]:
#Check language function
def check(word):
  translator=Translator()
  result = translator.translate(word)
  return [result.src, result.text]

In [None]:
reviewLang = reviews.reviewText.apply(check)
#This took 3m 45s to run for 1943 samples (pre-filtered by category)
#This took 9m 15s to run for 5000 samples
#This took 19m 33s to run for 10,000 samples

In [None]:
translated = pd.DataFrame(reviewLang)
translated = pd.DataFrame(translated['reviewText'].to_list(), columns = ['language', 'translated']) #Split our [src, text] into two columns
#translated.head()

In [None]:
#Added columns to our main dataframe
reviews['language'] = translated['language']
reviews['translated'] = translated['translated']

In [None]:
#Drop the un-translated column
reviews.drop('reviewText', axis=1, inplace=True)

### Use TF-IDF score to find most unique "defining" words of restaurant vibe

In [None]:
#Let's find the most unique/defining word in the group using TF-IDF:

cats_comp = list(reviews.top_category) #Train on the top_category for every relevant input

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(cats_comp)

tfidf_dict = dict(zip(vectorizer.get_feature_names_out(), X.toarray().sum(axis=0)))

#tfidf_dict

In [None]:
def def_cat(category):
  words = {}

  category = category.lower().replace("restaurant", "") #Had to edit filtering requirements because too many restaurants were just categorized as "restaurant"

  if category is None:
    return None

  for word in category.split():
    if word in tfidf_dict:
      words[word] = tfidf_dict[word]

  if len(words) > 0:
    sorted_words = dict(sorted(words.items(), key=lambda item: item[1]))
    return list(sorted_words.keys())[-1]
  else:
    return "restaurant" #if the only category word is "restaurant" we need a catchall

In [None]:
#Get the defining category
kw = reviews.top_category.apply(def_cat)

In [None]:
# Compute similarity scores between words
similarity_matrix = []
for word1 in list(kw): #For each word in kw
  similarity_scores = []
  for word2 in list(kw): #Compare to the other words in kw (kw = keyword in category)
    if word2 in wv and word1 in wv:
      similarity_scores.append(wv.similarity(word1, word2)) #Return the similarity score if kw is in wv training library
    else:
      similarity_scores.append(0) #Return a similarity of 0 if can't find it
  similarity_matrix.append(similarity_scores)

#Runtime should be about 1m

In [None]:
# Perform K-means clustering
num_clusters = 5  # I want 5 clusters for now
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(similarity_matrix)
cluster_labels = kmeans.labels_

#Runtime: 5s



In [None]:
#Assign each keyword in kw to a cluster
word_clusters = {}
cat_list = []
for i, word in enumerate(list(kw)):
  cluster_id = cluster_labels[i] #Get the cluster id = what kmeans grouped as in prev cell
  cat_list.append(cluster_id)
  if cluster_id not in word_clusters: #If this is a new cluster #, create a list to start storing keywords
      word_clusters[cluster_id] = []
  word_clusters[cluster_id].append(word) #Add the keyword to word_clusters[number]

In [None]:
reviews['cat_cluster'] = pd.Series(cat_list)

### Sentiment Analysis

In [None]:
analyzer = SentimentIntensityAnalyzer()

def sentiment_analysis_and_charged_words(review):
  if type(review) == str and review is not None:
    sentiment = analyzer.polarity_scores(review)
    charged_words = [word for word in review.split() if analyzer.polarity_scores(word)['compound'] != 0]
    return sentiment, charged_words
  else:
    return None, None

def sentiment_analysis_and_charged_words_2(reviews):
    rating_scores = []
    all_words = set()
    charged_words = []
    price_related_keywords = [
        'price', 'cost', 'costly', 'expensive', 'cheap', 'cheapest', 'value', 'best-value', 'money', "money's",
        'worth', 'pricing', 'affordable', 'budget', 'economical', 'budget-friendly',
        'pricey', 'pricy', 'economic', 'bargain', 'deal', 'discount', 'charge', 'fee', 'expense',
        'worth-it', 'sale', 'sales', 'purchase', 'transaction', 'deals',
        'payment', 'payments', 'costly', 'inexpensive', 'cost-effective', 'effective',
        'overpriced', 'underpriced', 'overprice', 'underprice', 'overprices', 'underprices', 'cheaper', 'cheapest', 'pricier', 'priciest',
        'refund', 'income', 'revenue', 'premium', 'garbage', 'inexpensive', 'extravagant', 'boujee', 'rip-off'
        "an arm and a leg", "break the bank", "dent in your wallet", "dent in the wallet", "arm and leg",
         "expensive proposition", "big bucks", "high price", "high cost", "high-price", "high-cost",
        "top dollar", "top-dollar", "cost a pretty penny", "pretty penny", "cost a fortune"
    ]
    if reviews is not None and type(reviews) == str:
      review = reviews.lower()
      for word in price_related_keywords:
        if word in review.split():
          charged_words.append(word)

    return list(set(charged_words))


reviews['sentiment'], reviews['charged_words'] = zip(*reviews['translated'].apply(lambda x: sentiment_analysis_and_charged_words(x)))
reviews['price_words'] = reviews['translated'].apply(sentiment_analysis_and_charged_words_2)

## Merge tables together

In [None]:
#Left join since we want only restaurants associated with the reviews
df = places.merge(reviews, how='inner', on='gPlusPlaceId')

#Time to merge: 0s

## Cleanup/Standardization

In [None]:
df['reviewTime'] = pd.to_datetime(df['reviewTime'])
df['unixReviewTime'] = pd.to_datetime(df['unixReviewTime'], unit='s')

In [None]:
reviews.columns

Index(['rating', 'categories', 'gPlusPlaceId', 'unixReviewTime', 'reviewTime',
       'top_category', 'relevant_cat', 'language', 'translated', 'cat_cluster',
       'sentiment', 'charged_words', 'price_words'],
      dtype='object')

In [None]:
#make all review text data an empty string
#take care of this case in data analysis
df['translated'].fillna('', inplace=True)

In [None]:
df.columns

Index(['name', 'address', 'closed', 'gPlusPlaceId', 'gps', 'Class_$',
       'Class_$$', 'Class_$$$', 'Class_$$$$', 'location_cluster', 'rating',
       'categories', 'unixReviewTime', 'reviewTime', 'top_category',
       'relevant_cat', 'language', 'translated', 'cat_cluster', 'sentiment',
       'charged_words', 'price_words'],
      dtype='object')

In [None]:
#fill out this information with unknowns
#assume closed is false --> significant amount of data shows that closed is usually false
#can use essentially the mode in this case
df['name'].fillna('Unknown', inplace=True)
df['address'].fillna('Unknown', inplace=True)
df['closed'].fillna(False, inplace=True)
df['gps'].fillna('Unknown', inplace=True)

In [None]:
df['reviewTime'] = pd.to_datetime(df['reviewTime'])
df['unixReviewTime'] = pd.to_datetime(df['unixReviewTime'])

In [None]:
#imputing missing values for time using the mean
#may not be most effective method
#making an assumption that a majority of these reviews were made at the same time which could skew data when training
df['unixReviewTime'].fillna(df['unixReviewTime'].mean(), inplace=True)
df['reviewTime'].fillna(df['reviewTime'].mean(), inplace=True)

In [None]:
#Get the negative sentiment value
df['sentiment'] = df['sentiment'].apply(lambda x: None if (type(x) is not str) else int(float(x[8:11])))

In [None]:
language_ohe = pd.get_dummies(df['language'], prefix='Class') #places = pd.get_dummies(places, columns=['price'], prefix='Class')
df = df.join(language_ohe)
df.drop(['sentiment', 'language'], axis=1, inplace=True)

## Save to CSV

In [None]:
df.to_csv('proc.csv')

# Creating the Model

### Loading the data

In [None]:
df = pd.read_csv('proc.csv', index_col=0)
df = df[df.columns.drop(list(df.filter(regex='Unnamed')))] #Copied off stackoverflow here: https://stackoverflow.com/questions/19071199/drop-columns-whose-name-contains-a-specific-string-from-pandas-dataframe

In [None]:
df['cat_cluster'] = df['cat_cluster'].fillna(0) #Fill NAs with 0s (should be in above section but i forgot :/)
df['location_cluster'] = df['location_cluster'].fillna(0)

In [None]:
str_cols = ['name', 'address', 'gps', 'top_category', 'relevant_cat', 'translated', 'charged_words', 'price_words', 'categories', 'unixReviewTime', 'reviewTime']
df_num = df[df.columns.drop(str_cols)]

In [None]:
df_num

Unnamed: 0,closed,gPlusPlaceId,location_cluster,rating,cat_cluster,sentiment,Class_ca,Class_da,Class_de,Class_en,...,Class_fr,Class_hi,Class_it,Class_ms,Class_nl,Class_pt,Class_tr,Class_$$,Class_$$$,Class_$$$$
0,False,115757957627721988675,0.0,5.0,4.0,,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
1,False,116555416797255000560,0.0,4.0,0.0,,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,True,114781865961627441828,0.0,5.0,0.0,,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,False,114735143729299990529,0.0,5.0,0.0,,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,False,105589668159024738610,0.0,5.0,0.0,,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
882,False,104180837589620084282,0.0,4.0,0.0,,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
883,False,110513581755637073383,0.0,3.0,0.0,,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
884,False,100311016010349448338,0.0,3.0,0.0,,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
885,False,101569325572579174871,0.0,5.0,0.0,,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [None]:
#Get X, Y data
X = df_num.drop(['Class_$$', 'Class_$$$', 'Class_$$$$'], axis=1).astype(float)
y = df_num[['Class_$$', 'Class_$$$', 'Class_$$$$']].astype(float)

display(X.shape)
display(y.shape)

(887, 18)

(887, 3)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
display(X_train.shape)
display(y_train.shape)

(709, 18)

(709, 3)

In [None]:
df_num.columns

Index(['closed', 'gPlusPlaceId', 'location_cluster', 'rating', 'cat_cluster',
       'sentiment', 'Class_ca', 'Class_da', 'Class_de', 'Class_en', 'Class_es',
       'Class_fr', 'Class_hi', 'Class_it', 'Class_ms', 'Class_nl', 'Class_pt',
       'Class_tr', 'Class_$$', 'Class_$$$', 'Class_$$$$'],
      dtype='object')

## Model 1: DNN

In [None]:
X_test = X_test[['closed', 'location_cluster', 'rating', 'cat_cluster', 'Class_ca', 'Class_da', 'Class_de', 'Class_en', 'Class_es',
       'Class_fr', 'Class_hi', 'Class_it', 'Class_ms', 'Class_nl', 'Class_pt',
       'Class_tr']]
X_train = X_train[['closed', 'location_cluster', 'rating', 'cat_cluster', 'Class_ca', 'Class_da', 'Class_de', 'Class_en', 'Class_es',
       'Class_fr', 'Class_hi', 'Class_it', 'Class_ms', 'Class_nl', 'Class_pt',
       'Class_tr']]

In [None]:
def build_model(hp):
    input_layer = Input(shape=16)
    x = input_layer

    for i in range(4):
        units = hp.Int(f'units_{i}', min_value=8, max_value=18, step=4)
        activation = hp.Choice(f'activation_{i}', ['relu', 'sigmoid', 'tanh'])
        x = Dense(units=units, activation=activation)(x)
    output = Dense(units=3, activation='softmax')(x)
    model = Model(inputs=input_layer, outputs=output)

    lr = hp.Float("learning_rate", min_value=0.0001, max_value=0.3, sampling="log")
    epochs = hp.Int('epochs', min_value=50, max_value=200, step=50)

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr), loss='binary_focal_crossentropy', metrics=['accuracy'])
    return model

#X_train, X_test, y_train, y_test = train_test_split(df_normalized, classes_encoded, test_size=0.1, random_state=1)

tuner = RandomSearch(
    hypermodel=build_model,
    objective='val_accuracy',
    max_trials=3,
    executions_per_trial=2,
    directory='idk_dir',
    project_name='hyperparams_optimization')

tuner.search(X_train, y_train, epochs=100, validation_data=(X_test, y_test))

best_hyperparameters = tuner.get_best_hyperparameters()[0]

best_model = tuner.hypermodel.build(best_hyperparameters)

best_model.fit(X_train, y_train, epochs=best_hyperparameters['epochs'])

y_pred_probabilities = best_model.predict(X_test)
y_pred = np.argmax(y_pred_probabilities, axis=1)
y_true_expanded = np.argmax(y_test.values, axis=1)
test_accuracy = accuracy_score(y_true_expanded, y_pred)

print("Test Accuracy:", test_accuracy)
print("Optimized Hyperparameters\nNumber of Units in Each Hidden Layer:", [best_hyperparameters[f'units_{i}'] for i in range(4)])
print("Activation Function for Each Hidden Layer:", [best_hyperparameters[f'activation_{i}'] for i in range(4)])
print(F"Learning Rate: {best_hyperparameters['learning_rate']}\nEpochs: {best_hyperparameters['epochs']}")

conf_matrix = confusion_matrix(y_true_expanded, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Reloading Tuner from idk_dir/hyperparams_optimization/tuner0.json
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 7

### Model 1 Updates

In [None]:
num_df = pd.read_csv("CSE-151A-Project-/final_num_df_cse151a.csv")
X = num_df.drop(['Class_$','Class_$$', 'Class_$$$', 'Class_'], axis=1).astype(float)
y = num_df[['Class_$$', 'Class_$$$', 'Class_']].astype(float)
X = X[['closed', 'location_cluster', 'rating', 'relevant_cat', 'cat_cluster','sentiment']]
X['cat_cluster'] += 1
X['location_cluster'] = X['location_cluster'].fillna(-1)
X['location_cluster'] += 1
display(X.shape)
display(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='accuracy',
    min_delta=0,
    patience=10,
    verbose=0,
    mode='auto',
    baseline=None,
    restore_best_weights=False,
    start_from_epoch=0
)

a= y_train['Class_$$$'] * 3 + y_train['Class_$$'] * 2 + y_train['Class_']
a -= 1
class_weights = compute_class_weight(class_weight="balanced", classes=np.unique(a), y=a)
class_weight = {0: class_weights[0], 1: class_weights[1], 2: class_weights[2]}



In [None]:
def build_model(hp):
    input_layer = Input(shape=6)
    x = input_layer

    for i in range(4):
        units = hp.Int(f'units_{i}', min_value=8, max_value=18, step=4)
        activation = hp.Choice(f'activation_{i}', ['relu', 'sigmoid', 'tanh'])
        x = Dense(units=units, activation=activation)(x)
    output = Dense(units=3, activation='softmax')(x)
    model = Model(inputs=input_layer, outputs=output)

    lr = hp.Float("learning_rate", min_value=0.0001, max_value=0.3, sampling="log")
    epochs = hp.Int('epochs', min_value=50, max_value=200, step=50)

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr), loss='categorical_focal_crossentropy', metrics=['accuracy'], sample_weight_mode=class_weight)
    return model

tuner = RandomSearch(
    hypermodel=build_model,
    objective='accuracy',
    max_trials=3,
    executions_per_trial=2,
    directory='save_directory',
    project_name='hyperparams_optimization')

tuner.search(X_train, y_train, epochs=100)

best_hyperparameters = tuner.get_best_hyperparameters()[0]

best_model = tuner.hypermodel.build(best_hyperparameters)

best_model.fit(X_train, y_train, epochs=best_hyperparameters['epochs'])

y_pred_probabilities = best_model.predict(X_test)
y_pred = np.argmax(y_pred_probabilities, axis=1)
y_true_expanded = np.argmax(y_test.values, axis=1)
test_accuracy = accuracy_score(y_true_expanded, y_pred)

print("Test Accuracy:", test_accuracy)
print("Optimized Hyperparameters\nNumber of Units in Each Hidden Layer:", [best_hyperparameters[f'units_{i}'] for i in range(4)])
print("Activation Function for Each Hidden Layer:", [best_hyperparameters[f'activation_{i}'] for i in range(4)])
print(F"Learning Rate: {best_hyperparameters['learning_rate']}\nEpochs: {best_hyperparameters['epochs']}")

conf_matrix = confusion_matrix(y_true_expanded, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

Trial 3 Complete [00h 00m 39s]
accuracy: 0.6375176310539246

Best accuracy So Far: 0.6382228434085846
Total elapsed time: 00h 01m 23s
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoc

## Model 2: SVM

### Process Data

In [28]:
df = pd.read_csv('proc.csv', index_col=0)
df = df[df.columns.drop(list(df.filter(regex='Unnamed')))] #Copied off stackoverflow here: https://stackoverflow.com/questions/19071199/drop-columns-whose-name-contains-a-specific-string-from-pandas-dataframe

In [29]:
df['cat_cluster'] = df['cat_cluster'].fillna(0) #Fill NAs with 0s (should be in above section but i forgot :/)
df['location_cluster'] = df['location_cluster'].fillna(0)

In [30]:
str_cols = ['name', 'address', 'gps', 'top_category', 'relevant_cat', 'translated', 'charged_words', 'price_words', 'categories', 'unixReviewTime', 'reviewTime']
df_num = df[df.columns.drop(str_cols)]

In [31]:
df_num.head(5)

Unnamed: 0,closed,gPlusPlaceId,Class_$,Class_$$,Class_$$$,Class_$$$$,location_cluster,rating,cat_cluster,Class_ca,...,Class_de,Class_en,Class_es,Class_fr,Class_hi,Class_it,Class_ms,Class_nl,Class_pt,Class_tr
0,False,115757957627721988675,0,1,0,0,0.0,5.0,1.0,0,...,0,1,0,0,0,0,0,0,0,0
1,False,116555416797255000560,0,1,0,0,0.0,4.0,0.0,0,...,0,0,0,0,0,0,0,0,0,0
2,True,114781865961627441828,0,0,1,0,0.0,5.0,0.0,0,...,0,0,0,0,0,0,0,0,0,0
3,False,114735143729299990529,0,1,0,0,0.0,5.0,0.0,0,...,0,0,0,0,0,0,0,0,0,0
4,False,105589668159024738610,0,1,0,0,0.0,5.0,0.0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
def cost_label_enc(row):
  if row['Class_$'] == 1:
    return 1
  elif row['Class_$$'] == 1:
    return 2
  elif row['Class_$$$'] == 1:
    return 3
  else:
    return 4

In [33]:
df_num['cost'] =df_num.apply(cost_label_enc, axis=1)
df_num.drop(['Class_$', 'Class_$$', 'Class_$$$', 'Class_$$$$'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_num['cost'] =df_num.apply(cost_label_enc, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_num.drop(['Class_$', 'Class_$$', 'Class_$$$', 'Class_$$$$'], axis=1, inplace=True)


In [34]:
X = df_num.drop(['cost'], axis=1).astype(float)
y = df_num[['cost']].astype(float)

In [35]:
#Resample
#ros = RandomOverSampler(random_state=42)
#X_res, y_res = ros.fit_resample(X, y)

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [37]:
scaler = StandardScaler()

In [38]:
#Scale all numbers
X_train_num = scaler.fit_transform(X_train)
X_test_num = scaler.fit_transform(X_test)

In [40]:
#svm = SVC(kernel = 'linear', class_weight={1.0: 1, 2.0: 1.1160116448326055, 3.0: 1.9847222222222223, 4.0: 11.484848484848484}, decision_function_shape='ovo')

#param_grid = {'C': [ 10, 100, 1000, 10_000, 100_000],
#              'gamma': [5, 3, 1, 0.1, 0.01, 0.001, 0.0001],
#              'kernel': ['rbf', 'sigmoid'],
#              'class_weight': [{1.0: 1, 2.0: 1.1160116448326055, 3.0: 1.9847222222222223, 4.0: 11.484848484848484}]}

#grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 2)

#grid.fit(X_train_num, y_train.values.ravel())

In [41]:
#grid.best_params_

In [47]:
svm = SVC(kernel = 'linear', class_weight={1.0: 1, 2.0: 1.1160116448326054, 3.0: 1.9847222222222223, 4.0: 11.484848484848484}, decision_function_shape='ovo')

In [48]:
svm.fit(X_train_num, y_train)

  y = column_or_1d(y, warn=True)


In [49]:
y_pred = svm.predict(X_test_num)

In [50]:
y_pred

array([2., 2., 2., 2., 2., 3., 2., 3., 2., 2., 2., 3., 3., 2., 3., 2., 2.,
       3., 2., 2., 2., 3., 2., 2., 3., 3., 2., 3., 2., 2., 2., 2., 2., 2.,
       2., 2., 2., 3., 2., 3., 3., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2.,
       3., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 3., 2., 2., 2., 2., 3.,
       3., 3., 2., 2., 2., 2., 2., 2., 3., 2., 2., 2., 3., 3., 2., 3., 2.,
       2., 2., 3., 2., 3., 3., 2., 3., 2., 2., 2., 2., 2., 3., 2., 2., 2.,
       3., 2., 2., 2., 2., 3., 2., 2., 3., 3., 2., 2., 2., 2., 2., 3., 2.,
       2., 2., 2., 3., 2., 2., 3., 2., 2., 2., 2., 2., 2., 2., 2., 2., 3.,
       2., 3., 2., 2., 3., 2., 2., 2., 2., 3., 2., 2., 3., 2., 3., 2., 2.,
       2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 2., 3., 3., 3., 2., 2.,
       2., 2., 2., 2., 2., 3., 2., 2.])

In [51]:
#Get results:

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         2.0       0.63      0.80      0.70       105
         3.0       0.52      0.33      0.40        70
         4.0       0.00      0.00      0.00         3

    accuracy                           0.60       178
   macro avg       0.38      0.38      0.37       178
weighted avg       0.58      0.60      0.57       178



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [55]:
#Training error for SVM
y_train_pred = svm.predict(X_train_num)
train_error = [list(y_train.cost)[i] == y_train_pred[i] for i in range(len(y_train_pred))]

In [56]:
np.mean(train_error)

0.6304654442877292

In [58]:
#Testing error
#Training error for SVM
y_test_pred = svm.predict(X_test_num)
test_error = [list(y_test.cost)[i] == y_test_pred[i] for i in range(len(y_test_pred))]

print(np.mean(test_error))

0.601123595505618


## Model 3: XGBoost

In [None]:
!pip install xgboost



### Process Data

In [None]:
df = pd.read_csv('proc.csv')
df = df[df.columns.drop(list(df.filter(regex='Unnamed')))]

In [None]:
df['cat_cluster'] = df['cat_cluster'].fillna(0) #Fill NAs with 0s (should be in above section but i forgot :/)
df['location_cluster'] = df['location_cluster'].fillna(0)

In [None]:
str_cols = ['name', 'address', 'gps', 'top_category', 'relevant_cat', 'translated', 'charged_words', 'price_words', 'categories', 'unixReviewTime', 'reviewTime']
df_num = df[df.columns.drop(str_cols)]

In [None]:
df_num['Class_$'].mean(),df_num['Class_$$'].mean(),df_num['Class_$$$'].mean(),df_num['Class_$$$$'].mean()

(0.0, 0.6347237880496054, 0.34949267192784667, 0.015783540022547914)

In [None]:
df_num.shape

(887, 21)

In [None]:
def cost_label_enc(row):
  #Still want to maintain ordinal values but no $ instances, so it doesn't make sense to encode values as [2, 3, 4]
  if row['Class_$$'] == 1:
    return 0
  elif row['Class_$$$'] == 1:
    return 1
  #For class $$$$:
  else:
    return 2

In [None]:
df_num['cost'] =df_num.apply(cost_label_enc, axis=1)
df_num.drop(['Class_$', 'Class_$$', 'Class_$$$', 'Class_$$$$'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_num['cost'] =df_num.apply(cost_label_enc, axis=1)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_num.drop(['Class_$', 'Class_$$', 'Class_$$$', 'Class_$$$$'], axis=1, inplace=True)


In [None]:
X = df_num.drop(['cost'], axis=1).astype(float)
y = df_num[['cost']].astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
print(y_train.value_counts())

cost
0       458
1       240
2        11
dtype: int64


In [None]:
scaler = StandardScaler()

In [None]:
#Scale all numbers
X_train_num = scaler.fit_transform(X_train)
X_test_num = scaler.fit_transform(X_test)

### XGBoost - No Oversampling

https://xgboost.readthedocs.io/en/stable/parameter.html

In [None]:
xgb = XGBClassifier(
    n_estimators=2, max_depth=2,
    learning_rate=1,
    objective='multi:softmax',
    eval_metric='merror',
    #scale_pos_weight=
)

In [None]:
xgb.fit(X_train_num, y_train)

In [None]:
train_pred = xgb.predict(X_train_num)
print(classification_report(y_train, train_pred))

              precision    recall  f1-score   support

           0       0.68      0.92      0.78       458
           1       0.59      0.23      0.33       240
           2       0.00      0.00      0.00        11

    accuracy                           0.67       709
   macro avg       0.42      0.38      0.37       709
weighted avg       0.64      0.67      0.62       709



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
pred = xgb.predict(X_test_num)

In [None]:
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.61      0.77      0.68       105
           1       0.48      0.31      0.38        70
           2       0.00      0.00      0.00         3

    accuracy                           0.58       178
   macro avg       0.36      0.36      0.35       178
weighted avg       0.55      0.58      0.55       178



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
print(confusion_matrix(y_test, pred))

[[81 24  0]
 [48 22  0]
 [ 3  0  0]]


### XGBoost - Oversampling

In [None]:
ros = RandomOverSampler(
    sampling_strategy={
        0: 500,
        1: 300,
        2: 200
    }
)

X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)

print(y_train_ros.value_counts())

cost
0       500
1       300
2       200
dtype: int64




In [None]:
y_train_ros = y_train_ros.astype(int)
y_test_ros = y_test.astype(int)

In [None]:
X_train_num_ros = scaler.fit_transform(X_train_ros)
X_test_num_ros = scaler.fit_transform(X_test)

In [None]:
xgb_oversampled = XGBClassifier(
    n_estimators=2, max_depth=2,
    learning_rate=1,
    objective='multi:softmax',
    eval_metric='merror'
)

In [None]:
xgb_oversampled.fit(X_train_num_ros, y_train_ros)

In [None]:
oversampled_train_pred = xgb_oversampled.predict(X_train_num_ros)
print(classification_report(y_train_ros, oversampled_train_pred))

              precision    recall  f1-score   support

           0       0.57      0.91      0.70       500
           1       0.53      0.17      0.26       300
           2       0.87      0.42      0.57       200

    accuracy                           0.59      1000
   macro avg       0.66      0.50      0.51      1000
weighted avg       0.62      0.59      0.54      1000



In [None]:
pred_oversampled = xgb_oversampled.predict(X_test_num_ros)

In [None]:
print(classification_report(y_test_ros, pred_oversampled))

              precision    recall  f1-score   support

           0       0.63      0.85      0.72       105
           1       0.56      0.29      0.38        70
           2       0.00      0.00      0.00         3

    accuracy                           0.61       178
   macro avg       0.39      0.38      0.37       178
weighted avg       0.59      0.61      0.57       178



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
print(confusion_matrix(y_test_ros, pred_oversampled))

[[89 16  0]
 [50 20  0]
 [ 3  0  0]]


### Gridsearch - No oversampling

In [None]:
xgb_gridsearch = XGBClassifier(objective='multi:softmax', eval_metric='merror')

In [None]:
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.5, 1],
    'subsample': [0.6, 0.8, 1.0],
    'gamma': [0.0, 1.0],
    'colsample_bytree': [0.5, 1]
}

In [None]:
grid_search = GridSearchCV(estimator=xgb_gridsearch, param_grid=param_grid,
                           scoring='f1_macro', cv=5, verbose=1)

In [None]:
grid_search.fit(X_train_num, y_train)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [None]:
top_params = grid_search.best_params_
top_score = grid_search.best_score_
top_model = grid_search.best_estimator_

In [None]:
top_params

{'colsample_bytree': 0.5,
 'gamma': 1.0,
 'learning_rate': 1,
 'max_depth': 5,
 'subsample': 0.6}

In [None]:
top_score

0.34942860750949944

In [None]:
gs_train_pred = xgb.predict(X_train_num)
print(classification_report(y_train, gs_train_pred))

              precision    recall  f1-score   support

           0       0.68      0.92      0.78       458
           1       0.59      0.23      0.33       240
           2       0.00      0.00      0.00        11

    accuracy                           0.67       709
   macro avg       0.42      0.38      0.37       709
weighted avg       0.64      0.67      0.62       709



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
top_pred = top_model.predict(X_test_num)
acc = (top_pred == y_test['cost']).mean()

In [None]:
print(classification_report(y_test, top_pred))

              precision    recall  f1-score   support

           0       0.60      0.70      0.65       105
           1       0.42      0.33      0.37        70
           2       0.00      0.00      0.00         3

    accuracy                           0.54       178
   macro avg       0.34      0.34      0.34       178
weighted avg       0.52      0.54      0.53       178



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
print(confusion_matrix(y_test, top_pred))

[[74 31  0]
 [47 23  0]
 [ 2  1  0]]


In [None]:
print(f'Top XGBoost Model Accuracy: {acc}')

Top XGBoost Model Accuracy: 0.5449438202247191


### GridSearch with Oversampling

In [None]:
xgb_gridsearch_ros = XGBClassifier(objective='multi:softmax', eval_metric='merror')

In [None]:
param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.5, 1],
    'subsample': [0.6, 0.8, 1.0],
    'gamma': [0.0, 1.0],
    'colsample_bytree': [0.5, 1]
}

In [None]:
grid_search_ros = GridSearchCV(estimator=xgb_gridsearch_ros, param_grid=param_grid,
                           scoring='accuracy', cv=5, verbose=1)

In [None]:
grid_search_ros.fit(X_train_num_ros, y_train_ros)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


In [None]:
top_params_ros = grid_search_ros.best_params_
top_score_ros = grid_search_ros.best_score_
top_model_ros = grid_search_ros.best_estimator_

In [None]:
top_params_ros

{'colsample_bytree': 1,
 'gamma': 0.0,
 'learning_rate': 1,
 'max_depth': 5,
 'subsample': 0.6}

In [None]:
top_score_ros

0.726

In [None]:
gs_oversampled_train_pred = xgb.predict(X_train_num_ros)
print(classification_report(y_train_ros, gs_oversampled_train_pred))

              precision    recall  f1-score   support

           0       0.53      0.75      0.62       500
           1       0.42      0.41      0.41       300
           2       0.00      0.00      0.00       200

    accuracy                           0.50      1000
   macro avg       0.32      0.39      0.34      1000
weighted avg       0.39      0.50      0.43      1000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
top_pred_ros = top_model_ros.predict(X_test_num_ros)
acc_ros = (top_pred_ros == y_test['cost']).mean()

In [None]:
print(classification_report(y_test, top_pred_ros))

              precision    recall  f1-score   support

           0       0.60      0.61      0.60       105
           1       0.39      0.37      0.38        70
           2       0.00      0.00      0.00         3

    accuracy                           0.51       178
   macro avg       0.33      0.33      0.33       178
weighted avg       0.51      0.51      0.51       178



In [None]:
print(confusion_matrix(y_test, top_pred_ros))

[[64 41  0]
 [40 26  4]
 [ 3  0  0]]


In [None]:
print(f'Top XGBoost Model Accuracy: {acc_ros}')

Top XGBoost Model Accuracy: 0.5056179775280899


### SMOTE Oversampling

In [None]:
smote = SMOTE(random_state=64)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

xgb = XGBClassifier(
    n_estimators=100, max_depth=2,
    learning_rate=1,
    objective='multi:softmax',
    eval_metric='mlogloss'
)

xgb.fit(X_train_smote, y_train_smote)
predictions = xgb.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.58      0.71      0.64       105
           1       0.41      0.24      0.31        70
           2       0.00      0.00      0.00         3

    accuracy                           0.52       178
   macro avg       0.33      0.32      0.31       178
weighted avg       0.50      0.52      0.50       178



In [None]:
predictions = xgb.predict(X_train_smote)
print(classification_report(y_train_smote, predictions))

              precision    recall  f1-score   support

           0       0.78      0.89      0.83       458
           1       0.89      0.74      0.81       458
           2       0.95      0.99      0.97       458

    accuracy                           0.87      1374
   macro avg       0.87      0.87      0.87      1374
weighted avg       0.87      0.87      0.87      1374

