# sprint 2
https://www.kaggle.com/code/ramantalwar00/classsification-restaurant-price-final/notebook


## classification of resaurant
When you get restaurant recommendations as a user you might want a functionality to filter out restaurants on price range. As a restaurant owner you might want to know what other restaurant features influence your price tag. That's why we want to create a classification model that classifies restaurants into cheap, medium of expensive categories.

To do this we will try out different clasifiers on default settings and then do a gridsearch on the most promising ones to get the best model

In [1]:
from fastai.imports import *
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import seaborn as sns

original_df = pd.read_csv("tripadvisor_dataset/restaurant_listings.csv")
pd.set_option("display.max_columns", None)


KeyboardInterrupt: 

in short, we will do the same preprocessing as we did in sprint 1

In [None]:
#see notebook sprint 1 for details on how we got this
coords=pd.read_csv("tripadvisor_dataset\coordinaten2.csv").replace(0,np.nan)

In [None]:
original_df["rank"]=original_df["rank"].str.replace("#","").astype(float)
original_df["general rating"]=original_df["general rating"].map(lambda x: x.split(" ")[0]).astype(float)
original_df["number of reviews"]=original_df["number of reviews"].map(lambda x: x.split(" ")[0].replace(",","")).astype(float)
original_df['city'] = original_df["address"].str.split(', ').str[-1].str.split(" ").str[0]
first_tag=original_df.tags.str.split("|",expand=True)[0].dropna()
ranges=first_tag[first_tag.str.find("$")!=-1]
original_df["price_tag"]=ranges
original_df=original_df.merge(coords,on="id")
original_df.drop(columns=["food rating", "service rating","price range"], inplace=True)

*NOTE* we can do this preprocessing on the original df because we are not aggregating data, each row is preprocessed individually (we are not using mean/mode/median/...)

In [None]:
original_df.columns

now we will split the data and use a seed ;) but because we want to actually have price tags (our labels) we will remove them from our data first before making the train test split. In the end we can use our model to actually fill in the missing price tags 

In [None]:
df_with_price_tag=original_df[~original_df.price_tag.isna()].copy()

In [None]:
df_with_price_tag.price_tag

we will also remove features that won't help with the classification


In [None]:
df_with_price_tag.drop(columns=["restaurant name","address","phone number","website url","menu url","timetable","email address","id","tags"],inplace=True)

### descriptions

We can use the descriptions, we have 2 types. We will first see which ones are the most usefull

In [None]:
print(f"description is missing: {df_with_price_tag.description.isna().sum()}")
print(f"dutch description is missing {df_with_price_tag['dutch description'].isna().sum()}")
print(f"size trainingsset is {len(df_with_price_tag)}")

we have too many missing values for description, 77% of the descriptions is missing so that's why we decided not to include this as a feature

In [None]:
df_with_price_tag.drop(columns=["description","dutch description"],inplace=True)
df_with_price_tag.head(2)

### splitting the data

In [None]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df_with_price_tag,test_size=0.2,train_size=0.8,shuffle=True,random_state=42)


In [None]:
df_train.price_tag.value_counts()

There is a big class imabalance in our training data, we have to take this into account. 
We have multiple options to handle this class imbalance, the first option is to continue with the imbalanced dataset.When choosing our models we will look for algorithms that have the class weight property so we can assign the corresponding weight to each class.

The second option is to artificially oversample our minority classes or downsample our majority classes. 

Before we start with the feature engineering part I would start by making a baseline classifier, the random forest. I chose to do this because it is very difficult to do something wrong here, it is insensitive to outliers and does not require feature scaling. The results are also interpretable, we can see which features are used to split our data (the most important features) and we can focus on those in our other classifier models.


inspiration from [this](https://www.kaggle.com/code/jhoward/how-random-forests-really-work) notebook used in the course [Practical Deep Learning for Coders 2022](https://www.youtube.com/watch?v=8SF_h3xF3cE&list=PLfYUBJiXbdtSvpQjSnJJ_PmDQB_VyT5iU&ab_channel=JeremyHoward)

peronally I also wanted to experiment a bit with pipelines so I got some inspiration from [this blog](https://towardsdatascience.com/getting-the-most-out-of-scikit-learn-pipelines-c2afc4410f1a#:~:text=They%20can%20be%20nested%20and,(model)%20at%20the%20end.)

In [None]:
rf_train=df_train.copy()
rf_test=df_test.copy()

In [None]:
rf_train.head(1)

In [None]:
numeric=["rank","general rating","number of reviews","value rating","atmosphere rating","latitude","longitude"]
mutlihot_col = ['cuisines','special diets',"meals","restaurant features"]
cat_cols = ['travelers choice', 'michelin', 'city']#one hot encoding
label="price_tag"

In [None]:
for col in mutlihot_col:
    rf_train[col]=rf_train[col].fillna("X").str.replace(" ","").str.split(",")

ColumnTransformers are built similarly to Pipelines, except you include a third value in each tuple representing the columns to be transformed in that step.

In [None]:
rf_train.head(2)

In [None]:
from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import MultiLabelBinarizer

from help_script import MultiHotEncoder

cols_trans = ColumnTransformer([
    ('mhe',MultiHotEncoder(),mutlihot_col),
    ('ohe', OneHotEncoder(drop='first'), cat_cols), 
    ('imputing',SimpleImputer(),numeric),
    ('scaling',StandardScaler(),numeric)
    ])

In [None]:
pipe = Pipeline([
    ('trans', cols_trans),
    ('clf', LogisticRegression(max_iter=500, class_weight='balanced'))
])

In [None]:
pipe

In [None]:
# pipe.fit(rf_train,rf_train["price_tag"])

In [None]:
# from sklearn.ensemble import RandomForestClassifier

# rf = RandomForestClassifier(100, min_samples_leaf=5)
# rf.fit(trn_xs, trn_y);
# mean_absolute_error(val_y, rf.predict(val_xs))

now let's start with the feature engineering part, as we've learned our model can only interpret numbers so as input we must turn all our attributes into numerical values

for the attributes that are already numerical we can look for feature scaling methods.

### numerical values

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_cols=df_train.select_dtypes(numerics).columns

In [None]:
fig=plt.figure(figsize=(10,15))
for i, col in enumerate(numerical_cols):
    plt.subplot(4,2,i+1)
    sns.histplot(df_train[col])
fig.tight_layout()
plt.show()

#### rank and number of reviews

we can see the rank and number of reviews have along tail distribution so I would take the log of the data first, and then apply standardization.

In [None]:
fig=plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
sns.histplot(df_train["rank"])
plt.subplot(2,2,2)
sns.histplot(np.log(df_train["rank"]))
plt.subplot(2,2,3)
sns.histplot(df_train["number of reviews"])
plt.subplot(2,2,4)
sns.histplot(np.log(df_train["number of reviews"]))
fig.tight_layout()
plt.show()


It already looks so much better! we will keep this but because sometimes the number of reviews is zero we will add a +1

In [None]:
##WE WILL DO THIS AFTER IMPUTATION
# df_train["lg_rank"]=np.log(df_train["rank"])
# df_train["lg_reviews"]=np.log(df_train["number of reviews"]+1)

#### general rating, atmosphere rating and value rating

look at these three again without the -1's

In [None]:
fig=plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
sns.histplot(df_train["general rating"].replace(-1,np.nan))
plt.subplot(2,2,2)
sns.histplot(df_train["atmosphere rating"].replace(-1,np.nan))
plt.subplot(2,2,3)
sns.histplot(df_train["value rating"].replace(-1,np.nan))
fig.tight_layout()
plt.show()

this already looks acceptable

#### lat & lon
for the coordinates we think it's best just to apply standardization to them

and for the missing data we wrote a script that derives the center latitude and longitude from the city and we will fill it in with those coordinates

In [None]:
city_centers=pd.read_csv("sprint1\city_centers.csv")

In [None]:
df_train=df_train.merge(city_centers,on="city")
df_train

In [None]:
df_train.loc[df_train.latitude.isna(),"latitude"]=df_train[df_train.latitude.isna()].latitude_center
df_train.loc[df_train.longitude.isna(),"longitude"]=df_train[df_train.longitude.isna()].longitude_center

In [None]:
df_train.latitude.isna().sum()
# df_train.loc[df_train.latitude.isna(),"latitude"]

In [None]:
df_train.drop(columns=["Unnamed: 0","latitude_center","longitude_center"],inplace=True)

In [None]:
df_train.columns

#### Missing values
Because our model won't like NANs we have to replace them by something. We decided to replace them with the median of the corresponding feature. But  we also think having a missing value can actually be a very good predictor. Thats why we will add a "missing" column when we have a missing value

In [None]:
df_train["rank_missing"]=0
df_train["atmosphere_missing"]=0
df_train["value_missing"]=0
df_train["general_missing"]=0

In [None]:
df_train["atmosphere rating"]=df_train["atmosphere rating"].replace(-1,np.nan)
df_train["value rating"]=df_train["value rating"].replace(-1,np.nan)
df_train["general rating"]=df_train["general rating"].replace(-1,np.nan)

In [None]:
df_train.loc[df_train["rank"].isna(),"rank_missing"] = 1
df_train.loc[df_train["atmosphere rating"].isna(),"atmosphere_missing"] = 1
df_train.loc[df_train["value rating"].isna(),"value_missing"] = 1
df_train.loc[df_train["general rating"].isna(),"general_missing"] = 1

imputing our missing values with the median

In [None]:
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='median')
imputed_data=imp_mean.fit_transform(df_train[["rank","general rating","value rating","atmosphere rating"]])
imputed_data

In [None]:
df_train["rank"]=imputed_data[:,0]
df_train["general rating"]=imputed_data[:,1]
df_train["value rating"]=imputed_data[:,2]
df_train["atmosphere rating"]=imputed_data[:,3]

scaling and then standardising

In [None]:
df_train["lg_rank"]=np.log(df_train["rank"])
df_train["lg_reviews"]=np.log(df_train["number of reviews"]+1)
df_train.drop(columns=["rank","number of reviews"],inplace=True)

In [None]:
df_train.head(1)

### booleans

In [None]:
df_train.select_dtypes(bool)

In [None]:
df_train.select_dtypes(bool).isna().sum()

no missing values! turn these into zeros and ones

In [None]:
df_train["travelers choice"]=df_train["travelers choice"].astype(int)
df_train["michelin"]=df_train["michelin"].astype(int)

### encoding categorical variables

we already explained how we did this and used it in our sprint 1 notebook so excuse us for just copy pasting the code 😅

In [None]:
mutlihot_col = ['cuisines','special diets',"meals","restaurant features"]

In [None]:
for col in mutlihot_col:
    df_train[col]=df_train[col].fillna(col+"_missing").str.replace(" ","").str.split(",")

In [None]:
df_train[mutlihot_col]

In [None]:
#multi hot encoding of the meals, restaurant features ,cuisines and diets
from sklearn.preprocessing import MultiLabelBinarizer

mlbs=[]
columns=["meals","restaurant features","cuisines","special diets"]
mh_encodings=[]
for col in columns:
    mlb= MultiLabelBinarizer()
    mlbs.append(mlb)
    # X=df_train[col].str.replace(" ","").str.split(",").fillna("X").to_list()
    #I want a list of sets that i can then pass to the MultiLabelBinarizer
    # lijst=[set(i) for i in X]
    mh_encodings.append(mlb.fit_transform(df_train[col]))


In [None]:
for i in mh_encodings:
    print(i.shape)

In [None]:
for i in mlbs:
    print(i.classes_)

one hot encode the city

In [None]:
from sklearn.preprocessing import OneHotEncoder
enc=OneHotEncoder(sparse=False,handle_unknown="infrequent_if_exist")
enc.fit(np.array(df_train["city"]).reshape(-1,1))
oh_cities=enc.transform(np.array(df_train["city"]).reshape(-1,1))
oh_cities

In [None]:
df_train.drop(columns=["cuisines","special diets","meals","restaurant features","city"],inplace=True)
df_train

### standardizing

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data=scaler.fit_transform(df_train[["general rating","value rating","atmosphere rating","latitude","longitude","lg_rank","lg_reviews"]])
scaled_data.shape

In [None]:
df_train["general rating"]=scaled_data[:,0]
df_train["value rating"]=scaled_data[:,1]
df_train["atmosphere rating"]=scaled_data[:,2]
df_train["latitude"]=scaled_data[:,3]
df_train["longitude"]=scaled_data[:,4]
df_train["lg_rank"]=scaled_data[:,5]
df_train["lg_reviews"]=scaled_data[:,6]

### label

our label is the pice tag, we will ordinal encode this

In [None]:
df_train.loc[df_train.price_tag=="$","price_tag"]=0
df_train.loc[df_train.price_tag=="$$ - $$$","price_tag"]=1
df_train.loc[df_train.price_tag=="$$$$","price_tag"]=2

In [None]:
y_train=df_train["price_tag"].astype(int)
df_train.drop(columns=["price_tag"],inplace=True)

finally putting it all together

In [None]:
oh_cities.shape

In [None]:
mh_encodings[0].shape

In [None]:
##testing if the shape is correct
np.concatenate((oh_cities,mh_encodings[0]),axis=1).shape

In [None]:
df_train

In [None]:
X_train=np.concatenate((oh_cities,mh_encodings[0],mh_encodings[1],mh_encodings[2],mh_encodings[3],df_train),axis=1)
X_train.shape

we will also make a variable, feature labels that will tell us where each feature in our array comes from

In [None]:
feature_labels=[]

In [None]:
feature_labels.extend(enc.categories_[0])
for i in mlbs:
    feature_labels.extend(i.classes_)
feature_labels.extend(df_train.columns)

In [None]:
len(feature_labels), feature_labels[:3]

#### now apply the same preprocessing for our test set


first, fill in the missing locations with the city_centers

In [None]:
df_test=df_test.merge(city_centers,on="city")

In [None]:
df_test.loc[df_test.latitude.isna(),"latitude"]=df_test[df_test.latitude.isna()].latitude_center
df_test.loc[df_test.longitude.isna(),"longitude"]=df_test[df_test.longitude.isna()].longitude_center

In [None]:
df_test.latitude.isna().sum()

In [None]:
df_test.drop(columns=["Unnamed: 0","latitude_center","longitude_center"],inplace=True)

In [None]:
#inserting the missing columns
df_test["rank_missing"]=0
df_test["atmosphere_missing"]=0
df_test["value_missing"]=0
df_test["general_missing"]=0
df_test["atmosphere rating"]=df_test["atmosphere rating"].replace(-1,np.nan)
df_test["value rating"]=df_test["value rating"].replace(-1,np.nan)
df_test["general rating"]=df_test["general rating"].replace(-1,np.nan)
df_test.loc[df_test["rank"].isna(),"rank_missing"] = 1
df_test.loc[df_test["atmosphere rating"].isna(),"atmosphere_missing"] = 1
df_test.loc[df_test["value rating"].isna(),"value_missing"] = 1
df_test.loc[df_test["general rating"].isna(),"general_missing"] = 1


imputing our data

In [None]:
#inputing our data
imputed_data=imp_mean.transform(df_test[["rank","general rating","value rating","atmosphere rating"]])
print(imputed_data.shape)
df_test["rank"]=imputed_data[:,0]
df_test["general rating"]=imputed_data[:,1]
df_test["value rating"]=imputed_data[:,2]
df_test["atmosphere rating"]=imputed_data[:,3]
df_test["lg_rank"]=np.log(df_test["rank"])
df_test["lg_reviews"]=np.log(df_test["number of reviews"]+1)
df_test.drop(columns=["rank","number of reviews"],inplace=True)
df_test["travelers choice"]=df_test["travelers choice"].astype(int)
df_test["michelin"]=df_test["michelin"].astype(int)

multihot encoding

In [None]:
for col in mutlihot_col:
    df_test[col]=df_test[col].fillna(col+"_missing").str.replace(" ","").str.split(",")

In [None]:
#multihot encoding
# mlbs=[]
columns=["meals","restaurant features","cuisines","special diets"]
mh_encodings=[]
for i,col in enumerate(columns):
    mlb= mlbs[i]
    mh_encodings.append(mlb.transform(df_test[col]))


we can already see that there are classs in our test set that don't appear in our trainingsset, we will ignore these

In [None]:
#OH encoding
# enc=OneHotEncoder(sparse=False,handle_unknown="infrequent_if_exist")
enc.transform(np.array(df_test["city"]).reshape(-1,1))
oh_cities=enc.transform(np.array(df_test["city"]).reshape(-1,1))
df_test.drop(columns=["cuisines","special diets","meals","restaurant features","city"],inplace=True)


In [None]:
#scaling
# scaler = StandardScaler()
scaled_data=scaler.transform(df_test[["general rating","value rating","atmosphere rating","latitude","longitude","lg_rank","lg_reviews"]])
scaled_data.shape
df_test["general rating"]=scaled_data[:,0]
df_test["value rating"]=scaled_data[:,1]
df_test["atmosphere rating"]=scaled_data[:,2]
df_test["latitude"]=scaled_data[:,3]
df_test["longitude"]=scaled_data[:,4]
df_test["lg_rank"]=scaled_data[:,5]
df_test["lg_reviews"]=scaled_data[:,6]
df_test.loc[df_test.price_tag=="$","price_tag"]=0
df_test.loc[df_test.price_tag=="$$ - $$$","price_tag"]=1
df_test.loc[df_test.price_tag=="$$$$","price_tag"]=2

y_test=df_test["price_tag"].astype(int)
df_test.drop(columns=["price_tag"],inplace=True)



In [None]:
X_test=np.concatenate((oh_cities,mh_encodings[0],mh_encodings[1],mh_encodings[2],mh_encodings[3],df_test),axis=1)

In [None]:
X_train.shape,y_train.shape, X_test.shape,y_test.shape

### baseline

In [None]:
from sklearn.metrics import balanced_accuracy_score
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(100, min_samples_leaf=1)
rf.fit(X_train, y_train)
balanced_accuracy_score(y_test, rf.predict(X_test))

this is already a good result, we can also find the features that were used the most in the descision trees to make splits (the important features)

In [None]:
rf.feature_importances_

In [None]:
importances=pd.DataFrame(dict(cols=feature_labels, imp=rf.feature_importances_))

In [None]:
importances.sort_values("imp",ascending=False).head(10)

above we can see  the 10 most important features for our random forest ensemble. It is interesting to see that we were right, the fact that a feature is missing is important for the classifier, "restaurant features missing" is in the top 10 as a feature. It is also interesting to see that Table service and reviews are the most inmortant features to determine our price tag

### model selection

our first idea was to try out a lot of different classifiers with default parameters, look which ones have the best accuracy and then do a grid search on the best 3 models and pich the best one. When we finished this we realised that our dataset is heavily imbalanced and we have to do something about this.


*NOTE: we have class imabalance so we chose models that have the class weight property*

we must choose our performance metric carefully because of the class imbalance [this article]("https://towardsdatascience.com/multiclass-classification-evaluation-with-roc-curves-and-roc-auc-294fd4617e3a") for inspiration

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import balanced_accuracy_score
def plot_cm(classifier):
    cm=confusion_matrix(y_test, classifier.predict(X_test))
    df_cm = pd.DataFrame(cm, columns=np.unique(y_test), index = np.unique(y_test))
    df_cm.index.name = 'Actual'
    df_cm.columns.name = 'Predicted'

    sns.heatmap(df_cm, cbar=False, annot=True, square=True, fmt='.0f',
                annot_kws={'size': 10})

In [None]:
classifiers=["svc linear","svc rbf","Logistic Regression","Naive Bayes","Light Gradient Boosting machine(LGBM)","xgboost","catboost"]
scores_list=[]

**svc linear**

quick experiment to see the effect of the class weight

In [None]:
clf=SVC(kernel="linear", C=0.025)
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
mscore=clf.score(X_test,y_test)
balanced_score=balanced_accuracy_score(y_test,pred)
print("The score is: ",mscore)
print("The balanced accuracy is: ",balanced_score)

In [None]:
clf=SVC(kernel="linear", C=0.025,class_weight="balanced")
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
mscore=clf.score(X_test,y_test)
balanced_score=balanced_accuracy_score(y_test,pred)
print("The score is: ",mscore)
print("The balanced accuracy is: ",balanced_score)
scores_list.append(balanced_score)

In [None]:
plot_cm(clf)

**svc rbf**


In [None]:
clf = SVC(gamma=2, C=1,class_weight="balanced")
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
mscore=clf.score(X_test,y_test)
balanced_score=balanced_accuracy_score(y_test,pred)
print("The score is: ",mscore)
print("The balanced accuracy is: ",balanced_score)
scores_list.append(balanced_score)

In [None]:
plot_cm(clf)

we can already see that by carefully looking at the confusion matrix and not blindly at our classification score we can already see that eventhough this classifier has a better score that the linear SVC, it predicts 1 for almost every input so the model is actually really bad

we can also confirm that the balanced accuracy is a better performance metric

**logistic geression**

In [None]:
clf=LogisticRegression(random_state=0,max_iter=500,class_weight="balanced")
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
mscore=clf.score(X_test,y_test)
balanced_score=balanced_accuracy_score(y_test,pred)
print("The score is: ",mscore)
print("The balanced accuracy is: ",balanced_score)
scores_list.append(balanced_score)

In [None]:
plot_cm(clf)

**Naieve Bayes**

In [None]:
clf=GaussianNB() ##here we don't need class weights because the probabilities are calculated from our classes
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
mscore=clf.score(X_test,y_test)
balanced_score=balanced_accuracy_score(y_test,pred)
print("The score is: ",mscore)
print("The balanced accuracy is: ",balanced_score)
scores_list.append(balanced_score)

In [None]:
plot_cm(clf)

**KNN**

In [None]:
# clf=KNeighborsClassifier()
# clf.fit(X_train,y_train)
# mscore=clf.score(X_test,y_test)
# print("The score is: ",mscore)
# scores_list.append(mscore)

**random forest**

In [None]:
# clf=RandomForestClassifier(300)
# clf.fit(X_train,y_train)
# mscore=clf.score(X_test,y_test)
# print("The score is: ",mscore)
# scores_list.append(mscore)

**Light gradient boosting machine (LGBM)**

In [None]:
clf=LGBMClassifier(random_state=0,class_weight="balanced")
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
mscore=clf.score(X_test,y_test)
balanced_score=balanced_accuracy_score(y_test,pred)
print("The score is: ",mscore)
print("The balanced accuracy is: ",balanced_score)
scores_list.append(balanced_score)

In [None]:
plot_cm(clf)

**XGBOOST**

with help from [Unbalanced multiclass data with XGBoost]("https://datascience.stackexchange.com/questions/16342/unbalanced-multiclass-data-with-xgboost")

In [None]:
from sklearn.utils import class_weight
classes_weights = class_weight.compute_sample_weight(
    class_weight='balanced',
    y=y_train
)

In [None]:
clf=XGBClassifier(use_label_encoder=False,random_state=0)
clf.fit(X_train,y_train,sample_weight=classes_weights)
pred=clf.predict(X_test)
mscore=clf.score(X_test,y_test)
balanced_score=balanced_accuracy_score(y_test,pred)
print("The score is: ",mscore)
print("The balanced accuracy is: ",balanced_score)
scores_list.append(balanced_score)

In [None]:
plot_cm(clf)

**catboost**

In [None]:
classes_weights

In [None]:
classes = np.unique(y_train)
weights = class_weight.compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
class_weights = dict(zip(classes, weights))
class_weights

In [None]:
clf=CatBoostClassifier(random_state=0,class_weights=class_weights)
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
mscore=clf.score(X_test,y_test)
balanced_score=balanced_accuracy_score(y_test,pred)
print("The score is: ",mscore)
print("The balanced accuracy is: ",balanced_score)
scores_list.append(balanced_score)

869:	learn: 0.1253935	total: 12.1s	remaining: 1.8s
870:	learn: 0.1252707	total: 12.1s	remaining: 1.79s
871:	learn: 0.1252555	total: 12.1s	remaining: 1.77s
872:	learn: 0.1251303	total: 12.1s	remaining: 1.76s
873:	learn: 0.1249897	total: 12.1s	remaining: 1.75s
874:	learn: 0.1248531	total: 12.1s	remaining: 1.73s
875:	learn: 0.1247896	total: 12.1s	remaining: 1.72s
876:	learn: 0.1247350	total: 12.2s	remaining: 1.7s
877:	learn: 0.1245401	total: 12.2s	remaining: 1.69s
878:	learn: 0.1245072	total: 12.2s	remaining: 1.68s
879:	learn: 0.1242438	total: 12.2s	remaining: 1.66s
880:	learn: 0.1240938	total: 12.2s	remaining: 1.65s
881:	learn: 0.1239845	total: 12.2s	remaining: 1.64s
882:	learn: 0.1237789	total: 12.3s	remaining: 1.63s
883:	learn: 0.1234888	total: 12.3s	remaining: 1.61s
884:	learn: 0.1234390	total: 12.3s	remaining: 1.6s
885:	learn: 0.1233749	total: 12.3s	remaining: 1.58s
886:	learn: 0.1232645	total: 12.3s	remaining: 1.57s
887:	learn: 0.1231295	total: 12.3s	remaining: 1.56s
888:	learn: 0.1

KeyboardInterrupt: 

In [None]:
plot_cm(clf)

putting it all together

In [None]:
scores_list

In [None]:
sb=pd.DataFrame(list(zip(classifiers,scores_list)),columns=['Classifier','Score'])
sb.sort_values("Score")

### Gridsearch
Now we will try to squeeze the last drops of performance out of our best model, the svc with linear kernel

In [None]:
linear_svc_grid={
    "kernel":["linear"],
    "C":[0.01,0.1,1,10,100],

}
# LGBM_grid = {'n_estimators': [50, 100, 150, 200],
#         'max_depth': [4, 8, 12],
#         'learning_rate': [0.05, 0.1, 0.15]}
xgboost_grid = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5]
        }

catboost_grid={'learning_rate': [0.01, 0.1,1],
        'n_estimators':[100,200,400],
        'depth': [4, 10,15,20,30],
        'l2_leaf_reg': [0,1, 3, 5, 9]}

# rf_grid={
#     "n_estimators":[100,200,300,400],
#     "max_depth":[1,2,3,4,5,6,7,8],
#     'min_samples_leaf':[2,4]
# }

log_grid={
    "penalty":["l1", "l2", "elasticnet" , "none" ],
    "C":[0.01,0.1,1,10],
    "max_iter":[100,150,200,300]
}
final_class=['linear_svc','xgboost_classifier','catboost_classifier','logistic_regression']
final_scores=[]

In [None]:
from joblib import dump, load


In [None]:
linear_svc_clf=GridSearchCV(estimator=SVC(class_weight="balanced"),param_grid=linear_svc_grid,n_jobs=1,cv=5,verbose=True)
linear_svc_clf.fit(X_train,y_train)
dump(linear_svc_clf, 'linear_svc_clf.joblib')

pred=linear_svc_clf.predict(X_test)
balanced_score=balanced_accuracy_score(y_test,pred)
final_scores.append(balanced_score)
plot_cm(linear_svc_clf)

In [None]:
xgboost_clf=GridSearchCV(estimator=XGBClassifier(use_label_encoder=False),param_grid=xgboost_grid,n_jobs=1,cv=5,verbose=True)

#Fit the model
xgboost_clf.fit(X_train,y_train,sample_weight=classes_weights)
#Score and Store the model
dump(xgboost_clf, 'xgboost_clf.joblib') 

pred=xgboost_clf.predict(X_test)
balanced_score=balanced_accuracy_score(y_test,pred)
final_scores.append(balanced_score)
plot_cm(xgboost_clf)

In [None]:
catboost_clf=GridSearchCV(estimator=CatBoostClassifier(class_weights=class_weights),param_grid=catboost_grid,n_jobs=1,cv=5,verbose=True)

#Fit the model
catboost_clf.fit(X_train,y_train)
dump(catboost_clf, 'catboost_clf.joblib') 

pred=catboost_clf.predict(X_test)
balanced_score=balanced_accuracy_score(y_test,pred)
final_scores.append(balanced_score)
plot_cm(catboost_clf)

In [None]:
log_reg_clf=GridSearchCV(estimator=LogisticRegression(class_weight="balanced"),param_grid=log_grid,n_jobs=1,cv=5,verbose=True)

#Fit the model
log_reg_clf.fit(X_train,y_train)

dump(log_reg_clf, 'logistic_regression.joblib') 
#Score and Store the model
pred=log_reg_clf.predict(X_test)
balanced_score=balanced_accuracy_score(y_test,pred)
final_scores.append(balanced_score)
plot_cm(log_reg_clf)

In [None]:
final_sb=pd.DataFrame(list(zip(final_class,final_scores)),columns=['Classifier','Score'])
final_sb

als moeilijk gaat om expensive restaurants eruit te halen probeer die eruit te krijgen met isolation forest of pca en de rest met binary classifiation

voor de laatste kan je dan grid search doen

### trying out TTA (test time augmentation)

Test-time augmentation, or TTA for short, is a technique for improving the skill of predictive models.

It is typically used to improve the predictive performance of deep learning models on image datasets where predictions are averaged across multiple augmented versions of each image in the test dataset.

Although popular with image datasets and neural network models, test-time augmentation can be used with any machine learning algorithm on tabular datasets, such as those often seen in regression and classification predictive modeling problems.

In [None]:
# create a test set for a row of real data with an unknown label
from numpy.random import normal
from scipy.stats import mode

def create_test_set(row, n_cases=3, feature_scale=0.2):
	test_set = list()
	test_set.append(row)
	# make copies of row
	for _ in range(n_cases):
		# create vector of random gaussians
		gauss = normal(loc=0.0, scale=feature_scale, size=len(row))
		# add to test case
		new_row = row + gauss
		# store in test set
		test_set.append(new_row)
	return test_set

In [None]:
# make predictions using test-time augmentation
def test_time_augmentation(model, X_test, noise):
	# evaluate model
	y_hat = list()
	for i in range(X_test.shape[0]):
		# retrieve the row
		row = X_test[i]
		# create the test set
		test_set = create_test_set(row, feature_scale=noise)
		# make a prediction for all examples in the test set
		labels = model.predict(test_set)
		# select the label as the mode of the distribution
		label, _ = mode(labels)
		# store the prediction
		y_hat.append(label)
	return y_hat

In [None]:
clf=SVC(kernel="linear", C=0.025,class_weight="balanced")
clf.fit(X_train,y_train)
pred=clf.predict(X_test)
mscore=clf.score(X_test,y_test)
balanced_score=balanced_accuracy_score(y_test,pred)
print("The score is: ",mscore)
print("The balanced accuracy is: ",balanced_score)
scores_list.append(balanced_score)

In [None]:
# evaluate different number of synthetic examples created at test time
examples = np.arange(0.01,1,0.01)
results = list()
for e in examples:
	pred = test_time_augmentation(clf, X_test, e)
	balanced_score=balanced_accuracy_score(y_test,pred)
	print("The balanced accuracy is: ",balanced_score)
	results.append(balanced_score)

In [None]:
df=pd.DataFrame({"feature_scaleing":examples,"results":results})
df

In [None]:
df.results.plot()

In [None]:
df[df.results>=0.68]

we can see that, eventhough it's a tiny improvement, we actually were able to improve our model!