# Introduction
in this dataset we have about 2000 entries about different chocolate bars,
their manufacturers, the origin of the bean etc...

the dataset contains a rating for the quality of the chocolate,

## Flavors of Cacao Rating System:
- #### Rating Scale:
    - 4.0 - 5.0 = Outstanding
    - 3.5 - 3.9 = Highly Recommended
    - 3.0 - 3.49 = Recommended
    - 2.0 - 2.9 = Disappointing
    - 1.0 - 1.9 = Unpleasant

### About this project
in this project we will try to predict the rating of the chocolate,
based on  the data we have, and we will conclude what are the factors
for the rating and quality of the chocolate,
or maybe it is a "secret" formula, that we do not have.

In [458]:
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import plot_confusion_matrix, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [459]:
chocolate = pd.read_csv('data/chocolate.csv')
pd.set_option('display.max_columns', None)
chocolate = chocolate.iloc[:,1:]
chocolate.head()

Unnamed: 0,ref,company,company_location,review_date,country_of_bean_origin,specific_bean_origin_or_bar_name,cocoa_percent,rating,counts_of_ingredients,beans,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,2454,5150,U.S.A,2019,Madagascar,"Bejofo Estate, batch 1",76.0,3.75,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,blackberry,full body,
1,2458,5150,U.S.A,2019,Dominican republic,"Zorzal, batch 1",76.0,3.5,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,vegetal,savory,
2,2454,5150,U.S.A,2019,Tanzania,"Kokoa Kamili, batch 1",76.0,3.25,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,rich cocoa,fatty,bready,
3,797,A. Morin,France,2012,Peru,Peru,63.0,3.75,4,have_bean,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fruity,melon,roasty,
4,797,A. Morin,France,2012,Bolivia,Bolivia,70.0,3.5,4,have_bean,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,vegetal,nutty,,


In [460]:
chocolate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2224 entries, 0 to 2223
Data columns (total 20 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   ref                               2224 non-null   int64  
 1   company                           2224 non-null   object 
 2   company_location                  2224 non-null   object 
 3   review_date                       2224 non-null   int64  
 4   country_of_bean_origin            2224 non-null   object 
 5   specific_bean_origin_or_bar_name  2224 non-null   object 
 6   cocoa_percent                     2224 non-null   float64
 7   rating                            2224 non-null   float64
 8   counts_of_ingredients             2224 non-null   int64  
 9   beans                             2224 non-null   object 
 10  cocoa_butter                      2224 non-null   object 
 11  vanilla                           2224 non-null   object 
 12  lecith

## data insight
we can clearly see that there are some null values,
but from a quick overview we can notice that "first_taste, second_taste...",
are all basically a list of all tastes in the chocolate with we will need,
to transform anyway into numerical data.

# transforming the data for use
in this section we will transform all the categories into data that we can use.

## Drop unnecessary columns
we will drop company, company location, specific_bean_origin..., beans...
review date and so one... those values should not effect the quallity of the choclate,
well maybe company will but as there is just too much of them, we will drop them.

In [461]:
dropc = ["specific_bean_origin_or_bar_name"]
chocolate_n = chocolate.iloc[:,4:]
chocolate_n = chocolate_n.drop(dropc, axis = 1)
chocolate_n

Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,beans,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,Madagascar,76.0,3.75,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,blackberry,full body,
1,Dominican republic,76.0,3.50,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,vegetal,savory,
2,Tanzania,76.0,3.25,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,rich cocoa,fatty,bready,
3,Peru,63.0,3.75,4,have_bean,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fruity,melon,roasty,
4,Bolivia,70.0,3.50,4,have_bean,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,Blend,80.0,2.75,4,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_salt,have_not_sugar,have_sweetener_without_sugar,waxy,cloying,vegetal,
2220,Colombia,75.0,3.75,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,strong nutty,marshmallow,,
2221,Belize,72.0,3.50,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,muted,roasty,accessible,
2222,Congo,70.0,3.25,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fatty,mild nuts,mild fruit,


### Look into Object dtypes, that are potential categories.

In [462]:
# select the columns where the dtype is object
cat = chocolate_n.select_dtypes(include=['object']).copy()

huge = []
is_binary = []
drop_cols = []
for col in cat.columns:
    num = len(cat[col].unique())
    if num < 4:
        if num == 1:
            drop_cols.append(col)
            continue
        elif num == 2:
            is_binary.append(col)
        else:
            huge.append(col)
        print(col, cat[col].unique())
    else:
        huge.append(col)
        print(col, num)
chocolate_n = chocolate_n.drop(drop_cols, axis=1)
chocolate_n
#chocolate["sugar"].unique()

country_of_bean_origin 62
cocoa_butter ['have_cocoa_butter' 'have_not_cocoa_butter']
vanilla ['have_not_vanila' 'have_vanila']
lecithin ['have_not_lecithin' 'have_lecithin']
salt ['have_not_salt' 'have_salt']
sugar ['have_sugar' 'have_not_sugar']
sweetener_without_sugar ['have_not_sweetener_without_sugar' 'have_sweetener_without_sugar']
first_taste 456
second_taste 480
third_taste 333
fourth_taste 89


Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,Madagascar,76.0,3.75,3,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,blackberry,full body,
1,Dominican republic,76.0,3.50,3,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,vegetal,savory,
2,Tanzania,76.0,3.25,3,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,rich cocoa,fatty,bready,
3,Peru,63.0,3.75,4,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fruity,melon,roasty,
4,Bolivia,70.0,3.50,4,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,Blend,80.0,2.75,4,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_salt,have_not_sugar,have_sweetener_without_sugar,waxy,cloying,vegetal,
2220,Colombia,75.0,3.75,3,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,strong nutty,marshmallow,,
2221,Belize,72.0,3.50,3,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,muted,roasty,accessible,
2222,Congo,70.0,3.25,3,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fatty,mild nuts,mild fruit,


## First look
for instance "sugar" has only two values, so it's a boolean value,
<br> on the other hand "beans" have just 1 value, so we can drop that column.

In [463]:
def replace_binary(data_set : pd.DataFrame, cols : []):
    data_copy = data_set.copy()
    for column in cols:
        data_copy[column] = np.where(data_set[column].str.contains("_not_"), 0, 1)
    return data_copy

chocolate_n2 = replace_binary(chocolate_n, is_binary)
chocolate_n2

Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,Madagascar,76.0,3.75,3,1,0,0,0,1,0,cocoa,blackberry,full body,
1,Dominican republic,76.0,3.50,3,1,0,0,0,1,0,cocoa,vegetal,savory,
2,Tanzania,76.0,3.25,3,1,0,0,0,1,0,rich cocoa,fatty,bready,
3,Peru,63.0,3.75,4,1,0,1,0,1,0,fruity,melon,roasty,
4,Bolivia,70.0,3.50,4,1,0,1,0,1,0,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,Blend,80.0,2.75,4,1,0,0,1,0,1,waxy,cloying,vegetal,
2220,Colombia,75.0,3.75,3,1,0,0,0,1,0,strong nutty,marshmallow,,
2221,Belize,72.0,3.50,3,1,0,0,0,1,0,muted,roasty,accessible,
2222,Congo,70.0,3.25,3,1,0,0,0,1,0,fatty,mild nuts,mild fruit,


# first approach
###  Label Encoding
let's use label encoding to encode the tastes, and try to predict the score after that.

this approach is quite easy to implement but might work badly,
due to the fact that we will "rank" some categories by order and there is no way,
for instance that "cocoa" should be 10 when "mild fruit" would be given the value 1.
<br>hence it will add x10 weight.

In [464]:
def label_encode(data_set : pd.DataFrame, columns : []):
    data_copy = data_set.copy()
    for column in columns:
        data_copy[column] = data_copy[column].astype('category')
        data_copy[column] = data_copy[column].cat.codes
    return data_copy
chocolate_cat = label_encode(chocolate_n2, huge)
chocolate_cat

Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,28,76.0,3.75,3,1,0,0,0,1,0,86,27,111,-1
1,12,76.0,3.50,3,1,0,0,0,1,0,86,460,256,-1
2,52,76.0,3.25,3,1,0,0,0,1,0,319,126,26,-1
3,36,63.0,3.75,4,1,0,1,0,1,0,137,222,247,-1
4,3,70.0,3.50,4,1,0,1,0,1,0,437,288,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,2,80.0,2.75,4,1,0,0,1,0,1,446,84,316,-1
2220,8,75.0,3.75,3,1,0,0,0,1,0,397,214,-1,-1
2221,1,72.0,3.50,3,1,0,0,0,1,0,265,357,0,-1
2222,9,70.0,3.25,3,1,0,0,0,1,0,130,255,180,-1


## Trying few models
### preparing the data

In [465]:
train_set, test_set = train_test_split(chocolate_cat, test_size = 0.2, random_state = 2)
# print( train_set.shape)
# print( test_set.shape)

def split_x_y(data_set):
    return data_set.drop('rating', axis = 1), data_set['rating']

x_train, y_train = split_x_y(train_set)
x_test, y_test = split_x_y(test_set)
print(train_set.shape, test_set.shape)

(1779, 14) (445, 14)


## KNN
let's use knn for our first try

In [466]:
def display_scores(m_scores):
    sqrt_scores = np.sqrt(-m_scores)
    print("Scores:", sqrt_scores)
    print("Mean:", sqrt_scores.mean())
    print("Standard deviation:", sqrt_scores.std())

from sklearn.neighbors import KNeighborsRegressor

x_train_copy = x_train.copy()

scalar = StandardScaler()
x_train_copy = scalar.fit_transform(x_train_copy)

for i in range(3, 30, 6):
    model = KNeighborsRegressor(n_neighbors=i)
    scores = cross_val_score(model, x_train_copy, y_train, scoring="neg_mean_squared_error", cv = 10)
    print(model)
    display_scores(scores)
    print()

KNeighborsRegressor(n_neighbors=3)
Scores: [0.5158611  0.45784363 0.46132409 0.45754358 0.47781625 0.46253353
 0.50016071 0.4723966  0.46928938 0.42753068]
Mean: 0.470229954669768
Standard deviation: 0.023059324603576885

KNeighborsRegressor(n_neighbors=9)
Scores: [0.45440806 0.43384781 0.42844363 0.41732532 0.45175    0.43334574
 0.46102555 0.44397388 0.44680902 0.38126735]
Mean: 0.43521963672249014
Standard deviation: 0.021910518194583048

KNeighborsRegressor(n_neighbors=15)
Scores: [0.44910787 0.41608116 0.4094478  0.41019034 0.43898845 0.43051881
 0.45018494 0.4374701  0.43571543 0.38828433]
Mean: 0.4265989236206152
Standard deviation: 0.018908055844986532

KNeighborsRegressor(n_neighbors=21)
Scores: [0.45233234 0.41429402 0.40581815 0.40351865 0.43596292 0.42975818
 0.443765   0.43676509 0.43294371 0.39120936]
Mean: 0.4246367404165433
Standard deviation: 0.01878303862173257

KNeighborsRegressor(n_neighbors=27)
Scores: [0.45142911 0.41165207 0.40502322 0.40031172 0.43549966 0.42796

### KNN overview
Surprisingly even with the "Bad" encoding of tastes its looks like KNN performs pretty well.
with an average error of -0.18, that actually might be just enough as again,
this is an expert rating and well it might be a bit subjective, so 0.18 error is quite low.

on the other hand most of our values are between 2.5-4 so 0.2 error might be significant.


#### we will try to improve the score later

## linear Regression
next step lets try the linear regression model

In [467]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lin_reg = LinearRegression()
scores = cross_val_score(lin_reg, x_train_copy, y_train, scoring="neg_mean_squared_error", cv = 10)

display_scores(scores)

Scores: [0.44092177 0.41540094 0.41062824 0.39885982 0.41700179 0.4286739
 0.4334701  0.44040698 0.43278802 0.38106429]
Mean: 0.4199215859702467
Standard deviation: 0.018317160989485017


### Linear Regression overview
linear regression actually perform a little bit better than the KNN model.
with an error of 0.17 that again isn't that bad.


In [468]:
lin_reg.fit(x_train_copy, y_train)

x_test_copy = scalar.transform(x_test)
pred = lin_reg.predict(x_test_copy)

lin_mse = mean_squared_error(y_test, pred)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
print(lin_rmse, lin_mse)
print(pred[:10])
#pred

0.430718973948816 0.18551883451952084
[2.97389056 3.13620375 2.98275809 2.98894151 3.18459156 3.07516332
 3.17256206 3.28733251 3.20915721 3.17708218]


In [469]:
model = KNeighborsRegressor(n_neighbors=10)
model.fit(x_train, y_train)

pred = model.predict(x_test_copy)

lin_mse = mean_squared_error(y_test, pred)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

0.5043699484939802

### Correlation to rating
lets take a look at the correlation's we have to "rating"

In [470]:
train_set.corr()["rating"]

country_of_bean_origin     0.038539
cocoa_percent             -0.082468
rating                     1.000000
counts_of_ingredients     -0.083579
cocoa_butter               0.022779
vanilla                   -0.158961
lecithin                  -0.063290
salt                      -0.043152
sugar                      0.099833
sweetener_without_sugar   -0.096052
first_taste               -0.061379
second_taste              -0.074703
third_taste               -0.103739
fourth_taste              -0.044089
Name: rating, dtype: float64

we can see that more or less every thing effect the outcome,

country of bean origin doesnt have any effect ( expected due to poor choice of encoding )
and surprisingly first,second... tastes hae some minor impact,
with the negative correlation i the fact that -1 is No taste
we can conclude that the less taste's there are in a chocolate the higher the quality is.

## Another approach
let's take a look at home many "Values" there are in country of bean.
and binary encode only the top choices for country of origin.

In [471]:
chocolate_n2["country_of_bean_origin"].value_counts().head(20)

Venezuela             238
Peru                  207
Dominican republic    200
Ecuador               194
Madagascar            157
Blend                 140
Nicaragua              92
Brazil                 74
Bolivia                71
Belize                 65
Colombia               65
Vietnam                64
Tanzania               63
Guatemala              53
Papua new guinea       48
Mexico                 45
Costa rica             42
Trinidad               38
Ghana                  32
U.s.a.                 28
Name: country_of_bean_origin, dtype: int64

In [472]:
chocolate_n2

Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,Madagascar,76.0,3.75,3,1,0,0,0,1,0,cocoa,blackberry,full body,
1,Dominican republic,76.0,3.50,3,1,0,0,0,1,0,cocoa,vegetal,savory,
2,Tanzania,76.0,3.25,3,1,0,0,0,1,0,rich cocoa,fatty,bready,
3,Peru,63.0,3.75,4,1,0,1,0,1,0,fruity,melon,roasty,
4,Bolivia,70.0,3.50,4,1,0,1,0,1,0,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,Blend,80.0,2.75,4,1,0,0,1,0,1,waxy,cloying,vegetal,
2220,Colombia,75.0,3.75,3,1,0,0,0,1,0,strong nutty,marshmallow,,
2221,Belize,72.0,3.50,3,1,0,0,0,1,0,muted,roasty,accessible,
2222,Congo,70.0,3.25,3,1,0,0,0,1,0,fatty,mild nuts,mild fruit,


In [473]:
def replace_names(data_set : pd.DataFrame, column : str, top_n = 15):
    value_map = {}
    data_copy = data_set.copy()
    index_val = 0
    for name in data_set[column].value_counts().index:
        val = data_set[column].value_counts()[name]
        #print(name, val)
        if index_val < top_n:
            print(name, index_val)
            value_map[name] = index_val
        else:
            #print(name, top_n)
            value_map[name] = top_n
        index_val += 1
    data_copy[column] = data_copy[column].replace(value_map)
    return data_copy, value_map


#chocolate_n2["country_of_bean_origin"].value_counts().index
chocolate_n3 = replace_names(chocolate_n2, "country_of_bean_origin")[0]
chocolate_n3 = chocolate_n3.rename(columns={"country_of_bean_origin" : "bean_origin"})
chocolate_n3

Venezuela 0
Peru 1
Dominican republic 2
Ecuador 3
Madagascar 4
Blend 5
Nicaragua 6
Brazil 7
Bolivia 8
Belize 9
Colombia 10
Vietnam 11
Tanzania 12
Guatemala 13
Papua new guinea 14


Unnamed: 0,bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,4,76.0,3.75,3,1,0,0,0,1,0,cocoa,blackberry,full body,
1,2,76.0,3.50,3,1,0,0,0,1,0,cocoa,vegetal,savory,
2,12,76.0,3.25,3,1,0,0,0,1,0,rich cocoa,fatty,bready,
3,1,63.0,3.75,4,1,0,1,0,1,0,fruity,melon,roasty,
4,8,70.0,3.50,4,1,0,1,0,1,0,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,5,80.0,2.75,4,1,0,0,1,0,1,waxy,cloying,vegetal,
2220,10,75.0,3.75,3,1,0,0,0,1,0,strong nutty,marshmallow,,
2221,9,72.0,3.50,3,1,0,0,0,1,0,muted,roasty,accessible,
2222,15,70.0,3.25,3,1,0,0,0,1,0,fatty,mild nuts,mild fruit,


In [474]:
import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['bean_origin'])
df_binary = encoder.fit_transform(chocolate_n3)

df_binary

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,bean_origin_0,bean_origin_1,bean_origin_2,bean_origin_3,bean_origin_4,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,0,0,0,0,1,76.0,3.75,3,1,0,0,0,1,0,cocoa,blackberry,full body,
1,0,0,0,1,0,76.0,3.50,3,1,0,0,0,1,0,cocoa,vegetal,savory,
2,0,0,0,1,1,76.0,3.25,3,1,0,0,0,1,0,rich cocoa,fatty,bready,
3,0,0,1,0,0,63.0,3.75,4,1,0,1,0,1,0,fruity,melon,roasty,
4,0,0,1,0,1,70.0,3.50,4,1,0,1,0,1,0,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,0,1,1,1,1,80.0,2.75,4,1,0,0,1,0,1,waxy,cloying,vegetal,
2220,0,1,0,1,1,75.0,3.75,3,1,0,0,0,1,0,strong nutty,marshmallow,,
2221,0,1,1,1,0,72.0,3.50,3,1,0,0,0,1,0,muted,roasty,accessible,
2222,0,1,0,1,0,70.0,3.25,3,1,0,0,0,1,0,fatty,mild nuts,mild fruit,


## let's take a look at how well this approach performes
but we will drop the taste columns for now.

In [491]:
# function for easy comparing two data sets performances.
def run_test(data_set, name :str, test_model):
    print()
    print(test_model)
    print(name)
    train_set_r, test_set_r = train_test_split(data_set, test_size = 0.2, random_state = 2)
    x_train_r, y_train_r = split_x_y(train_set_r)

    scalar_r = StandardScaler()
    x_train_copy_r = scalar_r.fit_transform(x_train_r)

    scores_r = cross_val_score(test_model, x_train_copy_r, y_train_r, scoring="neg_mean_squared_error", cv = 10)

    display_scores(scores_r)

    train_set_r2, test_set_r2 = train_test_split(train_set_r, test_size = 0.1, random_state = 2)
    x_train_r2, y_train_r2 = split_x_y(train_set_r2)
    x_test_r2, y_test_r2 = split_x_y(test_set_r2)

    x_train_copy_r2 = scalar_r.fit_transform(x_train_r2)
    test_model.fit(x_train_copy_r2, y_train_r2)
    x_test_r2_copy = scalar_r.transform(x_test_r2)
    pred_lin = test_model.predict(x_test_r2_copy)
    print(pred_lin[:10])
    print(list(y_test_r2[:10]))
    print("original mean : ", y_test_r2.mean(), "original std:", y_test_r2.std())
    print("predicted mean : ", pred_lin.mean(), "predicted std:", pred_lin.std())

In [492]:
lin_reg = LinearRegression()

df_binary_nl = df_binary.iloc[:,:-4]
run_test(df_binary_nl, "test binary encoding", lin_reg)

df_binary_nl = df_binary_nl.iloc[:,5:]
run_test(df_binary_nl, "test without origin", lin_reg)

chocolate_cat_r = chocolate_cat.iloc[:,:-4]
run_test(chocolate_cat_r, "test cat encoding", lin_reg)

run_test(chocolate_cat, "test cat encoding", lin_reg)

####
print()
print("----- KNN : 50 ----")

knn_model = KNeighborsRegressor(n_neighbors=50)

df_binary_nl = df_binary.iloc[:,:-4]
run_test(df_binary_nl, "test binary encoding", knn_model)

df_binary_nl = df_binary_nl.iloc[:,5:]
run_test(df_binary_nl, "test without origin", knn_model)

chocolate_cat_r = chocolate_cat.iloc[:,:-4]
run_test(chocolate_cat_r, "test cat encoding", knn_model)

run_test(chocolate_cat, "test cat encoding", knn_model)


LinearRegression()
test binary encoding
Scores: [0.4445332  0.42453713 0.41445475 0.40282278 0.42194185 0.42386839
 0.44024029 0.44091397 0.42978971 0.39638413]
Mean: 0.423948620266262
Standard deviation: 0.015204277477613201
[3.27503926 3.15785176 2.83271748 3.28578144 3.19239765 3.28810078
 3.20643574 3.20729023 3.10060078 3.28578144]
[3.0, 3.25, 3.0, 3.0, 3.75, 3.5, 4.0, 3.5, 3.5, 3.25]
original mean :  3.1980337078651684 original std: 0.4165150806593712
predicted mean :  3.209007445746225 predicted std: 0.10450094663238171

LinearRegression()
test without origin
Scores: [0.44386937 0.42342922 0.41372851 0.40215668 0.42306793 0.42691484
 0.44227604 0.44061876 0.43012784 0.38867406]
Mean: 0.42348632420423254
Standard deviation: 0.016970996870033766
[3.24651736 3.1986658  2.84908694 3.28313846 3.18743533 3.22454471
 3.24651736 3.13421268 3.08880252 3.28313846]
[3.0, 3.25, 3.0, 3.0, 3.75, 3.5, 4.0, 3.5, 3.5, 3.25]
original mean :  3.1980337078651684 original std: 0.4165150806593712
pr

## well we can see that out model just "learned" the mean.
and just +- randomly throws values in any direction.


Well we can see that this approach has actually failed to produce any value.

in fact it even worse then the cat encoding.

another approach we can do!
<br> lets swap country names and use one_hot_encoding.

In [477]:
country_name = "Grenada"
from countryinfo import CountryInfo

country = CountryInfo(country_name)
country.region()
country.subregion()
def name_to_loc(country_name : str):
    return CountryInfo(country_name).latlng()

print(country_name ,name_to_loc(country_name))
#print(pc.country_name_to_country_alpha3(country_name))

Grenada [12.11666666, -61.66666666]


In [478]:
from sklearn.base import BaseEstimator, TransformerMixin

country_name_fix = {
    "Blend" : "Tanzania",
    "Trinidad" : "Trinidad and Tobago",
    "U.s.a." : "USA",
    "Sao tome" : "São Tomé and Príncipe",
    "Congo" : "Democratic Republic of the Congo",
    "St. lucia" : "Saint Lucia",
    "Sao tome & principe" : "São Tomé and Príncipe",
    "Sumatra" : "Indonesia",
    "Tobago" : "Trinidad and Tobago",
    "Bolvia" : "Bolivia",
    "Principe" : "São Tomé and Príncipe",
    "Sulawesi" : "Indonesia",
    "St.vincent-grenadines" : "Saint Vincent and the Grenadines",
    "Burma" : "India", # for somereason it doesn't know this name
}

class RegionTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column : str, prefix : str = "", log = False):
        self.log = log
        self.name_map = {}
        self.unique = {}
        self.column = column
        self.prefix = prefix
        self.name_map_sub = {}
    def fit(self, x):
        r_set, self.name_map, self.name_map_sub, self.unique \
            = replace_country_name_to_continent(x, self.column, self.prefix, self.log)
        return self
    def transform(self, x):
        df_copy : pd.DataFrame = x.copy()
        for original_name in df_copy[self.column].value_counts().index:
            name = original_name
            if name in country_name_fix:
                name = country_name_fix[name]
            country_if = CountryInfo(name)
            region_n = country_if.region()
            sub_region_n = country_if.subregion()
            if region_n in self.unique:
                self.name_map[original_name] = region_n
                self.name_map_sub[original_name] = sub_region_n
                continue
            else:
                self.name_map[original_name] = "other"
                self.name_map_sub[original_name] = "other"
        df_copy[self.column] = df_copy[self.column].replace(self.name_map)
        df_copy["sub_" + self.column] = x[self.column].replace(self.name_map_sub)
        return df_copy

def replace_country_name_to_continent(df_data : pd.DataFrame, column : str, prefix : str, log = False):
    name_map = {}
    name_map_sub = {}
    unique = {}
    df_copy = df_data.copy()
    for original_name in df_data[column].value_counts().index:
        name = original_name
        if name in country_name_fix:
            name = country_name_fix[name]
        country_if = CountryInfo(name)
        region_n = country_if.region()
        sub_region_n = country_if.subregion()
        name_map[original_name] = region_n
        name_map_sub[original_name] = sub_region_n
        if region_n in unique:
            unique[region_n] += 1
        else:
            unique[region_n] = 1
    # for region, freq in unique.items():
    #     print(region, freq)
    df_copy[column] = df_data[column].replace(name_map)
    df_copy["sub_" + column] = df_data[column].replace(name_map_sub)
    if log:
        print(df_copy[column].value_counts())
    return df_copy, name_map, name_map_sub, unique
        #print(name, country_if.region(), country_if.subregion())

region_t = RegionTransformer("country_of_bean_origin", prefix="", log=True)
chocolate_no_taste = chocolate_n2.iloc[:,:-4]
chocolate_no_taste = region_t.fit_transform(chocolate_no_taste)
chocolate_no_taste

Americas    1545
Africa       449
Asia         139
Oceania       91
Name: country_of_bean_origin, dtype: int64


Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,sub_country_of_bean_origin
0,Africa,76.0,3.75,3,1,0,0,0,1,0,Eastern Africa
1,Americas,76.0,3.50,3,1,0,0,0,1,0,Caribbean
2,Africa,76.0,3.25,3,1,0,0,0,1,0,Eastern Africa
3,Americas,63.0,3.75,4,1,0,1,0,1,0,South America
4,Americas,70.0,3.50,4,1,0,1,0,1,0,South America
...,...,...,...,...,...,...,...,...,...,...,...
2219,Africa,80.0,2.75,4,1,0,0,1,0,1,Eastern Africa
2220,Americas,75.0,3.75,3,1,0,0,0,1,0,South America
2221,Americas,72.0,3.50,3,1,0,0,0,1,0,Central America
2222,Africa,70.0,3.25,3,1,0,0,0,1,0,Middle Africa


next lets leave in sub_region only the top 3 values.

In [479]:
class TopXsub(BaseEstimator, TransformerMixin):
    def __init__(self, column : str, top : int):
        self.top = top
        self.column = column
        self.value_map = {}
    def fit(self, x):
        data_copy, self.value_map = replace_names(x, self.column, self.top)
        return self
    def transform(self, x):
        data_copy = x.copy()
        for name in x[self.column].value_counts().index:
            if name in self.value_map:
                continue
            else:
                self.value_map[name] = self.top
        data_copy[self.column] = data_copy[self.column].replace(self.value_map)
        return data_copy

top3 = TopXsub("sub_country_of_bean_origin", 3)
chocolate_no_taste2 = top3.fit_transform(chocolate_no_taste)
top3_region = TopXsub("country_of_bean_origin", 2)
chocolate_no_taste2 = top3_region.fit_transform(chocolate_no_taste2)
chocolate_no_taste2

South America 0
Eastern Africa 1
Central America 2
Americas 0
Africa 1


Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,sub_country_of_bean_origin
0,1,76.0,3.75,3,1,0,0,0,1,0,1
1,0,76.0,3.50,3,1,0,0,0,1,0,3
2,1,76.0,3.25,3,1,0,0,0,1,0,1
3,0,63.0,3.75,4,1,0,1,0,1,0,0
4,0,70.0,3.50,4,1,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...
2219,1,80.0,2.75,4,1,0,0,1,0,1,1
2220,0,75.0,3.75,3,1,0,0,0,1,0,0
2221,0,72.0,3.50,3,1,0,0,0,1,0,2
2222,1,70.0,3.25,3,1,0,0,0,1,0,3


In [480]:
print(chocolate_no_taste2["country_of_bean_origin"].value_counts())
chocolate_no_taste2["sub_country_of_bean_origin"].value_counts()
#chocolate_no_taste = replace_country_name_to_continent(chocolate_no_taste, "country_of_bean_origin", "origin_")[0]
#chocolate_no_taste

0    1545
1     449
2     230
Name: country_of_bean_origin, dtype: int64


0    851
3    669
1    370
2    334
Name: sub_country_of_bean_origin, dtype: int64

## outlook
we can see that instead of having 100+ categories, now we are down to 13
with each having a good amounth of entries.

except for maybe
Australia and New Zealand      3
Polynesia                      3
Eastern Asia                   2

with we can always swap to being

## new Correlations
lets take a look at the correlations now

In [481]:
chocolate_no_taste2.corr()["rating"]

country_of_bean_origin       -0.022021
cocoa_percent                -0.078508
rating                        1.000000
counts_of_ingredients        -0.094850
cocoa_butter                  0.012224
vanilla                      -0.164881
lecithin                     -0.070179
salt                         -0.051381
sugar                         0.092216
sweetener_without_sugar      -0.087438
sub_country_of_bean_origin   -0.025640
Name: rating, dtype: float64

In [482]:
chocolate_cat_r.corr()["rating"]

country_of_bean_origin     0.025628
cocoa_percent             -0.078508
rating                     1.000000
counts_of_ingredients     -0.094850
cocoa_butter               0.012224
vanilla                   -0.164881
lecithin                  -0.070179
salt                      -0.051381
sugar                      0.092216
sweetener_without_sugar   -0.087438
Name: rating, dtype: float64

In [483]:
chocolate_no_taste2["sub_country_of_bean_origin"].value_counts()

0    851
3    669
1    370
2    334
Name: sub_country_of_bean_origin, dtype: int64

Let's take a look at the rating now


In [493]:
knn_model = KNeighborsRegressor(n_neighbors=50)

run_test(chocolate_no_taste2, "region one-hot encoding", knn_model)

chocolate_no_taste3 = chocolate_no_taste2.drop(["sub_country_of_bean_origin"],
                                               axis=1)
run_test(chocolate_no_taste3, "test oner-hot drop sub encoding", knn_model)

chocolate_cat_r = chocolate_cat.iloc[:,:-4]
run_test(chocolate_cat_r, "test cat encoding", knn_model)


KNeighborsRegressor(n_neighbors=50)
region one-hot encoding
Scores: [0.45236662 0.40471912 0.40158976 0.39778209 0.42847674 0.41752716
 0.44619511 0.4348937  0.43364044 0.38601084]
Mean: 0.4203201574930052
Standard deviation: 0.021071193289562418
[3.325 3.11  2.935 3.355 3.245 3.305 3.265 3.025 3.13  3.355]
[3.0, 3.25, 3.0, 3.0, 3.75, 3.5, 4.0, 3.5, 3.5, 3.25]
original mean :  3.1980337078651684 original std: 0.4165150806593712
predicted mean :  3.2278707865168537 predicted std: 0.12073859841232384

KNeighborsRegressor(n_neighbors=50)
test oner-hot drop sub encoding
Scores: [0.45518702 0.40917531 0.40009988 0.39817791 0.4255177  0.41414339
 0.4432558  0.43888073 0.43042375 0.388908  ]
Mean: 0.4203769491251081
Standard deviation: 0.020648301366899723
[3.355 3.065 2.94  3.32  3.14  3.36  3.355 3.04  3.205 3.32 ]
[3.0, 3.25, 3.0, 3.0, 3.75, 3.5, 4.0, 3.5, 3.5, 3.25]
original mean :  3.1980337078651684 original std: 0.4165150806593712
predicted mean :  3.2373595505617976 predicted std: 0.

## not much effect
sadly that almost gave no improvement at all.
but still improvement non-the less


## let's do the same thing for the taste values

## count number of flavors

In [485]:
class CountFlavors(BaseEstimator, TransformerMixin):
    def fit(self, x):
        return self
    def transform(self, x):
        df_copy = x.copy()
        df_copy["num_flavors"] = 0
        for column in df_copy.columns:
            if "_taste" in column:
                df_copy[column] = df_copy[column].fillna("no flavor")
                df_copy["num_flavors"] += np.where(df_copy[column] == "no flavor", 0, 1)
        return df_copy

chocolate_n2

Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,Madagascar,76.0,3.75,3,1,0,0,0,1,0,cocoa,blackberry,full body,
1,Dominican republic,76.0,3.50,3,1,0,0,0,1,0,cocoa,vegetal,savory,
2,Tanzania,76.0,3.25,3,1,0,0,0,1,0,rich cocoa,fatty,bready,
3,Peru,63.0,3.75,4,1,0,1,0,1,0,fruity,melon,roasty,
4,Bolivia,70.0,3.50,4,1,0,1,0,1,0,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,Blend,80.0,2.75,4,1,0,0,1,0,1,waxy,cloying,vegetal,
2220,Colombia,75.0,3.75,3,1,0,0,0,1,0,strong nutty,marshmallow,,
2221,Belize,72.0,3.50,3,1,0,0,0,1,0,muted,roasty,accessible,
2222,Congo,70.0,3.25,3,1,0,0,0,1,0,fatty,mild nuts,mild fruit,


In [486]:
my_pipe = Pipeline(
    [
      ("1", RegionTransformer("country_of_bean_origin", prefix="", log=True)),
      ("2", TopXsub("sub_country_of_bean_origin", 3)),
      ("3", TopXsub("country_of_bean_origin", 2)),
      ("4", CountFlavors()),
      ("5", TopXsub("first_taste", 3)),
      ("6", TopXsub("second_taste", 4)),
      ("7", TopXsub("third_taste", 3)),
      ("8", TopXsub("fourth_taste", 4)),
    ])

chocolate_n5 = my_pipe.fit_transform(chocolate_n2)
chocolate_n5

Americas    1545
Africa       449
Asia         139
Oceania       91
Name: country_of_bean_origin, dtype: int64
South America 0
Eastern Africa 1
Central America 2
Americas 0
Africa 1
creamy 0
sandy 1
intense 2
sweet 0
nutty 1
no flavor 2
earthy 3
no flavor 0
cocoa 1
nutty 2
no flavor 0
cocoa 1
roasty 2
sour 3


Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste,sub_country_of_bean_origin,num_flavors
0,1,76.0,3.75,3,1,0,0,0,1,0,3,4,3,0,1,3
1,0,76.0,3.50,3,1,0,0,0,1,0,3,4,3,0,3,3
2,1,76.0,3.25,3,1,0,0,0,1,0,3,4,3,0,1,3
3,0,63.0,3.75,4,1,0,1,0,1,0,3,4,3,0,0,3
4,0,70.0,3.50,4,1,0,1,0,1,0,3,1,0,0,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,1,80.0,2.75,4,1,0,0,1,0,1,3,4,3,0,1,3
2220,0,75.0,3.75,3,1,0,0,0,1,0,3,4,0,0,0,2
2221,0,72.0,3.50,3,1,0,0,0,1,0,3,4,3,0,2,3
2222,1,70.0,3.25,3,1,0,0,0,1,0,3,4,3,0,3,3


In [487]:
chocolate_n5.corr()["rating"]

country_of_bean_origin       -0.022021
cocoa_percent                -0.078508
rating                        1.000000
counts_of_ingredients        -0.094850
cocoa_butter                  0.012224
vanilla                      -0.164881
lecithin                     -0.070179
salt                         -0.051381
sugar                         0.092216
sweetener_without_sugar      -0.087438
first_taste                  -0.128189
second_taste                  0.038270
third_taste                  -0.103693
fourth_taste                 -0.034067
sub_country_of_bean_origin   -0.025640
num_flavors                  -0.074751
Name: rating, dtype: float64

In [494]:
run_test(chocolate_n5, "test", knn_model)


KNeighborsRegressor(n_neighbors=50)
test
Scores: [0.4392753  0.39657086 0.40231989 0.39007878 0.42260039 0.41858963
 0.43140744 0.43199153 0.42918478 0.38910026]
Mean: 0.41511188574282487
Standard deviation: 0.017919886198967148
[3.43  3.14  2.9   3.285 3.2   3.215 3.34  2.86  3.11  3.4  ]
[3.0, 3.25, 3.0, 3.0, 3.75, 3.5, 4.0, 3.5, 3.5, 3.25]
original mean :  3.1980337078651684 original std: 0.4165150806593712
predicted mean :  3.2250505617977527 predicted std: 0.12861956237496333


In [531]:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators=30, max_depth=4)
run_test(chocolate_n5, "test", forest)


RandomForestRegressor(max_depth=4, n_estimators=30)
test
Scores: [0.43835896 0.40230967 0.39939744 0.37630656 0.4069506  0.42525803
 0.43651759 0.43144148 0.41696228 0.38187672]
Mean: 0.41153793340069045
Standard deviation: 0.02081297686664735
[3.18966555 3.18463689 2.93902445 3.26721461 3.18150523 3.2425894
 3.21464215 2.93250695 2.91047429 3.67036444]
[3.0, 3.25, 3.0, 3.0, 3.75, 3.5, 4.0, 3.5, 3.5, 3.25]
original mean :  3.1980337078651684 original std: 0.4165150806593712
predicted mean :  3.210064240041205 predicted std: 0.13646830559173687


In [490]:
print("original mean :",chocolate["rating"].mean(), ", original std:",
      chocolate["rating"].std())

original mean : 3.198561151079137 , original std: 0.43432896919136804


## Honsetly i have no idea if i can improve this.

our model just "predicts" the mean of the sample, and gives as random value
near the mean, it's not really that great, the only thing we can learn
from this is that on average experts rate chocolate at about 3.2 score.

there might be just to little data!

### conclusion
well maybe there just not enough data about each chocolate,
or there is a secret that we do not know,
there is almost to no correlation between "bean-origin" and
bean the rating, and there is little correlation between the ingredients,
and the rating.

the fact that there is so many ingredient's doesn't make our job easy,
as we simply have to little data per ingredient.

so in conclusion neither KNN nor linear regression,
could give as a good result.


