# Introduction
in this dataset we have about 2000 entries about different chocolate bars,
their manufacturers, the origin of the bean etc...

the dataset contains a rating for the quality of the chocolate,

## Flavors of Cacao Rating System:
- #### Rating Scale:
    - 4.0 - 5.0 = Outstanding
    - 3.5 - 3.9 = Highly Recommended
    - 3.0 - 3.49 = Recommended
    - 2.0 - 2.9 = Disappointing
    - 1.0 - 1.9 = Unpleasant

### About this project
in this project we will try to predict the rating of the chocolate,
based on  the data we have, and we will conclude what are the factors
for the rating and quality of the chocolate,
or maybe it is a "secret" formula, that we do not have.

In [351]:
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import plot_confusion_matrix, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [352]:
chocolate = pd.read_csv('data/chocolate.csv')
pd.set_option('display.max_columns', None)
chocolate = chocolate.iloc[:,1:]
chocolate.head()

Unnamed: 0,ref,company,company_location,review_date,country_of_bean_origin,specific_bean_origin_or_bar_name,cocoa_percent,rating,counts_of_ingredients,beans,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,2454,5150,U.S.A,2019,Madagascar,"Bejofo Estate, batch 1",76.0,3.75,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,blackberry,full body,
1,2458,5150,U.S.A,2019,Dominican republic,"Zorzal, batch 1",76.0,3.5,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,vegetal,savory,
2,2454,5150,U.S.A,2019,Tanzania,"Kokoa Kamili, batch 1",76.0,3.25,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,rich cocoa,fatty,bready,
3,797,A. Morin,France,2012,Peru,Peru,63.0,3.75,4,have_bean,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fruity,melon,roasty,
4,797,A. Morin,France,2012,Bolivia,Bolivia,70.0,3.5,4,have_bean,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,vegetal,nutty,,


In [353]:
chocolate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2224 entries, 0 to 2223
Data columns (total 20 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   ref                               2224 non-null   int64  
 1   company                           2224 non-null   object 
 2   company_location                  2224 non-null   object 
 3   review_date                       2224 non-null   int64  
 4   country_of_bean_origin            2224 non-null   object 
 5   specific_bean_origin_or_bar_name  2224 non-null   object 
 6   cocoa_percent                     2224 non-null   float64
 7   rating                            2224 non-null   float64
 8   counts_of_ingredients             2224 non-null   int64  
 9   beans                             2224 non-null   object 
 10  cocoa_butter                      2224 non-null   object 
 11  vanilla                           2224 non-null   object 
 12  lecith

## data insight
we can clearly see that there are some null values,
but from a quick overview we can notice that "first_taste, second_taste...",
are all basically a list of all tastes in the chocolate with we will need,
to transform anyway into numerical data.

# transforming the data for use
in this section we will transform all the categories into data that we can use.

## Drop unnecessary columns
we will drop company, company location, specific_bean_origin..., beans...
review date and so one... those values should not effect the quallity of the choclate,
well maybe company will but as there is just too much of them, we will drop them.

In [354]:
dropc = ["specific_bean_origin_or_bar_name"]
chocolate_n = chocolate.iloc[:,4:]
chocolate_n = chocolate_n.drop(dropc, axis = 1)
chocolate_n

Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,beans,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,Madagascar,76.0,3.75,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,blackberry,full body,
1,Dominican republic,76.0,3.50,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,vegetal,savory,
2,Tanzania,76.0,3.25,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,rich cocoa,fatty,bready,
3,Peru,63.0,3.75,4,have_bean,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fruity,melon,roasty,
4,Bolivia,70.0,3.50,4,have_bean,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,Blend,80.0,2.75,4,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_salt,have_not_sugar,have_sweetener_without_sugar,waxy,cloying,vegetal,
2220,Colombia,75.0,3.75,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,strong nutty,marshmallow,,
2221,Belize,72.0,3.50,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,muted,roasty,accessible,
2222,Congo,70.0,3.25,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fatty,mild nuts,mild fruit,


### Look into Object dtypes, that are potential categories.

In [355]:
# select the columns where the dtype is object
cat = chocolate_n.select_dtypes(include=['object']).copy()

huge = []
is_binary = []
drop_cols = []
for col in cat.columns:
    num = len(cat[col].unique())
    if num < 4:
        if num == 1:
            drop_cols.append(col)
            continue
        elif num == 2:
            is_binary.append(col)
        else:
            huge.append(col)
        print(col, cat[col].unique())
    else:
        huge.append(col)
        print(col, num)
chocolate_n = chocolate_n.drop(drop_cols, axis=1)
chocolate_n
#chocolate["sugar"].unique()

country_of_bean_origin 62
cocoa_butter ['have_cocoa_butter' 'have_not_cocoa_butter']
vanilla ['have_not_vanila' 'have_vanila']
lecithin ['have_not_lecithin' 'have_lecithin']
salt ['have_not_salt' 'have_salt']
sugar ['have_sugar' 'have_not_sugar']
sweetener_without_sugar ['have_not_sweetener_without_sugar' 'have_sweetener_without_sugar']
first_taste 456
second_taste 480
third_taste 333
fourth_taste 89


Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,Madagascar,76.0,3.75,3,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,blackberry,full body,
1,Dominican republic,76.0,3.50,3,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,vegetal,savory,
2,Tanzania,76.0,3.25,3,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,rich cocoa,fatty,bready,
3,Peru,63.0,3.75,4,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fruity,melon,roasty,
4,Bolivia,70.0,3.50,4,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,Blend,80.0,2.75,4,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_salt,have_not_sugar,have_sweetener_without_sugar,waxy,cloying,vegetal,
2220,Colombia,75.0,3.75,3,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,strong nutty,marshmallow,,
2221,Belize,72.0,3.50,3,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,muted,roasty,accessible,
2222,Congo,70.0,3.25,3,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fatty,mild nuts,mild fruit,


## First look
for instance "sugar" has only two values, so it's a boolean value,
<br> on the other hand "beans" have just 1 value, so we can drop that column.

In [356]:
def replace_binary(data_set : pd.DataFrame, cols : []):
    data_copy = data_set.copy()
    for column in cols:
        data_copy[column] = np.where(data_set[column].str.contains("_not_"), 0, 1)
    return data_copy

chocolate_n2 = replace_binary(chocolate_n, is_binary)
chocolate_n2

Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,Madagascar,76.0,3.75,3,1,0,0,0,1,0,cocoa,blackberry,full body,
1,Dominican republic,76.0,3.50,3,1,0,0,0,1,0,cocoa,vegetal,savory,
2,Tanzania,76.0,3.25,3,1,0,0,0,1,0,rich cocoa,fatty,bready,
3,Peru,63.0,3.75,4,1,0,1,0,1,0,fruity,melon,roasty,
4,Bolivia,70.0,3.50,4,1,0,1,0,1,0,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,Blend,80.0,2.75,4,1,0,0,1,0,1,waxy,cloying,vegetal,
2220,Colombia,75.0,3.75,3,1,0,0,0,1,0,strong nutty,marshmallow,,
2221,Belize,72.0,3.50,3,1,0,0,0,1,0,muted,roasty,accessible,
2222,Congo,70.0,3.25,3,1,0,0,0,1,0,fatty,mild nuts,mild fruit,


# first approach
###  Label Encoding
let's use label encoding to encode the tastes, and try to predict the score after that.

this approach is quite easy to implement but might work badly,
due to the fact that we will "rank" some categories by order and there is no way,
for instance that "cocoa" should be 10 when "mild fruit" would be given the value 1.
<br>hence it will add x10 weight.

In [357]:
def label_encode(data_set : pd.DataFrame, columns : []):
    data_copy = data_set.copy()
    for column in columns:
        data_copy[column] = data_copy[column].astype('category')
        data_copy[column] = data_copy[column].cat.codes
    return data_copy
chocolate_cat = label_encode(chocolate_n2, huge)
chocolate_cat

Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,28,76.0,3.75,3,1,0,0,0,1,0,86,27,111,-1
1,12,76.0,3.50,3,1,0,0,0,1,0,86,460,256,-1
2,52,76.0,3.25,3,1,0,0,0,1,0,319,126,26,-1
3,36,63.0,3.75,4,1,0,1,0,1,0,137,222,247,-1
4,3,70.0,3.50,4,1,0,1,0,1,0,437,288,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,2,80.0,2.75,4,1,0,0,1,0,1,446,84,316,-1
2220,8,75.0,3.75,3,1,0,0,0,1,0,397,214,-1,-1
2221,1,72.0,3.50,3,1,0,0,0,1,0,265,357,0,-1
2222,9,70.0,3.25,3,1,0,0,0,1,0,130,255,180,-1


## Trying few models
### preparing the data

In [358]:
train_set, test_set = train_test_split(chocolate_cat, test_size = 0.2, random_state = 2)
# print( train_set.shape)
# print( test_set.shape)

def split_test_train(data_set):
    return data_set.drop('rating', axis = 1), data_set['rating']

x_train, y_train = split_test_train(train_set)
x_test, y_test = split_test_train(test_set)
print(train_set.shape, test_set.shape)

(1779, 14) (445, 14)


## KNN
let's use knn for our first try

In [359]:
def display_scores(m_scores):
    print("Scores:", m_scores)
    print("Mean:", m_scores.mean())
    print("Standard deviation:", m_scores.std())

from sklearn.neighbors import KNeighborsRegressor

x_train_copy = x_train.copy()

scalar = StandardScaler()
x_train_copy = scalar.fit_transform(x_train_copy)

for i in range(3, 30, 6):
    model = KNeighborsRegressor(n_neighbors=i)
    scores = cross_val_score(model, x_train_copy, y_train, scoring="neg_mean_squared_error", cv = 10)
    print(model)
    display_scores(scores)
    print()

KNeighborsRegressor(n_neighbors=3)
Scores: [-0.26611267 -0.20962079 -0.21281991 -0.20934613 -0.22830836 -0.21393727
 -0.25016074 -0.22315855 -0.22023252 -0.18278249]
Mean: -0.22164794271990518
Standard deviation: 0.021869242289076347

KNeighborsRegressor(n_neighbors=9)
Scores: [-0.20648668 -0.18822392 -0.18356395 -0.17416042 -0.20407806 -0.18778853
 -0.21254456 -0.19711281 -0.1996383  -0.14536479]
Mean: -0.1898962029964114
Standard deviation: 0.01847783244351037

KNeighborsRegressor(n_neighbors=15)
Scores: [-0.20169788 -0.17312353 -0.1676475  -0.16825612 -0.19271086 -0.18534644
 -0.20266648 -0.19138009 -0.18984794 -0.15076472]
Mean: -0.18234415621010455
Standard deviation: 0.015933441456796652

KNeighborsRegressor(n_neighbors=21)
Scores: [-0.20460454 -0.17163953 -0.16468837 -0.1628273  -0.19006366 -0.18469209
 -0.19692737 -0.19076375 -0.18744025 -0.15304476]
Mean: -0.18066916385145226
Standard deviation: 0.015835972876983226

KNeighborsRegressor(n_neighbors=27)
Scores: [-0.20378824 -0.

### KNN overview
Surprisingly even with the "Bad" encoding of tastes its looks like KNN performs pretty well.
with an average error of -0.18, that actually might be just enough as again,
this is an expert rating and well it might be a bit subjective, so 0.18 error is quite low.

on the other hand most of our values are between 2.5-4 so 0.2 error might be significant.


#### we will try to improve the score later

## linear Regression
next step lets try the linear regression model

In [360]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lin_reg = LinearRegression()
scores = cross_val_score(lin_reg, x_train_copy, y_train, scoring="neg_mean_squared_error", cv = 10)

display_scores(scores)

Scores: [-0.19441201 -0.17255794 -0.16861555 -0.15908916 -0.1738905  -0.18376131
 -0.18789632 -0.19395831 -0.18730547 -0.14521   ]
Mean: -0.17666965675048202
Standard deviation: 0.015137441089699288


### Linear Regression overview
linear regression actually perform a little bit better than the KNN model.
with an error of 0.17 that again isn't that bad.


In [361]:
lin_reg.fit(x_train_copy, y_train)

x_test_copy = scalar.transform(x_test)
pred = lin_reg.predict(x_test_copy)

lin_mse = mean_squared_error(y_test, pred)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
print(lin_rmse, lin_mse)
#pred

0.430718973948816 0.18551883451952084


In [362]:
model = KNeighborsRegressor(n_neighbors=10)
model.fit(x_train, y_train)

pred = model.predict(x_test_copy)

lin_mse = mean_squared_error(y_test, pred)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

0.5043699484939802

### Correlation to rating
lets take a look at the correlation's we have to "rating"

In [363]:
train_set.corr()["rating"]

country_of_bean_origin     0.038539
cocoa_percent             -0.082468
rating                     1.000000
counts_of_ingredients     -0.083579
cocoa_butter               0.022779
vanilla                   -0.158961
lecithin                  -0.063290
salt                      -0.043152
sugar                      0.099833
sweetener_without_sugar   -0.096052
first_taste               -0.061379
second_taste              -0.074703
third_taste               -0.103739
fourth_taste              -0.044089
Name: rating, dtype: float64

we can see that more or less every thing effect the outcome,

country of bean origin doesnt have any effect ( expected due to poor choice of encoding )
and surprisingly first,second... tastes hae some minor impact,
with the negative correlation i the fact that -1 is No taste
we can conclude that the less taste's there are in a chocolate the higher the quality is.

## Another approach
let's take a look at home many "Values" there are in country of bean.
and binary encode only the top choices for country of origin.

In [364]:
chocolate_n2["country_of_bean_origin"].value_counts().head(20)

Venezuela             238
Peru                  207
Dominican republic    200
Ecuador               194
Madagascar            157
Blend                 140
Nicaragua              92
Brazil                 74
Bolivia                71
Colombia               65
Belize                 65
Vietnam                64
Tanzania               63
Guatemala              53
Papua new guinea       48
Mexico                 45
Costa rica             42
Trinidad               38
Ghana                  32
U.s.a.                 28
Name: country_of_bean_origin, dtype: int64

In [365]:
chocolate_n2

Unnamed: 0,country_of_bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,Madagascar,76.0,3.75,3,1,0,0,0,1,0,cocoa,blackberry,full body,
1,Dominican republic,76.0,3.50,3,1,0,0,0,1,0,cocoa,vegetal,savory,
2,Tanzania,76.0,3.25,3,1,0,0,0,1,0,rich cocoa,fatty,bready,
3,Peru,63.0,3.75,4,1,0,1,0,1,0,fruity,melon,roasty,
4,Bolivia,70.0,3.50,4,1,0,1,0,1,0,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,Blend,80.0,2.75,4,1,0,0,1,0,1,waxy,cloying,vegetal,
2220,Colombia,75.0,3.75,3,1,0,0,0,1,0,strong nutty,marshmallow,,
2221,Belize,72.0,3.50,3,1,0,0,0,1,0,muted,roasty,accessible,
2222,Congo,70.0,3.25,3,1,0,0,0,1,0,fatty,mild nuts,mild fruit,


In [366]:
def replace_names(data_set : pd.DataFrame, column : str, top_n = 15):
    value_map = {}
    data_copy = data_set.copy()
    index_val = 0
    for name in chocolate_n2[column].value_counts().index:
        val = chocolate_n2[column].value_counts()[name]
        #print(name, val)
        if index_val < top_n:
            print(name, index_val)
            value_map[name] = index_val
        else:
            #print(name, top_n)
            value_map[name] = top_n
        index_val += 1
    data_copy[column] = data_copy[column].replace(value_map)
    return data_copy


#chocolate_n2["country_of_bean_origin"].value_counts().index
chocolate_n3 = replace_names(chocolate_n2, "country_of_bean_origin")
chocolate_n3 = chocolate_n3.rename(columns={"country_of_bean_origin" : "bean_origin"})
chocolate_n3

Venezuela 0
Peru 1
Dominican republic 2
Ecuador 3
Madagascar 4
Blend 5
Nicaragua 6
Brazil 7
Bolivia 8
Colombia 9
Belize 10
Vietnam 11
Tanzania 12
Guatemala 13
Papua new guinea 14


Unnamed: 0,bean_origin,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,4,76.0,3.75,3,1,0,0,0,1,0,cocoa,blackberry,full body,
1,2,76.0,3.50,3,1,0,0,0,1,0,cocoa,vegetal,savory,
2,12,76.0,3.25,3,1,0,0,0,1,0,rich cocoa,fatty,bready,
3,1,63.0,3.75,4,1,0,1,0,1,0,fruity,melon,roasty,
4,8,70.0,3.50,4,1,0,1,0,1,0,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,5,80.0,2.75,4,1,0,0,1,0,1,waxy,cloying,vegetal,
2220,9,75.0,3.75,3,1,0,0,0,1,0,strong nutty,marshmallow,,
2221,10,72.0,3.50,3,1,0,0,0,1,0,muted,roasty,accessible,
2222,15,70.0,3.25,3,1,0,0,0,1,0,fatty,mild nuts,mild fruit,


In [367]:
import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['bean_origin'])
df_binary = encoder.fit_transform(chocolate_n3)

df_binary

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,bean_origin_0,bean_origin_1,bean_origin_2,bean_origin_3,bean_origin_4,cocoa_percent,rating,counts_of_ingredients,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,0,0,0,0,1,76.0,3.75,3,1,0,0,0,1,0,cocoa,blackberry,full body,
1,0,0,0,1,0,76.0,3.50,3,1,0,0,0,1,0,cocoa,vegetal,savory,
2,0,0,0,1,1,76.0,3.25,3,1,0,0,0,1,0,rich cocoa,fatty,bready,
3,0,0,1,0,0,63.0,3.75,4,1,0,1,0,1,0,fruity,melon,roasty,
4,0,0,1,0,1,70.0,3.50,4,1,0,1,0,1,0,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,0,1,1,1,1,80.0,2.75,4,1,0,0,1,0,1,waxy,cloying,vegetal,
2220,0,1,0,1,1,75.0,3.75,3,1,0,0,0,1,0,strong nutty,marshmallow,,
2221,0,1,1,1,0,72.0,3.50,3,1,0,0,0,1,0,muted,roasty,accessible,
2222,0,1,0,1,0,70.0,3.25,3,1,0,0,0,1,0,fatty,mild nuts,mild fruit,


## let's take a look at how well this approach performes
but we will drop the taste columns for now.

In [368]:
# function for easy comparing two data sets performances.
def run_lin_reg(data_set, name):
    print()
    print(name)
    train_set_r, test_set_r = train_test_split(data_set, test_size = 0.2, random_state = 2)
    x_train_r, y_train_r = split_test_train(train_set_r)

    scalar_r = StandardScaler()
    x_train_copy_r = scalar_r.fit_transform(x_train_r)

    lin_reg_r = LinearRegression()
    scores_r = cross_val_score(lin_reg_r, x_train_copy_r, y_train_r, scoring="neg_mean_squared_error", cv = 10)

    display_scores(scores_r)


In [369]:
df_binary_nl = df_binary.iloc[:,:-4]
run_lin_reg(df_binary_nl, "test binary encoding")

df_binary_nl = df_binary_nl.iloc[:,5:]
run_lin_reg(df_binary_nl, "test without origin")

chocolate_cat_r = chocolate_cat.iloc[:,:-4]
run_lin_reg(chocolate_cat_r, "test cat encoding")


test binary encoding
Scores: [-0.19760977 -0.18023177 -0.17177274 -0.16226619 -0.17803492 -0.17966442
 -0.19381151 -0.19440513 -0.18471919 -0.15712038]
Mean: -0.17996360267928352
Standard deviation: 0.012807449770013812

test without origin
Scores: [-0.19702002 -0.1792923  -0.17117128 -0.16173    -0.17898647 -0.18225628
 -0.19560809 -0.19414489 -0.18500996 -0.15106753]
Mean: -0.17962868152277503
Standard deviation: 0.014186397656965215

test cat encoding
Scores: [-0.19611467 -0.1798936  -0.17155319 -0.16015839 -0.17983221 -0.18440125
 -0.19338259 -0.19382862 -0.18457848 -0.15203581]
Mean: -0.17957788221790646
Standard deviation: 0.01384063728212879


Well we can see that this approach has actually failed to produce any value.

in fact it even worse then the cat encoding.

another approach we can do!
lets 

## let's do the same thing for the taste values

In [370]:
def count_flavors(df_data : pd.DataFrame):
    df_copy = df_data.copy()
    df_copy["num_flavors"] = 0
    flavors = {}
    for column in df_copy.columns:
        if "_taste" in column:
            df_copy[column] = df_copy[column].fillna("no flavor")
            df_copy["num_flavors"] += np.where(df_copy[column] == "no flavor", 0, 1)
            for name in df_copy[column].value_counts().index:
                val = df_copy[column].value_counts()[name]
                if name in flavors:
                    flavors[name] += val
                else:
                    flavors[name] = val
    flavors = dict(sorted(flavors.items(), key=lambda item: -item[1]))
    topn = 25
    for name, freq in flavors.items():
        print(name, freq)
        if topn < 0:
            break
        topn -= 1
    return df_copy, flavors

chocolate_b1, best_flavors = count_flavors(df_binary)

no flavor 2679
nutty 238
sweet 237
cocoa 203
roasty 198
creamy 184
earthy 164
sandy 153
fatty 149
floral 133
intense 132
spicy 122
sour 122
molasses 82
woody 80
vanilla 78
sticky 77
fruit 74
coffee 69
rich 67
gritty 65
dried fruit 60
dry 59
grassy 56
bitter 56
tart 54
caramel 54


## flavor look
we can see that there is quite a lot of "flavors", but the top "12" flavors are all above 100
times found in chocolates.

lets use the One-Hot encoding approach and encode the values that way


