# Introduction
in this dataset we have about 2000 entries about different chocolate bars,
their manufacturers, the origin of the bean etc...

the dataset contains a rating for the quality of the chocolate,

## Flavors of Cacao Rating System:
- #### Rating Scale:
    - 4.0 - 5.0 = Outstanding
    - 3.5 - 3.9 = Highly Recommended
    - 3.0 - 3.49 = Recommended
    - 2.0 - 2.9 = Disappointing
    - 1.0 - 1.9 = Unpleasant

### About this project
in this project we will try to predict the rating of the chocolate,
based on  the data we have, and we will conclude what are the factors
for the rating and quality of the chocolate,
or maybe it is a "secret" formula, that we do not have.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import plot_confusion_matrix, confusion_matrix

In [2]:
chocolate = pd.read_csv('data/chocolate.csv')
pd.set_option('display.max_columns', None)
chocolate.head()

Unnamed: 0.1,Unnamed: 0,ref,company,company_location,review_date,country_of_bean_origin,specific_bean_origin_or_bar_name,cocoa_percent,rating,counts_of_ingredients,beans,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,0,2454,5150,U.S.A,2019,Madagascar,"Bejofo Estate, batch 1",76.0,3.75,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,blackberry,full body,
1,1,2458,5150,U.S.A,2019,Dominican republic,"Zorzal, batch 1",76.0,3.5,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,vegetal,savory,
2,2,2454,5150,U.S.A,2019,Tanzania,"Kokoa Kamili, batch 1",76.0,3.25,3,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,rich cocoa,fatty,bready,
3,3,797,A. Morin,France,2012,Peru,Peru,63.0,3.75,4,have_bean,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fruity,melon,roasty,
4,4,797,A. Morin,France,2012,Bolivia,Bolivia,70.0,3.5,4,have_bean,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,vegetal,nutty,,


In [3]:
chocolate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2224 entries, 0 to 2223
Data columns (total 21 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Unnamed: 0                        2224 non-null   int64  
 1   ref                               2224 non-null   int64  
 2   company                           2224 non-null   object 
 3   company_location                  2224 non-null   object 
 4   review_date                       2224 non-null   int64  
 5   country_of_bean_origin            2224 non-null   object 
 6   specific_bean_origin_or_bar_name  2224 non-null   object 
 7   cocoa_percent                     2224 non-null   float64
 8   rating                            2224 non-null   float64
 9   counts_of_ingredients             2224 non-null   int64  
 10  beans                             2224 non-null   object 
 11  cocoa_butter                      2224 non-null   object 
 12  vanill

## data insight
we can clearly see that there are some null values,
but from a quick overview we can notice that "first_taste, second_taste...",
are all basically a list of all tastes in the chocolate with we will need,
to transform anyway into numerical data.

## transforming the data for use
in this section we will transform all the categories into data that we can use.

In [5]:
cat = chocolate.select_dtypes(include=['object']).copy()

Unnamed: 0,company,company_location,country_of_bean_origin,specific_bean_origin_or_bar_name,beans,cocoa_butter,vanilla,lecithin,salt,sugar,sweetener_without_sugar,first_taste,second_taste,third_taste,fourth_taste
0,5150,U.S.A,Madagascar,"Bejofo Estate, batch 1",have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,blackberry,full body,
1,5150,U.S.A,Dominican republic,"Zorzal, batch 1",have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,cocoa,vegetal,savory,
2,5150,U.S.A,Tanzania,"Kokoa Kamili, batch 1",have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,rich cocoa,fatty,bready,
3,A. Morin,France,Peru,Peru,have_bean,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fruity,melon,roasty,
4,A. Morin,France,Bolivia,Bolivia,have_bean,have_cocoa_butter,have_not_vanila,have_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,vegetal,nutty,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2219,Zotter,Austria,Blend,Raw,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_salt,have_not_sugar,have_sweetener_without_sugar,waxy,cloying,vegetal,
2220,Zotter,Austria,Colombia,"APROCAFA, Acandi",have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,strong nutty,marshmallow,,
2221,Zotter,Austria,Belize,Maya Mtn,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,muted,roasty,accessible,
2222,Zotter,Austria,Congo,Mountains of the Moon,have_bean,have_cocoa_butter,have_not_vanila,have_not_lecithin,have_not_salt,have_sugar,have_not_sweetener_without_sugar,fatty,mild nuts,mild fruit,


In [11]:
for col in cat.columns:
    num = len(cat[col].unique())
    if num < 4:
        print(col, cat[col].unique())
    else:
        print(col, num)
#chocolate["sugar"].unique()

company 502
company_location 66
country_of_bean_origin 62
specific_bean_origin_or_bar_name 1398
beans ['have_bean']
cocoa_butter ['have_cocoa_butter' 'have_not_cocoa_butter']
vanilla ['have_not_vanila' 'have_vanila']
lecithin ['have_not_lecithin' 'have_lecithin']
salt ['have_not_salt' 'have_salt']
sugar ['have_sugar' 'have_not_sugar']
sweetener_without_sugar ['have_not_sweetener_without_sugar' 'have_sweetener_without_sugar']
first_taste 456
second_taste 480
third_taste 333
fourth_taste 89


for instance "sugar" has only values, so in reality it's a binary data,
that we can transform into 1, for sugar and 0 for no sugar.
<br>
on the other hand "beans" have just 1 value, so we can drop that column.
