# DSCI Project Proposal: Variables that affect Chocolate Rating

### Question:
How does different variables(Manufacture Company, Bean origin, Cocoa%, Company Location, BeanType) affect the expert rating of choclate bars? 
### Introduction:
To make chocolate, pods are harvested from the cacao plant and processed into cocoa, which when combined with other ingredients creates chocolate. Many factors play into the creation of chocolate, such as the percentage of cocoa, the company that produced the chocolate, the specific bean origin, and other variables. It has been shown in studies on how these different factors affect the rating of chocolate that cocoa beans originating from different countries can produce generally higher or lower rated chocolate. Based on this observation and others, our question is if the rating of various chocolates can be predicted using multiple variables. The dataset that will be used to answer this question is the “Chocolate Bar Ratings” dataset uploaded by Rachael Tatman which includes expert ratings of over 1,700 chocolate bars that includes relevant information and multiple variables about each chocolate rated. 

##### Columns: 
* Company: Name of the company manufacturing the bar
* Specific Bean Origin: The specific geo-region of origin for the bar
* REF: A value linked to when the review was entered in the database. Higher = more recent.
* Review Date: Date of publication of the review.
* CocoaPercent: Cocoa percentage (darkness) of the chocolate bar being reviewed.
* Company Location: Manufacturer base country.
* Rating: Expert rating for the bar from 1-5.
* BeanType: The variety (breed) of bean used, if provided.
* Broad Bean Origin: The broad geo-region of origin for the bean.

### Methods:
We will filter out the more recent ratings with the column that indicates the review date so that our analysis is not outdated. We are planning to make a graph with the cocoa percentage on the x-axis, ratings on the y-axis, and the companies distinguished with different coloured points.

### Expected outcomes and significance:
##### What do you expect to find? 
We expect to find what variables impact the expert rating of chocolate bar the most. We will find the relationship between; the brand vs. rating, country of cocoa vs. rating, and cocoa % vs. rating.
##### What impact could such findings have?
(1)Our decisions on purchasing which type of chocolate. (2)Inform and recommend companies what % of cocoa is best.(3)From which country should companies purchase cacao?  
##### What future questions could this lead to?
Does the specific bean of origin and type of bean impact the rating? How reliable is this dataset? This can be investigated through comparing other dataset ratings.
#### Sources
S. Jinap, P. S. Dimick, R. Hollender, (1995). Flavour evaluation of chocolate formulated from cocoa beans from different countries. Food Control, 6(2), 105-110. https://doi.org/10.1016/0956-7135(95)98914-M. Roberto Verna, (2013). The history and science of chocolate. Malaysian J Pathol, 35(2), 111-121. http://www.mjpath.org.my/2013.2/history-and-science-of-chocolate.pdf

 




### Preliminary exploratory data analysis: 

In [3]:
import pandas as pd
import altair as alt
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split

chocolate = pd.read_csv("data/flavors_of_cacao.csv")
chocolate.head()

# filtering out training data
chocolate_training, chocolate_testing = train_test_split(
    chocolate, train_size=0.75, random_state=2000 # do not change the random_state
)
chocolate_training = pd.DataFrame(chocolate_training)
chocolate_training

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin
1294,Pierre Marcolini,"Ocumare, Puerto Cabello, Venezuela",93,2006,72%,Belgium,4.00,Criollo,Venezuela
883,Kah Kow,"Rizek Cacao, Cibao Valley, Domin. Rep.",1061,2013,70%,Domincan Republic,3.50,,Dominican Republic
1770,Zart Pralinen,"Kakao Kamili, Kilombero Valley",1824,2016,85%,Austria,3.00,"Criollo, Trinitario",Tanzania
296,Cacao Hunters,Macondo,1816,2016,71%,Colombia,3.75,,Colombia
542,Dick Taylor,Belize,955,2012,72%,U.S.A.,2.75,Trinitario,Belize
...,...,...,...,...,...,...,...,...,...
1659,Theo,"Ghana, Panama, Ecuador",188,2007,75%,U.S.A.,3.00,Blend,"Ghana, Panama, Ecuador"
1590,Summerbird,Peru,1800,2016,71%,Denmark,3.00,"Criollo, Trinitario",Peru
1230,Original Beans (Felchlin),Papua Kerafat,1438,2014,68%,Switzerland,2.75,,Papua New Guinea
840,Hotel Chocolat (Coppeneur),"Island Growers, 96hr c.",623,2011,65%,U.K.,3.25,Trinitario,St. Lucia


In [5]:
# chocolate data filtered to have dates more later than 2010
# and the percentage symbols were deleted to enable the cocoa percentage 
# be arranged into ascending order

chocolate_recent = chocolate_training[chocolate_training["Review\nDate"] > 2010]
chocolate_recent.head()

# cocoa_percentage = pd.DataFrame(chocolate_recent["Cocoa\nPercent"].str.replace("%",""))

chocolate_recent["Cocoa\nPercent"] = chocolate_recent["Cocoa\nPercent"].str.rstrip('%').astype('float') / 100.0


# chocolate_recent_percentage = chocolate_recent.assign(cocoa_amount = cocoa_percentage)
# chocolate_recent_percentage = chocolate_recent_percentage.sort_values(by="cocoa_amount", ascending=True).reset_index()
chocolate_recent.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chocolate_recent["Cocoa\nPercent"] = chocolate_recent["Cocoa\nPercent"].str.rstrip('%').astype('float') / 100.0


Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin
883,Kah Kow,"Rizek Cacao, Cibao Valley, Domin. Rep.",1061,2013,0.7,Domincan Republic,3.5,,Dominican Republic
1770,Zart Pralinen,"Kakao Kamili, Kilombero Valley",1824,2016,0.85,Austria,3.0,"Criollo, Trinitario",Tanzania
296,Cacao Hunters,Macondo,1816,2016,0.71,Colombia,3.75,,Colombia
542,Dick Taylor,Belize,955,2012,0.72,U.S.A.,2.75,Trinitario,Belize
106,Arete,"La Masica, FHIA",1908,2016,0.7,U.S.A.,3.5,,Honduras


In [11]:
# lots of repeated ratings were found
count_rating_duplicates = chocolate_recent.pivot_table(index = ['Rating'], aggfunc ='size')
count_rating_duplicates

Rating
1.50      2
2.00      7
2.25      7
2.50     70
2.75    143
3.00    158
3.25    207
3.50    230
3.75    120
4.00     52
dtype: int64

Since there is a limited number of ratings, it would be better to investigate the dataset by finding the average for each category and plotting a bar graph with rating cateogry vs cocoa percentage

We tried to calculate the averages for each repeated rating but it was unsuccessful. We are planning to work on this part. 

In [16]:
rating_category1 = chocolate_recent[chocolate_recent["Rating"]==1.50]
rating_category1
# rating_category1_cocoa_average = rating_category1.mean()
# rating_category1_cocoa_average

# take out only the cocoa_amount columns for each rating category
rating_category1_with_cocoa = rating_category1.assign(cocoa_average=rating_category1.loc[:,"Cocoa\nPercent"].mean())
rating_category1_with_cocoa
rating_category2 = chocolate_recent[chocolate_recent["Rating"]==2.00]
rating_category2
rating_category2_with_cocoa = rating_category2.assign(cocoa_average=rating_category2.loc[:,"Cocoa\nPercent"].mean())
rating_category2_with_cocoa

rating_category3 = chocolate_recent[chocolate_recent["Rating"]==2.25]
rating_category3
rating_category3_with_cocoa = rating_category3.assign(cocoa_average=rating_category3.loc[:,"Cocoa\nPercent"].mean())
rating_category3_with_cocoa

rating_category4 = chocolate_recent[chocolate_recent["Rating"]==2.50]
rating_category4
rating_category4_with_cocoa = rating_category4.assign(cocoa_average=rating_category4.loc[:,"Cocoa\nPercent"].mean())
rating_category4_with_cocoa


# rating_filtered1 = rating_category1_with_cocoa[['Rating','cocoa_average']].copy()
# rating_filtered2 = rating_category2_with_cocoa[['Rating','cocoa_average']].copy()
# rating_filtered3 = rating_category3_with_cocoa[['Rating','cocoa_average']].copy()
# rating_filtered4 = rating_category4_with_cocoa[['Rating','cocoa_average']].copy()
# df=[rating_filtered1, rating_filtered2, rating_filtered3, rating_filtered4]
# filtered_table = pd.concat(df)
# filtered_table

Unnamed: 0,Rating,cocoa_average
988,1.5,0.675000
1129,1.5,0.675000
1130,2.0,0.767143
676,2.0,0.767143
1217,2.0,0.767143
...,...,...
1127,2.5,0.716643
279,2.5,0.716643
1064,2.5,0.716643
192,2.5,0.716643


In [17]:
rating_category1_with_cocoa = rating_category1.assign(cocoa_average=rating_category1.loc[:,"Cocoa\nPercent"].mean())
rating_category1_with_cocoa

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin,cocoa_average
988,Machu Picchu Trading Co.,Peru,721,2011,0.7,Peru,1.5,,Peru,0.675
1129,Middlebury,Houseblend,887,2012,0.65,U.S.A.,1.5,,,0.675


In [37]:
rating_category2_with_cocoa = rating_category2.assign(cocoa_average=rating_category2.loc[:,"Cocoa\nPercent"].mean())
rating_category2_with_cocoa.head(1)

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin,cocoa_average
1130,Middlebury,Alto Beni,887,2012,0.75,U.S.A.,2.0,Criollo,Bolivia,0.767143


In [38]:
rating_category3 = chocolate_recent[chocolate_recent["Rating"]==2.25]
rating_category3
rating_category3_with_cocoa = rating_category3.assign(cocoa_average=rating_category3.loc[:,"Cocoa\nPercent"].mean())
rating_category3_with_cocoa.head(1)

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin,cocoa_average
1574,SRSLY,Dominican Republic,1121,2013,0.84,U.S.A.,2.25,,Dominican Republic,0.735714


In [36]:
rating_category4 = chocolate_recent[chocolate_recent["Rating"]==2.50]
rating_category4
rating_category4_with_cocoa = rating_category4.assign(cocoa_average=rating_category4.loc[:,"Cocoa\nPercent"].mean())
rating_category4_with_cocoa.head(1)

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin,cocoa_average
327,C-Amaro,Cuba,1442,2014,0.75,Italy,2.5,,Cuba,0.716643


In [39]:
rating_category5 = chocolate_recent[chocolate_recent["Rating"]==2.75]
rating_category5
rating_category5_with_cocoa = rating_category5.assign(cocoa_average=rating_category5.loc[:,"Cocoa\nPercent"].mean())
rating_category5_with_cocoa.head(1)

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin,cocoa_average
542,Dick Taylor,Belize,955,2012,0.72,U.S.A.,2.75,Trinitario,Belize,0.721469


In [40]:
rating_category6 = chocolate_recent[chocolate_recent["Rating"]==3.00]
rating_category6
rating_category6_with_cocoa = rating_category6.assign(cocoa_average=rating_category6.loc[:,"Cocoa\nPercent"].mean())
rating_category6_with_cocoa.head(1)

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin,cocoa_average
1770,Zart Pralinen,"Kakao Kamili, Kilombero Valley",1824,2016,0.85,Austria,3.0,"Criollo, Trinitario",Tanzania,0.727785


In [42]:
rating_category7 = chocolate_recent[chocolate_recent["Rating"]==3.25]
rating_category7
rating_category7_with_cocoa = rating_category7.assign(cocoa_average=rating_category7.loc[:,"Cocoa\nPercent"].mean())
rating_category7_with_cocoa.head(1)

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin,cocoa_average
1521,Soma,Little Big Man,1339,2014,0.7,Canada,3.25,Blend,Madagascar & Ecuador,0.718019


In [45]:
rating_category8 = chocolate_recent[chocolate_recent["Rating"]==3.50]
rating_category8
rating_category8_with_cocoa = rating_category8.assign(cocoa_average=rating_category8.loc[:,"Cocoa\nPercent"].mean())
rating_category8_with_cocoa.head(1)

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin,cocoa_average
883,Kah Kow,"Rizek Cacao, Cibao Valley, Domin. Rep.",1061,2013,0.7,Domincan Republic,3.5,,Dominican Republic,0.709435


In [46]:
rating_category9 = chocolate_recent[chocolate_recent["Rating"]==3.75]
rating_category9
rating_category9_with_cocoa = rating_category9.assign(cocoa_average=rating_category9.loc[:,"Cocoa\nPercent"].mean())
rating_category9_with_cocoa.head(1)

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin,cocoa_average
296,Cacao Hunters,Macondo,1816,2016,0.71,Colombia,3.75,,Colombia,0.7075


In [47]:
rating_category10 = chocolate_recent[chocolate_recent["Rating"]==4.00]
rating_category10
rating_category10_with_cocoa = rating_category10.assign(cocoa_average=rating_category10.loc[:,"Cocoa\nPercent"].mean())
rating_category10_with_cocoa.head(1)

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin,cocoa_average
1756,Woodblock,Ocumare,741,2011,0.7,U.S.A.,4.0,,Venezuela,0.711923


In [51]:
df_cols=["Rating", "Average cocoa percentage"]
rating_average_table = pd.DataFrame(columns = df_cols)
rating_average_table["Rating"] = ["1.50", "2.00", "2.25", "2.50", "2.75", "3.00", "3.25", "3.50", "3.75", "4.00"]
rating_average_table["Average cocoa percentage"] = ["0.675", "0.767", "0.736", "0.717", "0.721", "0.728", "0.718","0.709", "0.708", "0.712"]
rating_average_table

Unnamed: 0,Rating,Average cocoa percentage
0,1.5,0.675
1,2.0,0.767
2,2.25,0.736
3,2.5,0.717
4,2.75,0.721
5,3.0,0.728
6,3.25,0.718
7,3.5,0.709
8,3.75,0.708
9,4.0,0.712


In [63]:
rating_recent_plot = (
    alt.Chart(rating_average_table, title= "Ratings and average cocoa percentage") # set the title for the entire plot
    .mark_point()  # Deals with the transparency of the points, set it to an appropiate value
    .encode(
        x=alt.X("Rating", title = "Rating", scale=alt.Scale()),
        y=alt.Y( "Average cocoa percentage", title = "Average cocoa percentage", scale=alt.Scale()  
),
    )
    .properties(width=480, height=400)  #Remember to set your plot sizes to an appropiate size
    
)
 
rating_recent_plot

### Information on training dataset

In [44]:
info_table=chocolate_recent.describe()
info_table

Unnamed: 0,REF,Review\nDate,Cocoa\nPercent,Rating
count,996.0,996.0,996.0,996.0
mean,1300.558233,2013.803213,0.716782,3.223645
std,383.515714,1.691073,0.05623,0.42768
min,623.0,2011.0,0.42,1.5
25%,971.0,2012.0,0.7,3.0
50%,1313.0,2014.0,0.7,3.25
75%,1630.0,2015.0,0.74,3.5
max,1952.0,2017.0,1.0,4.0


In [None]:
chocolate_recent_plot = (
    alt.Chart(chocolate_recent_percentage, title= "Variables and the chocolate bar ratings") # set the title for the entire plot
    .mark_point(opacity=0.7)  # Deals with the transparency of the points, set it to an appropiate value
    .encode(
        x=alt.X("cocoa_amount", title = "Amount of cocoa (%)", scale=alt.Scale()),
        y=alt.Y( "Rating", sort="x", title = "Rating", scale=alt.Scale()  
),
    )
    .properties(width=480, height=400)  #Remember to set your plot sizes to an appropiate size
    
)
 
chocolate_recent_plot

This plot shows the approxmiate distribution of chocolate bar rating based on the percentage of cocoa. The points are more concentrated around the 70% averages. There are a lot of repeated values of ratings, so we will try to figure out a way to represent a more accurate distribution. 

In [None]:
# Exploring the relationship bean origin and rating
chocolate_recent["Cocoa\nPercent"].mean()
