# DSCI Project Proposal: Variables that affect Chocolate Rating

### Question:
How does different variables(Manufacture Company, Bean origin, Cocoa%, Company Location, BeanType) affect the expert rating of choclate bars? 
### Introduction:
To make chocolate, pods are harvested from the cacao plant and processed into cocoa, which when combined with other ingredients creates chocolate. Many factors play into the creation of chocolate, such as the percentage of cocoa, the company that produced the chocolate, the specific bean origin, and other variables. It has been shown in studies on how these different factors affect the rating of chocolate that cocoa beans originating from different countries can produce generally higher or lower rated chocolate. Based on this observation and others, our question is if the rating of various chocolates can be predicted using multiple variables. The dataset that will be used to answer this question is the “Chocolate Bar Ratings” dataset uploaded by Rachael Tatman which includes expert ratings of over 1,700 chocolate bars that includes relevant information and multiple variables about each chocolate rated. 

##### Columns: 
* Company: Name of the company manufacturing the bar
* Specific Bean Origin: The specific geo-region of origin for the bar
* REF: A value linked to when the review was entered in the database. Higher = more recent.
* Review Date: Date of publication of the review.
* CocoaPercent: Cocoa percentage (darkness) of the chocolate bar being reviewed.
* Company Location: Manufacturer base country.
* Rating: Expert rating for the bar from 1-5.
* BeanType: The variety (breed) of bean used, if provided.
* Broad Bean Origin: The broad geo-region of origin for the bean.

### Methods:
We will filter out the more recent ratings with the column that indicates the review date so that our analysis is not outdated. We are planning to make a graph with the cocoa percentage on the x-axis, ratings on the y-axis, and the companies distinguished with different coloured points.

### Expected outcomes and significance:
##### What do you expect to find? 
We expect to find what variables impact the expert rating of chocolate bar the most. We will find the relationship between; the brand vs. rating, country of cocoa vs. rating, and cocoa % vs. rating.
##### What impact could such findings have?
(1)Our decisions on purchasing which type of chocolate. (2)Inform and recommend companies what % of cocoa is best.(3)From which country should companies purchase cacao?  
##### What future questions could this lead to?
Does the specific bean of origin and type of bean impact the rating? How reliable is this dataset? This can be investigated through comparing other dataset ratings.
#### Sources
S. Jinap, P. S. Dimick, R. Hollender, (1995). Flavour evaluation of chocolate formulated from cocoa beans from different countries. Food Control, 6(2), 105-110. https://doi.org/10.1016/0956-7135(95)98914-M. Roberto Verna, (2013). The history and science of chocolate. Malaysian J Pathol, 35(2), 111-121. http://www.mjpath.org.my/2013.2/history-and-science-of-chocolate.pdf

 




### Preliminary exploratory data analysis: 

In [7]:
import pandas as pd
import altair as alt
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split

chocolate = pd.read_csv("dsci-100-group18/data/flavors_of_cacao.csv")
chocolate.head()

# filtering out training data
chocolate_training, chocolate_testing = train_test_split(
    chocolate, train_size=0.75, random_state=2000 # do not change the random_state
)
chocolate_training = pd.DataFrame(chocolate_training)
chocolate_training

Unnamed: 0,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin
1294,Pierre Marcolini,"Ocumare, Puerto Cabello, Venezuela",93,2006,72%,Belgium,4.00,Criollo,Venezuela
883,Kah Kow,"Rizek Cacao, Cibao Valley, Domin. Rep.",1061,2013,70%,Domincan Republic,3.50,,Dominican Republic
1770,Zart Pralinen,"Kakao Kamili, Kilombero Valley",1824,2016,85%,Austria,3.00,"Criollo, Trinitario",Tanzania
296,Cacao Hunters,Macondo,1816,2016,71%,Colombia,3.75,,Colombia
542,Dick Taylor,Belize,955,2012,72%,U.S.A.,2.75,Trinitario,Belize
...,...,...,...,...,...,...,...,...,...
1659,Theo,"Ghana, Panama, Ecuador",188,2007,75%,U.S.A.,3.00,Blend,"Ghana, Panama, Ecuador"
1590,Summerbird,Peru,1800,2016,71%,Denmark,3.00,"Criollo, Trinitario",Peru
1230,Original Beans (Felchlin),Papua Kerafat,1438,2014,68%,Switzerland,2.75,,Papua New Guinea
840,Hotel Chocolat (Coppeneur),"Island Growers, 96hr c.",623,2011,65%,U.K.,3.25,Trinitario,St. Lucia


In [2]:
# chocolate data filtered to have dates more later than 2010
# and the percentage symbols were deleted to enable the cocoa percentage 
# be arranged into ascending order

chocolate_recent = chocolate_training[chocolate_training["Review\nDate"] > 2010]
chocolate_recent.head()

cocoa_percentage = pd.DataFrame(chocolate_recent["Cocoa\nPercent"].str.replace("%",""))


chocolate_recent_percentage = chocolate_recent.assign(cocoa_amount = cocoa_percentage)
chocolate_recent_percentage = chocolate_recent_percentage.sort_values(by="cocoa_amount", ascending=False).reset_index()
chocolate_recent_percentage.head()

Unnamed: 0,index,Company \n(Maker-if known),Specific Bean Origin\nor Bar Name,REF,Review\nDate,Cocoa\nPercent,Company\nLocation,Rating,Bean\nType,Broad Bean\nOrigin,cocoa_amount
0,1623,TCHO,Peru- Ecuador,915,2012,99%,U.S.A.,3.25,,"Peru, Ecuador",99
1,392,Chocolate Alchemist-Philly,"Tumbes, ""Zarumilla""",1772,2016,90%,U.S.A.,2.5,,Peru,90
2,906,Kto,Belize,1426,2014,90%,U.S.A.,3.25,Trinitario,Belize,90
3,1786,Zotter,El Ceibo Coop,879,2012,90%,Austria,3.25,,Bolivia,90
4,831,Hotel Chocolat (Coppeneur),"Los Rios, H. Iara",1065,2013,90%,U.K.,3.0,Forastero (Nacional),Ecuador,90


In [3]:
# lots of repeated ratings were found
count_rating_duplicates = chocolate_recent_percentage.pivot_table(index = ['Rating'], aggfunc ='size')
count_rating_duplicates

Rating
1.50      2
2.00      7
2.25      7
2.50     70
2.75    143
3.00    158
3.25    207
3.50    230
3.75    120
4.00     52
dtype: int64

Since there is a limited number of ratings, it would be better to investigate the dataset by finding the average for each category and plotting a bar graph with rating cateogry vs cocoa percentage

We tried to calculate the averages for each repeated rating but it was unsuccessful. We are planning to work on this part. 

In [4]:
rating_category1 = chocolate_recent_percentage[chocolate_recent_percentage["Rating"]==1.50]
rating_category1
rating_category1_cocoa_average = rating_category1["cocoa_amount"].mean()
rating_category1_cocoa_average

3532.5

### Information on training dataset

In [5]:
info_table=chocolate_recent_percentage.describe()
info_table

Unnamed: 0,index,REF,Review\nDate,Rating
count,996.0,996.0,996.0,996.0
mean,922.206827,1300.558233,2013.803213,3.223645
std,518.363352,383.515714,1.691073,0.42768
min,1.0,623.0,2011.0,1.5
25%,486.75,971.0,2012.0,3.0
50%,930.5,1313.0,2014.0,3.25
75%,1382.25,1630.0,2015.0,3.5
max,1793.0,1952.0,2017.0,4.0


In [6]:
chocolate_recent_plot = (
    alt.Chart(chocolate_recent_percentage, title= "Variables and the chocolate bar ratings") # set the title for the entire plot
    .mark_point(opacity=0.7)  # Deals with the transparency of the points, set it to an appropiate value
    .encode(
        x=alt.X("cocoa_amount", title = "Amount of cocoa (%)", scale=alt.Scale()),
        y=alt.Y( "Rating", sort="x", title = "Rating"  
),
    )
    .properties(width=480, height=400)  #Remember to set your plot sizes to an appropiate size
    
)
 
chocolate_recent_plot

  for col_name, dtype in df.dtypes.iteritems():


This plot shows the approxmiate distribution of chocolate bar rating based on the percentage of cocoa. The points are more concentrated around the 70% averages. There are a lot of repeated values of ratings, so we will try to figure out a way to represent a more accurate distribution. 