## BUSINESS UNDERSTANDING 

### Overview
People who enjoy chocolate often say, "If there is no chocolate in heaven, I'm not going." Chocolate is not only delicious; dark chocolate is also high in nutrients that can benefit your health. It's one of the best antioxidant sources available, made from the cacao tree's seed. According to research, dark chocolate can improve your health and lower your risk of heart disease. Dark chocolate's antioxidants have been shown to lower blood pressure, reduce the risk of clotting, and increase blood circulation to the heart, lowering the risks of stroke, coronary heart disease, and heart disease death.

### Problem Statement
Gourmet chocolate bars are something that Willy Wonka's Chocolate wants to add. In order to influence our initial approach to potential suppliers, we want to know what qualities the highest rated chocolate has.

### Objectives
To find out if the cacao beans' country of origin affects the grade.

To find out if ratings vary depending on the percentage of cocoa in the bar.

To see if the number of ingredients affects the rating.

To find out if the type of ingredients affects the rating.

### Business Question
Which country are cocoa beans most commonly from?

Which ingredients are most widely used?

How many ingredients does the best-rated chocolate have?

### Expected Benefits to the Organization
We can reduce our supplier search with its assistance.

## Data Understanding

### Overview
I'll attempt to learn more from the provided information so that Willy Wonka's Chocolate can focus its supplier search.

### Data Description

* "id" - id number of the review
* "manufacturer" - Name of the bar manufacturer
* "company_location" - Location of the manufacturer
* "year_reviewed" - From 2006 to 2021
* "bean_origin" - Country of origin of the cacao beans
* "bar_name" - Name of the chocolate bar
* "cocoa_percent" - Cocoa content of the bar (%)
* "num_ingredients" - Number of ingredients
* "ingredients" - B (Beans), S (Sugar), S* (Sweetener other than sugar or beet sugar), C (Cocoa Butter), (V) Vanilla, (L) Lecithin, (Sa) Salt
* "review" - Summary of most memorable characteristics of the chocolate bar
* "rating" - 1.0-1.9 Unpleasant, 2.0-2.9 Disappointing, 3.0-3.49 Recommended, 3.5-3.9 Highly Recommended, 4.0-5.0 Oustanding

In [1]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
import scipy.stats

# Import visualization libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

plt.style.use('ggplot')

import warnings
warnings.filterwarnings("ignore")

In [5]:
# Loading the data
df_chocs =pd.read_csv("Data\chocolate bars.csv")

In [7]:
# previewing the data set
df_chocs.head()

Unnamed: 0,id,manufacturer,company_location,year_reviewed,bean_origin,bar_name,cocoa_percent,num_ingredients,ingredients,review,rating
0,2454,5150,U.S.A.,2019,Tanzania,"""Kokoa Kamili, batch 1""",76.0,3.0,"""B,S,C""","""rich cocoa, fatty, bready""",3.25
1,2458,5150,U.S.A.,2019,Dominican Republic,"""Zorzal, batch 1""",76.0,3.0,"""B,S,C""","""cocoa, vegetal, savory""",3.5
2,2454,5150,U.S.A.,2019,Madagascar,"""Bejofo Estate, batch 1""",76.0,3.0,"""B,S,C""","""cocoa, blackberry, full body""",3.75
3,2542,5150,U.S.A.,2021,Fiji,"""Matasawalevu, batch 1""",68.0,3.0,"""B,S,C""","""chewy, off, rubbery""",3.0
4,2546,5150,U.S.A.,2021,Venezuela,"""Sur del Lago, batch 1""",72.0,3.0,"""B,S,C""","""fatty, earthy, moss, nutty,chalky""",3.0


In [12]:
# Previewing the last five rows of the data
df_chocs.tail()

Unnamed: 0,id,manufacturer,company_location,year_reviewed,bean_origin,bar_name,cocoa_percent,num_ingredients,ingredients,review,rating
1813,2302,Pangea,Spain,2019,Peru,Chilique,71.0,2.0,"""B,S""","""sandy, fruit, cocoa, sour""",3.5
1814,1363,Park 75,U.S.A.,2014,Blend,South America,65.0,3.0,"""B,S,V""","""mild nutty, basic cocoa""",3.5
1815,1251,Parliament,U.S.A.,2014,Bolivia,Alto Beni,70.0,2.0,"""B,S""","""gritty, woody, acidic""",3.0
1816,1255,Parliament,U.S.A.,2014,Dominican Republic,"""Oko Caribe, batch 4""",70.0,2.0,"""B,S""","""mild spice, grapes""",3.5
1817,1542,Parliament,U.S.A.,2015,Guatemala,"""Lachua, Q'egchi families""",70.0,2.0,"""B,S""","""intense, blackberry, acidic""",3.5


In [9]:
# getting the shape of the data
print(f" This data has {df_chocs.shape[0]} rows and {df_chocs.shape[1]} columns")

 This data has 1818 rows and 11 columns


### Preliminary Data Inspection

In [10]:
# Getting the 'data' about the data
df_chocs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1818 entries, 0 to 1817
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                1818 non-null   int64  
 1   manufacturer      1818 non-null   object 
 2   company_location  1818 non-null   object 
 3   year_reviewed     1818 non-null   int64  
 4   bean_origin       1818 non-null   object 
 5   bar_name          1818 non-null   object 
 6   cocoa_percent     1818 non-null   float64
 7   num_ingredients   1754 non-null   float64
 8   ingredients       1754 non-null   object 
 9   review            1818 non-null   object 
 10  rating            1818 non-null   float64
dtypes: float64(3), int64(2), object(6)
memory usage: 156.4+ KB


In [11]:
# Obtaining the summary descriptive statistics of the data
df_chocs.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,1818.0,1443.671617,758.434534,5.0,810.0,1462.0,2108.0,2712.0
year_reviewed,1818.0,2014.451595,3.960332,2006.0,2012.0,2015.0,2018.0,2021.0
cocoa_percent,1818.0,71.520627,5.615671,42.0,70.0,70.0,74.0,100.0
num_ingredients,1754.0,3.00057,0.899705,1.0,2.0,3.0,4.0,6.0
rating,1818.0,3.188119,0.447606,1.0,3.0,3.25,3.5,4.0


#### Observations
` year_reviewed` should be casted as a datetime object. 

`id`  should be casted to object.

In [15]:
# Getting the data types of the data
df_chocs.dtypes.value_counts()

object     6
float64    3
int64      2
dtype: int64

In our dataset, there are 6 string features and 5 numeric characteristics.