# Exploratory Visualisation with Tabular Data 🍫

Often our data lives in tables - in spreadsheets or in CSV files. In this notebook we will explore the tabular dataset [Chocolate Bar Ratings](https://www.kaggle.com/datasets/rtatman/chocolate-bar-ratings?resource=download) available on Kaggle.

The learning objectives for this activity are:

*   Open the dataset using pandas and examine variables
*   Summarise and clean dataset
*   Analyse distribution of Chocolate Ratings across the dataset

**About the Dataset**

This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown.

Flavors of Cacao Rating System:
* 5= Elite (Transcending beyond the ordinary limits)
* 4= Premium (Superior flavor development, character and style)
* 3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
* 2= Disappointing (Passable but contains at least one significant flaw)
* 1= Unpleasant (mostly unpalatable)
Each chocolate is evaluated from a combination of both objective qualities and subjective interpretation. A rating here only represents an experience with one bar from one batch. Batch numbers, vintages and review dates are included in the database when known.

More information about the dataset can be found [here](https://www.kaggle.com/datasets/rtatman/chocolate-bar-ratings?resource=download)

**Acknowledgements**

These ratings were compiled by Brady Brelinski, Founding Member of the Manhattan Chocolate Society. For up-to-date information, as well as additional content (including interviews with craft chocolate makers), please see his website: Flavors of Cacao

However, we can use the [pandas](https://pandas.pydata.org) package to work with tabular data in a more convenient way.


## Pandas

We will use pandas to achieve our learning objectives. You will have already been introduced to Pandas in the previous session.



In [68]:
#@title Question 0
import ipywidgets as widgets
import sys
from IPython.display import display
from IPython.display import clear_output

out = widgets.Output()

alternativ = widgets.RadioButtons(
    options=[('Graphs', 1), ('Images', 2), ('DataFrames', 3)],
    description='',
    disabled=False
)
print('\033[1m','1) Pandas helps us to work with what type of data structure?','\033[0m')
check = widgets.Button(description="Check my answer")
display(alternativ)
display(check)


def sjekksvar(b):
        a = int(alternativ.value)
        right_answer = 3
        if(a==right_answer): 
            color = '\x1b[6;30;42m' + "Correct" + '\x1b[0m' +"\n" #green color
        else:
            color = '\x1b[5;30;41m' + "False" + '\x1b[0m' +"\n" #red color
        svar = ["","","",""] 
        with out:
            clear_output()
        with out:
            print(color+""+svar[a-1])   
    
        
        
display(out)
check.on_click(sjekksvar)

[1m 1) Pandas helps us to work with what type of data structure? [0m


RadioButtons(options=(('Graphs', 1), ('Images', 2), ('DataFrames', 3)), value=1)

Button(description='Check my answer', style=ButtonStyle())

Output()

In [None]:
# read in the data

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/johnpinney/irc_viz/main/flavors_of_cacao.csv')

In [None]:
df.head()

In [None]:
# check for na values
df[df.isna().any(axis=1)] # display rows with one or more na values

In [None]:
df.dtypes

In [None]:
# format the variables
df['Cocoa\nPercent'] = df['Cocoa\nPercent'].apply(lambda x: x.strip('%')).astype('float64')

In [69]:
#@title Question 1
out = widgets.Output()

alternativ = widgets.RadioButtons(
    options=[('1795', 1), ('9', 2), ('1796', 3)],
    description='',
    disabled=False
)
print('\033[1m','1) How many ratings have been collected?','\033[0m')
check = widgets.Button(description="Check my answer")
display(alternativ)
display(check)


def sjekksvar(b):
        a = int(alternativ.value)
        right_answer = 1
        if(a==right_answer): 
            color = '\x1b[6;30;42m' + "Correct" + '\x1b[0m' +"\n" #green color
        else:
            color = '\x1b[5;30;41m' + "False" + '\x1b[0m' +"\n" #red color
        svar = ["","","",""] 
        with out:
            clear_output()
        with out:
            print(color+""+svar[a-1])   
    

display(out)
check.on_click(sjekksvar)

[1m 1) How many ratings have been collected? [0m


RadioButtons(options=(('1795', 1), ('9', 2), ('1796', 3)), value=1)

Button(description='Check my answer', style=ButtonStyle())

Output()

In [None]:
# enter code to find out total number of ratings


In [None]:
#@title Answer 1 - Show Code
print (df['Rating'].shape) # here we have the total number of rows
print (df['Rating'].count()) # count will give all non na values

# we can also examine the distribution of ratings by plotting the histogram
df['Rating'].hist(bins=20)

In [70]:
#@title Question 2
out = widgets.Output()

alternativ = widgets.RadioButtons(
    options=[('101', 1), ('1039', 2), ('1796', 3)],
    description='',
    disabled=False
)
print('\033[1m','1) How many types of Bar or Specific Bean Origin are there?','\033[0m')
check = widgets.Button(description="Check my answer")
display(alternativ)
display(check)


def sjekksvar(b):
        a = int(alternativ.value)
        right_answer = 2
        if(a==right_answer): 
            color = '\x1b[6;30;42m' + "Correct" + '\x1b[0m' +"\n" #green color
        else:
            color = '\x1b[5;30;41m' + "False" + '\x1b[0m' +"\n" #red color
        svar = ["","","",""] 
        with out:
            clear_output()
        with out:
            print(color+""+svar[a-1])   
    

display(out)
check.on_click(sjekksvar)

[1m 1) How many types of Bar or Specific Bean Origin are there? [0m


RadioButtons(options=(('101', 1), ('1039', 2), ('1796', 3)), value=1)

Button(description='Check my answer', style=ButtonStyle())

Output()

In [None]:
# enter code to find out number of types of Bar or Specific Bean Origin

In [None]:
#@title Answer 2 - Show Code

df['Specific Bean Origin\nor Bar Name'].unique().shape # by calling unique, we get remove any duplicates from the same category

In [71]:
#@title Question 3
out = widgets.Output()

alternativ = widgets.RadioButtons(
    options=[('Togo', 1), ('Ecuador', 2), ('Venezuela', 3)],
    description='',
    disabled=False
)
print('\033[1m','1) Which country makes the most chocoloate in this dataset?','\033[0m')
check = widgets.Button(description="Check my answer")
display(alternativ)
display(check)


def sjekksvar(b):
        a = int(alternativ.value)
        right_answer = 3
        if(a==right_answer): 
            color = '\x1b[6;30;42m' + "Correct" + '\x1b[0m' +"\n" #green color
        else:
            color = '\x1b[5;30;41m' + "False" + '\x1b[0m' +"\n" #red color
        svar = ["","","",""] 
        with out:
            clear_output()
        with out:
            print(color+""+svar[a-1])   
    

display(out)
check.on_click(sjekksvar)

[1m 1) Which country makes the most chocoloate in this dataset? [0m


RadioButtons(options=(('Togo', 1), ('Ecuador', 2), ('Venezuela', 3)), value=1)

Button(description='Check my answer', style=ButtonStyle())

Output()

In [None]:
# enter your code for finding the country which makes the most chocolate
country = df.columns[-1]

In [None]:
#@title Answer 3 - Show Code
df['Broad Bean\nOrigin'].value_counts().plot(kind='bar') # value counts tallies the counts for each category and we can plot this as a bar graph

In [None]:
#@title Answer 3 - Show Code
# lets make it bigger so we can read the labels
import matplotlib.pyplot as plt
plt.figure(figsize=(16,10))
df['Broad Bean\nOrigin'].value_counts().plot(kind='bar')

In [None]:
# notice above there is a missing label for one of the origins, let's see how this is encoded
df['Broad Bean\nOrigin'].value_counts().index[5], df['Broad Bean\nOrigin'].value_counts()[5]

In [72]:
#@title Question 4
out = widgets.Output()

alternativ = widgets.RadioButtons(
    options=[('Brazil', 1), ('Honduras', 2), ('Vietnam', 3)],
    description='',
    disabled=False
)
print('\033[1m','1) Which countries produce the highest-rated bars?','\033[0m')
check = widgets.Button(description="Check my answer")
display(alternativ)
display(check)


def sjekksvar(b):
        a = int(alternativ.value)
        right_answer = [2]
        if(a in right_answer): 
            color = '\x1b[6;30;42m' + "Correct - Honduras has the highest mean and median value" + '\x1b[0m' +"\n" #green color
        else:
            color = '\x1b[5;30;41m' + "False - Both Vietnam and Brazil have the second highest median rating" + '\x1b[0m' +"\n" #red color
        svar = ["","","",""] 
        with out:
            clear_output()
        with out:
            print(color+""+svar[a-1])   
    

display(out)
check.on_click(sjekksvar)

[1m 1) Which countries produce the highest-rated bars? [0m


RadioButtons(options=(('Brazil', 1), ('Honduras', 2), ('Vietnam', 3)), value=1)

Button(description='Check my answer', style=ButtonStyle())

Output()

In [None]:
# enter you answer here. hint: can we be confident about a high ratings when there is only one review?


In [None]:
#@title Answer 4 - Show Code
# some countries have very few number of ratings, let's subset our dataframe to include countries which have more than 10 ratings

subset = df['Broad Bean\nOrigin'].value_counts()[df['Broad Bean\nOrigin'].value_counts() > 10]
subset.index

subset_df = df[df['Broad Bean\nOrigin'].isin(list(subset.index))]

plot = subset_df.boxplot('Rating', by='Broad Bean\nOrigin', figsize=(16,10), showmeans=True)
plt.xticks(rotation=90)
plt.show()

In [None]:
#@title Answer 4 - Show Code
subset_df.groupby(country).mean()['Rating'].sort_values()

In [73]:
#@title Question 5
out = widgets.Output()

alternativ = widgets.RadioButtons(
    options=[('Highest ratings for > 90%', 1), ('Highest ratings for < 50%', 2), ('Highest ratings for ~ 70%', 3)],
    description='',
    disabled=False
)
print('\033[1m','1)  What’s the relationship between cocoa solids percentage and rating??','\033[0m')
check = widgets.Button(description="Check my answer")
display(alternativ)
display(check)


def sjekksvar(b):
        a = int(alternativ.value)
        right_answer = 3
        if(a==right_answer): 
            color = '\x1b[6;30;42m' + "Correct" + '\x1b[0m' +"\n" #green color
        else:
            color = '\x1b[5;30;41m' + "False" + '\x1b[0m' +"\n" #red color
        svar = ["","","",""] 
        with out:
            clear_output()
        with out:
            print(color+""+svar[a-1])   
    

display(out)
check.on_click(sjekksvar)

[1m 1)  What’s the relationship between cocoa solids percentage and rating?? [0m


RadioButtons(options=(('Highest ratings for > 90%', 1), ('Highest ratings for < 50%', 2), ('Highest ratings fo…

Button(description='Check my answer', style=ButtonStyle())

Output()

In [None]:
# enter your code here

In [None]:
#@title Answer 5 - Show Code

df.plot.scatter('Rating', 'Cocoa\nPercent', figsize=(10,10))

In [None]:
#@title
