In [30]:
import pandas as pd
import numpy as np

import scipy.stats as st
import altair as alt

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import panel as pn

In [81]:
df = pd.read_csv('../data/flavors_of_cacao.csv')

I want to do several things with this data, which is taken from Rachael Tatman (https://www.kaggle.com/rtatman/chocolate-bar-ratings). 

First, I must organize this DataFrame. I aim to sort the DataFrame first, based on SCORE and COUNTRY. Then, I will make a facetgrid for all the countries that plots the KDE and mean values for their chocolate rankings, based on bean rankings.

In [82]:
df.head()

Unnamed: 0,Company,REF,Review Date,Cocoa Percent,Company Location,Rating,Bean Origin
0,A. Morin,1876,2016,63%,France,3.75,Sao Tome
1,A. Morin,1676,2015,70%,France,2.75,Togo
2,A. Morin,1676,2015,70%,France,3.0,Togo
3,A. Morin,1680,2015,70%,France,3.5,Togo
4,A. Morin,1704,2015,70%,France,3.5,Peru


In [83]:
df_origin = df['Bean Origin'].astype('string')

In [84]:
df = df.drop(labels='Bean Origin', axis='columns')

In [85]:
df['Bean Origin'] = df_origin

In [89]:
df.iloc[[1120]]

Unnamed: 0,Company,REF,Review Date,Cocoa Percent,Company Location,Rating,Bean Origin
1120,Michel Cluizel,81,2006,99%,France,2.0,


In [68]:
df['Bean Origin'].replace(' ', np.nan, inplace=True)

In [None]:
df = df.sort_values(by=['Company Location', 'Rating']).copy()

What I want to do:

-Remove rows with NaNs or blank values

DONE -Remove multi-origin chocolates (rows with country of origin containing a comma)

DONE -Sort/organize dataset

What I want to determine:

-Which country MAKES the best chocolate

-The mean chocolate rating score

-Which country grows the best beans

-Plot histogram of Making Country vs. Origin

-Which company makes best chocolate?

First, I will remove multi-origin chocolates by eliminating those rows with a ',' in the name.

In [10]:
df.dtypes

Bean Origin String     string
Company                object
REF                     int64
Review Date             int64
Cocoa Percent          object
Company Location       object
Rating                float64
dtype: object

In [42]:
condition_origin = df['Bean Origin'].str.contains(',')

df[condition_origin].index

ValueError: Cannot mask with non-boolean array containing NA / NaN values

In [12]:
df.drop(axis='index', labels=[197,  224,  509,  629,  631,  635,  746,  748,  783,  930,  939,
             957,  972,  984,  985,  986, 1070, 1071, 1089, 1162, 1215, 1216,
            1262, 1288, 1337, 1338, 1356, 1434, 1510, 1528, 1530, 1535, 1536,
            1539, 1544, 1547, 1623, 1631, 1659, 1691], inplace=True)

So that finally worked, and successfully removed multi-origin beans from this DataFrame. Quite a workaround, though.

Next, I want to identify and remove ANY rows with blank values

Notice that row 1778 (check it for yourself) is blank. There are probably others.

In [13]:
df1 = df[df['Bean Origin String'] != ' ']

Which country makes the best chocolate? I want to groupby country, and then compute the mean() of the rating. Easy!

In [14]:
ratings = df.groupby(by='Bean Origin String') ['Rating'].mean()

In [38]:
alt.Chart(df).mark_point().encode(
    x=alt.X('Bean Origin:O', axis=alt.Axis(title='Bean Origin Country')),
    y=alt.Y('Rating:Q', axis=alt.Axis(title='Rating [0-5]')),
)