<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/02_data/05_exploratory_data_analysis/colab-part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Exploratory Data Analysis

[Exploratory Data Analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis), often shortened to EDA, is a term that you'll hear quite a bit in the field of data science. EDA is the process of examining a dataset to find facts about the data and communicating those facts, often through visualizations.

In order to explore the data and visualize it, some modifications might need to be made to the data along the way. This is often referred to as *data preprocessing*. Though data preprocessing is technically different from EDA, EDA often exposes problems with the data that need to be fixed in order to continue exploring. Because of this tight coupling, we'll clean the data as necessary to help understand the data.

In this lab we will apply our Pandas knowledge to explore a dataset about chocolate. Part 1 of the lab will explore each column in our dataset individually. Part 2 will take the results of our preprocessed data and search for patterns across columns and rows.

## Introduction



### The Dataset: Chocolate Bar Ratings

In this lab we will use a [chocolate bar ratings dataset](https://www.kaggle.com/rtatman/chocolate-bar-ratings). The dataset is from the [Flavors of Cacao](http://flavorsofcacao.com/flavor.html) data.

On the [Kaggle page for the dataset](https://www.kaggle.com/rtatman/chocolate-bar-ratings), we can find some basic information about the dataset. For instance, there are over 1,700 chocolate bars that have been rated. We can also preview the columns found in the dataset:

Column | Data Type | Description
-------|-----------|-------------
Company (Maker-if known) | String | Name of the company manufacturing the bar.
Specific Bean Origin or Bar Name | String | The specific geo-region of origin for the bar.
REF | Number | A value linked to when the review was entered in the database. Higher = more recent.
Review Date | Number | Date of publication of the review.
Cocoa Percent | String | Cocoa percentage (darkness) of the chocolate bar being reviewed.
Company Location | String | Manufacturer base country.
Rating | Number | Expert rating for the bar.
BeanType | String | The variety (breed) of bean used, if provided.
Broad Bean Origin | String | The broad geo-region of origin for the bean.

This is an interesting dataset. Think of the questions that you might be able to answer! A few could be:

*   Is there a relationship between numeric rating and properties such as percentage of cocoa, bean type, origin, and maker?
*   Are some of the properties of cacao beans correlated?
*   Where are the top chocolate bars from?
*   Are there multiple entries for the same bar from the same maker, but with different ratings over the years? If so, has there been any change in the chocolate bar that could account for the differences?
*   Do makers who produce a wide variety of bars have a higher chance of creating a top-rated chocolate bar?

I'm sure you can think of even more. So, what are we waiting for? Let's load the data!

## Acquiring the Data

The data is hosted on Kaggle, so we can use our Kaggle credentials to download the data into the lab. The dataset is located at [https://www.kaggle.com/rtatman/chocolate-bar-ratings](https://www.kaggle.com/rtatman/chocolate-bar-ratings). We can use the `kaggle` command line utility to do this.

First off, upload your `kaggle.json` file into the lab now.

Next, run the following command to get the credential files set to the right permissions and located in the correct spot.

In [0]:
! chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'

Now we can run the `kaggle` command to actually download the data.

In [0]:
! kaggle datasets download rtatman/chocolate-bar-ratings
! ls

We now have our data downloaded to our virtual machine and stored in the file `chocolate-bar-ratings.zip`.

## Creating a `DataFrame`

We now need to load the data into memory. We can do this easily using Pandas' `read_csv()` function.

In [0]:
import pandas as pd

df = pd.read_csv('chocolate-bar-ratings.zip')
df

Let's also make sure that our data types match what was documented:

In [0]:
df.dtypes

In this output, `object` types are strings while `int64` types are whole numbers and `float64` types are fractional numbers. This seems to match the documentation that we saw for the dataset.

From just a glance at the `DataFrame`, we can see a few facts about our data:

* There are 1,795 rows and 9 columns.
* The columns are the columns we expected based on the documentation, though some have `\n` (new line) embedded in them. We'll need to clean that up.
* The data seems to be sorted by the 'Company' column.
* There is definitely some missing data, as we can see in the 'Bean Type' column.

We will look more closely at each column throughout this lab.

## Cleaning Up Column Names

One of the more frustrating aspects of this dataset is the poor format of the column names. Typing 'Specific Bean Origin\nor Bar Name' in order to access the column is painful.

So our first order of business will be to update the column names.

In [0]:
df.columns = [
  'Company',
  'Specific Bean Origin',
  'REF',
  'Review Date',
  'Cocoa Percent',
  'Company Location',
  'Rating',
  'Bean Type',
  'Broad Bean Origin'
]

df

That's much better, but the columns are also in an odd order. Information about the company is spread across the columns, and so is the information about the cacao bean. Let's order the columns a little more meaningfully.

This order makes a little more sense:

**Company Information:**
* Company
* Company Location

**Chocolate Bar Information**
* Bean Type
* Specific Bean Origin
* Broad Bean Origin
* Cocoa Percent

**Review Information**
* REF
* Review Date
* Rating

We can reorder the columns by specifically selecting the columns in order and reassigning them to the `df` variable:

In [0]:
df = df[[
  'Company',
  'Company Location',
  'Bean Type',
  'Specific Bean Origin',
  'Broad Bean Origin',
  'Cocoa Percent',
  'REF',
  'Review Date',
  'Rating',
]]

df

## Examining Each Column


In this section we will examining each column to learn about the data in the column. We will also make changes to the data as needed.

### Column: Company

The 'Company' column is the first in the list, so let's look at it first.

We can tell that the column contains string values. Let's see if any are missing:

In [0]:
df['Company'].isnull().any()

No data is missing. Let's now see how many distinct values there are:

In [0]:
df['Company'].unique().size

A few hundred is not a terribly long list. Let's print the list in alphabetical order to see how it looks.

In [0]:
for company in sorted(df['Company'].unique()):
  print(company)

This is some interesting data. Looking at it raises many questions. For instance:

* Should company names like 'Vintage Plantations' and 'Vintage Plantations (Tulicorp)' be changed to the same name?
* Is 'Cacao de Origin' a misspelling of 'Cacao de Origen'?
* Is 'Shattel' a misspelling of 'Shattell'?

These are the types of things you'll see and questions you'll ask when you encounter a new dataset. Rarely is the data in perfect condition. Often you'll spend a considerable amount of time researching topics related to the data in order to make a call about repairing aspects of the data.

In this particular case, it would be great if we could find a master list of all of the chocolate makers in the world. We could then crossreference the names in the dataset with the names in the master list.

Unfortunately, we don't have a master list of chocolate makers. Instead, we will have to rely on manually inspecting the data and researching when things don't look right.

Let's say that for now we are confident that 'Cacao de Origin' and 'Shattel' are misspellings, so we will correct that data. We aren't confident enough to change any of the names with parentheses in them though.

Let's fix our misspellings!

#### Exercise 1: Fixing Misspellings

We have decided that we would like to change every instance of 'Cacao de Origin' to 'Cacao de Origen' and every instance of 'Shattel' to 'Shattell' in the 'Company' column of our dataset. Write the code to modify the values. Make sure your code doesn't have any warnings. At the end of the code block, print the number of unique company names when you are done. There should be two less columns than what you saw above.

**Student Solution**

In [0]:
import pandas as pd

df = pd.read_csv('chocolate-bar-ratings.zip')
df.columns = ['Company', 'Specific Bean Origin', 'REF', 'Review Date',
              'Cocoa Percent', 'Company Location', 'Rating', 'Bean Type',
              'Broad Bean Origin']
df = df[['Company', 'Company Location', 'Bean Type', 'Specific Bean Origin',
         'Broad Bean Origin', 'Cocoa Percent', 'REF', 'Review Date', 'Rating']]

# Change 'Shattel' to 'Shattell'

# Change 'Cacao de Origin' to 'Cacao de Origen'

# Print the number of unique company names

---

##### Answer Key

In [0]:
import pandas as pd

df = pd.read_csv('chocolate-bar-ratings.zip')
df.columns = ['Company', 'Specific Bean Origin', 'REF', 'Review Date',
              'Cocoa Percent', 'Company Location', 'Rating', 'Bean Type',
              'Broad Bean Origin']
df = df[['Company', 'Company Location', 'Bean Type', 'Specific Bean Origin',
         'Broad Bean Origin', 'Cocoa Percent', 'REF', 'Review Date', 'Rating']]

# Change 'Shattel' to 'Shattell'
df.loc[df['Company'] == 'Shattel', 'Company'] = 'Shattell'

# Change 'Cacao de Origin' to 'Cacao de Origen'
df.loc[df['Company'] == 'Cacao de Origin',
       'Company'] = 'Cacao de Origen'

# Print the number of unique company names
print(len(df['Company'].unique()))

---

### Column: Company Location

The [documentation](https://www.kaggle.com/rtatman/chocolate-bar-ratings) describes the 'Company Location' column as "*Manufacturer base country*."

Let's take a look at the data. As always, we'll first check to see if any data is missing.

In [0]:
df['Company Location'].isna().any()

No missing data.

Now we can see how many unique values there are:

In [0]:
df['Company Location'].unique().shape

There are just 60 locations, which is small enough that we can manually inspect the values. Let's print the data.

In [0]:
for location in sorted(df['Company Location'].unique()):
  print(location)

Overall, the data looks pretty clean. The column is supposed to contain countries and *most* entries are countries. There are a few problems with the country data though. We found at least five errors in the data. Let's see what you can find.

#### Exercise 2: Fixing Company Location Data

There are at least five errors in the company location data that need to be fixed. Some are fairly easy to spot (spelling errors), but some do require knowledge of what constitutes a country. Take some time to look at the data, and see if you can spot at least two of the issues. Write code to fix the issues.

**Student Solution**

In [0]:
# Fix at least two issues with the 'Company Location' data

---

##### Answer Key

Three of the errors are spelling errors, two of which result in "splitting" a country.

1. "Dominican Republic" is misspelled "Domincan Republic". This is purely a cosmetic error.
1. "Ecuador" is misspelled as "Eucador", but also correctly spelled in some cases. This splits the country.
1. "Nicaragua" is misspelled as "Niacragua", but also correctly spelled in some cases. This splits the country.

We found two "what is a country?" errors:

1. 'Amsterdam' is a city in the country of 'Holland'
1. 'U.K.' is a group of countries, some of which are already represented in the data. Likely this should be 'England'.

There are other changes that could arguably be made. For instance, the island of Martinique is technically part of France. We don't make that change, but a case could be made to do so.

In [0]:
import pandas as pd

df = pd.read_csv('chocolate-bar-ratings.zip')
df.columns = [
  'Company',
  'Specific Bean Origin',
  'REF',
  'Review Date',
  'Cocoa Percent',
  'Company Location',
  'Rating',
  'Bean Type',
  'Broad Bean Origin'
]
df = df[[
  'Company',
  'Company Location',
  'Bean Type',
  'Specific Bean Origin',
  'Broad Bean Origin',
  'Cocoa Percent',
  'REF',
  'Review Date',
  'Rating',
]]
              
df.loc[df['Company Location'] == 'Domincan Republic',
       'Company Location'] = 'Dominican Republic'
df.loc[df['Company Location'] == 'Niacragua', 'Company Location'] = 'Nicaragua'
df.loc[df['Company Location'] == 'Eucador', 'Company Location'] = 'Ecuador'
df.loc[df['Company Location'] == 'Amsterdam', 'Company Location'] = 'Holland'
df.loc[df['Company Location'] == 'U.K.', 'Company Location'] = 'England'

for location in sorted(df['Company Location'].unique()):
  print(location)

---

### Column: Bean Type

Now that our company data is looking a little better, let's move into data about the cocoa going into the chocolate bar itself. The first piece of data is the 'Bean Type'. 'Bean Type' is defined as "*The variety (breed) of bean used, if provided*". This hints that there will be some missing data. Let's check and see.

In [0]:
df['Bean Type'].isna().any()

Indeed, we have missing data. Let's see how much is missing.

In [0]:
df[df['Bean Type'].isna()].count()

Only one row of data is missing 'Bean Type'. Let's take a look at that row.

In [0]:
df[df['Bean Type'].isna()]

Now we have a choice to make about how to handle this missing data. Some options include:

* Leave it as is
* Remove the entire row
* Fill in the data with some value

Leaving undefined values lying around in our data can be problematic. Missing values are not counted and can be tricky to program around.

Removing the entire row actually isn't a bad option in this case. Since it is only one row out of over 1,700, it likely won't have too much effect on any analysis that we do.

As for filling in the row, we can:

* Use 'Unknown' or some other placeholder value
* Actually do research to find the true missing value
* See if there is a reasonable value already in the data

In this case, we are just going to replace the missing value with 'Unknown'.

In [0]:
df.loc[df['Bean Type'].isna(), 'Bean Type'] = 'Unknown'
df[df['Bean Type'].isna()]

Now we can see how many unique bean types we have.

In [0]:
df['Bean Type'].unique().size

Only 42, let's print them out.

In [0]:
for t in sorted(df['Bean Type'].unique()):
  print(t)

The data looks pretty good. But there is a small little problem. After 'Unknown' there seems to be an empty line. What is that?

It turns out that it is a whitespace character. We thought we had only one missing value, but it looks like there are some values that are present but are white space. Let's see how many.

White space can be tricky because there are many different encodings that render as white space. Let's find out exactly which space character this is.

To get the space(s) we can sort the 'Bean Type' values again and get the last one, since we see the space last in the list. We can then print the space as hexadecimal characters.

In [0]:
space = sorted(df['Bean Type'].unique())[-1]
print(", ".join("0x{:02x}".format(ord(c)) for c in space))

We get `0xa0` which is the ASCII code for [non-breaking space](https://en.wikipedia.org/wiki/Non-breaking_space). This is different from the white space that you get when you hit the space bar. That space is encoded `0x20`.

Let's see how many of these there are:

In [0]:
df[df['Bean Type'] == chr(0xa0)]

Almost 900! Let's encode those as 'Unknown' also.

#### Exercise 3: Fixing Non-Breaking Space

There are non-breaking space characters, `0xa0` in the 'Bean Type' column. Replace these values with the word 'Unknown'.

**Student Solution**

In [0]:
# Your Code Goes Here

---

##### Answer Key

In [0]:
import pandas as pd

df = pd.read_csv('chocolate-bar-ratings.zip')
df.columns = [
  'Company',
  'Specific Bean Origin',
  'REF',
  'Review Date',
  'Cocoa Percent',
  'Company Location',
  'Rating',
  'Bean Type',
  'Broad Bean Origin'
]
df = df[[
  'Company',
  'Company Location',
  'Bean Type',
  'Specific Bean Origin',
  'Broad Bean Origin',
  'Cocoa Percent',
  'REF',
  'Review Date',
  'Rating',
]]

df.loc[df['Bean Type'].isna(), 'Bean Type'] = 'Unknown'

df.loc[df['Bean Type'] == chr(0xa0), 'Bean Type'] = 'Unknown'

for bt in sorted(df['Bean Type'].unique()):
  print(bt)

---

### Column: Specific Bean Origin

Let's look at our next column: 'Specific Bean Origin'. 'Specific Bean Origin' is a string column that contains the "*specific geo-region of origin for the bar*."

First, we'll see if we are missing any data in the 'Specific Bean Origin' column.

In [0]:
df['Specific Bean Origin'].isna().any()

Good, we don't have any 'N/A' data. But we learned from the 'Bean' column that we also need to check string columns for being only white space.

A good way to do this is to apply a function that strips leading and trailing white space from every value in a column, and see if the resulting string is zero-length.

In [0]:
df[df['Specific Bean Origin'].apply(lambda x: x.strip()).str.len() == 0]

Here we can see that no data was returned, so we don't have any 'Specific Bean Origin' values that are only spaces.

If you run this function and get an error about numbers/floats not having a strip function, you likely have N/A values in your column. Always check `isna()` first.

Now that we know that every row has a 'Specific Bean Origin' value, let's see how many unique values we have.

In [0]:
df['Specific Bean Origin'].unique().size

Over 1,000 values! That is quite a bit of data to manually sift through. Let's look at the first bit of data, up until the first origin that starts with 'B'.

In [0]:
for origin in sorted(df['Specific Bean Origin'].unique()):
  if origin.startswith('B'):
    break
  print(origin)

This is some pretty ugly data. Most (but not all) rows contain the bean's geographical origin, but some seem to include the year and/or batch numbers as well, and some seem to contain different information entirely ("100 percent").

Looking at the data, we can also see some things that look odd. For instance, "Akesson Estate" and "Akesson's Estate" are likely the same origin. Also, "Ambolikapkly P." clearly looks like a misspelling of "Ambolikapiky P."

We could make all of the "Akesson" origins look the same, but should we? First, let's look at the entire rows for the offending data.

In [0]:
df[(df['Specific Bean Origin'] == 'Akesson Estate') | \
   (df['Specific Bean Origin'] == "Akesson's Estate")]

It is interesting that all of the bean types and origins are alike. It looks like Akesson('s) Estate serves many companies though.

It is tempting to go ahead and change the "Specific Bean Origin" values to make them match, but it is better to do more research into the industry before making those sorts of changes. You might disagree with this decision, and that is perfectly fine. When working with datasets, you will often have to make difficult calls to deal with ambiguous data. Different people will make different decisions, and that's okay.

The "Ambolikapkly P." issue is a little more obvious and can be validated with a quick internet search. The "Ambolikapkly" spelling shows up very few times and always in the context of this data set. The other spelling is much more common. Let's go ahead and fix that.

In [0]:
df.loc[df['Specific Bean Origin'] == 'Ambolikapkly P.', 
       'Specific Bean Origin'] = 'Ambolikapiky P.'

#### Exercise 4: Finding and Repairing Bad Data

There are a few more obvious errors in the 'Specific Bean Origin' column of the dataset. Print out the column, scan the output, and see if you can find any more errors. Write the code to fix the errors. Find at least one error to fix.

The code to print the dataset is below.

In [0]:
for origin in sorted(df['Specific Bean Origin'].unique()):
  print(origin)

**Student Solution**

In [0]:
# Repair the data

---

##### Answer Key

Any reasonable fix is acceptable. A few that we found are that 'Dominican Republicm, rustic' should be 'Dominican Republic, rustic'. Also 'Nicaraqua' should be 'Nicaragua'. There are likely many many more.

In [0]:
import pandas as pd

df = pd.read_csv('chocolate-bar-ratings.zip')
df.columns = [
  'Company',
  'Specific Bean Origin',
  'REF',
  'Review Date',
  'Cocoa Percent',
  'Company Location',
  'Rating',
  'Bean Type',
  'Broad Bean Origin'
]
df = df[[
  'Company',
  'Company Location',
  'Bean Type',
  'Specific Bean Origin',
  'Broad Bean Origin',
  'Cocoa Percent',
  'REF',
  'Review Date',
  'Rating',
]]
              
df.loc[df['Specific Bean Origin'] == 'Dominican Republicm, rustic', 
       'Specific Bean Origin'] = 'Dominican Republic, rustic'
df.loc[df['Specific Bean Origin'] == 'Nicaraqua',
       'Specific Bean Origin'] = 'Nicaragua'

for origin in sorted(df['Specific Bean Origin'].unique()):
  print(origin)

---

#### Exercise 5: Top Specific Bean Origins

There are just over 1,000 unique specific bean origins and over 1,700 entries in the dataset. Write code to find the top five most repeated origins. Print the origins and the number of times that each appears in the dataset.

**Student Solution**

In [0]:
# Find the top 5 bar origins

---

##### Answer Key

In [0]:
import pandas as pd

df = pd.read_csv('chocolate-bar-ratings.zip')
df.columns = [
  'Company',
  'Specific Bean Origin',
  'REF',
  'Review Date',
  'Cocoa Percent',
  'Company Location',
  'Rating',
  'Bean Type',
  'Broad Bean Origin'
]
df = df[[
  'Company',
  'Company Location',
  'Bean Type',
  'Specific Bean Origin',
  'Broad Bean Origin',
  'Cocoa Percent',
  'REF',
  'Review Date',
  'Rating',
]]
              
df.groupby('Specific Bean Origin')['Specific Bean Origin'].count().sort_values(
    ascending=False).head()

---

### Column: Broad Bean Origin

The 'Broad Bean Origin' is the "*broad geo-region of origin for the bean.*" In theory, this should be broader regions than the 'Specific Bean Origin' that we just worked with.

Let's dive in. First things first, let's check for N/A values.

In [0]:
df[df['Broad Bean Origin'].isna()].count()

It looks like we are missing one origin. Let's take a look at the record.

In [0]:
df[df['Broad Bean Origin'].isna()]

The one record has a 'Specific Bean Origin' of 'Madagascar'. Let's see if there are any other chocolates from that same specific origin.

In [0]:
df[df['Specific Bean Origin'] == 'Madagascar']

Quite a few! And they all have a 'Broad Bean Origin' of 'Madagascar', except for our one missing value. It is probably safe to just set the missing value to 'Madagascar' also.

In [0]:
df.loc[(df['Specific Bean Origin'] == 'Madagascar') &
       (df['Broad Bean Origin'].isna()),
       'Broad Bean Origin'] = 'Madagascar'

df[df['Broad Bean Origin'].isna()]

Now that we have all of the N/A values handled, let's see if we have an issue with spaces.

In [0]:
df[df['Broad Bean Origin'].apply(lambda x: x.strip()).str.len() == 0]

There are spaces in 73 rows of the data. Let's see what those space values are.

In [0]:
spaces_df = df[df['Broad Bean Origin'].apply(
    lambda x: x.strip()).str.len() == 0]

for space in spaces_df['Broad Bean Origin'].unique():
  print(", ".join("0x{:02x}".format(ord(c)) for c in space))

It is that pesky `0xa0` again.

We can fix this by replacing all of the `0xa0` values with 'Unknown'. However, an even better fix would be if we could find similar chocolates with the same 'Specific Bean Origin' and then derive the 'Broad Bean Origin' from that.

Let's see if it is even possible. To do that we can find all of the 'Specific Bean Origin' values for rows with 'Broad Bean Origin' and those without. Then we can use `pd.merge()` to combine the two. If you remember, `pd.merge()` returns only the values which appear in both of the given Series. This means that the return value will show us which values appear both in columns with 'Broad Bean Origin' values and those without.

In [0]:
has_bbo_idx = df['Broad Bean Origin'].apply(lambda x: x.strip()).str.len() > 0

sbo_bbo = df[has_bbo_idx]['Specific Bean Origin']
sbo_no_bbo = df[~has_bbo_idx]['Specific Bean Origin']

pd.merge(sbo_bbo, sbo_no_bbo)

We have overlap, which is good. In theory, we could use the 'Broad Bean Origin' values from bars that *have* that value to fill in the 'Broad Bean Origin' for bars from the same specific region that *don't have* it.

But look closely at those 'Specific Bean Origin' values. Dark? Raw? Blend?

Those are specific origins. The only two origins that seem even close to regions are 'Amazonas' and 'Orinoco'. Let's look closer at the data for those regions.

In [0]:
df[(df['Specific Bean Origin'] == 'Orinoco') | 
   (df['Specific Bean Origin'] == 'Amazonas')]

Yuck! Amazonas turns out to be a very common location. There are states called Amazonas in Brazil, Venezuela, and Peru. Orinoco is a river that runs through both Venezuela and Columbia.

In neither case do we have definitive data to make the call about the 'Broad Bean Origin' for these rows.

Unfortunately that is how it goes when working with data. You get imperfect data into your system, and then you try to research and find the best fix. But sometimes just have to accept that you are missing data.

#### Exercise 6: Unknown Broad Bean Origins

We have a few 'Broad Bean Origin' values of `0xa0`. Change those values to the literal string 'Unknown'.

**Student Solution**

In [0]:
# Your Code Goes Here

---

##### Answer Key

In [0]:
import pandas as pd

df = pd.read_csv('chocolate-bar-ratings.zip')
df.columns = [
  'Company',
  'Specific Bean Origin',
  'REF',
  'Review Date',
  'Cocoa Percent',
  'Company Location',
  'Rating',
  'Bean Type',
  'Broad Bean Origin'
]
df = df[[
  'Company',
  'Company Location',
  'Bean Type',
  'Specific Bean Origin',
  'Broad Bean Origin',
  'Cocoa Percent',
  'REF',
  'Review Date',
  'Rating',
]]

# Not required, this was illustrated in the lab
df.loc[(df['Specific Bean Origin'] == 'Madagascar') &
       (df['Broad Bean Origin'].isna()),
       'Broad Bean Origin'] = 'Madagascar'

# Solution
df.loc[df['Broad Bean Origin'] == chr(0xa0), 
       'Broad Bean Origin'] = 'Unknown'

df[df['Broad Bean Origin'].apply(lambda x: x.strip()).str.len() == 0]

---

### Column: Cocoa Percent

Next we will check out the 'Cocoa Percent' column. Remember that 'Cocoa Percent' is "*Cocoa percentage (darkness) of the chocolate bar*."

As usual, we'll first see if there is any missing data:

In [0]:
df['Cocoa Percent'].isna().any()

Nothing missing. Great!

Next, we should probably check to make sure that the percentages fall within a valid range: 0-100 or 0.0-1.0. You might recall that 'Cocoa Percent' isn't actually a numeric column, though, so we can't easily find the range. If we sample the data, we see that it looks like percentages from 0 to 100, but they are stored as strings with '%' symbols appended.

In [0]:
df['Cocoa Percent'].sample(10)

We need to remove those percentage signs and convert the digits that remain into numbers. There are a few ways that we can accomplish this.

One is to apply a lambda to each value. The lambda can slice all but the last character of each value and then convert it to a float using core Python syntax.

In [0]:
df['Cocoa Percent'].apply(lambda s: float(s[:-1]))

An alternative is to use `.str.strip('%')` on the `Series` to remove the percentage sign and then pass the resultant `Series` to `pd.to_numeric()` in order to convert the string values to numbers.

In [0]:
pd.to_numeric(df['Cocoa Percent'].str.strip('%'))

Is one way better than the other? Not necessarily. Feel free to choose whichever feels more natural to you.

Either way, we need to do the conversion and save the new values to 'Cocoa Percent'.

In [0]:
df['Cocoa Percent'] = df['Cocoa Percent'].apply(lambda s: float(s[:-1]))
df['Cocoa Percent'].describe()

We have now converted our 'Cocoa Percent' column from a string to a floating point number. We can see in the output of the call to `describe()` that the minimum cocoa percentage that we have is 42% and that the maximum is 100%. Both seem like reasonable values for cocoa content in a chocolate bar, so our work here is done.

### Column: REF

The 'REF' column is "*A value linked to when the review was entered in the database. Higher = more recent*." Let's take a look at it.

As always, we should check and see if there are any values missing.

In [0]:
df['REF'].isna().any()

We can `describe()` the data to see some basic statistics about it.

In [0]:
df['REF'].describe()

Here we can see that the data ranges from 5 through 1952 and that the mean is pretty high.

Are the values unique?

In [0]:
df['REF'].unique().size

Not unique. So 'REF' isn't a unique identifier for our rows of data.

There isn't much more that we can do with this column. We might want to visualize it to see if we can find any meaning. The numbers themselves aren't particularly interesting, but the quantity of each number might be. Let's find and plot the count of each 'REF'.

In [0]:
import matplotlib.pyplot as plt

ref_counts = df['REF'].groupby(df['REF']).count()
plt.figure(figsize=(20,10))
plt.bar(ref_counts.index.values, ref_counts)
plt.show()

From this chart we can see that 'REF' values repeat between 1 and 9 times with 4 being the most common. Overall, there isn't much interesting data or data repair for this column.

### Column: Review Date

Review date is the date that the review for a given row was actually published. It is a numeric column.

First, let's see if any data is missing.

In [0]:
df['Review Date'].isna().any()

No missing data. Good.

Now we can check some basic statistics about the data.

In [0]:
df['Review Date'].describe()

We can see publication dates ranging from 2006 through 2017, which seems like reasonable years. If we had seen dates from the 1800s or the future, we should be worried. This range seems well within reason, though.

There isn't much else that we need to do for this column. Since we only have a few years when reviews were posted, we can create a visualization showing how many reviews were posted each year.

#### Exercise 7: Reviews Per Year

Create a visualization that shows the number of reviews that were created each year.

**Student Solution**

In [0]:
# Reviews Per Year Visualization

---

##### Answer Key

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('chocolate-bar-ratings.zip')
df.columns = ['Company', 'Bar Origin', 'REF', 'Review Date', 'Cocoa Percent',
              'Company Location', 'Rating', 'Bean Type', 'Bean Origin']
              
df.groupby('Bar Origin')['Bar Origin'].count().sort_values(
    ascending=False).head()

ref_counts = df['Review Date'].groupby(df['Review Date']).count()
min_val = df['Review Date'].min()
max_val = df['Review Date'].max()
plt.figure(figsize=(20,10))
plt.bar(ref_counts.index.values, ref_counts)
plt.xticks(np.arange(min_val, max_val+1))
plt.show()

---

### Column: Rating

We have now made it to the rating column. The rating is the "*expert rating for the bar*."  From the [documentation](https://www.kaggle.com/rtatman/chocolate-bar-ratings), the possible ratings are:

Rating | Meaning
-------|---------
5 | Elite (Transcending beyond the ordinary limits)
4 | Premium (Superior flavor development, character and style)
3 | Satisfactory (3.0) to praiseworthy(3.75) (well made with special qualities)
2 | Disappointing (Passable but contains at least one significant flaw)
1 | Unpleasant (mostly unpalatable)

Let's take a look at ratings. First off, are any missing?

In [0]:
df['Rating'].isna().any()

Nothing missing. Let's describe the column of data.

In [0]:
df['Rating'].describe()

It looks like our ratings are indeed floating point values and that they range from 1.0 to 5.0. But are they really continuous?

In [0]:
sorted(df['Rating'].unique())

Interestingly enough, the values don't seem to be continuous, but instead seem to be divided into quarters. Instead of infinite possible values between 1.0 and 5.0, we really have 17 possible values: 1.0, 1.25, 1.5, 1.75, 2.0, 2.25, 2.5, 2.75, 3.0, 3.25, 3.5, 3.75, 4.0, 4.25, 4.5, 4.75, 5.0.

What does this mean for us?

It means that downstream we might be able to use a regression or categorical model in order to predict these values.

If we think about the ratings numbers, their relative position matters. For example, a 4.0 chocolate is better than a 2.0 chocolate. But does the magnitude matter? Is a 4.0 chocolate twice as good as a 2.0 chocolate? What does that even mean?

Let's set our modelers up for success and create a new column that they can use to potentially build models for our data.

#### Exercise 8: Ratings as Catagories

In this exercise we are going to create a new column called 'Grade'. Grade is a categorical rating system that maps the following ratings to grades:

Rating | Grade
-------|------
5.00   | A
4.75   | B
4.50   | C
4.25   | D
4.00   | E
3.75   | F
3.50   | G
3.25   | H
3.00   | I
2.75   | J
2.50   | K
2.25   | L
2.00   | M
1.75   | N
1.50   | O
1.25   | P
1.00   | Q

Create the 'Grade' column and add it to our chocolate bar `DataFrame`.

**Student Solution**

In [0]:
# Your Code Goes Here

---

##### Answer Key

In [0]:
import pandas as pd

df = pd.read_csv('chocolate-bar-ratings.zip')
df.columns = [
  'Company',
  'Specific Bean Origin',
  'REF',
  'Review Date',
  'Cocoa Percent',
  'Company Location',
  'Rating',
  'Bean Type',
  'Broad Bean Origin'
]
df = df[[
  'Company',
  'Company Location',
  'Bean Type',
  'Specific Bean Origin',
  'Broad Bean Origin',
  'Cocoa Percent',
  'REF',
  'Review Date',
  'Rating',
]]

def grade(rating):
  if rating >= 5.00:
    return 'A'
  if rating >= 4.75:
    return 'B'
  if rating >= 4.50:
    return 'C'
  if rating >= 4.25:
    return 'D'
  if rating >= 4.00:
    return 'E'
  if rating >= 3.75:
    return 'F'
  if rating >= 3.50:
    return 'G'
  if rating >= 3.25:
    return 'H'
  if rating >= 3.00:
    return 'I'
  if rating >= 2.75:
    return 'J'
  if rating >= 2.50:
    return 'K'
  if rating >= 2.25:
    return 'L'
  if rating >= 2.00:
    return 'M'
  if rating >= 1.75:
    return 'N'
  if rating >= 1.50:
    return 'O'
  if rating >= 1.25:
    return 'P'
  if rating >= 1.00:
    return 'Q'
  raise Exception(f'Unsupported rating: {rating}')

df['Grade'] = df['Rating'].apply(grade)

df[['Rating', 'Grade']].sample(10)

---

## Onward

We've now explored every column in our dataset. We have filled in missing values and repaired obviously bad data.

As you can imagine, you can spend near infinite time trying to get a dataset into shape for analysis and modeling. It is common to hear that up to 60% to 80% of a data scientist's time is spent working on the data before it is fed to a model!

In this lab, we only tried to get the data into the state that it was intended to be in. Once we get into modeling, we will learn even more data manipulation techniques that need to be used in order to get models to train well on the data.

But we aren't quite ready for model building yet. There is still more Exploratory Data Analysis (EDA) to do. In part 2 of this unit, we will look more closely at the relationships between the columns.