# Scrapping With BeautifulSoup for Chocolate Review Analysis

In this project we gonna scrap a Codecademy website that contains information about chocolate, then analyze the data to answer the following questions:

- Where are the best cocao beans grown?
- Which countries produce the highest-rated bars?
- What’s the relationship between cocao solids percentage and rating?

As we are scrapping the data from a website, it is not ready for analysis until we clean and tidy it.

## Structure of the data

The data is laid out on a table and each column is labelled with the column name as the class. It has the following format:

| Company  (Maker-if known) | Specific Bean Origin or Bar Name | REF | Review Date |Cocoa Percent | Company Location | Rating | Bean Type |	Broad Bean Origin |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| A. Morin | Sur del Lago | 1315 | 2014 | 70% |	France | 3.5 | Criollo | Venezuela |
| Adi | Vanua Levu | 705 | 2011 | 60% | Fiji | 2.75 | Trinitario | Fiji |

## Libraries

For the project we need different python library. *request* to make request to the website, *bs4* to extract the data from the response we get of the website, *pandas* for data cleaning, *matplotlib* for visualizaling the data.

In [2]:
# Importing necessary libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd
from matplotlib import pyplot as plt

In [23]:
# Requesting the data 

URL = "https://content.codecademy.com/courses/beautifulsoup/cacao/index.html"
COLUMNS = ["Company", "Origin", "REF", "ReviewDate", "CocoaPercent", "CompanyLocation",
"Rating", "BeanType", "BroadBeanOrigin"]

response = requests.get(URL)
soup = BeautifulSoup(response.content, "html.parser")

In [42]:
# Extracting data and create DataFrame

data_dict = {}

for colname in COLUMNS:
    data_dict[colname] = [element.get_text() for element in soup.find_all(attrs={"class": colname})][1:]

choco_df = pd.DataFrame(data_dict)

choco_df.head(10)

Unnamed: 0,Company,Origin,REF,ReviewDate,CocoaPercent,CompanyLocation,Rating,BeanType,BroadBeanOrigin
0,A. Morin,Agua Grande,1876,2016,63%,France,3.75,,Sao Tome
1,A. Morin,Kpime,1676,2015,70%,France,2.75,,Togo
2,A. Morin,Atsane,1676,2015,70%,France,3.0,,Togo
3,A. Morin,Akata,1680,2015,70%,France,3.5,,Togo
4,A. Morin,Quilla,1704,2015,70%,France,3.5,,Peru
5,A. Morin,Carenero,1315,2014,70%,France,2.75,Criollo,Venezuela
6,A. Morin,Cuba,1315,2014,70%,France,3.5,,Cuba
7,A. Morin,Sur del Lago,1315,2014,70%,France,3.5,Criollo,Venezuela
8,A. Morin,Puerto Cabello,1319,2014,70%,France,3.75,Criollo,Venezuela
9,A. Morin,Pablino,1319,2014,70%,France,4.0,,Peru


## Data Cleaning

Now that we have the data into a data frame, we will clean it and make it ready for analysis.

The first step would be to identify qualitative and quantitative columns, then make the necessary data conversion. As the data came from scrapping a website, the columns value are string object. Luckily we can access the string properties to clean the data. 

In [43]:
# Data Types
choco_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1795 entries, 0 to 1794
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Company          1795 non-null   object
 1   Origin           1795 non-null   object
 2   REF              1795 non-null   object
 3   ReviewDate       1795 non-null   object
 4   CocoaPercent     1795 non-null   object
 5   CompanyLocation  1795 non-null   object
 6   Rating           1795 non-null   object
 7   BeanType         1795 non-null   object
 8   BroadBeanOrigin  1795 non-null   object
dtypes: object(9)
memory usage: 126.3+ KB


In [44]:
# Trimming white space

for colname in COLUMNS:
    choco_df[colname] = choco_df[colname].str.strip()

# REF, ReviewDate, CocoaPercent, and Rating columns should be numerical

choco_df['REF'] = choco_df['REF'].astype(int)
choco_df['ReviewDate'] = choco_df['ReviewDate'].astype(int)
choco_df['CocoaPercent'] = pd.to_numeric(choco_df['CocoaPercent'].str.rstrip("%"))
choco_df['Rating'] = choco_df['Rating'].astype(float)

choco_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1795 entries, 0 to 1794
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Company          1795 non-null   object 
 1   Origin           1795 non-null   object 
 2   REF              1795 non-null   int64  
 3   ReviewDate       1795 non-null   int64  
 4   CocoaPercent     1795 non-null   float64
 5   CompanyLocation  1795 non-null   object 
 6   Rating           1795 non-null   float64
 7   BeanType         1795 non-null   object 
 8   BroadBeanOrigin  1795 non-null   object 
dtypes: float64(2), int64(2), object(5)
memory usage: 126.3+ KB


In [46]:
# Some Bean Type are missing, So we replace them with Unknown

choco_df['BeanType'] = choco_df['BeanType'].replace("", "Unknown")
choco_df.head(10)

Unnamed: 0,Company,Origin,REF,ReviewDate,CocoaPercent,CompanyLocation,Rating,BeanType,BroadBeanOrigin
0,A. Morin,Agua Grande,1876,2016,63.0,France,3.75,Unknown,Sao Tome
1,A. Morin,Kpime,1676,2015,70.0,France,2.75,Unknown,Togo
2,A. Morin,Atsane,1676,2015,70.0,France,3.0,Unknown,Togo
3,A. Morin,Akata,1680,2015,70.0,France,3.5,Unknown,Togo
4,A. Morin,Quilla,1704,2015,70.0,France,3.5,Unknown,Peru
5,A. Morin,Carenero,1315,2014,70.0,France,2.75,Criollo,Venezuela
6,A. Morin,Cuba,1315,2014,70.0,France,3.5,Unknown,Cuba
7,A. Morin,Sur del Lago,1315,2014,70.0,France,3.5,Criollo,Venezuela
8,A. Morin,Puerto Cabello,1319,2014,70.0,France,3.75,Criollo,Venezuela
9,A. Morin,Pablino,1319,2014,70.0,France,4.0,Unknown,Peru
