# Web Scraping using Python
## Largest chemical producers worldwide
_by Virginia Herrero_

The **overall** goal of this project is to scrap from the Wikipedia website the list of largest chemical producers by sales in the year 2021. Clean, explore, analyse and visualize the data obtained. 

**Section 01: Web scraping**
* Obtain the list of largest chemical producers worldwide from Wikipedia.

**Section 02: Data cleaning**
* Clean the data scraped from Wikipedia.
* Rename columns, change data types, convert currency.

**Section 03: Data exploration and analysis**
* Null values
* Outliers

**Section 04: Data visualization**


## Section 01: Web scraping

In [1]:
# Import all required libraries

from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
# Specify the url
url = r"https://en.wikipedia.org/wiki/List_of_largest_chemical_producers"

In [3]:
# Send a GET request to the url
page = requests.get(url)

In [4]:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(page.text, "html")

In [5]:
# Extract the table containing the data needed
table = soup.find("table")
table

<table class="wikitable sortable">
<tbody><tr>
<th>Rank<sup class="reference" id="cite_ref-report2021_4-0"><a href="#cite_note-report2021-4"><span class="cite-bracket">[</span>4<span class="cite-bracket">]</span></a></sup>
</th>
<th style="width:150px;">Company
</th>
<th>Chemical sales in 2021<br/><small>USD millions<style data-mw-deduplicate="TemplateStyles:r1041539562">.mw-parser-output .citation{word-wrap:break-word}.mw-parser-output .citation:target{background-color:rgba(0,127,255,0.133)}</style><sup class="citation nobold" id="ref_note01^"><a href="#endnote_note01^">[A]</a></sup></small>
</th>
<th>Change from 2020<br/><small>in percent</small>
</th>
<th>Headquarters
</th></tr>
<tr>
<td>1
</td>
<td><a href="/wiki/BASF" title="BASF">BASF</a>
</td>
<td>92,982
<p><br/>
</p><p><br/>
</p>
</td>
<td><span typeof="mw:File"><span title="Increase"><img alt="Increase" class="mw-file-element" data-file-height="300" data-file-width="300" decoding="async" height="11" src="//upload.wikimedia.org

In [6]:
# Extract the titles of the table
titles = table.find_all("th")
titles

[<th>Rank<sup class="reference" id="cite_ref-report2021_4-0"><a href="#cite_note-report2021-4"><span class="cite-bracket">[</span>4<span class="cite-bracket">]</span></a></sup>
 </th>,
 <th style="width:150px;">Company
 </th>,
 <th>Chemical sales in 2021<br/><small>USD millions<style data-mw-deduplicate="TemplateStyles:r1041539562">.mw-parser-output .citation{word-wrap:break-word}.mw-parser-output .citation:target{background-color:rgba(0,127,255,0.133)}</style><sup class="citation nobold" id="ref_note01^"><a href="#endnote_note01^">[A]</a></sup></small>
 </th>,
 <th>Change from 2020<br/><small>in percent</small>
 </th>,
 <th>Headquarters
 </th>]

In [7]:
titles_table = [title.text.strip() for title in titles]
titles_table

['Rank[4]',
 'Company',
 'Chemical sales in 2021USD millions[A]',
 'Change from 2020in percent',
 'Headquarters']

In [8]:
# Create a DataFrame with the scraped titles
df = pd.DataFrame(columns = titles_table)
df

Unnamed: 0,Rank[4],Company,Chemical sales in 2021USD millions[A],Change from 2020in percent,Headquarters


In [9]:
# Fill the DataFrame with the scraped values
column_data = table.find_all("tr")

for row in column_data[1:]:
    row_data = row.find_all('td')
    individual_row_data = [data.text.strip() for data in row_data]

    length = len(df)
    df.loc[length] = individual_row_data

DataFrame with all the data scraped from the Wikipedia website

In [10]:
df

Unnamed: 0,Rank[4],Company,Chemical sales in 2021USD millions[A],Change from 2020in percent,Headquarters
0,1,BASF,92982,32.9%,"Germany, Ludwigshafen am Rhein"
1,2,Sinopec,65848,31.9%,"China, Beijing"
2,3,Dow,54968,42.6%,"United States, Midland, Michigan"
3,4,SABIC,43230,50.1%,"Saudi Arabia, Riyadh"
4,5,Formosa Plastics,43173,47.8%,"Taiwan, Taipei"
5,6,Ineos,39937,121%,"United Kingdom, London"
6,7,Petrochina,39693,41.7%,"China, Beijing"
7,8,LyondellBasell Industries,38995,66.6%,"United States, Houston, Texas"
8,9,LG Chem,37257,41.8%,"South Korea, Seoul"
9,10,ExxonMobil,36858,59.6%,"United States, Spring, Texas"
