# Chemicals Hazards WebScrapping

## 01. Importing Libraries

In [1]:
import pandas as pd

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was too old on your system - pyarrow 10.0.1 is the current minimum supported version as of this release.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## 02. Data Load

In [9]:
cas = pd.read_excel("reg-substances-export.xlsx")

In [11]:
cas = cas[cas['Cas Number'] != "-"]
cas.head()

Unnamed: 0,Name,EC / List Number,Cas Number,ID,Registration Status,Registration Type,Submission Type,Total tonnage Band,Tonnage Band Min,Tonnage Band Max,Last Updated,Factsheet URL,Substance Information Page
1,"''amyl nitrite'', mixed isomers",203-770-8,110-46-3,100.003.429,Active,FULL,JOINT_SUBMISSION,≥ 10 to < 100 tonnes,10.0,100.0,26-07-2022,https://echa.europa.eu/registration-dossier/-/...,https://echa.europa.eu/substance-information/-...
2,"''amyl nitrite'', mixed isomers",203-770-8,110-46-3,100.003.429,Active,INTERMEDIATE,JOINT_SUBMISSION,Intermediate use only,,,10-12-2019,https://echa.europa.eu/registration-dossier/-/...,https://echa.europa.eu/substance-information/-...
4,"(((2,2-dimethylbut-3-yn-1-yl)oxy)methyl)benzene",811-585-4,1092536-54-3,100.242.210,Active,INTERMEDIATE,JOINT_SUBMISSION,Intermediate use only,,,12-05-2017,https://echa.europa.eu/registration-dossier/-/...,https://echa.europa.eu/substance-information/-...
6,"((2R,3R,4R,5R)-3-(benzoyloxy)-5-(2,4-dioxo-3,4...",807-766-2,863329-65-1,100.235.582,Cease Manufacture,INTERMEDIATE,INDIVIDUAL_SUBMISSION,Cease manufacture,,,01-02-2016,https://echa.europa.eu/registration-dossier/-/...,https://echa.europa.eu/substance-information/-...
7,"((2R,3R,4R,5R)-3-(benzoyloxy)-5-(2,4-dioxo-3,4...",807-766-2,863329-65-1,100.235.582,Active,INTERMEDIATE,JOINT_SUBMISSION,Intermediate use only,,,07-03-2017,https://echa.europa.eu/registration-dossier/-/...,https://echa.europa.eu/substance-information/-...


In [13]:
cas.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19943 entries, 1 to 26864
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Name                        19943 non-null  object 
 1   EC / List Number            19943 non-null  object 
 2   Cas Number                  19943 non-null  object 
 3   ID                          19943 non-null  object 
 4   Registration Status         19943 non-null  object 
 5   Registration Type           19942 non-null  object 
 6   Submission Type             19943 non-null  object 
 7   Total tonnage Band          19943 non-null  object 
 8   Tonnage Band Min            9171 non-null   float64
 9   Tonnage Band Max            8781 non-null   float64
 10  Last Updated                19943 non-null  object 
 11  Factsheet URL               19943 non-null  object 
 12  Substance Information Page  19943 non-null  object 
dtypes: float64(2), object(11)
memory usa

## 03. Data cleaning

In [15]:
cas_df = cas[['Name','Cas Number']]
cas_df = cas_df.drop_duplicates()
cas_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 17483 entries, 1 to 26863
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Name        17483 non-null  object
 1   Cas Number  17483 non-null  object
dtypes: object(2)
memory usage: 409.8+ KB


## 04. Scrapping

In [29]:
import requests
from bs4 import BeautifulSoup

url = "https://pubchem.ncbi.nlm.nih.gov/compound/Formaldehyde"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# x =  soup.find("div", class_="sm:table w-full")

print(soup)

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="index,follow,noarchive" name="robots"/>
<title>Formaldehyde | H2CO | CID 712 - PubChem</title>
<script type="text/javascript">
      window.ncbi_startTime = new Date ()
      
    </script>
<script type="application/ld+json">
      {
          "@context": "https://schema.org",
          "@type": "Organization",
          "name": "PubChem",
          "url": "https://pubchem.ncbi.nlm.nih.gov",
          "logo": "https://pubchem.ncbi.nlm.nih.gov/pcfe/logo/PubChem_logo.png",
          "foundingDate": "2004"
      }
      
    </script>
<link href="/pcfe/favicon/apple-touch-icon.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/pcfe/favicon/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>
<link href="/pcfe/favicon/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>
<link href="/pcfe/favicon/favicon.ico" rel="shortcut icon"/>
<link href="https://www.ncbi.nlm.nih.gov" rel="preconn

In [23]:
import requests
from bs4 import BeautifulSoup

response = requests.get("https://pubchem.ncbi.nlm.nih.gov/compound/Formaldehyde", headers={"Accept-Language":"en-US"})
soup = BeautifulSoup(response.content, "html.parser")

movies = []
for movie in soup.find_all("div", class_="space-y-1"):
    title = movie.find("ul").find("li").string
    # duration = int(movie.find("span", class_="runtime").string.strip(' min'))
    duration = "test"
    movies.append({'title': title, 'duration': duration})

print(movies[0:2])

[]


In [24]:
response = requests.get("https://www.google.com/search?sca_esv=e6608f0c11c7d82b&sca_upv=1&sxsrf=ACQVn08uzZu-LIqRnkRM-hZ-G_duJR0eHA:1714609290365&q=tatuagem+old+school&uds=AMwkrPtH4R_IcK4JzT8HHqNW5j-mekCTXylnC0Guylnd4CA3nXoqazORvzxDEosDvviehTpCGCF-ONAtX_VDTZHeW1fBvD-tzB-mx0mG0RZIFOyZdl5T1EAVQ4d93UDA_nNiqLvj4JMhaa5aJYKVgOlTBLo13k9QEuHhzaB7RVsuYrHASW-nLkN3JMJPsOEUqu55HrnQN-V7zl6XWj9t5hXulaqI1uzbMMIEqc5P8qD0TRNrL8jEFfAHVklD3_QUsnCkgAC3668c4dTaDGmvm_yuoUKxBLyEO3nq4oEyu-3nm9xjDqkBWdqd7VquSv8ImEXl17zMWLsQ&udm=2&prmd=isvnmbtz&sa=X&ved=2ahUKEwjO1rit2e2FAxVAkJUCHXl5BtEQtKgLegQIFBAB&biw=1920&bih=911&dpr=1", headers={"Accept-Language":"en-US"})
soup = BeautifulSoup(response.content, "html.parser")

y = soup.find_all("div", class_="toI8Rb OSrXXb")
y

[]