<a href="https://colab.research.google.com/github/mikaelaraujo/seotools/blob/main/Structured_data_extractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Autor: Mikael Araújo
# Versão: 1.0
# Programa capaz de extrair metadados de um conjunto de dados estruturados  "do tipo: <script type="application/ld+json">" de uma URL.
# Os labels a serem extraídos são os seguintes: headline, articleBody e keywords.
# O código também trabalha maneiras de resolver problemas causados pelo status code 403.

In [None]:
import requests
import json
from bs4 import BeautifulSoup

In [None]:
def extract_data_from_url(url):
  """
  Extrai os dados 'headline', 'articleBody', 'keywords' e 'image_url' de um conjunto de dados estruturados
  (do tipo: <script type="application/ld+json">) de uma URL.

  Args:
    url: A URL da página web.

  Returns:
    Um dicionário contendo os dados extraídos ou None se não forem encontrados.
  """
  try:
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    response.raise_for_status()  # Levanta uma exceção para códigos de status diferentes de 200

    soup = BeautifulSoup(response.content, 'html.parser')
    script_tags = soup.find_all('script', type="application/ld+json")

    for script_tag in script_tags:
      try:
        json_data = json.loads(script_tag.string)
        if isinstance(json_data, list):
          for item in json_data:
            if 'headline' in item and 'articleBody' in item and 'keywords' in item and 'image' in item:
              return {
                  'headline': item.get('headline'),
                  'articleBody': item.get('articleBody'),
                  'keywords': item.get('keywords'),
                  'image_url': item.get('image', {}).get('url')
              }
        elif 'headline' in json_data and 'articleBody' in json_data and 'keywords' in json_data and 'image' in json_data:
          return {
              'headline': json_data.get('headline'),
              'articleBody': json_data.get('articleBody'),
              'keywords': json_data.get('keywords'),
              'image_url': json_data.get('image', {}).get('url')
          }
      except json.JSONDecodeError:
        continue  # Ignora erros de decodificação JSON

    return None  # Se nenhum dado for encontrado

  except requests.exceptions.RequestException as e:
    print(f"Erro ao acessar a URL: {e}")
    return None

In [None]:
# Exemplo de uso
url = "https://coingape.com/heres-why-us-spot-ethereum-etf-see-largest-79m-in-outflows-since-july/"
data = extract_data_from_url(url)

if data:
  print(data)
else:
  print("Dados não encontrados.")

{'headline': 'Here&#8217;s Why US Spot Ethereum ETFs See Largest $79M in Outflows Since July?', 'articleBody': 'Unlike the spot Bitcoin ETFs in the United States seeing renewed demand, the spot Ethereum ETFs have seen waning interest in recent times. On Monday, these Ether ETFs saw the largest single outflows of $79 million since the launch in July. This shows that the Ethereum investment products have failed to garner enough investment participation and institutional attention.\r\nGrayscale Leads Spot Ethereum ETF Outflows\r\nO Monday, the total outflows from the spot Ethereum ETF stood at a staggering $79.3 million, the highest since July this year. Grayscale\'s ETHE played a major spoilsport yesterday with more than $80.8 million in outflows. Of the other market players, only Bitwise Ether ETF (ETHW) saw negligent inflows of $1.3 million. All other Ether ETFs saw zero inflows yesterday.\r\n\r\nInflows into US spot Ethereum ETFs have significantly dried down in recent times. In the p

In [None]:
# prompt: Crie um trecho de código que exiba os labels extraídos separadamente

if data:
  print("Original article URL:\n", url)
  print("\n")
  print("Headline: \n", data['headline'])
  print("\n")
  print("Keywords: \n", data['keywords'])
  print("\n")
  print("Article Body:\n\n", data['articleBody'], "\n")
  print("Image URL: \n", data['image_url'])

else:
  print("Dados não encontrados.")

Original article URL:
 https://coingape.com/heres-why-us-spot-ethereum-etf-see-largest-79m-in-outflows-since-july/


Headline: 
 Here&#8217;s Why US Spot Ethereum ETFs See Largest $79M in Outflows Since July?


Keywords: 
 ETH price, ETHE, Ethereum (ETH), Grayscale Ethereum ETF, Spot Ethereum ETF


Article Body:

 Unlike the spot Bitcoin ETFs in the United States seeing renewed demand, the spot Ethereum ETFs have seen waning interest in recent times. On Monday, these Ether ETFs saw the largest single outflows of $79 million since the launch in July. This shows that the Ethereum investment products have failed to garner enough investment participation and institutional attention.
Grayscale Leads Spot Ethereum ETF Outflows
O Monday, the total outflows from the spot Ethereum ETF stood at a staggering $79.3 million, the highest since July this year. Grayscale's ETHE played a major spoilsport yesterday with more than $80.8 million in outflows. Of the other market players, only Bitwise Eth