### Extração de links de um site

Link do exemplo: https://www.thepythoncode.com/article/extract-all-website-links-python

Salvando em .txt: https://www.thepythoncode.com/code/extract-all-website-links-python

Dependências: pip3 install requests bs4 colorama

Estaremos usando requests para fazer solicitações HTTP convenientemente, BeautifulSoup (bs4) para analisar HTML e colorama para alterar a cor do texto.

Precisaremos de duas variáveis globais, uma para todos os links internos do site e outra para todos os links externos:

- Links internos são URLs com links para outras páginas do mesmo site.
- Links externos são URLs com links para outros sites

**Funções:**

**is_valid():** Como nem todos os links em tags âncora (tags a) são válidos, alguns são links para partes do site, alguns são javascript, então vamos escrever uma função para validar URLs. Isso garantirá que um esquema adequado (protocolo, por exemplo, http ou https) e nome de domínio existam no URL.

**get_all_website_links():** função para retornar todos os URLs válidos de uma página da web

**crawl():** Esta função rastreia o site, o que significa que obtém todos os links da primeira página e se auto-chama recursivamente para seguir todos os links extraídos anteriormente. No entanto, isso pode causar alguns problemas, o programa ficará travado em grandes sites (que têm muitos links), como google.com, como resultado, foi adicionado um parâmetro max_urls para sair quando atingirmos um determinado número de URLs verificados . 

In [14]:
import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama

In [25]:
# init the colorama module
colorama.init()

GREEN = colorama.Fore.GREEN
GRAY = colorama.Fore.LIGHTBLACK_EX
RESET = colorama.Fore.RESET
YELLOW = colorama.Fore.YELLOW

# initialize the set of links (unique links)
internal_urls = set()
external_urls = set()

total_urls_visited = 0

def is_valid(url):
    """
    Checks whether `url` is a valid URL.
    """
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)

def get_all_website_links(url):
    """
    Returns all URLs that is found on `url` in which it belongs to the same website
    """
    # all URLs of `url`
    urls = set()
    # domain name of the URL without the protocol
    domain_name = urlparse(url).netloc
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    for a_tag in soup.findAll("a"):
        href = a_tag.attrs.get("href")
        if href == "" or href is None:
            # href empty tag
            continue
        # join the URL if it's relative (not absolute link)
        href = urljoin(url, href)
        parsed_href = urlparse(href)
        # remove URL GET parameters, URL fragments, etc.
        href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
        if not is_valid(href):
            # not a valid URL
            continue
        if href in internal_urls:
            # already in the set
            continue
        if domain_name not in href:
            # external link
            if href not in external_urls:
                print(f"{GRAY}[!] External link: {href}{RESET}")
                external_urls.add(href)
            continue
        print(f"{GREEN}[*] Internal link: {href}{RESET}")
        urls.add(href)
        internal_urls.add(href)
    return urls

def crawl(url, max_urls=30):
    """
    Crawls a web page and extracts all links.
    You'll find all links in `external_urls` and `internal_urls` global set variables.
    params:
        max_urls (int): number of max urls to crawl, default is 30.
    """
    global total_urls_visited
    total_urls_visited += 1
    print(f"{YELLOW}[*] Crawling: {url}{RESET}")
    links = get_all_website_links(url)
    for link in links:
        if total_urls_visited > max_urls:
            break
        crawl(link, max_urls=max_urls)
        
    print("[+] Total Internal links:", len(internal_urls))
    print("[+] Total External links:", len(external_urls))
    print("[+] Total URLs:", len(external_urls) + len(internal_urls))
    #print("[+] Total crawled URLs:", max_urls)

    domain_name = urlparse(url).netloc

    # save the internal links to a file
    with open(f"{domain_name}_internal_links.txt", "w") as f:
        for internal_link in internal_urls:
            print(internal_link.strip(), file=f)

    # save the external links to a file
    with open(f"{domain_name}_external_links.txt", "w") as f:
        for external_link in external_urls:
            print(external_link.strip(), file=f)

In [16]:
# exemplo de URl de phishing
url = "http://vlkote.hop.ru/"
max_urls=30
crawl(url, max_urls)

[*] Crawling: http://vlkote.hop.ru/
[!] External link: http://www.r3.ru/
[!] External link: http://vkontakte.ru/faq.php
[!] External link: javascript://document.location='reg0'
[!] External link: http://vkontakte.ru/login.php
[!] External link: javascript://document.login.submit()
[!] External link: http://www.alexa.com/site/ds/top_sites
[!] External link: http://vkontakte.ru/help.php
[!] External link: http://vkontakte.ru/techsupp.php
[!] External link: http://vkontakte.ru/blog.php
[!] External link: http://vkontakte.ru/
[+] Total Internal links: 0
[+] Total External links: 10
[+] Total URLs: 10


In [22]:
# exemplo URL legítima
url = "https://www.amazon.com.br/"
max_urls=30
crawl(url, max_urls)

[*] Crawling: https://www.amazon.com.br/
[!] External link: https://www.amazon.com/gp/help/customer/display.html/ref=footer_cou
[!] External link: https://www.amazon.com/gp/help/customer/display.html/ref=footer_privacy
[+] Total Internal links: 0
[+] Total External links: 2
[+] Total URLs: 2


In [26]:
# exemplo URL legítima
url = "https://www.magazineluiza.com.br/"
max_urls=30
crawl(url, max_urls)

[*] Crawling: https://www.magazineluiza.com.br/
[*] Internal link: https://www.magazineluiza.com.br/
[!] External link: https://lojas.magazineluiza.com.br/
[!] External link: https://www.parceiromagalu.com.br/
[!] External link: https://especiais.magazineluiza.com.br/regulamentos/
[!] External link: https://especiais.magazineluiza.com.br/acessibilidade/
[!] External link: https://especiais.magazineluiza.com.br/seguranca/
[*] Internal link: https://www.magazineluiza.com.br/central-de-atendimento/
[*] Internal link: https://www.magazineluiza.com.br/acompanhamento/
[*] Internal link: https://www.magazineluiza.com.br/cliente/login/
[*] Internal link: https://www.magazineluiza.com.br/informatica-e-acessorios/l/ia/
[*] Internal link: https://www.magazineluiza.com.br/ar-e-ventilacao/l/ar/
[*] Internal link: https://www.magazineluiza.com.br/artesanato/l/am/
[*] Internal link: https://www.magazineluiza.com.br/artigos-para-festa/l/af/
[*] Internal link: https://www.magazineluiza.com.br/audio/l/e

[*] Internal link: https://www.magazineluiza.com.br/geladeira-refrigerador/eletrodomesticos/s/ed/refr/tipo-de-degelo---frost-free/
[*] Internal link: https://www.magazineluiza.com.br/geladeira-refrigerador/eletrodomesticos/s/ed/refr/cor---inox/
[*] Internal link: https://www.magazineluiza.com.br/geladeira-duplex/eletrodomesticos/s/ed/ref2/tipo-de-degelo---frost-free/
[*] Internal link: https://www.magazineluiza.com.br/lava-loucas/eletrodomesticos/s/ed/lalo/
[*] Internal link: https://www.magazineluiza.com.br/fogao-a-lenha/eletrodomesticos/s/ed/fole/
[*] Internal link: https://www.magazineluiza.com.br/fogao-2-bocas/eletrodomesticos/s/ed/fg2b/
[*] Internal link: https://www.magazineluiza.com.br/eletrodomesticos/l/ed/brand---brastemp/
[*] Internal link: https://www.magazineluiza.com.br/eletrodomesticos/l/ed/brand---electrolux/
[*] Internal link: https://www.magazineluiza.com.br/eletrodomesticos/l/ed/brand---consul/
[*] Internal link: https://www.magazineluiza.com.br/smart-tv/tv-e-video/s/

[!] External link: http://especiais.magazineluiza.com.br/netshoes/
[*] Internal link: https://www.magazineluiza.com.br/acessorios-de-tecnologia/l/ia/
[!] External link: https://especiais.magazineluiza.com.br/black-friday/
[!] External link: https://www.listasmagalu.com/chadebebe/
[*] Internal link: https://www.magazineluiza.com.br/moveis-para-sala-de-estar/moveis/s/mo/msal/
[*] Internal link: https://www.magazineluiza.com.br/racks-e-paineis-para-tv/moveis/s/mo/rptv/
[*] Internal link: https://www.magazineluiza.com.br/portaldalu/moveis/painel-para-tv/mo/mopa/
[*] Internal link: https://www.magazineluiza.com.br/painel-para-tv/moveis/s/mo/mopa/httpswww.magazineluiza.com.br/racks/moveis-e-decoracao/s/mo/racm/
[*] Internal link: https://www.magazineluiza.com.br/painel-para-tv/moveis/s/mo/mopa/httpswww.magazineluiza.com.br/tv-led-plasma-lcd-e-outras/tv-e-video/s/et/peco/
[*] Internal link: https://www.magazineluiza.com.br/painel-para-tv/moveis/s/mo/mopa/httpswww.magazineluiza.com.br/dvd-play

[*] Crawling: https://www.magazineluiza.com.br/painel-para-tv-ate-60-polegadas-1-porta-led-nt-1115-notavel-notavel-moveis/p/hjf941ha56/mo/mopa/
[+] Total Internal links: 192
[+] Total External links: 59
[+] Total URLs: 251
[*] Crawling: https://www.magazineluiza.com.br/painel-para-tv/moveis/s/mo/mopa/httpswww.magazineluiza.com.br/racks/moveis-e-decoracao/s/mo/racm/
[+] Total Internal links: 192
[+] Total External links: 59
[+] Total URLs: 251
[*] Crawling: https://www.magazineluiza.com.br/painel-para-tv/moveis/s/mo/mopa/httpswww.magazineluiza.com.br/caixa-de-som-som-portatil/audio/s/ea/aucx/
[+] Total Internal links: 192
[+] Total External links: 59
[+] Total URLs: 251
[*] Crawling: https://www.magazineluiza.com.br/painel-suspenso-denver-sala-tv-ate-55-polegadas-1-porta-moveis-bechara/p/cb4ae6ec22/mo/mopa/
[+] Total Internal links: 192
[+] Total External links: 59
[+] Total URLs: 251
[*] Crawling: https://www.magazineluiza.com.br/painel-para-tv-ate-55-3-prateleiras-dj-moveis-greco/p/12

In [24]:
# exemplo de URl de phishing
url = "https://servikalodkerrns.io/to/login/password.php"
max_urls=30
crawl(url, max_urls)

[*] Crawling: https://servikalodkerrns.io/to/login/password.php
[*] Internal link: https://servikalodkerrns.io/to/login/password.php
[!] External link: https://ynp.a78.myftpupload.com/
[!] External link: https://wordpress.org/
[*] Crawling: https://servikalodkerrns.io/to/login/password.php
[+] Total Internal links: 1
[+] Total External links: 2
[+] Total URLs: 3
[+] Total Internal links: 1
[+] Total External links: 2
[+] Total URLs: 3


In [20]:
# exemplo de URl de phishing (com erro)
url = "https://regularupdates.otheramazoncardpayment.xyz/"
max_urls=30
crawl(url, max_urls)

[*] Crawling: https://regularupdates.otheramazoncardpayment.xyz/


ConnectionError: HTTPSConnectionPool(host='regularupdates.otheramazoncardpayment.xyz', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fa93c3d4fa0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

**Observações:**

- Solicitar o mesmo site várias vezes em um curto período de tempo pode fazer com que o site bloqueie seu endereço de IP, nesse caso, você precisará usar um servidor proxy para tal.
- Como lidar com URLs (especialmente de phishing) onde não é possível extrair links?
- Como utilizar os dados gerados como grafo? Será um grafo para cada site?
