### Extração de links de um site

Link do exemplo: https://www.thepythoncode.com/article/extract-all-website-links-python
Salvando em .txt: https://www.thepythoncode.com/code/extract-all-website-links-python

Dependências: pip3 install requests bs4 colorama

Estaremos usando requests para fazer solicitações HTTP convenientemente, BeautifulSoup (bs4) para analisar HTML e colorama para alterar a cor do texto.

In [1]:
import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import colorama

Vamos usar o colorama apenas para usar cores diferentes na impressão, para distinguir entre links internos e externos:

In [2]:
# init the colorama module
colorama.init()

GREEN = colorama.Fore.GREEN
GRAY = colorama.Fore.LIGHTBLACK_EX
RESET = colorama.Fore.RESET
YELLOW = colorama.Fore.YELLOW

Precisaremos de duas variáveis globais, uma para todos os links internos do site e outra para todos os links externos:

- Links internos são URLs com links para outras páginas do mesmo site.
- Links externos são URLs com links para outros sites

In [3]:
# initialize the set of links (unique links)
internal_urls = set()
external_urls = set()

total_urls_visited = 0

Como nem todos os links em tags âncora (tags a) são válidos, alguns são links para partes do site, alguns são javascript, então vamos escrever uma função para validar URLs:

Isso garantirá que um esquema adequado (protocolo, por exemplo, http ou https) e nome de domínio existam no URL.

In [4]:
def is_valid(url):
    """
    Checks whether `url` is a valid URL.
    """
    parsed = urlparse(url)
    return bool(parsed.netloc) and bool(parsed.scheme)

Agora vamos construir uma função para retornar todos os URLs válidos de uma página da web:

In [5]:
def get_all_website_links(url):
    """
    Returns all URLs that is found on `url` in which it belongs to the same website
    """
    # all URLs of `url`
    urls = set()
    # domain name of the URL without the protocol
    domain_name = urlparse(url).netloc
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    for a_tag in soup.findAll("a"):
        href = a_tag.attrs.get("href")
        if href == "" or href is None:
            # href empty tag
            continue
        # join the URL if it's relative (not absolute link)
        href = urljoin(url, href)
        parsed_href = urlparse(href)
        # remove URL GET parameters, URL fragments, etc.
        href = parsed_href.scheme + "://" + parsed_href.netloc + parsed_href.path
        if not is_valid(href):
            # not a valid URL
            continue
        if href in internal_urls:
            # already in the set
            continue
        if domain_name not in href:
            # external link
            if href not in external_urls:
                print(f"{GRAY}[!] External link: {href}{RESET}")
                external_urls.add(href)
            continue
        print(f"{GREEN}[*] Internal link: {href}{RESET}")
        urls.add(href)
        internal_urls.add(href)
    return urls

A função acima irá capturar apenas os links de uma página específica, e se quisermos extrair todos os links de todo o site? Vamos fazer isso: 

Esta função rastreia o site, o que significa que obtém todos os links da primeira página e se auto-chama recursivamente para seguir todos os links extraídos anteriormente. No entanto, isso pode causar alguns problemas, o programa ficará travado em grandes sites (que têm muitos links), como google.com, como resultado, foi adicionado um parâmetro max_urls para sair quando atingirmos um determinado número de URLs verificados . 

In [6]:
def crawl(url, max_urls=30):
    """
    Crawls a web page and extracts all links.
    You'll find all links in `external_urls` and `internal_urls` global set variables.
    params:
        max_urls (int): number of max urls to crawl, default is 30.
    """
    global total_urls_visited
    total_urls_visited += 1
    print(f"{YELLOW}[*] Crawling: {url}{RESET}")
    links = get_all_website_links(url)
    for link in links:
        if total_urls_visited > max_urls:
            break
        crawl(link, max_urls=max_urls)

In [8]:
import argparse

**Exemplo de teste com URL do exemplo:**

In [16]:
if __name__ == "__main__":

    #parser = argparse.ArgumentParser(description="Link Extractor Tool with Python")
    #parser.add_argument("url", help="The URL to extract links from.")
    #parser.add_argument("-m", "--max-urls", help="Number of max URLs to crawl, default is 30.", default=30, type=int)

    #args = parser.parse_args()
    #url = args.url
    #max_urls = args.max_urls

    #crawl(url, max_urls=max_urls)
    url = "https://www.thepythoncode.com"
    max_urls=30
    crawl(url, max_urls)

    print("[+] Total Internal links:", len(internal_urls))
    print("[+] Total External links:", len(external_urls))
    print("[+] Total URLs:", len(external_urls) + len(internal_urls))
    print("[+] Total crawled URLs:", max_urls)

    domain_name = urlparse(url).netloc

    # save the internal links to a file
    with open(f"{domain_name}_internal_links.txt", "w") as f:
        for internal_link in internal_urls:
            print(internal_link.strip(), file=f)

    # save the external links to a file
    with open(f"{domain_name}_external_links.txt", "w") as f:
        for external_link in external_urls:
            print(external_link.strip(), file=f)

[*] Crawling: https://www.thepythoncode.com
[+] Total Internal links: 137
[+] Total External links: 123
[+] Total URLs: 260
[+] Total crawled URLs: 30


**Exemplo de teste com URL de phishing:**

In [19]:
if __name__ == "__main__":

    #parser = argparse.ArgumentParser(description="Link Extractor Tool with Python")
    #parser.add_argument("url", help="The URL to extract links from.")
    #parser.add_argument("-m", "--max-urls", help="Number of max URLs to crawl, default is 30.", default=30, type=int)

    #args = parser.parse_args()
    #url = args.url
    #max_urls = args.max_urls

    #crawl(url, max_urls=max_urls)
    url = "http://vlkote.hop.ru/"
    max_urls=30
    crawl(url, max_urls)

    print("[+] Total Internal links:", len(internal_urls))
    print("[+] Total External links:", len(external_urls))
    print("[+] Total URLs:", len(external_urls) + len(internal_urls))
    print("[+] Total crawled URLs:", max_urls)

    domain_name = urlparse(url).netloc

    # save the internal links to a file
    with open(f"{domain_name}_internal_links.txt", "w") as f:
        for internal_link in internal_urls:
            print(internal_link.strip(), file=f)

    # save the external links to a file
    with open(f"{domain_name}_external_links.txt", "w") as f:
        for external_link in external_urls:
            print(external_link.strip(), file=f)

[*] Crawling: http://vlkote.hop.ru/
[!] External link: http://www.r3.ru/
[!] External link: http://vkontakte.ru/faq.php
[!] External link: javascript://document.location='reg0'
[!] External link: http://vkontakte.ru/login.php
[!] External link: javascript://document.login.submit()
[!] External link: http://www.alexa.com/site/ds/top_sites
[!] External link: http://vkontakte.ru/help.php
[!] External link: http://vkontakte.ru/techsupp.php
[!] External link: http://vkontakte.ru/blog.php
[!] External link: http://vkontakte.ru/
[+] Total Internal links: 137
[+] Total External links: 133
[+] Total URLs: 270
[+] Total crawled URLs: 30


**Exemplo de teste com URL legítima:**

In [20]:
if __name__ == "__main__":

    #parser = argparse.ArgumentParser(description="Link Extractor Tool with Python")
    #parser.add_argument("url", help="The URL to extract links from.")
    #parser.add_argument("-m", "--max-urls", help="Number of max URLs to crawl, default is 30.", default=30, type=int)

    #args = parser.parse_args()
    #url = args.url
    #max_urls = args.max_urls

    #crawl(url, max_urls=max_urls)
    url = "https://www.amazon.com.br/"
    max_urls=30
    crawl(url, max_urls)

    print("[+] Total Internal links:", len(internal_urls))
    print("[+] Total External links:", len(external_urls))
    print("[+] Total URLs:", len(external_urls) + len(internal_urls))
    print("[+] Total crawled URLs:", max_urls)

    domain_name = urlparse(url).netloc

    # save the internal links to a file
    with open(f"{domain_name}_internal_links.txt", "w") as f:
        for internal_link in internal_urls:
            print(internal_link.strip(), file=f)

    # save the external links to a file
    with open(f"{domain_name}_external_links.txt", "w") as f:
        for external_link in external_urls:
            print(external_link.strip(), file=f)

[*] Crawling: https://www.amazon.com.br/
[*] Internal link: https://www.amazon.com.br/ref=nav_logo
[*] Internal link: https://www.amazon.com.br/ap/signin
[*] Internal link: https://www.amazon.com.br/gp/css/order-history
[*] Internal link: https://www.amazon.com.br/gp/cart/view.html
[*] Internal link: https://www.amazon.com.br/gp/site-directory
[*] Internal link: https://www.amazon.com.br/gp/browse.html
[*] Internal link: https://www.amazon.com.br/gp/goldbox
[*] Internal link: https://www.amazon.com.br/gp/bestsellers/
[*] Internal link: https://www.amazon.com.br/prime
[*] Internal link: https://www.amazon.com.br/Livros/b/
[*] Internal link: https://www.amazon.com.br/gp/help/customer/display.html
[*] Internal link: https://www.amazon.com.br/Eletronicos-e-Tecnologia/b/
[*] Internal link: https://www.amazon.com.br/Computadores-Informatica/b/
[*] Internal link: https://www.amazon.com.br/gp/new-releases/
[*] Internal link: https://www.amazon.com.br/ebooks-kindle/b/
[*] Internal link: https:/

[*] Internal link: https://www.amazon.com.br/M%C3%A1scaras-Prote%C3%A7%C3%A3o-Algod%C3%A3o-Mash-Branco/dp/B0874C3KQ4
[*] Internal link: https://www.amazon.com.br/Duna-Cr%C3%B4nicas-Livro-1-ebook/dp/B015EE5JX4
[*] Internal link: https://www.amazon.com.br/Box-Funda%C3%A7%C3%A3o-Trilogia-Isaac-Asimov-ebook/dp/B07PW57BK8
[*] Internal link: https://www.amazon.com.br/Corrente-ferro-Vol-%C3%BAltimas-horas-ebook/dp/B09FX5B2C3
[*] Internal link: https://www.amazon.com.br/Messias-Duna-Cr%C3%B4nicas-Livro-ebook/dp/B015EECXZ6
[*] Internal link: https://www.amazon.com.br/Chinelo-Infantil-Ipanema-Sandalia-Feminina/dp/B08TK7JHSQ
[*] Internal link: https://www.amazon.com.br/Shorts-Meninas-Brandili-Natural-Conj34689/dp/B09HR2C6XD
[*] Internal link: https://www.amazon.com.br/Vestido-Infantil-Dc-Super-Friends/dp/B09HL3M4S9
[*] Internal link: https://www.amazon.com.br/Chinelo-Drifter-Crian%C3%A7a-Unissex-Dourado/dp/B095W1ZQVP
[*] Internal link: https://www.amazon.com.br/Controle-Dualshock-PlayStation-4-Pr