## Explanation of Web Scraping Code for Email Extraction

This code performs web scraping to collect email addresses from two biological collection platforms: Specieslink and SIBBr. Below is a detailed breakdown of its functionality:

1. **Installing Required Libraries**:
   - The code starts by installing the necessary libraries, `requests` and `beautifulsoup4`, which are used for making HTTP requests and parsing HTML, respectively.

2. **Importing Libraries**:
   - After installation, the required libraries are imported into the script for use in subsequent operations.

3. **Data Collection from Specieslink**:
   - The code sends a GET request to the Specieslink URL to fetch the webpage content. If the request is successful, it parses the HTML and extracts data from a specified table, including links and other relevant texts.

4. **Processing Extracted Data**:
   - The extracted data is processed to split location information into separate columns and rename various columns for clarity. Unnecessary columns are dropped to streamline the DataFrame.

5. **Email Extraction Functions**:
   - Two functions, `get_emails_from_page(url)` and `get_emails_after_contact(url)`, are defined to extract email addresses from given URLs. The first function retrieves emails from the entire page text, while the second specifically looks for emails in the "Contacts" section or in "Subcollections" if the first section is absent.

6. **Data Collection from SIBBr**:
   - Another function, `extract_collection_info(url)`, is created to collect collection information from the SIBBr platform. It retrieves the list of collections available on the specified webpage.

7. **Additional Information Retrieval**:
   - The `get_additional_info(url)` function extracts additional details such as email addresses and acronyms from SIBBr pages.

8. **Merging DataFrames**:
   - DataFrames from Specieslink and SIBBr are merged based on collection names, combining relevant columns such as emails and acronyms. This step ensures that all collected data is consolidated in one place.

9. **Updating Missing Emails**:
   - The `update_missing_emails(df)` function iterates through the merged DataFrame to identify and fill in any missing email addresses by calling the appropriate extraction functions based on the link's domain.

10. **Exporting Results**:
    - Finally, the merged DataFrame, containing all collected information, is exported to a CSV file for further analysis or record-keeping.

Overall, this code efficiently collects and organizes email addresses from biological collections, facilitating better communication and collaboration within the research community.


# Installation of Required Libraries

This cell installs the necessary libraries for web scraping:
- `requests`: A simple library for making HTTP requests in Python.
- `beautifulsoup4`: A library for parsing HTML and XML documents, commonly used for web scraping tasks.

To execute this command, run the cell to ensure that both libraries are installed in your environment.


In [54]:
!pip install requests beautifulsoup4


# Importing Required Libraries

In this cell, we import the essential libraries for the web scraping project:

- `requests`: This library is used to send HTTP requests to web pages and fetch their content.
  
- `BeautifulSoup` from `bs4`: A powerful library for parsing HTML and XML documents. It allows us to navigate the parse tree and extract data easily.
  
- `pandas`: A library used for data manipulation and analysis. It provides data structures like DataFrames that are useful for organizing and storing data.

- `re`: This is Python's built-in regular expressions library, which allows for pattern matching and text manipulation.

These libraries will enable us to retrieve web content, parse it, and store the relevant information efficiently.


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re


# Web Scraping from SpeciesLink

In this cell, we perform web scraping on the SpeciesLink network map page to extract data from an HTML table. 

1. **Setting the URL**: 
   - We define the URL of the SpeciesLink network map.

2. **Sending HTTP Request**: 
   - The `requests.get(url)` function sends a GET request to the specified URL. The response is stored in the `response` variable.

3. **Checking Response Status**: 
   - We check if the request was successful by verifying if `response.status_code` is equal to 200.

4. **Parsing the HTML Content**:
   - If the request is successful, we parse the HTML content using `BeautifulSoup`, specifying the parser as `'html.parser'`.

5. **Extracting Text and Tables**:
   - We extract the entire text of the page using `soup.get_text()` and locate all `<table>` elements on the page with `soup.find_all('table')`.

6. **Initializing Variables**:
   - `table_index` is set to `0`, which indicates the first table on the page.
   - We create an empty list, `links_texts`, to store extracted links and their corresponding text.

7. **Processing the Table**:
   - We check if there are tables available and proceed to extract data from the specified table index:
     - We initialize empty lists for `headers` and `rows`.
     - If a header row exists, we extract the headers from the table and store them in the `headers` list.
     - We iterate over all rows in the table:
       - For each row, we extract the text from all cells (`<td>` elements).
       - We specifically process the second column to find any `<a>` tags (hyperlinks).
       - For each hyperlink found, we construct the full URL by appending it to the base URL of SpeciesLink and store both the link and the display text in `links_texts`.
       - We append the constructed link to the row of data.

8. **Creating a DataFrame**:
   - Finally, we create a `pandas` DataFrame, `df_specieslink`, to organize the extracted data into a structured format. If headers are present, they are used as the column names for the DataFrame; otherwise, the DataFrame is created without headers.


In [55]:
# Define the URL to scrape
url = 'https://specieslink.net/network-map'

# Send a GET request to the URL
response = requests.get(url)
table_index = 0  # Index to select which table to extract (0 for the first table)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract the entire text from the page (not used later)
    text = soup.get_text()
    
    # Find all table elements on the page
    tables = soup.find_all('table')
    links_texts = []  # List to store links and their text

    # Check if the specified table index exists in the list of tables
    if len(tables) > table_index:
        table = tables[table_index]  # Select the specified table
        headers = []  # List to store table headers
        rows = []     # List to store rows of data
        
        # Extract headers from the first row of the table, if they exist
        header_row = table.find('tr')
        if header_row:
            headers = [th.get_text(strip=True) for th in header_row.find_all('th')]

        # Iterate through each row in the table
        for tr in table.find_all('tr'):
            cells = tr.find_all('td')  # Find all cells in the current row
            if len(cells) > 0:  # Only process rows with data (at least one cell)
                row = [cell.get_text(strip=True) for cell in cells]  # Extract text from each cell
                
                second_column = cells[1]  # Select the second column to find links
                # Find all <a> tags within the second column and extract their href attribute
                for a_tag in second_column.find_all('a', href=True):
                    link = 'https://specieslink.net' + a_tag['href']  # Construct the full URL
                    text = a_tag.get_text(strip=True)  # Get the text for the link
                    links_texts.append((link, text))  # Append the link and its text to the list
                
                row.append(link)  # Append the last extracted link to the current row
                rows.append(row)  # Add the row data to the rows list
                
        # Create a DataFrame from the extracted table data
        if headers:  # If headers exist, use them as column names
            df_specieslink = pd.DataFrame(rows, columns=headers)
        else:  # If no headers, create DataFrame without specified columns
            df_specieslink = pd.DataFrame(rows)

In [56]:
# Splitting the contents of the 4th column (index 3) into two new columns: 'Cidade' and 'Estado'
# The contents are split based on ' / ' delimiter and expanded into separate columns
df_specieslink[['Cidade', 'Estado']] = df_specieslink[3].str.split(' / ', expand=True)

# Renaming the columns of the DataFrame for better readability
# The current index numbers correspond to the columns that need renaming
df_specieslink = df_specieslink.rename(columns={1: 'sigla', 2: 'nome', 4: 'ano', 6: 'link_specieslink'})

# Dropping unnecessary columns from the DataFrame
# Columns at index 0, 3, and 5 are removed as they are not needed for further analysis
df_specieslink = df_specieslink.drop(columns=[0, 3, 5])


In [57]:
def get_emails_from_page(url):
    """
    Retrieves unique email addresses from the specified web page URL.

    Parameters:
    url (str): The web page URL from which email addresses will be extracted.

    Returns:
    List[str]: A list of unique email addresses found on the web page.
                If the URL is inaccessible or an error occurs, an empty list is returned.

    Exceptions:
    The function handles exceptions that may arise during the HTTP request or HTML parsing. 
    If an error occurs, a message is printed to indicate the problem, 
    and an empty list is returned.

    Usage Example:
    >>> url = "https://example.com"
    >>> emails = get_emails_from_page(url)
    >>> print(emails)  # Prints the list of unique email addresses found on the page.

    Notes:
    - The function uses a regular expression to match standard email formats. 
      The pattern it uses is designed to capture most valid email addresses but may not account for all possible formats.
    - Ensure that the URL provided is publicly accessible and allows for scraping.
    """
    # This function retrieves unique email addresses from the specified web page URL
    try:
        # Sending a GET request to the provided URL
        response = requests.get(url)
        
        # Check if the response status code indicates success (200)
        if response.status_code == 200:
            # Parse the HTML content of the page with BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')
            # Extract all text from the page
            page_text = soup.get_text()
            
            # Define a regular expression pattern to match email addresses
            email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
            # Find all email addresses in the page text
            emails = email_pattern.findall(page_text)
            # Use a set to remove duplicate email addresses
            unique_emails = set(emails)
            # Return the unique email addresses as a list
            return list(unique_emails)
        else:
            # Print an error message if the request was unsuccessful
            print(f'Falha ao acessar a URL {url}. Status code: {response.status_code}')
            return []
    except Exception as e:
        # Handle any exceptions that occur during the request
        print(f'Erro ao acessar a URL {url}: {e}')
        return []



In [None]:
def get_emails_after_contact(url):
    """
    Extracts unique email addresses from the "Contatos" section of a specified web page URL. 
    If the "Contatos" section is not found, it attempts to find email addresses in the "Subcoleções" section 
    and updates the main DataFrame with new links.

    Parameters:
    url (str): The web page URL from which email addresses and subcollection information will be extracted.

    Returns:
    List[str]: A list of unique email addresses found in the "Contatos" section.
                If no emails are found, or if neither section is present, it returns an empty list.
                Additionally, if subcollection links are found, a DataFrame with new links is returned.

    Exceptions:
    This function handles exceptions that may arise during the HTTP request or HTML parsing. 
    If an error occurs, a message is printed to indicate the problem, 
    and an empty list is returned.

    Usage Example:
    >>> url = "https://example.com"
    >>> emails = get_emails_after_contact(url)
    >>> print(emails)  # Prints the list of unique email addresses found in the "Contatos" section.

    Notes:
    - The function uses a regular expression to match standard email formats. 
      The pattern it uses is designed to capture most valid email addresses but may not account for all possible formats.
    - If the "Contatos" section is absent, it checks for the "Subcoleções" section and extracts relevant links. 
      The main DataFrame `df_specieslink` is updated with any new links found.
    - Ensure that the URL provided is publicly accessible and allows for scraping.
    """
    global df_specieslink
    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # Encontra a tag <h4> com o texto "Contatos"
            contact_header = soup.find('h4', string='Contatos')
            
            if contact_header:
                # Pega o próximo elemento após a tag <h4>
                siblings = contact_header.find_next_siblings()
                
                # Junte o texto de todos os irmãos até encontrar um email
                section_text = ''
                for sibling in siblings:
                    section_text += sibling.get_text() + ' '
                    
                # Expressão regular para encontrar e-mails
                email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
                emails = email_pattern.findall(section_text)
                
                # Remove e-mails duplicados
                unique_emails = set(emails)
                if unique_emails:
                    return list(unique_emails)
                else:
                    print(f'Não foi possível encontrar e-mails na seção após a tag <h4> "Contatos" em {url}.')
                    return []
            else:
                print(f'Tag <h4> com o texto "Contatos" não encontrada em {url}.')
                
                # Procura "Subcoleções" se "Contatos" não for encontrado
                subcollections_header = soup.find('h4', string='Subcoleções')
                
                if subcollections_header:
                    list_section = subcollections_header.find_next_sibling()
                    
                    if list_section:
                        subcollections_info = []
                        
                        for li in list_section.find_all('li'):
                            a_tag = li.find('a', href=True)
                            
                            if a_tag:
                                sigla = a_tag.get_text(strip=True)
                                link = 'https://specieslink.net' + a_tag['href']
                                name = li.get_text(strip=True).replace(sigla, '').strip().replace('- ', '')
                                
                                subcollections_info.append({'nome': name, 'sigla': sigla, 'link_specieslink': link})
                        
                        # Cria um DataFrame com os links coletados
                        new_links_df = pd.DataFrame(subcollections_info)
                        
                        # Remove links que já estão no DataFrame principal
                        existing_links = set(df_specieslink['link_specieslink'])
                        new_links_df = new_links_df[~new_links_df['link_specieslink'].isin(existing_links)]
                        df_specieslink = pd.concat([df_specieslink, new_links_df], ignore_index=True)
                        return new_links_df
                else:
                    print(f'Tag <h4> com o texto "Subcoleções" não encontrada em {url}.')
                    return get_emails_from_page(url)  # Chama a função para extrair e-mails diretamente
        else:
            print(f'Falha ao acessar a URL {url}. Status code: {response.status_code}')
            return []
    except Exception as e:
        print(f'Erro ao acessar a URL {url}: {e}')
        return []

def get_emails_from_page(url):
    """
    Extracts unique email addresses from the specified web page URL.

    This function sends a GET request to the provided URL, 
    retrieves the page content, and uses a regular expression 
    to find all email addresses present in the text of the page.

    Parameters:
    url (str): The web page URL from which email addresses will be extracted.

    Returns:
    List[str]: A list of unique email addresses found on the page. 
               If no emails are found or if there is an error accessing the URL, an empty list is returned.

    Exceptions:
    This function handles exceptions that may arise during the HTTP request or HTML parsing. 
    If an error occurs, a message is printed to indicate the problem, 
    and an empty list is returned.

    Usage Example:
    >>> url = "https://example.com"
    >>> emails = get_emails_from_page(url)
    >>> print(emails)  # Prints the list of unique email addresses found on the page.

    Notes:
    - The function uses a regular expression to match standard email formats. 
      The pattern it uses is designed to capture most valid email addresses but may not account for all possible formats.
    - Ensure that the URL provided is publicly accessible and allows for scraping.
    """
    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            page_text = soup.get_text()
            email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
            emails = email_pattern.findall(page_text)
            unique_emails = set(emails)
            return list(unique_emails)
        else:
            print(f'Falha ao acessar a URL {url}. Status code: {response.status_code}')
            return []
    except Exception as e:
        print(f'Erro ao acessar a URL {url}: {e}')
        return []

df_specieslink['email'] = df_specieslink['link_specieslink'].apply(lambda url: ', '.join(get_emails_after_contact(url)))


In [60]:
df_specieslink.to_csv('email_list1.csv')

In [61]:
url = 'https://collectory.sibbr.gov.br/collectory/'

def extract_collection_info(url):
    """
    Extracts information about collections from the specified SIBBR Collectory URL.

    This function sends a GET request to the provided SIBBR Collectory URL, 
    retrieves the page content, and extracts the names and links of collections 
    listed in the filtered collection list.

    Parameters:
    url (str): The URL of the SIBBR Collectory page from which collection information will be extracted.

    Returns:
    List[Dict[str, str]]: A list of dictionaries, each containing the name and link of a collection.
                          If no collections are found or if there is an error accessing the URL, 
                          an empty list is returned.

    Exceptions:
    This function handles exceptions that may arise during the HTTP request or HTML parsing. 
    If an error occurs, a message is printed to indicate the problem, 
    and an empty list is returned.

    Usage Example:
    >>> url = "https://collectory.sibbr.gov.br/collectory/"
    >>> collections = extract_collection_info(url)
    >>> print(collections)  # Prints a list of collections with their names and links.

    Notes:
    - The function specifically looks for a <ul> element with the id 'filtered-list' 
      and extracts all <li> items within it. 
      Each <li> item is expected to contain an <a> tag with the collection name and URL.
    - Ensure that the provided URL is publicly accessible and allows for scraping.
    """
    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # Find tag <ul> with id 'filtered-list'
            filtered_list = soup.find('ul', id='filtered-list')
            
            if filtered_list:
                collections = []
                
                # Find all tags <li> in <ul>
                for li in filtered_list.find_all('li'):
                    a_tag = li.find('a', href=True)
                    
                    if a_tag:
                        collection_name = a_tag.get_text(strip=True)
                        collection_url = 'https://collectory.sibbr.gov.br' + a_tag['href']
                        collections.append({'nome': collection_name, 'link_sibbr': collection_url})
                
                return collections
            else:
                print(f"Tag <ul> com o id 'filtered-list' não encontrada em {url}.")
                return []
        else:
            print(f'Falha ao acessar a URL {url}. Status code: {response.status_code}')
            return []
    except Exception as e:
        print(f'Erro ao acessar a URL {url}: {e}')
        return []


In [62]:
collection_info = extract_collection_info(url)


In [None]:
def get_additional_info(url):
    """
    Retrieves additional information (emails and acronym) from the specified URL.

    This function performs an HTTP GET request to the given URL, extracts email 
    addresses from the page's content, and looks for a specific acronym 
    contained within a designated HTML element.

    Parameters:
    - url (str): The URL of the page to scrape for additional information.

    Returns:
    - Tuple[str, str]: A tuple containing:
        - email_list (str): A comma-separated string of unique email addresses found on the page.
        - acronym_text (str): The acronym extracted from the page, or an empty string if not found.

    Example:
    - If the URL is valid and contains emails and an acronym, the function will return
      the emails as a string and the acronym as a string. If no emails or acronym 
      are found, it will return an empty string for each.

    Note:
    - The function handles errors related to network requests and parsing.
    - It looks for emails in two places: as plain text on the page and encoded in 
      JavaScript (within `onclick` attributes).
    """
    # Try to get the response from the URL
    try:
        response = requests.get(url)
        # Check if the request was successful
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')

            # Extract emails from the page text
            page_text = soup.get_text()
            email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
            emails = email_pattern.findall(page_text)

            # Extract emails from the contact div if it exists
            contact_div = soup.find('div', class_='contact')
            if contact_div:
                a_tags = contact_div.find_all('a', href=True)
                for a_tag in a_tags:
                    onclick_text = a_tag.get('onclick', '')
                    email_match = re.search(r"sendEmail\('([^']*)'\)", onclick_text)
                    if email_match:
                        email_encoded = email_match.group(1)
                        email = email_encoded.replace('(SPAM_MAIL@ALA.ORG.AU)', '@')
                        emails.append(email)

            # Get unique emails
            unique_emails = set(emails)
            email_list = ', '.join(unique_emails)

            # Extract acronym from the page
            acronym_tag = soup.find('span', class_='acronym')
            acronym_text = acronym_tag.get_text(strip=True).replace('Acronym: ', '') if acronym_tag else ''

            return email_list, acronym_text
        else:
            print(f'Falha ao acessar a URL {url}. Status code: {response.status_code}')
            return '', ''
    except Exception as e:
        print(f'Erro ao acessar a URL {url}: {e}')
        return '', ''

In [None]:
# Loop through collection_info to retrieve additional information
for collection in collection_info:
    email, acronym = get_additional_info(collection['link_sibbr'])
    collection['email'] = email
    collection['sigla'] = acronym

# Create a DataFrame from the collection information
df_sibbr = pd.DataFrame(collection_info)


In [None]:
# Save the DataFrame df_sibbr to a CSV file named 'email_list2.csv'
df_sibbr.to_csv('email_list2.csv')

# Merge df_specieslink and df_sibbr on the 'nome' column
# The merge is performed as an outer join to include all records from both DataFrames
# Suffixes are added to distinguish columns from each DataFrame in case of overlap
merged_df = pd.merge(df_specieslink, df_sibbr, on='nome', how='outer', suffixes=('_df1', '_df2'))

# Combine the 'sigla' columns from both DataFrames
# The combine_first method fills missing values in 'sigla_df1' with values from 'sigla_df2'
merged_df['sigla'] = merged_df['sigla_df1'].combine_first(merged_df['sigla_df2'])

# Combine the 'email' columns from both DataFrames
# The combine_first method fills missing values in 'email_df1' with values from 'email_df2'
merged_df['email'] = merged_df['email_df1'].combine_first(merged_df['email_df2'])

# Combine the 'link' columns from both DataFrames
# The combine_first method fills missing values in 'link_sibbr' with values from 'link_specieslink'
merged_df['link'] = merged_df['link_sibbr'].combine_first(merged_df['link_specieslink'])

# Remove the original columns that were used for merging and combining
merged_df = merged_df.drop(columns=['sigla_df1', 'sigla_df2', 'email_df1', 'email_df2'])

# Create a new DataFrame containing records with missing email addresses
missing_emails_df = merged_df[merged_df['email'].isna()]


In [None]:
def update_missing_emails(df):
    """
    Updates missing email addresses in the DataFrame by extracting them from the corresponding links.

    Parameters:
    df (DataFrame): The DataFrame containing collection information, including links and email addresses.

    Returns:
    DataFrame: The updated DataFrame with missing email addresses filled in.
    """
    # Iterate over each row in the DataFrame
    for index, row in df.iterrows():
        email = row['email']  # Get the current email for the row
        link = row['link']    # Get the current link for the row
        
        # Check if the email is empty or contains only whitespace
        if pd.isna(email) or email.strip() == '':
            # If the link belongs to specieslink.net, extract emails using the corresponding function
            if 'specieslink.net' in link:
                emails = get_emails_after_contact(link)
            # If the link belongs to sibbr, extract emails and acronym using the corresponding function
            elif 'sibbr' in link:
                emails, acronym = get_additional_info(link)
            else:
                emails = []  # If the link doesn't match any known source, set emails to an empty list

            # If emails are found, update the DataFrame with the new emails
            if emails:
                df.at[index, 'email'] = ', '.join(emails)  # Join multiple emails into a single string

    return df  # Return the updated DataFrame

# Update the main DataFrame with missing emails
merged_df = update_missing_emails(merged_df)
