<a href="https://colab.research.google.com/github/poudelmohit/project_IUCN/blob/main/iucn_pdf_link_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Steps (Workflow):


1.   Mouting directory and installing required libraries
2.   Obtain a list of mammals (mammals diversity database)
3.   obtain IUCN data (pdf) download link for each of those species
4.   Download IUCN-report of each species using the download link obtained
5.   Extract IUCN information into a dataframe by reading pdfs
6.   genomic approaches

# 1.1 Mounting Directory:




In [1]:
from google.colab import drive
MOUNTPOINT = '/content/drive'
drive.mount(MOUNTPOINT)

import os
directory = os.path.join(MOUNTPOINT,'MyDrive','Colab Notebooks','LAB','project_IUCN')
os.chdir(directory)

Mounted at /content/drive


In [2]:
! ls

all_download_links.csv	iucn_extracted_raw_data.csv	iucn_reports
data_extraction.py	iucn_pdf_link_extraction.ipynb	mammals_list.txt


# 1.2 Importing libraries:


In [None]:
! pip install selenium
# import selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd

import re # regex to manipulate texts

! pip install pdfplumber
import pdfplumber # to work with pdf (IUCN reports)


Collecting selenium
  Downloading selenium-4.25.0-py3-none-any.whl.metadata (7.1 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.26.2-py3-none-any.whl.metadata (8.6 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading selenium-4.25.0-py3-none-any.whl (9.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m49.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trio-0.26.2-py3-none-any.whl (475 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m476.0/476.0 kB[0m [31m29.

# 2. Obtaining the list of mammals from Mammal Diversity Database:

In [None]:

url =  'https://www.mammaldiversity.org/explore.html'
mammal_database = pd.read_html(url)[0]

mammal_database['scientific_name'] = mammal_database['Genus'] + " " + mammal_database["Species"]

# Convert DataFrame column to a list
mammals_list = mammal_database['scientific_name'].to_list()

# Correct the spelling error in the list
mammals_list = [species.replace('Caluromysiops irruptus', 'Caluromysiops irrupta') for species in mammals_list]


# Save to a text file, comma-separated
with open('mammals_list.txt', 'w') as file:
    file.write(','.join(mammals_list))

#### some issues here:
##### a. this list have some incorrect species name (while comparing with IUCN site)
###### b. this list has all names in a single line, which needs to be fixed while reading the file

In [None]:
# let's check the number of species we have:

with open('mammals_list.txt', 'r') as file:
    total_entries = sum(len(line.split(',')) for line in file)

print(f"Total number of entries: {total_entries}")


Total number of entries: 6753


#### 6753 entries are present currently.

# 3.1 Creating a function to obtain IUCN data download link:

In [None]:
# ! pip install selenium
# might require installation.



def search_iucn_species(species_name):
    """
    Searches for a species on the IUCN Red List website and retrieves the common name and a download link of the IUCN species assessment report pdf.

    Args:
        species_name (str): The name of the species to search for, preferentially scientific name.

    Returns:
        dict(A dictionary containing):
            - "scientific_name": The input species name.
            - "common_name": The headline text of the species page.
            - "download_link": The URL of the first available download button, or None if no download buttons are found.
    """

    # Set up Chrome options
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")

    # Initialize the WebDriver with Chrome options
    driver = webdriver.Chrome(options=chrome_options)

    try:
        # Open the IUCN Red List website
        driver.get("https://www.iucnredlist.org/")

        # Find the search box element using the class attribute
        search_box = driver.find_element("css selector", "input.search.search--site")

        # Type the search query (species name) and hit Enter
        search_box.send_keys(species_name)
        search_box.send_keys(Keys.RETURN)

        # Wait for the search results to load
        time.sleep(1)

        # Find and click on the first 'View' link with the class "link--faux"
        view_link = driver.find_element("css selector", "a.link--faux")
        view_link.click()

        # Wait for the species page to load after clicking the link
        time.sleep(1)

        # Find the h1 element with the class "headline__title"
        headline = driver.find_element("css selector", "h1.headline__title")
        headline_text = headline.text

        # Find and click the download button with the specified class
        download_button = driver.find_element("name", "download_search_results")
        download_button.click()

        # Wait for the download options to appear
        time.sleep(1)

        # Find all 'link--download' buttons
        download_buttons = driver.find_elements("css selector", "a.link--download")
        if download_buttons:
            # Get the href attribute of the first download button
            first_href = download_buttons[0].get_attribute("href")
        else:
            first_href = None
            print(f"No download buttons found for species: {species_name}")

        # Create a dictionary with the headline and first href
        result = {
            "scientific_name": species_name,
            "common_name": headline_text,
            "download_link": first_href
        }
        return result

    except Exception as e:
        # Print the error message and skip to the next species
        print(f"Error searching for species: {species_name}")
        return {
            "scientific_name": species_name,
            "common_name": None,
            "download_link": None
        }

    finally:
        # Close the browser
        driver.quit()



In [None]:
# just a test:
print(search_iucn_species("Didelphis virginiana"))

{'scientific_name': 'Didelphis virginiana', 'common_name': 'Virginia Opossum', 'download_link': 'https://www.iucnredlist.org/species/pdf/22176259'}


#### This function works well. Now, I need to iterate it over the list of mammals (or any species) I have, to get the download link of the IUCN data.

#### Although, currently I have >6k entries, I will work with only ~1000 first, just to check the codes/pipelines.


# 3.2 Using the function to obtain the download links:

In [None]:
import pandas as pd

# Read the single line from the file mammals_list
with open('mammals_list.txt', 'r') as file:
    # Read the single line and split into a list of species names using comma as the separator
    mammals_list = file.readline().split(',')

# Loop over each species and its index in the mammals_list
for index, species in enumerate(mammals_list):
    species = species.strip()  # Remove any leading/trailing whitespace
    print(f"Working on species: {species} (Position: {index + 1})")

    # Assuming search_iucn_species is a function that takes a species name and returns some result
    result = search_iucn_species(species)

    # Create a DataFrame for the current result
    df_link = pd.DataFrame([result])

    # Append the result to the CSV file (without header after the first write)
    df_link.to_csv("all_download_links.csv", mode='a', index=False, header=not index)


#### Currently, >1000 species are iterated in the 'search_iucn_species()' function, and those are saved into: 'all_download_links.csv'

# 4. Download IUCN-reports from the dataframe:

In [None]:
df_report_download_link = pd.read_csv('all_download_links.csv')
df_report_download_link.columns = ['scientific_name','common_name','download_link']

# deleting rows without download link:
df_report_download_link = df_report_download_link[~df_report_download_link['download_link'].isnull()] # 146 rows have no download_links

df_report_download_link = df_report_download_link.reset_index(drop=True)

In [None]:
! mkdir iucn_reports

In [None]:
for link in df_report_download_link['download_link']:
     os.system(f"wget -P iucn_reports {link}")

In [None]:
! ls iucn_reports | wc -l

906


##### At this point, 906 pdfs have been downloaded

# 5. Obtaining IUCN-values from the pdfs:

In [None]:


def extract_data_from_pdf(pdf_path):
    # Open the PDF file
    with pdfplumber.open(pdf_path) as pdf:
        text = ''

        # Loop through all the pages and extract text
        for page in pdf.pages:
            text += page.extract_text()

    # Extract the fields using regular expressions
    try:
        scientific_name = re.search(r'^(Scientific Name:|Taxon Name:)\s*(.*)', text, re.MULTILINE).group(2).strip()
    except:
        scientific_name = ''
    try:
        taxonomy = re.search(r'^Animalia.*', text, re.MULTILINE).group(0).strip()
    except:
        taxonomy = ''
    try:
        red_list_category = re.search(r'Red List Category & Criteria:\s*(.*)', text).group(1).strip()
    except:
        red_list_category = ''
    try:
        date_assessed = re.search(r'Date Assessed:\s*(.*)', text).group(1).strip()
    except:
        date_assessed = ''
    try:
        year_published = re.search(r'Year Published:\s*(.*)', text).group(1).strip()
    except:
        year_published = ''
    try:
        current_population_trend = re.search(r'Current Population Trend:\s*(.*)', text).group(1).strip()
    except:
        current_population_trend = ''
    try:
        systems = re.search(r'Systems:\s*(.*)', text).group(1).strip()
    except:
        systems = ''
    try:
        range_description = re.search(r'Range Description:\s*(.*?)[.]\s', text, re.DOTALL).group(1).strip()
    except:
        range_description = ''
    try:
       habitat_and_ecology = re.search(r'Habitat and Ecology\s*(.*?)[.]\s', text, re.DOTALL).group(1).strip()
    except:
        habitat_and_ecology = ''
    try:
        threats = re.search(r'Threats\s*(.*?)[.]\s', text, re.DOTALL).group(1).strip()
    except:
        threats = ''

    # Return the extracted data as a dictionary
    return {
        "Scientific Name": scientific_name,
        "Taxonomy": taxonomy,
        "Red List Category & Criteria": red_list_category,
        "Date Assessed": date_assessed,
        "Year Published": year_published,
        "Current Population Trend": current_population_trend,
        "Systems": systems,
        "Range Description": range_description,
        "Habitat and Ecology": habitat_and_ecology,
        "Threats": threats
    }



In [None]:
# ! ls iucn_reports/*.pdf

iucn_reports/102331567.pdf  iucn_reports/182235113.pdf	  iucn_reports/21950199.pdf
iucn_reports/10331066.pdf   iucn_reports/182235685.pdf	  iucn_reports/21950307.pdf
iucn_reports/10479343.pdf   iucn_reports/182236363.pdf	  iucn_reports/21950421.pdf
iucn_reports/111870274.pdf  iucn_reports/182239524.pdf	  iucn_reports/21950491.pdf
iucn_reports/111871718.pdf  iucn_reports/182239898.pdf	  iucn_reports/21950608.pdf
iucn_reports/111873502.pdf  iucn_reports/182240168.pdf	  iucn_reports/21950723.pdf
iucn_reports/111940150.pdf  iucn_reports/182240582.pdf	  iucn_reports/21950802.pdf
iucn_reports/115063540.pdf  iucn_reports/185202632.pdf	  iucn_reports/21950924.pdf
iucn_reports/115100163.pdf  iucn_reports/189740044.pdf	  iucn_reports/21950989.pdf
iucn_reports/115106154.pdf  iucn_reports/190269269.pdf	  iucn_reports/21951066.pdf
iucn_reports/115106311.pdf  iucn_reports/190319676.pdf	  iucn_reports/21951146.pdf
iucn_reports/115166757.pdf  iucn_reports/190412426.pdf	  iucn_reports/21951235.pdf
iucn

In [None]:
# extract_data_from_pdf('iucn_reports/21946586.pdf')
# extract_data_from_pdf('iucn_reports/166615690.pdf')
# extract_data_from_pdf('iucn_reports/45435876.pdf')
# extract_data_from_pdf('iucn_reports/210442893')
# extract_data_from_pdf('iucn_reports/21286959 ')
# extract_data_from_pdf('iucn_reports/17971958')


# Needs some cleaning in the dictionary values before saving as csv :)

In [3]:
def process_all_pdfs_in_directory(directory_path, output_csv):
    # List to hold all the extracted data
    all_data = []

    # Iterate through all files in the directory
    for filename in os.listdir(directory_path):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(directory_path, filename)
            print(f"Processing {pdf_path}...")
            data = extract_data_from_pdf(pdf_path)
            all_data.append(data)

    # Create a DataFrame from the extracted data
    df = pd.DataFrame(all_data)

    # Save the DataFrame to a CSV file
    df.to_csv(output_csv, index=False)
    print(f"Data saved to {output_csv}")




In [None]:
# Using the function:

directory_path = 'iucn_reports/'  # Replace with your PDF directory
output_csv = 'iucn_extracted_raw_data.csv'

# Process all PDFs and save the results in a CSV file
process_all_pdfs_in_directory(directory_path, output_csv)


Processing iucn_reports/21964353.pdf...
Processing iucn_reports/21964496.pdf...
Processing iucn_reports/21964204.pdf...
Processing iucn_reports/22180055.pdf...
Processing iucn_reports/22180165.pdf...
Processing iucn_reports/22179860.pdf...
Processing iucn_reports/22179949.pdf...
Processing iucn_reports/166524217.pdf...
Processing iucn_reports/22179769.pdf...
Processing iucn_reports/116333652.pdf...
Processing iucn_reports/22175821.pdf...
Processing iucn_reports/197310136.pdf...
Processing iucn_reports/197321055.pdf...
Processing iucn_reports/97206475.pdf...
Processing iucn_reports/22173467.pdf...
Processing iucn_reports/197310863.pdf...
Processing iucn_reports/197310366.pdf...
Processing iucn_reports/22176554.pdf...
Processing iucn_reports/197310576.pdf...
Processing iucn_reports/22176668.pdf...
Processing iucn_reports/22176259.pdf...
Processing iucn_reports/22175337.pdf...
Processing iucn_reports/166526155.pdf...
Processing iucn_reports/197311087.pdf...
Processing iucn_reports/2217738

# Work From Here..

In [5]:
import pandas as pd
raw_df = pd.read_csv('iucn_extracted_raw_data.csv')

#6. Cleaning the raw csv data:

### A. First, cleaning the 'Red List Category' column:

In [6]:
raw_df.head()

Unnamed: 0,Scientific Name,Taxonomy,Red List Category & Criteria,Date Assessed,Year Published,Current Population Trend,Systems,Range Description,Habitat and Ecology,Threats
0,"Zaglossus attenboroughi Flannery & Groves, 1998",Animalia Chordata Mammalia Monotremata Tachygl...,"Critically Endangered B1ab(iii,v)+2ab(iii,v) v...","July 24, 2015",2016,Decreasing,Terrestrial,This species is known from one specimen collec...,(see Appendix for additional information)\nThe...,(see Appendix for additional information)\nAll...
1,"Zaglossus bartoni (Thomas, 1907)",Animalia Chordata Mammalia Monotremata Tachygl...,Vulnerable A2cd ver 3.1,"July 24, 2015",2016,Decreasing,Terrestrial,This species ranges throughout the central mou...,(see Appendix for additional information)\nThi...,(see Appendix for additional information)\nAll...
2,"Zaglossus bruijnii (Peters & Doria, 1876)",Animalia Chordata Mammalia Monotremata Tachygl...,Critically Endangered A2acd ver 3.1,"July 24, 2015",2016,Decreasing,Terrestrial,This species is recorded only from the Vogelko...,(see Appendix for additional information)\nThi...,(see Appendix for additional information)\nAll...
3,"Caenolestes caniventer Anthony, 1921",Animalia Chordata Mammalia Paucituberculata Ca...,Near Threatened ver 3.1,"May 14, 2015",2015,Decreasing,Terrestrial,This species is found in western Ecuador and n...,(see Appendix for additional information)\nThi...,(see Appendix for additional information)\nThe...
4,"Caenolestes condorensis Albuja & Patterson, 1996",Animalia Chordata Mammalia Paucituberculata Ca...,Vulnerable D1+2 ver 3.1,"March 1, 2015",2015,Unknown,Terrestrial,This species is only found in one locality in ...,(see Appendix for additional information)\nThe...,(see Appendix for additional information)\nPla...


In [7]:
# raw_df['Red List Category & Criteria'].unique()

In [8]:
cleaned_df = raw_df.copy() # just copying so that I have both versions.

In [9]:
cleaned_df['Red List Category & Criteria'] = cleaned_df['Red List Category & Criteria'].apply(
    lambda x: (
        'critically_endangered' if 'Critically Endangered' in x else
        'endangered' if 'Endangered' in x else
        'vulnerable' if 'Vulnerable' in x else
        'near_threatened' if 'Near Threatened' in x else
        'least_concern' if 'Least Concern' in x else
        'data_deficient' if 'Data Deficient' in x else
        'extinct' if 'Extinct' in x else x

    )
)


In [10]:
cleaned_df['Red List Category & Criteria'].unique()

array(['critically_endangered', 'vulnerable', 'near_threatened',
       'least_concern', 'data_deficient', 'endangered', 'extinct'],
      dtype=object)

### B. Cleaning Habitat and Ecology, and Threats:

In [11]:
cleaned_df['Habitat and Ecology']  = cleaned_df['Habitat and Ecology'].str.replace(r'\(see Appendix for additional information\)', '', regex=True)
cleaned_df['Habitat and Ecology']  = cleaned_df['Habitat and Ecology'].str.replace(r'\n', '', regex=True)
# cleaned_df['Habitat and Ecology']



In [12]:
cleaned_df['Threats']  = cleaned_df['Threats'].str.replace(r'\(see Appendix for additional information\)', '', regex=True)
cleaned_df['Threats']  = cleaned_df['Threats'].str.replace(r'\n', '', regex=True)
cleaned_df['Threats']

Unnamed: 0,Threats
0,All long-beaked echidnas Zaglossus are highly ...
1,All long-beaked echidnas Zaglossus are highly ...
2,All long-beaked echidnas Zaglossus are highly ...
3,The major threat to this species is deforestation
4,Plausible threats could include land conversio...
...,...
900,Hunting of this species is unsustainable (Gold...
901,The major threat is forest loss due to illicit...
902,This species is threatened by habitat loss and...
903,The entirety of this species’ known distributi...


### C. Cleaning scientific Name column:

In [13]:
print(cleaned_df['Scientific Name'])

0        Zaglossus attenboroughi Flannery & Groves, 1998
1                       Zaglossus bartoni (Thomas, 1907)
2              Zaglossus bruijnii (Peters & Doria, 1876)
3                   Caenolestes caniventer Anthony, 1921
4       Caenolestes condorensis Albuja & Patterson, 1996
                             ...                        
900    Lepilemur seali Louis Jr., Engberg, Lei, Geng,...
901    Lepilemur septentrionalis Rumpler & Albignac, ...
902    Lepilemur tymerlachsoni Louis Jr., Engberg, Le...
903    Lepilemur wrightae Louis Jr., Engberg, Lei, Ge...
904         Palaeopropithecus ingens G. Grandidier, 1899
Name: Scientific Name, Length: 905, dtype: object


In [20]:
cleaned_df['Scientific Name'] = cleaned_df['Scientific Name'].str.split().str[:2].str.join(' ')


In [25]:
cleaned_df[cleaned_df['Scientific Name'].duplicated(keep=False)]
# keep=False includes both the first occurrence and the subsequent occurrence.

Unnamed: 0,Scientific Name,Taxonomy,Red List Category & Criteria,Date Assessed,Year Published,Current Population Trend,Systems,Range Description,Habitat and Ecology,Threats
100,Dasycercus cristicauda,Animalia Chordata Mammalia Dasyuromorphia Dasy...,near_threatened,"March 18, 2014",2016,Stable,Terrestrial,The Crest-tailed Mulgara has (or had) a wide d...,The Crest-tailed Mulgara is a mostly nocturnal...,Threats are poorly understood but include pred...
101,Dasycercus cristicauda,Animalia Chordata Mammalia Dasyuromorphia Dasy...,near_threatened,"March 18, 2014",2016,Stable,Terrestrial,The Crest-tailed Mulgara has (or had) a wide d...,The Crest-tailed Mulgara is a mostly nocturnal...,Threats are poorly understood but include pred...
590,Trachypithecus hatinhensis,Animalia Chordata Mammalia Primates Cercopithe...,endangered,"November 21, 2015",2021,Decreasing,Terrestrial,The Ha Tinh Langur occurs in limestone areas i...,This species is typically found in forested ha...,"The main threat to this species is hunting, as..."
594,Trachypithecus hatinhensis,Animalia Chordata Mammalia Primates Cercopithe...,endangered,"November 21, 2015",2021,Decreasing,Terrestrial,The Ha Tinh Langur occurs in limestone areas i...,This species is typically found in forested ha...,"The main threat to this species is hunting, as..."
762,Cacajao calvus,Animalia Chordata Mammalia Primates Pitheciidae,least_concern,"March 6, 2020",2022,Stable,Terrestrial,The geographic distribution of Cacajao calvus ...,Most of the available information on the ecolo...,Habitat loss is the principal threat to this s...
765,Cacajao calvus,Animalia Chordata Mammalia Primates Pitheciidae,vulnerable,"January 26, 2015",2022,Decreasing,Terrestrial,© The IUCN Red List of Threatened Species: Cac...,No field data on ecology are available for thi...,The subspecies has a restricted range between ...
766,Cacajao calvus,Animalia Chordata Mammalia Primates Pitheciidae,least_concern,"August 14, 2021",2021,Decreasing,Terrestrial,Cacajao calvus rubicundus is endemic to Brazil...,Other forms of Cacajao calvus inhabit varzea h...,There are no evident threats for C
767,Cacajao calvus,Animalia Chordata Mammalia Primates Pitheciidae,vulnerable,"August 28, 2021",2021,Decreasing,Terrestrial,Cacajao calvus ucayalii is found south of the ...,"Occurs in flooded forests, low to medium hill ...",Habitat loss and hunting are the main threats ...


In [32]:
cleaned_df = cleaned_df.drop_duplicates().reset_index(drop=True)

In [34]:
cleaned_df.to_csv('cleaned_df.csv')