
# *NASA-NIFC Incident Specific Data Web Crawler V1*
### By Katrina Sharonin

## Description:
The Incident Data Web Crawler is a recursive prototype program which given a user input (keyword) and starting FTP database URL, will traverse the entire tree of file directories to locate URLS containing the keyword. This program aims to accelerate the search of incident fire data given a fire name.

The NIFC FTP Server (https://ftp.wildfire.gov/) is the an official site for interagency wildland fire incident data and documents. Valuable data for several incidents is store on the server, with data such as historic fire perimeters, IR mission guides, IR category maps, incident management documents, etc. The server claims:

>'This ftp service is intended for short-term interagency sharing, not as a file archive or records repository. There shouldn't be anything data that isn't stored in a safer location, or much data that carries over from season to season.' 

However, most data for infrared operations and perimeters, mainly CAL FIRE, is stored here in a long-term manner. Making a public request for data from CAL FIRE will put most users here.

Upon a simple inspection (visit https://ftp.wildfire.gov/public/incident_specific_data/ to see the incident specific data portion of the data), the structure of the FTP database demonstrates a flawed structure: <span style='color:Red'> **no search option is available. Users must manually search through the database by clicking through directories.** </span>

Given the slow and inefficent structure, this script accelerates search for quick location of research-relevant data.

## Inspiration:
Comparing NASA Tools means comparing NASA data on specific incidents. For example, the MASTER tool has its own archive of fire campaigns, with names included (https://daac.ornl.gov/cgi-bin/dataset_lister.pl?p=43). Research centered on comparing NASA data VS. wildfire management collected data requires locating the data in the FTP directories. But...

<span style='color:Maroon'>
    
#### **My personal experience was painful to say the least: I spent hours looking for certain fires only to find they don't exist, or got put into some federal and not state folder. The "organization" of data interferes with its usability.**
    
#### **The interference of the data's acessibility ultimately impacts research. The fact this script needs to exist demonstrates a fault in current data storage for wildfire incidents.**
    
## **If we want good research, we need usable, relevant, and accessible quality data**

#### **By automating search we can understand how much incident specific data is truly present and how much is not, how many fires exist that line-up with NASA incident data, and so on. All in all the search will help characterize available data in order to provide points of weakness, rather than saying "we need more data"**
</span>

## Goals:
- Take user input of fire
- Search given starting URL directories 
- When matching name is found, **return link(s) to directory with the name contained (provide print statements)**
- Future development add ons: quantity of net files (rather than just dirs existing) -> calculate percent of empty dirs

***

## Part 1: Welcome + Process User Input
- Run block to accept user input
- Take input and process for use in the incoming blocks


In [40]:
# RUN ME FOR USER WELCOME - Press Shift+Enter

# import libraries to parse and read through the websites 
import bs4, requests
import plyer
import time 

print('---- Welcome to the Incident Data Web Crawler ----')
print('')
print('To get started, please run the blocks below in sequential order using Shift + Enter')
print('')
print('This program will take your keyword and start from the the Incident Specific Data/ directory.')
print('The final block will print out results, it may take a few minutes due to the sheer size.')
print('')
print('Sample inputs: "Fort Huachuca", "Avenza", "Melozitna", "Dixie", "Contact Creek"')
print('')
print('Report any bugs to katrina.sharonin@nasa.gov')



---- Welcome to the Incident Data Web Crawler ----

To get started, please run the blocks below in sequential order using Shift + Enter

This program will take your keyword and start from the the Incident Specific Data/ directory.
The final block will print out results, it may take a few minutes due to the sheer size.

Sample inputs: "Fort Huachuca", "Avenza", "Melozitna", "Dixie", "Contact Creek"

Report any bugs to katrina.sharonin@nasa.gov


In [49]:
# RUN ME TO ACCEPT USER INPUT FOR KEYWORD

# Accept user prompt: Run this block to gather user input on the fire wanted
# after acceptance, regex will be applied to the key term -> grab capital alikes 
# 4+ situations:
# 1. King (first letter capitalized)
# 2. KING (all caps)
# 3. king (all lower)
# 4. King Fire (etc versions of these) -> exclude anything aligning with fire

print('Enter your search keyword.')
print('NOTE: Enter incident name in ALL lower case with no "Fire" or "fire" term. Please avoid underscores. Include spaces in between for names i.e. Creek Forest, not CreekForest for optimal results')
print(' ')
keyword = input('Please input the incident name you are searching for: ')
print('')
print('Name "' + keyword  + '" recieved.')

# in attempt to normalize all inputs, put into lower case 
keyword = keyword.lower()

# check if the word 'fire' || 'Fire' exists. if true, remove
if 'fire' in keyword:
    print('The keyword "fire" was found in your input. Normalizing...')
    keyword = keyword.replace("fire", "")
if 'Fire' in keyword:
    # in case lower fails -> catch later!
    print('The keyword "Fire" was found in your input. Normalizing...')
    keyword = keyword.replace("Fire", "")
    
# remove any spaces within the string for easy search - do not string in-between values 
keyword = keyword.strip()

# match encoding of inside spaces - " " do not exist in URLs of FTP server!
# also remove _ as they are mixed with spaces -> this way processing program treats them fairly
if " " in keyword:
    keyword = keyword.replace(" ", "%20")
    print('The keyword contained a space, it has been replaced with a "%20" to match URL encoding')
if "_" in keyword:
    keyword = keyword.replace("_", "%20")
    print('The keyword contained an underscore, it has been replaced with a "%20" to match URL encoding')

print(' ')
print('Final keyword for input:')
print("'"+ keyword + "'")

# Check for any capitals or spaces
try:
        assert keyword.islower(), "Passed input is not lower-case, halt program. Please input your incident as lower case with no key term of 'Fire' or 'fire' included"
except AssertionError as msg:
        print(msg)

print('')
print('Successful Input. Initiating search...')

Enter your search keyword.
NOTE: Enter incident name in ALL lower case with no "Fire" or "fire" term. Please avoid underscores. Include spaces in between for names i.e. Creek Forest, not CreekForest for optimal results
 
Please input the incident name you are searching for: harrison gulch

Name "harrison gulch" recieved.
The keyword contained a space, it has been replaced with a "%20" to match URL encoding
 
Final keyword for input:
'harrison%20gulch'

Successful Input. Initiating search...


***

## Part 2: Main functions

***

## Design considerations and notes during dev:
- Currently the ftp gov URL is organized by appending directory names -> per searching just recursively grab names until its either finished looping through ALL directories for a name
- if the name is not lower, need to lower case all letters in the directory to match with lower() method used on the keyword
- append to all possible present directories that have name -> keep searching till directory is 

- Ultimately return the directories, not the files (so should not have . at the end, instead should end with slash!)
- EX: /public/incident_specific_data/calif_n/2020_FEDERAL_Incidents/CA-KNF-007035_Slater/IR/NIROPS -> the parent is excluding '/NIROPS'
- look for title 'Parent directory' -> exclude from the aref class loop list
- GOAL: Capture all directories with the names present -> may include many subdirs captured

Ex dir and wanted results:
- main
   --> NIROPS
       --> Training
       --> ttfs
           --> phoneix_samples
               --> getty_fire_shape
   --> calin
       --> Getty_fire (this directory name shouldve been detected)
           --> getty.shp (these files should be detected by the program)
           --> getty.zip
       --> South fire
   --> calis
       --> Tahoe fire
       --> sad fire
       
If 'getty' was used, the following links to dirs should return:
- https.../calin
- https.../Getty_fire
- https.../phoenix_samples
print(The following directories contain files that match your search!)

**Consequently, this may cause dir subduplicates to get returned. Nevertheless it narrows down searches to links only where the name appears!**

***

## Function 1: searchForKeyword()

- Description: given a page URL, return true if the keyWord is found within the listed files/directories listed in the URL. I.e. If in https://ftp.wildfire.gov/public/incident_specific_data/ , then if an input of 'fuels' would be given, it would return true as there is a folder titled 'Fuels/' in this link
- Inputs: pageURL (string), keyWord (string)
- Output: boolean

***


In [50]:
# RUN TO INITIATE SEARCHFORKEYWORD() FUNCTION IN MEMORY

# Repeat imports incase restart of kernel wipes local var memory
import bs4, requests
import plyer
import time 

# searchForKeyword() function
# Description: given a page URL, return true if the keyWord is found within the listed files/directories listed in the URL.
# Inputs: pageURL (string), keyWord (string)
# Output: boolean

def searchForKeyword(pageURL, keyWord):

    # Download page
    # Example of valid pageURL string input: 'https://ftp.wildfire.gov/public/incident_specific_data/'
    
    # try status and throw if forbidden is encountered
    try:
        getPage = requests.get(pageURL)
        getPage.raise_for_status()
    except requests.exceptions.HTTPError as err: 
        # in case of forbidden access URL 
        print('Accessed a Forbidden URL/URL with error status, return false')
        return False
    except requests.exceptions.Timeout:
        # Maybe set up for a retry, or continue in a retry loop
        print('Timeout occured, check FTP site status manually')
        return False
    except requests.exceptions.TooManyRedirects:
        # Tell the user their URL was bad and try a different one
        print('Too many re-directs, check FTP site status manually')
        return False
    except requests.exceptions.RequestException as e:
        # catastrophic error. bail.
        print('Catostrophic error, bail execution due to RequestException')
        return False
  
    # Parse text for opportunities block
    soup = bs4.BeautifulSoup(getPage.text, 'html.parser')
    
    # check the current text for any emptiness 
    a_categories = soup.find_all('a')
    # pop out using loop to exclude irrelevant categories 
    a_categories_modified = a_categories.copy()
    
    if not a_categories:
        print('Empty page detected, return false')
        return False
    
    # loop through the a_categories to get rid of irreleveant matches like parent dir
    # Also eliminate the Name, Last modified, size from the list to reduce search confusion
        
    # before filtering - debugging
    # print(a_categories)
    
    for link in a_categories:
        
        # eliminate from the link search to prevent recursive returns:
        # parent directory, name, last modified, size, description
        
        # Note: the filter assumes uniformity per every page for its a-class titles

        # want to eliminate the possibility of lower case misses -> make whole thing lower
        if 'Parent Directory' in str(link):
            # remove this link from the a_categories_modified
            a_categories_modified.remove(link)
        elif 'Name' in str(link) or  'Size' in str(link) or 'Last modified' in str(link) or 'Description' in str(link):
            # remove related categories that do not contribute to search, aka redundant categories
            a_categories_modified.remove(link)
    
    
    # after filtering - debugging
    # print(a_categories_modified)
    
    # define default boolean which will be switched on if found
    available = False

    
    # search the modified list of valid links, including files
    for link in a_categories_modified:
        # fetch the a class name of the link
        current_string_check = link.get('href')
        
        # print search - debugging
        # print(str(current_string_check).lower())
        
        if keyWord in str(current_string_check).lower():
            available = True
            break
        # try: if there is %20 -> check one side and the other for appearance
        # check is "_" can appear instead of spaces
        elif ("%20" in keyWord):
            # ISSUE: may be more than one %20 -> iterate through list 
            split = keyWord.split("%20")
            # excluding the %20, try both sides
            # ex: Minto_Lakes != Minto%20Lakes
            # if %20 isn't in, it may bug bc it wants whole thing
            
            # [true, true] -> counter should = 2
            counter_found = 0;
            
            # iterate through splits to see if all are in
            for substring in split:
                if substring in str(current_string_check).lower():
                    counter_found += 1
                
            # if the counter matches arr size -> must be true for all substrings
            # therefore the keyword is found
            if len(split) == counter_found:
                available = True
                break
                

    # print('Process finished for:' + pageURL)
    # print('For input word in the given dir, "' + keyWord + '", its existence is')
    # print('Entered keyword function')
            
    # If key offering found, return true
    if available == True:
        return True

    # Otherwise, return false 
    else:
        return False
    

# TEST CALLS - Debugging Only:
# searchForKeyword('https://ftp.wildfire.gov/public/incident_specific_data/', 'getty')
# searchForKeyword('https://ftp.wildfire.gov/public/incident_specific_data/.swp', 'getty')
# searchForKeyword('https://ftp.wildfire.gov/public/incident_specific_data/calif_n/!CALFIRE/2013_Incidents/CA-BTU-005638-Cedar/GIS/', 'getty')
# searchForKeyword('https://ftp.wildfire.gov/public/incident_specific_data/alaska/2022/', 'minto%20lakes')


***

## Function 2: crawler()
- Description: given a starting URL (defaults to 'https://ftp.wildfire.gov/public/incident_specific_data/', keyword (user-input defined), crawler() will go through each directory in depth first search (DFS). If the URL returns true for searchForKeyword(), then the URL is printed with a success message. Otherwise, the program will note that it continues recursing. There is no return value.
- Inputs: keyword (string which was inputted by user and processed in earlier block), startingPageURL (string)
- Output: NONE (side-effects of printing)


- **Note: future versions of script may attempt to form an array from found URLs. However due to recursive nature, the array can get corrupted or overflowed with information**

***

In [51]:
# RUN TO BEING SEARCH USING CRAWLER() FUNCTION

# Repeat imports incase restart of kernel wipes local var memory
import bs4, requests
import plyer
import time 

# crawler() function
# Description: 
# Inputs: keyword (string which was inputted by user and processed in earlier block), startingPageURL (string)
# Output: found_URL_matches (array with all URL strings found with matches)

# starting URL -> this can be customized, but is used to reduce the search span for the function 
# DEFAULT:

starting_URL = 'https://ftp.wildfire.gov/public/incident_specific_data/'
# empty array which will be returned 
# append to array after every recursive call to the searchForKeyword function 
# found_URL_matches = []

def crawler(currPageURL, keyword):
    
    # search the current page with boolean returned
    keyword_boolean = searchForKeyword(currPageURL, keyword)
    
    if keyword_boolean:
        # if found, display result
        print('')
        print('NAME FOUND! Printing directory URL that has "' + keyword + '" in it...')
        print(currPageURL)
        print('')
        # append the URL to arr
        # found_URL_matches.append(currPageURL)
        
        # TEMPORARY
        # return
        
    # now for every directory, append -> create URL -> recurse
    # make sure to update the stored ver 
    # this could cause issues depending on how it traverses the file tree
    
    # repeat get page process for the current URL before access
    try:
        getPage = requests.get(currPageURL)
        getPage.raise_for_status()
    except requests.exceptions.HTTPError as err: 
        # in case of forbidden access URL 
        print('Accessed a Forbidden URL/URL with error status, return false')
        return False
    except requests.exceptions.Timeout:
        # Maybe set up for a retry, or continue in a retry loop
        print('Timeout occured, check FTP site status manually')
        return False
    except requests.exceptions.TooManyRedirects:
        # Tell the user their URL was bad and try a different one
        print('Too many re-directs, check FTP site status manually')
        return False
    except requests.exceptions.RequestException as e:
        # catastrophic error. bail.
        print('Catostrophic error, bail execution due to RequestException')
        return False
  
    # Parse text for opportunities block
    soup = bs4.BeautifulSoup(getPage.text, 'html.parser')
    
    # check the current text for any emptiness 
    a_categories = soup.find_all('a')
    # pop out using loop to exclude irrelevant categories 
    a_categories_modified = a_categories.copy()
    
    # loop through the a_categories to get rid of irreleveant matches like parent dir
    # Also eliminate the Name, Last modified, size from the list to reduce search confusion
        
    for link in a_categories:
        # eliminate from the link search to prevent recursive returns:
        # want to eliminate the possibility of lower case misses -> make whole thing lower
        if 'Parent Directory' in str(link):
            # remove this link from the a_categories_modified
            a_categories_modified.remove(link)
        elif 'Name' in str(link) or  'Size' in str(link) or 'Last modified' in str(link) or 'Description' in str(link):
            # remove related categories that do not contribute to search, aka redundant categories
            a_categories_modified.remove(link)
    
    
    for link in a_categories_modified:
        # recurse and pick directories
        # if is a directory -> call function with the URL name appended to meet it
        actual_URL = str(link.get('href'))
        # if ends with '/' -> is a directory -> recurse 
        # DO NOT RECURSE ON FILES!
        if actual_URL.endswith('/'):
            #p rint(actual_URL)
            # print('Directory detected through / in link')
            
            # modify currentURL by appending this
            new_URL_to_recurse = currPageURL + actual_URL
            
            # recurse on this file branch
            print('Recursing in: ' + new_URL_to_recurse)
            # print(found_URL_matches)
            # found_URL_matches.append(crawler(new_URL_to_recurse, keyword, []))
            
            # recursive call
            crawler(new_URL_to_recurse, keyword)
    
    # return found_URL_matches
    return 1


# directly have the initial array modified and then returned after recursive calls
found_URL_matches = crawler(starting_URL, keyword)

# print(found_URL_matches)

# BUG: Fort%20Huachuca%20RX != match with 'fort huachuca'
# likeley due to mandated space -> then needs to be set of keywords or replace space with %20?


Recursing in: https://ftp.wildfire.gov/public/incident_specific_data/BaseData/
Recursing in: https://ftp.wildfire.gov/public/incident_specific_data/BaseData/Arizona/
Recursing in: https://ftp.wildfire.gov/public/incident_specific_data/BaseData/ForestData/
Recursing in: https://ftp.wildfire.gov/public/incident_specific_data/BaseData/ForestData/R06_CSA4/
Recursing in: https://ftp.wildfire.gov/public/incident_specific_data/BaseData/ForestData/R06_UMF/
Recursing in: https://ftp.wildfire.gov/public/incident_specific_data/BaseData/ForestData/R06_UMF/Ellis/
Recursing in: https://ftp.wildfire.gov/public/incident_specific_data/BaseData/UpdateNationalBaseMap/
Recursing in: https://ftp.wildfire.gov/public/incident_specific_data/Fuels/
Recursing in: https://ftp.wildfire.gov/public/incident_specific_data/Fuels/AZCNF/
Recursing in: https://ftp.wildfire.gov/public/incident_specific_data/Fuels/AZCNF/Cochise%20Stronghold%20Project/
Recursing in: https://ftp.wildfire.gov/public/incident_specific_data/Fu

KeyboardInterrupt: 

In [26]:
# test space here for string equivelencies

#Fort%20Huachuca%20RX != match with 'fort huachuca'

lower = "Fort%20Huachuca%20RX"
lower = lower.lower()
keyWord = 'fort huachuca'

boolean = keyWord in lower
print(boolean)

keyWord = keyWord.replace(" ", "%20")
print(keyWord)

# therefore need to replace spaces with %20 mark!


False
fort%20huachuca
