# Tell me about my seatime (Using information from R2R)

The Rolling Deck to Repository (R2R) Program publishes a [beautiful API here](https://www.rvdata.us/about/technical-details/services/api).

The intention is to use a person's name to get all cruises they were on, then days on the ship, then miles sailed on each cruise.

## Process Flow

To get the cruises a person was on:

1. run a `/person/person_id/{person_id}` query for iterations of Last and First names.
2. Get a person's list of cruises from NautilusLive.org

To get days and miles sailed, get data from R2R then process retrieved data file and/or local data files. Loop through all applicable cruises and process files (in order of preference of formats) :

1. Get R2R format processed nav from a `/product/cruise_id/{cruise_id}` query (not all cruises have this data ready to download)
2. get INS data from `/fileset/cruise_id/{cruise_id}` query
3. get data from OET people and just put it in data-local directory

aaaand then process that data. Using... ?

## TODOs

- [ ] some entries had `file_format` but not `url` or `actual_url`. deal with.
  - [x] check if `actual_url` is present, non-empty, and valid.
  - [ ] write code for downloading files that have URLs
- [ ] lookup geo and/or nav libraries and see if easy option to parse nav files
- [ ] how best to look for errors or variations in spelling of someone's name?
  - ~~[ ] get all expeditions of a ship (within a timeframe?) and join the names?~~
  - [x] display all cruises returned for that person with the cruise's dates and leave it up to them to double-check
  - [x] implement the combination of results from multiple searches with name variations
  - [ ] allow user to manually add cruise IDs that were not found in R2R or NautilusLive. Explain where to put data files.
- [ ] parse [R2R formatted processed nav data](https://service.rvdata.us/info/format/100157)
- [ ] parse other r2rnav data type?
- [ ] ?parse INS data? or just say 'sorry, wait for the data to be available'?


In [None]:
import time

import requests
from bs4 import BeautifulSoup
from tqdm.auto import tqdm

In [None]:
# Edit the values of these two variables to the person to search for
# the variables are lists in order to account for multiple names due to spelling/punctation variations/mistakes, changed first or last names, etc.
# e.g. lastNames = ["Obrien", "O'Brien", "O'brian"] etc. - for instances where the metadata sent to r2r was entered incorrectly
# Note: Capitalization does not matter at all for r2r, and the names are converted to lowercase for NautilusLive.org
# DO NOT use this to search for multiple people at once, it will join their records.
lastNames = ["Lowe"]
firstNames = ["Justin"]

# How many seconds should we wait between making requests to R2R api?
wait_time = 0.5
req_timeout = 15

## Get Cruises from R2R with Person's Name(s)

In [None]:
# get cruises from r2r
# https://www.rvdata.us/about/technical-details/services/api
# Use format `LastName, FirstName` as person_id

tempSetCruises = set()

# Try all combinations of first and last names
for lastName in lastNames:
    for firstName in firstNames:
        try:
            response = requests.get(
                "https://service.rvdata.us/api/person/person_id/{0}%2C%20{1}".format(
                    lastName, firstName
                ),
                timeout=req_timeout,
            )
            response.raise_for_status()

            searchNameResults = response.json()["data"]

            for cruiseResult in searchNameResults:
                tempSetCruises.add(cruiseResult["cruise_id"])

            print(
                "Found {0} cruises for {1} {2}: {3}".format(
                    len(tempSetCruises), firstName, lastName, ", ".join(sorted(tempSetCruises))
                )
            )

            if len(firstNames) > 1 or len(lastNames) > 1:
                # Pause between requests to be nice
                time.sleep(wait_time)

        except requests.exceptions.HTTPError as errh:
            print(f"HTTP Error for {firstName} {lastName}: {errh}")
        except requests.exceptions.ConnectionError as errc:
            print(f"Connection Error for {firstName} {lastName}: {errc}")
        except requests.exceptions.Timeout as errt:
            print(f"Timeout Error for {firstName} {lastName}: {errt}")
        except requests.exceptions.RequestException as err:
            print(f"Error for {firstName} {lastName}: {err}")

if len(firstNames) > 1 or len(lastNames) > 1:
    print(
        "\nFinal sorted list of all cruises found: {0}".format(
            ", ".join(sorted(tempSetCruises))
        )
    )

print(
    "NOTE: If you notice any missing cruises or cruises you weren't on, you can modify the 'lastNames' and 'firstNames' lists at the top of this notebook."
)


## Try to also get Cruises from Person's page on NautilusLive.org

In [None]:
# Also get cruise IDs from links on the user's NautilusLive.org page (if applicable)
nlSetCruises = set()

# NautilusLive.org user page base URL
nautilusLivePageURL = "https://nautiluslive.org/people/"

nlCruises = dict()

# Try all combinations of names
for firstName in firstNames:
    for lastName in lastNames:
        # combine first name with last name with hyphens, all lowercase
        nautilusLivePageName = "-".join([firstName, lastName]).lower()

        try:
            # use BeautifulSoup to parse the HTML page and extract the links from all <div> elements with class="cruises"
            response = requests.get(nautilusLivePageURL + nautilusLivePageName)
            response.raise_for_status()

            soup = BeautifulSoup(response.text, "html.parser")

            # find all <div> elements with class="cruises"
            cruiseLinks = soup.find_all("div", class_="cruises")

            # extract the href attribute from each <a> element in the <div> elements with class="cruises"
            for div in cruiseLinks:
                a_tags = div.find_all('a')
                for a in a_tags:
                    href = a.get('href')
                    if href:
                        # parse out the cruise ID from the href attribute e.g. "NA160" from "/cruise/NA160"
                        cruise_id = href.split('/')[-1]
                        # Validate it is a valid cruise ID of the format "NA" and then three numbers
                        if cruise_id.startswith('NA') and len(cruise_id) == 5:
                            nlSetCruises.add(cruise_id)

                            # get name of link from <a> element
                            cruise_name = a.get_text()
                            # add cruise_id and cruise_name to nlCruises dict
                            nlCruises[cruise_id] = cruise_name

            if len(firstNames) > 1 or len(lastNames) > 1:
                print(
                    "Found the following cruises on NautilusLive.org for {0} {1}".format(
                        firstName, lastName
                    )
                )

            time.sleep(wait_time)  # Be nice to the server

        except requests.exceptions.RequestException as err:
            print(f"Error accessing NautilusLive.org for {firstName} {lastName}: {err}")

print(
    "\nFinal sorted list of cruises from NautilusLive.org has {0} cruises: {1}. \n".format(
        len(nlSetCruises),
        ", ".join(sorted(nlSetCruises))
    )
)

# print out any cruises that are missing from either list
if nlCruises and nlSetCruises != tempSetCruises:
    missing_from_nl = sorted(tempSetCruises.difference(nlSetCruises))
    if missing_from_nl:
        if len(missing_from_nl) > 1:
            print("These {0} R2R cruises are missing from the person's NautilusLive list: {1}".format(
                len(missing_from_nl), ", ".join(missing_from_nl)))
        else:
            print("This {0} R2R cruise is missing from the person's NautilusLive list: {1}".format(
                len(missing_from_nl), missing_from_nl[0]))

    missing_from_r2r = sorted(nlSetCruises.difference(tempSetCruises))
    if missing_from_r2r:
        if len(missing_from_r2r) > 1:
            print(
                "These {0} NautilusLive cruises are missing from the person's R2R list: {1}".format(
                    len(missing_from_r2r),
                    ", ".join(missing_from_r2r),
                )
            )
        else:
            print(
                "This {0} NautilusLive cruise is missing from the person's R2R list: {1}".format(
                    len(missing_from_r2r),
                    missing_from_r2r[0],
                )
            )

# TODO: use a new variable instead of just adding to existing set?
print("Cruise list had {0} cruises from R2R".format(len(tempSetCruises)))
tempSetCruises = tempSetCruises.union(nlSetCruises)
print("Cruise list now has {0} cruises from R2R and NautilusLive".format(len(tempSetCruises)))
#print(", ".join(sorted(tempSetCruises)))

# TODO: ask user if they want to union both sets of cruises? or just keep as is and add them without asking

## Get metadata about each Cruise

In [None]:
# Loop through Cruises and get data. Use: dates, `has_r2rnav`, ?

# exampleDict =	{
#     cruiseID: {
#         "cruise_id": TEXT,
#         "cruise_name": TEXT,
#         "depart_date": 'YYYY-MM-DD',
#         "arrive_date": 'YYYY-MM-DD',
#         "has_r2rnav": true/false,
#     },
#     ...
# }


cruisesDict = dict()

# progressBar = tqdm(sorted(tempSetCruises), desc="Requesting Cruise metadata", unit="request")
progressBar = tqdm(sorted(tempSetCruises), unit="request")
for cruiseID in progressBar:
    # progressBar.set_description("Requesting {0} data. Total Progress:".format(cruiseID))
    # ok, why did I change from use description?? Test or leave it...
    progressBar.set_postfix_str("Requesting {0} data.".format(cruiseID))

    # TODO: currently cruisesDict is only updated/added-to if R2R has metadata on the cruise... will R2R have a cruise personnel list but not metadata?
    #  how to handle NautilusLive cruises that R2R doesn't have?
    #  always create a dict entry with just... cruise_id and ... `error_notes` that is human readable?

    # from https://pycoders-nl.gitbook.io/pycoders-handbook/web-scraping/week-14/python-requests-library-and-fastapi#how-to-make-robust-api-requests
    try:
        response = requests.get(
            "https://service.rvdata.us/api/cruise/cruise_id/{0}".format(cruiseID),
            timeout=req_timeout,
        )
        response.raise_for_status()

        if response.status_code == 200:
            tempDict = {
                "cruise_id": response.json()["data"][0]["cruise_id"],
                "cruise_name": response.json()["data"][0]["cruise_name"],
                "depart_date": response.json()["data"][0]["depart_date"],
                "arrive_date": response.json()["data"][0]["arrive_date"],
                "has_r2rnav": response.json()["data"][0]["has_r2rnav"],
            }

            cruisesDict.update({tempDict["cruise_id"]: tempDict})
        elif response.status_code == 204:
            print("ERROR: Cruise ID {0} returns 'No Cruise Found' from R2R".format(cruiseID))
        else:
            print(
                "ERROR: Cruise ID {0} returns Code {1}, text {2}".format(
                    cruiseID,
                    str(response.status_code),
                    response.text["status_message"],
                )
            )

        # Pause for `wait_time` seconds between each request to be nice. Longer?
        time.sleep(wait_time)
    except requests.exceptions.HTTPError as errh:
        print(errh)
    except requests.exceptions.ConnectionError as errc:
        print(errc)
    except requests.exceptions.Timeout as errt:
        print(errt)
    except requests.exceptions.RequestException as err:
        print(err)

## Manually add missing cruise IDs

In [None]:
# TODO: allow user to add cruise IDs that were not found in R2R.
#  validate they are correct format if they start with `NA`.
#  Then check if they exist in R2R, if not, add to cruisesDict with error note.
#  Check if non-R2R-existent cruises have data in data-local, otherwise warn user that data will not be found for those cruises.

## Get links to nav data files (and their data formats) for each cruise

In [None]:
# Get the urls for r2rnav files for all applicable cruises

progressBar = tqdm(cruisesDict.keys())
for cruiseID in progressBar:
    progressBar.set_postfix_str("Requesting {0} url.".format(cruiseID))
    # print("Cruise {0} has r2rnav data? {1}".format(cruiseID, cruisesDict[cruiseID]['has_r2rnav']))

    if cruisesDict[cruiseID]["has_r2rnav"]:
        try:
            response = requests.get(
                "https://service.rvdata.us/api/product/cruise_id/{0}".format(cruiseID),
                timeout=req_timeout,
            )
            response.raise_for_status()

            r2rnav_expected_format = "r2rnav_geocsv"
            r2rnav_expected_format2 = "r2rnav"
            r2rnav_expected_formats = ["r2rnav_geocsv", "r2rnav"]
            r2rnav_product_found = False
            r2rnav_found_formats = []
            r2rnav_url_found = False
            r2rnav_url_valid = False
            data_product_formats = []

            # Try each format in order of preference
            for r2rnav_format in r2rnav_expected_formats:
                if not r2rnav_url_found:  # will be false on the first instance of for loop, so on later loops: do not execute code if we already found the nav_url
                    for dataProduct in response.json()["data"]:
                        # first check that this is a data product we care about right now (i.e. it is nav, not CTD data etc somehow)
                        if dataProduct["file_format"] == r2rnav_format:
                            r2rnav_product_found = True
                            r2rnav_found_formats.append(dataProduct["file_format"])

                            # Check if the keys for the `url` and `actual_url` are present and non-empty
                            if "url" in dataProduct and dataProduct["url"] and "actual_url" in dataProduct and \
                                    dataProduct["actual_url"]:
                                r2rnav_url_found = True
                                cruisesDict[cruiseID].update({
                                    "r2rnav_url": dataProduct["url"],
                                    "r2rnav_url_actual": dataProduct["actual_url"],
                                    "r2rnav_format": dataProduct["file_format"],
                                    "has_r2rnav_valid_url": False
                                })

                                # now let's try to access the `actual_url`
                                try:
                                    # Test if URL is accessible
                                    url_check = requests.head(dataProduct["actual_url"], timeout=req_timeout)
                                    url_check.raise_for_status()

                                    cruisesDict[cruiseID].update({
                                        "has_r2rnav_valid_url": True
                                    })
                                    r2rnav_url_valid = True
                                except requests.exceptions.RequestException:
                                    pass
                            # ok they weren't present and non-empty
                            else:
                                if "url" in dataProduct and "actual_url" in dataProduct:  # test if keys are present
                                    # TODO: code out further non-empty url/actual_url tests if needed
                                    if not dataProduct["actual_url"]:  # test if actual_url is non-empty
                                        print("ERROR: Cruise {0} has a data product with a null `actual_url`".format(
                                            cruiseID))

                                        # update `has_r2rnav_valid_url` to false, since we only use `actual_url` to download data file
                                        cruisesDict[cruiseID].update({
                                            "has_r2rnav_valid_url": False,
                                        })

                                    if dataProduct["url"]:  # test if url is non-empty
                                        print("ERROR: Cruise {0} has a data product with a null `url`".format(cruiseID))
                                else:
                                    # the keys weren't present...?
                                    print(
                                        "ERROR: Cruise {0} has a data product with no `url` and `actual_url` keys".format(
                                            cruiseID))

                                # print json result for debugging
                                print(dataProduct)

            if not r2rnav_url_valid:
                if not r2rnav_url_found:
                    if not r2rnav_product_found:
                        print(
                            "ERROR: Failed to find {0} formatted data products for Cruise {1}, even though `has_r2rnav` is True".format(
                                ", OR ".join(r2rnav_expected_formats),
                                cruiseID
                            )
                        )
                        for dataProduct in response.json()["data"]:
                            # start making a list of encountered data format types in case we need to use it in error message below
                            data_product_formats.append(dataProduct["file_format"])
                    else:  #yes product, no url
                        print(
                            "ERROR: No 'actual_url' found for Cruise {0}, but `has_r2rnav` is True and found {1} formatted data product(s)".format(
                                cruiseID, ", ".join(r2rnav_found_formats)
                            )
                        )
                else:  # `actual_url` exists and is non-empty, but we could not access it
                    print(
                        "WARN: Failed to access 'actual_url' for Cruise {0}. This could be a temporary error... "
                    )

            # Pause for `wait_time` seconds between each request to be nice. Longer?
            time.sleep(wait_time)
        except requests.exceptions.HTTPError as errh:
            print(errh)
        except requests.exceptions.ConnectionError as errc:
            print(errc)
        except requests.exceptions.Timeout as errt:
            print(errt)
        except requests.exceptions.RequestException as err:
            print(err)

    # else: # Do something to alert user to cruises without r2rnav? don't need to do here, just loop through dict again for if !has_r2rnav #NA096 and 3 other have this other format. txt from r2r saved in this code dir.

In [None]:
# Test output so far
from pprint import pprint

pprint(cruisesDict)

In [None]:
# Test output so far, specifically which cruises don't have r2rnav files
cruises_missing_r2rnav = []
cruises_missing_url_r2rnav = []

for cruiseID in cruisesDict:
    if not cruisesDict[cruiseID]["has_r2rnav"]:
        cruises_missing_r2rnav.append(cruiseID)
    else:
        if "r2rnav_url_actual" not in cruisesDict[cruiseID]:
            cruises_missing_url_r2rnav.append(cruiseID)

if cruises_missing_r2rnav:
    print(
        "The following cruises don't have r2rnav data, will need to parse INS data or do something else:\n {0}".format(
            ", ".join(cruises_missing_r2rnav)
        )
    )

for cruiseID in cruisesDict:
    #if not cruisesDict[cruiseID]["has_r2rnav"]:
    if cruisesDict[cruiseID]["has_r2rnav"]:
        if "r2rnav_url_actual" not in cruisesDict[cruiseID]:
            cruises_missing_r2rnav.append(cruiseID)

if cruises_missing_url_r2rnav:
    print(
        "The following cruises don't have an 'actual url' to their r2rnav data, will need to parse INS data or do something else:\n {0}".format(
            ", ".join(cruises_missing_url_r2rnav)
        )
    )

In [None]:
# TODO: download files

# TODO: give option to load directly into memory?
#  or download and save locally and load into memory
#  (or is memory the smaller limit on mybinder and we should save all files to disk and load full data into memory only one at a time?)

# TODO: what module to use to read in lat longs and get distance of path? can just pandas do this? or some specific geo module better? this will prob inform how we load the files into vars...
# numpy?
# https://github.com/pyproj4/pyproj ? maybe just coordinate transforms...
# https://github.com/GenericMappingTools/pygmt
# hmm apparently GeoPandas is a thing? https://geopandas.org/en/stable/getting_started/introduction.html and pretty maps in jupyter??

# a bunch of words. https://www.geeksforgeeks.org/working-with-geospatial-data-in-python/


In [None]:
# TODO: check for nav files, download as needed - do the below...
# loop through list of URLs - first check if file already exists locally, if not, then download it
#  add location of local file (previously existent or ust downloaded) to cruisesDict under new key 'r2rnav_local_file'



## Notes

useful for later - to check if url goes to a tar.gz file or a directory:

```python
regex = r".*\.tar\.gz$"
if re.match(regex,response.json()['data'][0]['actual_url']):
    print("tar.gz url")
```

# starting point for r2rnav files

NA041 returns: 'r2rnav_url_actual': 'https://www.ncei.noaa.gov/archive/accession/0296681/data/0-data/NA041_615289_r2rnav/'

in that dir is:

https://www.ncei.noaa.gov/data/oceans/archive/arc0228/0296681/1.3/data/0-data/NA041_615289_r2rnav/data/NA041_1min.r2rnav

so something like `"{0}/data/{1}_1min.r2rnav".format(cruisesDict[cruiseID]['r2rnav_url_actual'], cruiseID)` and download it
