# Tell me about my seatime (Using information from R2R)

The Rolling Deck to Repository (R2R) Program publishes a [beautiful API here](https://www.rvdata.us/about/technical-details/services/api).

Intention is to use a person's name to get all cruises they were on, then days on the ship, then miles sailed on each cruise.

## To Dos and Options

To get the cruises a person was on:

1. run a `/person/person_id/{person_id}` query and hope the name was spelled the same each time. Have not tested with actual query, but the 'Try it out' feature of API documentation doesn't not seem to allow partial IDs (i.e. just a last name) or wildcards...
2. Give option to search for different names and display output?? Or just tell them to go to [R2R's website](https://www.rvdata.us) and investigate, then come back to this script with a few names to search for? Meh if we are displaying returned cruises anyway, easy enough to let them reject/not-yet-explicitly-add the search results that they don't want... meh
3. Just loop through all cruise IDs with `/person/cruise_id/{cruise_id}` queries and... join unique names and show all names? show all matching last names?


To get miles sailed, loop through all applicable cruises and (in order of preference):

1. Get R2R format processed nav from a `/product/cruise_id/{cruise_id}` query (not all cruises have this data ready to download)
2. get INS data from `/fileset/cruise_id/{cruise_id}` query

aaaand then process that data. Using... ?

TODO:

- [ ] lookup geo and/or nav libraries and see if easy option to parse nav files
- [ ] how best to look for errors or variations in spelling of someone's name?
  - [ ] get all expeditions of a ship (within a timeframe?) and join the names?
  - [ ] Just display all cruises returned for that person with the cruise's dates and leave it up to them to double-check?
  - [ ] implement the combination of results from multiple searches with name variations
- [ ] parse [R2R formatted processed nav data](https://service.rvdata.us/info/format/100157)
- [ ] parse other r2rnav data type?
- [ ] ?parse INS data? or just say 'sorry, wait for the data to be available'?


In [None]:
import requests
import json
import time
from tqdm.auto import tqdm

In [None]:
# Edit the values of these two variables to the person to search for
lastName = "Lowe"
firstName = "Justin"

# How many seconds should we wait between making requests to R2R api?
wait_time = 2

In [None]:
# Use LastName, FirstName to get cruises

tempSetCruises = set()

# from https://pycoders-nl.gitbook.io/pycoders-handbook/web-scraping/week-14/python-requests-library-and-fastapi#how-to-make-robust-api-requests
try:
    response = requests.get(
        "https://service.rvdata.us/api/person/person_id/{0}%2C%20{1}".format(
            lastName, firstName
        ),
        timeout=5,
    )
    response.raise_for_status()

    searchNameResults = response.json()["data"]

    for cruiseResult in searchNameResults:
        tempSetCruises.add(cruiseResult["cruise_id"])
        # print(cruiseResult['cruise_id'])

    print(
        "Sorted list from set of cruises for {0} {1}: {2}".format(
            firstName, lastName, sorted(tempSetCruises)
        )
    )

    print(
        "NOTE: This notebook/script does not currently handle if there are errors in this list (e.g. missing cruises, cruises you weren't on, etc). Sorry. Maybe soon?"
    )

except requests.exceptions.HTTPError as errh:
    print(errh)
except requests.exceptions.ConnectionError as errc:
    print(errc)
except requests.exceptions.Timeout as errt:
    print(errt)
except requests.exceptions.RequestException as err:
    print(err)

In [None]:
# TODO: ask user for missing cruises? and add them to set? verify entered cruiseIDs are valid.

In [None]:
# Loop through Cruises and get data. Use: dates, `has_r2rnav`, ?

# exampleDict =	{
#     cruiseID: {
#         "cruise_id": TEXT,
#         "cruise_name": TEXT,
#         "depart_date": 'YYYY-MM-DD',
#         "arrive_date": 'YYYY-MM-DD',
#         "has_r2rnav": true/false,
#     },
#     ...
# }


cruisesDict = dict()

progressBar = tqdm(sorted(tempSetCruises))
for cruiseID in progressBar:
    # progressBar.set_description("Requesting {0} data. Total Progress:".format(cruiseID))
    progressBar.set_postfix_str("Requesting {0} data.".format(cruiseID))

    # from https://pycoders-nl.gitbook.io/pycoders-handbook/web-scraping/week-14/python-requests-library-and-fastapi#how-to-make-robust-api-requests
    try:
        response = requests.get(
            "https://service.rvdata.us/api/cruise/cruise_id/{0}".format(cruiseID),
            timeout=5,
        )
        response.raise_for_status()

        tempDict = {
            "cruise_id": response.json()["data"][0]["cruise_id"],
            "cruise_name": response.json()["data"][0]["cruise_name"],
            "depart_date": response.json()["data"][0]["depart_date"],
            "arrive_date": response.json()["data"][0]["arrive_date"],
            "has_r2rnav": response.json()["data"][0]["has_r2rnav"],
        }

        cruisesDict.update({tempDict["cruise_id"]: tempDict})

        # Pause for `wait_time` seconds between each request to be nice. Longer?
        time.sleep(wait_time)
    except requests.exceptions.HTTPError as errh:
        print(errh)
    except requests.exceptions.ConnectionError as errc:
        print(errc)
    except requests.exceptions.Timeout as errt:
        print(errt)
    except requests.exceptions.RequestException as err:
        print(err)

In [None]:
# Get urls for r2rnav files for all applicable cruises

progressBar = tqdm(cruisesDict.keys())
for cruiseID in progressBar:
    progressBar.set_postfix_str("Requesting {0} url.".format(cruiseID))
    # print("Cruise {0} has r2rnav data? {1}".format(cruiseID, cruisesDict[cruiseID]['has_r2rnav']))

    if cruisesDict[cruiseID]["has_r2rnav"]:
        # print("Retrieving url for {0}".format(cruiseID))

        # from https://pycoders-nl.gitbook.io/pycoders-handbook/web-scraping/week-14/python-requests-library-and-fastapi#how-to-make-robust-api-requests
        try:
            response = requests.get(
                "https://service.rvdata.us/api/product/cruise_id/{0}".format(cruiseID),
                timeout=5,
            )
            response.raise_for_status()

            r2rnav_expected_format = "r2rnav_geocsv"
            r2rnav_expected_format2 = "r2rnav"
            r2rnav_url_found = False
            data_product_formats = []

            # loop through all data product entries returned. Many cruises only have one, but some (later ones?) have multiple different products
            for dataProduct in response.json()["data"]:
                if dataProduct["file_format"] == r2rnav_expected_format:
                    cruisesDict[cruiseID].update(
                        {
                            "r2rnav_url": dataProduct["url"],
                            "r2rnav_url_actual": dataProduct["actual_url"],
                            "r2rnav_format": dataProduct["file_format"],
                        }
                    )
                    r2rnav_url_found = True

            # if no products of first expected format found, loop through again looking for second type
            if not r2rnav_url_found:
                for dataProduct in response.json()["data"]:
                    # start making a list of encountered data format types in case we need to use it in error message below
                    data_product_formats.append(dataProduct["file_format"])
                    if dataProduct["file_format"] == r2rnav_expected_format2:
                        cruisesDict[cruiseID].update(
                            {
                                "r2rnav_url": dataProduct["url"],
                                "r2rnav_url_actual": dataProduct["actual_url"],
                                "r2rnav_format": dataProduct["file_format"],
                            }
                        )
                        r2rnav_url_found = True

            # if we didn't find first or second expected type then complain
            if not r2rnav_url_found:
                # check this isn't empty
                if data_product_formats:
                    print(
                        "ERROR: Cruise {0} does not have r2rnav data in expected formats, even though `has_r2rnav` is True, instead found the following formats: {1}".format(
                            cruiseID,
                            data_product_formats,
                        )
                    )
                    # set some new flag in dict for this cruise?? or set 'has_r2rnav' to false? or parse this diff data? Other values encountered have been: `r2rnav` - TODO: look into diff formats.
                else:
                    print(
                        "ERROR: No data products found for Cruise {0}, but `has_r2rnav` is True".format(
                            cruiseID
                        )
                    )

            # Pause for `wait_time` seconds between each request to be nice. Longer?
            time.sleep(wait_time)
        except requests.exceptions.HTTPError as errh:
            print(errh)
        except requests.exceptions.ConnectionError as errc:
            print(errc)
        except requests.exceptions.Timeout as errt:
            print(errt)
        except requests.exceptions.RequestException as err:
            print(err)

    # else: # Do something to alert user to cruises without r2rnav? don't need to do here, just loop through dict again for if !has_r2rnav

In [None]:
# Test output so far
from pprint import pprint

pprint(cruisesDict)

In [None]:
# Test output so far, specifically which cruises don't have r2rnav files
cruises_missing_r2rnav = []
for cruiseID in cruisesDict:
    if not cruisesDict[cruiseID]["has_r2rnav"]:
        cruises_missing_r2rnav.append(cruiseID)

if cruises_missing_r2rnav:
    pprint(
        "The following cruises don't have r2rnav data, will need to parse INS data or do something else: {0}".format(
            ", ".join(cruises_missing_r2rnav)
        )
    )

## Notes

useful for later - to check if url goes to a tar.gz file or a directory:

```python
regex = r".*\.tar\.gz$"
if re.match(regex,response.json()['data'][0]['actual_url']):
    print("tar.gz url")
```

# starting point for r2rnav files

NA041 returns: 'r2rnav_url_actual': 'https://www.ncei.noaa.gov/archive/accession/0296681/data/0-data/NA041_615289_r2rnav/'

in that dir is:

https://www.ncei.noaa.gov/data/oceans/archive/arc0228/0296681/1.3/data/0-data/NA041_615289_r2rnav/data/NA041_1min.r2rnav

so something like `"{0}/data/{1}_1min.r2rnav".format(cruisesDict[cruiseID]['r2rnav_url_actual'], cruiseID)` and download it
