# 1 Finding EMIT L2B Mineralogy Data

**Summary**

#TODO
 - Add uncertainty to the outputs
 - better describe some cells
 - add image to this cell

<div>
<img src="../img/mineralogy_search.png" width="750"/>
</div>

**Background**

**Requirements**

 - [NASA Earthdata Account](https://urs.earthdata.nasa.gov/home)   
 - *No Python setup requirements if connected to the workshop cloud instance!*  
 - **Local Only** Set up Python Environment - See **setup_instructions.md** in the `/setup/` folder to set up a local compatible Python environment 

**Learning Objectives**

**Tutorial Outline**

1. Setup  
2. Searching for EMIT L2B Mineralogy Data
3. Advanced Filtering
4. Visualizing Data
5. Creating a List of Results and Asset URLs
6. Streaming or Downloading Data

## 1. Setup

Import the required Python libraries.

In [None]:
# Import required libraries
import os
import sys
import folium
import earthaccess
import warnings
import folium.plugins
import pandas as pd
import geopandas as gpd
import math

from branca.element import Figure
from IPython.display import display
from shapely import geometry
from skimage import io
from datetime import timedelta
from shapely.geometry.polygon import orient
from matplotlib import pyplot as plt
import matplotlib.cm as cm

sys.path.append('../modules/')
from emit_tools import emit_xarray, ortho_xr, ortho_browse
from tutorial_utils import list_metadata_fields, results_to_geopandas, convert_bounds

### 1.2 NASA Earthdata Login Credentials

To download or stream NASA data you will need an Earthdata account, you can create one [here](https://urs.earthdata.nasa.gov/home). Searching We will use the `login` function from the `earthaccess` library for authentication before downloading at the end of the notebook. This function can also be used to create a local `.netrc` file if it doesn't exist or add your login info to an existing `.netrc` file. If no Earthdata Login credentials are found in the `.netrc` you'll be prompted for them. This step is not necessary to conduct searches but is needed to download or stream data.

## 2.0 Searching for EMIT L2B Mineralogy Data

To find data we will use the [`earthaccess` Python library](https://github.com/nsidc/earthaccess). `earthaccess` searches [NASA Common Metadata Repository (CMR) API](), a metadata system that catalogs Earth Science data and associated metadata records. The results can then be used to download granules or generate lists of granule search result URLs.

Using `earthaccess` we can search based on the attributes of a granule, which can be thought of as a spatiotemporal scene from an instrument containing multiple assets (ex: Reflectance, Reflectance Uncertainty, Masks for the EMIT L2A Reflectance Collection, and EMIT ). When conducting a search we can provide a product, in this case the mineralogy product, a date-time range, and spatial constraints. This process can also be used with other EMIT products, other NASA collections.

### 2.1 Querying for Datasets

Our first step in searching for data is determining which collection (e.g. EMIT L2A Estimated Surface Reflectance Uncertainty and Masks, EMIT L2B Estimated Mineral Identification and Band Depth and Uncertainty) we want to search for. The best way to do this is using the collection `short_name` (e.g. EMITL2ARFL, EMITL2BMIN) or `concept-id`. In rare cases, the `short_name` of two collections can be the same, so we will use the `concept-id` which is a unique identifier for each collection. To find the `concept-id` we can search using some keywords.

In [None]:
# EMIT Collection Query
emit_collection_query = earthaccess.collection_query().keyword('EMIT L2B Mineral')
emit_collection_query.fields(['ShortName','EntryTitle','Version']).get()

From this list of results we can see that the `concept-id` for the desired mineral product is `C2408034484-LPCLOUD`. We can use this to define one of our search arguments.

In [None]:
concept_id = 'C2408034484-LPCLOUD'

### 2.2 Define Temporal Range

For our date range, we'll look at all EMIT data collected over 2023. The `date_range` can be specified as a pair of dates, start and end (up to, not including).

In [None]:
date_range = ('2023-01-01','2024-01-01')

### 2.3 Define Spatial Region of Interest

For this example, our spatial region of interest will be the area around Cuprite, NV. A location where there have been several previous mineralogy studies. We can define this region using a rectangular polygon. If you want to make a polygon for a different region, you can use a tool like [geojson.io](http://geojson.io/).

Open the `geojson` as a `geodataframe`, and check the coordinate reference system (CRS) of the data.

In [None]:
roi_gdf = gpd.read_file('../../data/cuprite_bbox.geojson')
roi_gdf.crs

In [None]:
roi_gdf

We can see this `geodataframe` consists of a single polygon which we want to include in our search, but the geometry is the only information contained in the file, so lets add a column for the site name, and set the value to "Cuprite".

In [None]:
roi_gdf['Name'] = 'Cuprite'

In [None]:
roi_gdf

In [None]:
# Function to convert a bounding box for use in leaflet notation

def convert_bounds(bbox, invert_y=False):
    """
    Helper method for changing bounding box representation to leaflet notation

    ``(lon1, lat1, lon2, lat2) -> ((lat1, lon1), (lat2, lon2))``
    """
    x1, y1, x2, y2 = bbox
    if invert_y:
        y1, y2 = y2, y1
    return ((y1, x1), (y2, x2))

In [None]:
fig = Figure(width="750px", height="375px")
map1 = folium.Map(tiles='https://mt1.google.com/vt/lyrs=y&x={x}&y={y}&z={z}', attr='Google')
fig.add_child(map1)

# Add roi geodataframe
roi_gdf.explore(
    "Name",
    popup=True,
    categorical=True,
    cmap='Set3',
    style_kwds=dict(opacity=0.7, fillOpacity=0.4),
    name="Regions of Interest",
    m=map1
)

map1.add_child(folium.LayerControl())
map1.fit_bounds(bounds=convert_bounds(roi_gdf.unary_union.bounds))
display(fig)

In our `earthaccess` search, we will use the `polygon` argument to find where this geometry intersects with the footprint of the EMIT scenes. To do this, we need to create a list of exterior polygon vertices in counter-clockwise order to submit in our search.

In [None]:
# Use orient to place vertices in counter-clockwise order
roi = orient(roi_gdf.geometry[0], sign = 1.0)
# Put the exterior coordinates in a list
roi = list(roi.exterior.coords)
roi

In [None]:
results = earthaccess.search_data(
    concept_id=concept_id,
    polygon=roi,
    temporal=date_range,
    count=500
)

## 3. Advanced Filtering

Now that we have some results, we will place them into a geodataframe that includes links to browse imagery and the files, so we can do some more advanced filtering of the data.

List the metadata fields available in the search results.

In [None]:
list_metadata_fields(results)

Some datasets have unique metadata that we can choose to include when we use our `results_to_geopandas` function to create a geodataframe. For example, `_cloud_cover` is not always available. We can add it to the default fields of this function by adding it to a `fields` argument in list form.

default_fields = [  
        "size",  
        "concept_id",  
        "dataset-id",  
        "native-id",  
        "provider-id",  
        "_related_urls",  
        "_single_date_time",  
        "_beginning_date_time",  
        "_ending_date_time",  
        "geometry",  
    ]

In [None]:
results_gdf = results_to_geopandas(results, fields=['_cloud_cover'])

Add an index, so we can easily reference the data in the geodataframe and use it 

In [None]:
# Specify index so we can reference it with gdf.explore()
results_gdf['index']=results_gdf.index

In [None]:
results_gdf

Filter the results geodataframe by cloud cover. We'll use a cloud cover 10% as our threshold.

In [None]:
# Filter Results
results_gdf = results_gdf[results_gdf['_cloud_cover'] < 10]
results_gdf.reset_index(drop=True, inplace=True)

In [None]:
# Set up Figure and Basemap tiles
fig = Figure(width="1080px",height="540")
map1 = folium.Map(tiles=None)
folium.TileLayer(tiles='https://mt1.google.com/vt/lyrs=y&x={x}&y={y}&z={z}',name='Google Satellite', attr='Google', overlay=True).add_to(map1)
folium.TileLayer(tiles='https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}.png',
                name='ESRI World Imagery',
                attr='Tiles &copy; Esri &mdash; Source: Esri, i-cubed, USDA, USGS, AEX, GeoEye, Getmapping, Aerogrid, IGN, IGP, UPR-EGP, and the GIS User Community',
                overlay='True').add_to(map1)
fig.add_child(map1)
# Add Search Results by Row'
# Create a color map for the results
cmap = cm.get_cmap('Set3')
n = len(results_gdf['native-id'].unique())
colors = [cmap(i) for i in range(n)]
colors = [cm.colors.rgb2hex(color) for color in colors]

for index, row in results_gdf.iterrows():
    color = colors[index % len(colors)]
    folium.GeoJson(row.geometry, name = row['native-id'],style_function=lambda feature, color=color: {'color': color, 'fillColor': color}).add_to(map1)

# Add roi geodataframe
roi_gdf.explore(
    "Name",
    popup=True,
    categorical=True,
    cmap='Set3',
    style_kwds=dict(opacity=0.7, fillOpacity=0.4),
    name="Regions of Interest",
    m=map1
)

# Zoom to Data
map1.fit_bounds(bounds=convert_bounds(results_gdf.unary_union.bounds))
# Add Layer controls
map1.add_child(folium.LayerControl(collapsed=False))
display(fig)

In [None]:
results_gdf._related_urls[0]

In [None]:
def get_asset_url(row,asset, key='Type',value='GET DATA'):
    """
    Retrieve a url from the list of dictionaries for a row in the _related_urls column.
    Asset examples: CH4PLM, CH4PLMMETA, RFL, MASK, RFLUNCERT 
    """
    # Add _ to asset so string matching works
    asset = f"_{asset}_"
    # Retrieve URL matching parameters
    for _dict in row['_related_urls']:
        if _dict.get(key) == value and asset in _dict['URL'].split('/')[-1]:
            return _dict['URL']

In [None]:
# Iterate over rows in the plm_gdf and get the mineral urls and store them in a list
min_urls = results_gdf.apply(lambda row: get_asset_url(row, asset='L2B_MIN'), axis=1).tolist()
min_urls

In [None]:
min_png = results_gdf.apply(lambda row: get_asset_url(row, asset='L2B_MIN', value='GET RELATED VISUALIZATION'), axis=1).tolist()
min_png

With some knowledge of how the granules and assets are neamed, we can grab the rgb browse images to get an idea of what the location looks like.

In [None]:
# Replace Collection ID
rgb_urls = [s.replace('EMITL2BMIN', 'EMITL2ARFL') for s in min_png]
# Update Product and Asset Names
rgb_urls = [s.replace('EMIT_L2B_MIN', 'EMIT_L2A_RFL') for s in rgb_urls]
# Change file extension
#rgb_urls = [s.replace('.nc', '.png') for s in rgb_urls]
rgb_urls

In [None]:
cols = 3
rows = math.ceil(len(results_gdf)/cols)
fig, ax = plt.subplots(rows, cols, figsize=(12,12))
ax = ax.flatten()

for _n, index in enumerate(results_gdf.index.to_list()):
    img = io.imread(rgb_urls[index])
    ax[_n].imshow(img)
    ax[_n].set_title(f"Index: {index} - {results_gdf['native-id'][index]}", fontsize=8)
    ax[_n].axis('off')
plt.tight_layout()
plt.show()

## 5. Saving Lists of Results

In [None]:
with open('../../data/rgb_browse_urls.txt', 'w') as f:
    for line in rgb_urls:
        f.write(f"{line}\n")

In [None]:
with open('../../data/results_urls.txt', 'w') as f:
    for line in min_urls:
        f.write(f"{line}\n")

## 6. Streaming or Downloading Data
For the workshop, we will stream the data, but either method can be used, and each has trade-offs based on the internet speed, storage space, or use case. The EMIT files are very large due to the number of bands, so operations can take some time if streaming with a slower internet connection. Since the workshop is hosted in a Cloud workspace, we can stream the data directly to the workspace.

### 6.1 Streaming Data Workflow
For an example of streaming both netCDF please see [Working with EMIT L2B Mineralogy.ipynb](Working_with_EMIT_L2B_Mineralogy.ipynb).

If you plan to stream the data, you can stop here and move to the next notebook.

### 6.2 Downloading Data Workflow
To download the scenes, we can use the earthaccess library to authenticate then download the files.

First, log into Earthdata using the login function from the earthaccess library. The persist=True argument will create a local .netrc file if it doesn’t exist, or add your login info to an existing .netrc file. If no Earthdata Login credentials are found in the .netrc you’ll be prompted for them. As mentioned in section 1.2, this step is not necessary to conduct searches, but is needed to download or stream data.

The outputs saved in section 5 can be downloading by uncommenting and running the following cells.

In [None]:
# # Authenticate using earthaccess
# earthaccess.login(persist=True)

In [None]:
# # Open Text File and Read Lines
# file_list = ['../../data/rgb_browse_urls.txt','../../data/results_urls.txt']
# urls = []
# for file in file_list:
#     with open(file) as f:
#         urls.extend([line.rstrip('\n') for line in f])

In [None]:
# # Get requests https Session using Earthdata Login Info
# fs = earthaccess.get_requests_https_session()
# # Retrieve granule asset ID from URL (to maintain existing naming convention)
# for url in urls:
#     granule_asset_id = url.split('/')[-1]
#     # Define Local Filepath
#     fp = f'../../data/{granule_asset_id}'
#     # Download the Granule Asset if it doesn't exist
#     if not os.path.isfile(fp):
#         with fs.get(url,stream=True) as src:
#             with open(fp,'wb') as dst:
#                 for chunk in src.iter_content(chunk_size=64*1024*1024):
#                     dst.write(chunk)


## Contact Info:  

Email: LPDAAC@usgs.gov  
Voice: +1-866-573-3222  
Organization: Land Processes Distributed Active Archive Center (LP DAAC)¹  
Website: <https://lpdaac.usgs.gov/>  
Date last modified: 06-21-2024  

¹Work performed under USGS contract 140G0121D0001 for NASA contract NNG14HH33I. 