<a href="https://colab.research.google.com/github/myrandaGoesToSpace/semantic-search-datasets/blob/main/semantic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Search on NOAA NCEI Data
Myranda Uselton Shirk, NOAA NCEI Data Scientist

## Introduction
The National Oceanic and Atmospheric Administration's (NOAA's) National Centers for Environmental Information (NCEI) hosts one of the world's largest environmental data archives and is the US leading authority on environmental information. One large part of NCEI's role in environmental science is providing access to this data through advances in data discoverability. This notebook showcases a new method for constructing a semantic natural language search engine over NCEI data abstracts and serves as a proof of concept for any organization that wishes to provide a natural language search engine for its clients. 

## Methods

### The Transformer Model
[HuggingFace](www.huggingface.co) is an online community dedicated to creating and sharing state-of-the-art transformer models within the data science community. Transformer machine learning models, first proposed by Google in 2017, are natural language processing models that outperform all other models to date. The key to these models is an attribute known as attention, which allows for models to embed and understand context in written language. The model used in this notebook is the Sentence Transformer model, which computes semantic similarity for question answering and inference problems. 


### Model Application
Because the Sentence Transformer computes semantic similarity for question answering, it is reasonable to assume that this method could be applied for semantic search over a corpus of data. In this case, a user query will be computed for similarity against semantic embeddings of NCEI dataset abstracts.

### The Data
The data in this notebook are sourced from NCEI's metadata portal, Geoportal, and include dataset identifiers, titles, and abstracts. The data are gathered using Catalogue Service for the Web (CSW) protocol and organized for use in Python using the Beautiful Soup library. 

## Application

Install libraries not included in Google CoLab - Sentence Transformers and BS4 (Beautiful Soup)

In [None]:
!pip install -U sentence-transformers
!pip install bs4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 772 kB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 27.6 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 13.2 MB/s 
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 44.7 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 32.0 MB/s 
Building wheels for collected 

Load the relevant packages.

In [None]:
from sentence_transformers import SentenceTransformer, util
from bs4 import BeautifulSoup

Below is the example code found on the Sentence Transformer HuggingFace page on using their model.

In [None]:
'''
# Example - semantic search

query = "How many people live in London?"
docs = ["Around 9 Million people live in London", "London is known for its financial district"]

#Load the model
model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')

#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)


#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)
  
'''

'\n# Example - semantic search\n\nquery = "How many people live in London?"\ndocs = ["Around 9 Million people live in London", "London is known for its financial district"]\n\n#Load the model\nmodel = SentenceTransformer(\'sentence-transformers/multi-qa-MiniLM-L6-cos-v1\')\n\n#Encode query and documents\nquery_emb = model.encode(query)\ndoc_emb = model.encode(docs)\n\n#Compute dot score between query and all document embeddings\nscores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()\n\n#Combine docs & scores\ndoc_score_pairs = list(zip(docs, scores))\n\n#Sort by decreasing score\ndoc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)\n\n\n#Output passages & scores\nfor doc, score in doc_score_pairs:\n    print(score, doc)\n  \n'

Write functions for obtaining abstract, title, and identifier from Geoportal CSW request.

In [None]:
import pandas as pd
import requests

def find_abstracts(soup):
  #df = pd.DataFrame(columns = ["identifier", "abstract"])
  id_list = []
  abs_list = []
  title_list = []

  for record in soup.find_all("csw:record"):
    id = record.find("dc:identifier")
    abs = record.find("dct:abstract")
    title = record.find("dc:title")

    # append id and abs to df
    #df = df.append([id.text, abs.text])
    id_list.append(id.text)
    title_list.append(title.text)

    if abs != None:
      abs_list.append(abs.text)
    else:
      abs_list.append("NA")

  return id_list, title_list, abs_list

# Get the abstracts from Geoportal
URL = "https://www.ncei.noaa.gov/metadata/geoportal/opensearch?f=csw&from=0&size=5000&sort=title.sort"

page = requests.get(URL)
soup = BeautifulSoup(page.text, "lxml")

id_list, title_list, abs_list = find_abstracts(soup)
df = pd.DataFrame(list(zip(id_list,title_list, abs_list)), columns = ["identifier", "title", "abstract"])
df

Unnamed: 0,identifier,title,abstract
0,gov.noaa.nodc:0000662,(NCEI Accession 0000662),
1,gov.noaa.nodc:5400007,(NCEI Accession 5400007),
2,gov.noaa.ncei:SeaSurfaceTemperature,.Sea Surface Temperature,Derived from processed satellite observations....
3,gov.noaa.ncdc:C00386,1-Minute Page 1 Surface Weather Observations f...,This dataset includes one-minute observations ...
4,gov.noaa.ncdc:C00387,1-Minute Page 2 Surface Weather Observations f...,This dataset includes one-minute observations ...
...,...,...,...
4995,gov.noaa.ngdc.mgg.wcd:NA093_EM302,EM302 Water Column Sonar Data Collected During...,Nautilus Cruise NA093 (EM302)
4996,gov.noaa.ngdc.mgg.wcd:NA094_EM302,EM302 Water Column Sonar Data Collected During...,Nautilus Cruise NA094 (EM302)
4997,gov.noaa.ngdc.mgg.wcd:NA095_EM302,EM302 Water Column Sonar Data Collected During...,Nautilus Cruise NA095 (EM302)
4998,gov.noaa.ngdc.mgg.wcd:NA096_EM302,EM302 Water Column Sonar Data Collected During...,Nautilus Cruise NA096 (EM302)


In [None]:
# Make the abstracts the docs
docs_df = df[df["abstract"] != "NA"]
docs = list(docs_df["abstract"])
titles = list(docs_df["title"])

Load the Sentence Transformers model and embed the dataset abstracts.

In [None]:
#Load the model
model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')

doc_emb = model.encode(docs)


Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.8k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Save the abstract embeddings in a file. This allows the search engine manager to only compute the embeddings for the data corpus once.

In [None]:
import numpy as np
np.save("doc_emb", doc_emb)

In [None]:
doc_emb_saved = np.load("doc_emb.npy")
doc_emb_saved

array([[ 6.94027185e-05,  3.76847200e-02,  1.30539024e-02, ...,
         3.56099494e-02, -1.13519669e-01, -1.43111013e-02],
       [ 4.22849096e-02, -3.21854278e-02, -3.11591253e-02, ...,
         4.74197939e-02, -1.07490636e-01, -3.00586899e-03],
       [-6.01339852e-03, -2.23910846e-02, -4.13601585e-02, ...,
         5.25129661e-02, -1.01892620e-01, -3.97650823e-02],
       ...,
       [-5.70231751e-02,  8.11418425e-03,  3.22012976e-02, ...,
         1.98791665e-03, -2.65017198e-03,  5.94463907e-02],
       [-5.81365339e-02,  1.32378815e-02,  2.05236133e-02, ...,
         6.04736293e-03, -2.20852275e-03,  5.79196922e-02],
       [-5.61661944e-02,  1.02252932e-02,  1.95322707e-02, ...,
        -1.23474875e-03, -7.81550724e-03,  6.14073277e-02]], dtype=float32)

Ask the user to enter a query, then compute its similarity against the previously embedded abstracts.

In [None]:
# Query
query = input("Enter your query: ")

#Encode query and documents
query_emb = model.encode(query)

# predict on a search query for data


#Compute dot score between query and all document embeddings
scores = util.dot_score(query_emb, doc_emb_saved)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores, titles))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)


#Output passages & scores
for doc, score, title in doc_score_pairs[:10]:
    print("Score: %f" %score)
    print("Title: %s" %title)
    print("Abstract: %s" %doc)
    print('\n')

Enter your query: coral health in Atlantic ocean
Score: 0.607544
Title: EK60 Water Column Sonar Data Collected During HB1402
Abstract: Deep Water Coral (HB1402, EK60). The cruise will survey and collect samples of deep-sea corals and related marine life in the canyons in the northern Gulf of Maine in U.S. and Canadian waters. The U.S. and Canadian scientists (National Systematics Laboratory at NOAA’s Northeast Fisheries Science Center; Dalhousie University in Halifax, Nova Scotia) will work aboard the NOAA Ship Henry B. Bigelow. They will use a remotedly operated vehicle (ROV) owned by Canada. Called the Remotely Operated Platform for Ocean Science-- ROPOS for short -- the vehicle can record images and sample water, sediments, and marine life using sampling gear operated from the ship by ROPOS pilots. The sites to be sampled range from 200 to 3000 meters deep, or roughly 650 to 10,000 feet deep. The ROPOS generally works in the Pacific Ocean, and this cruise marks the first time the ve

Print the results in a more readable format.

In [None]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
#Output passages & scores
for doc, score, title in doc_score_pairs[:10]:
    print("Score: %f" %score)
    print("Title: %s" %title)
    print("Abstract: %s" %doc)
    print('\n')

Score: 0.595809
Title: Digital Video taken during the 3-person submersible Clelia dive 621 of the NOAA Office of Ocean Exploration's Islands in the Stream 2001 cruise, 2001-09-05 (NCEI Accession 0039428)
Abstract: Islands in the Stream is a three-month scientific expedition to marine protected areas and other habitats being considered for protection from offshore of Belize in Central America to North Carolina’s continental shelf. "Islands” are the coral reefs and hard-bottom (also called “hard substrate”) biological communities found along the course of a massive “stream” of ocean currents connecting the Caribbean Sea, the Gulf of Mexico, and the western Atlantic off North America. This expedition emphasizes habitat characterization and an understanding of these islands' "connection" as parts of an interrelated ecosystem. Though separated by large expanses of ocean water, the fishes, corals, and invertebrates common to these islands may demonstrate that the health and vitality of “down