# Build your own meteorite classifier

In this expert part of the data-access tutorial, we see an example of how we
can use our understanding of APIs to create useful tools for our daily
research-life. The best tools are those that address a direct need in your
analysis. Here, the example taken from real life is:

*"I need a function that accepts the name of one or many meteorites and returns their classification."*

This tutorial explains in-depth how we can approach and solve this general idea. To find the final
solution, we require basic knowledge of how data is sent to and from websites, which we explain below.
You will see that the code itself to get this tool working is rather minimal.

## Step 1: Identify the data source

To get started, we need a database containing the information we are looking
for: meteorite name-class relationships. The first thought: We could create a
large text file with all known meteorite names and their classifications.
However, we don't want to build and maintain this database ourselves. It's a
lot of work and other services already do this for us.

A good place for meteorite data is the [Meteoritical Bulletin](https://www.lpi.usra.edu/meteor/). The
website does what we want: you enter a meteorite name and it returns basic
information, including the classification. Great, the data that we are looking
for is available somewhere on this website!

However, the main page does not allow us to submit multiple objects at the same
time. We could still make it work, however, the [Meteorite Name Checking
Utility](https://www.lpi.usra.edu/meteor/metbullcheck.php) on the same website
does allow us to send multiple objects, so we continue with this service.

<div>
  <center>
    <img src="../06-Conclusion/gfx/demo_metsoc.png" width="800"/>
  </center>
</div>

## Step 2: Identify an access protocol

If we have more than a handful of meteorites that we want to look up, it
quickly becomes tiring to type in their names on the website and submit the
"Search" request. Instead, we want to do this programmatically. We need some
way to communicate with the server via a script.

We learned about different APIs to access data on remote servers this week:
HTML POST requests using the `requests` module, using secondary data clients
like `astroquery`, or the TAP services. Most databases specify at least one of
these methods as their preferred access method for the user. However, there is
no official API for the Meteoritical Bulletin. We thus have to be a bit
creative to find a way to programmatically query our data.

When you fill out a text form such as on this main page of the Meteoritical
Bulletin and hit `Search`, you are in most cases executing what is called an
HTML POST request. We saw this in notebook `3.1-How_to_query_an_API`, where we
used the `requests.post` function to send data to a server. The same is
happening when you click the `Search!` button: We are sending data to a URL. If
we can find out what this data looks like and where it is sent to, we might be
able to replicate this behaviour using the `requests` module.

Fortunately, your browser can tell you exactly what data it is sending to which
URL. This information is typically hidden in a "Developer" sub-menu. We use
Mozilla Firefox here, but the same is possible in most browsers. With Firefox,
you can right-click on the page and select "Inspect". You can see what this
looks like in the figure below.

<div>
  <center>
    <img src="../06-Conclusion/gfx/demo_inspect.png" width="1000"/>
  </center>
</div>

The window at the bottom shows the website elements. The "Network" tab shows
data that we sent to the server and that we get from the server. When you open
the inspection window, this tab is empty. If you then execute a search on the
website (e.g. type in "Vigarano" and hit `Search`), you will see a "POST
request" appear. This entry contains the data that we sent to the server, and
the target URL. It also contains the server's response, which in this case is
an HTML page (the page that it shows after you executed a search).

The following information is relevant to us:

- Domain and file that the request was sent to: `https://www.lpi.usra.edu/` and
`meteor/metbullcheck.php`. This means that our target URL is
`https://www.lpi.usra.edu/meteor/metbullcheck.php`, the page that we are on.

- Request: The request contains the actual data that we sent. We see that there are two key-value pairs: `sea: Vigarano` and `Search: Search`.
We deduce that the `sea` is the search term that we enter into the text form.

<div>
  <center>
    <img src="../06-Conclusion/gfx/zoom_inspect.png" width="1000"/>
  </center>
</div>

We have all that we need to access the server programmatically now.

## 3. Scripting the access

We use the `requests` module in `python` and write a standard POST request with the information we just gathered.

In [1]:
import requests

URL = "https://www.lpi.usra.edu/meteor/metbullcheck.php"

data = {
    "Search": "Search",
    "sea": "Vigarano",
}

r = requests.post(URL, data=data)

if r.ok:
  print("The query was successful.")
else:
  print("The query failed.")

The query failed.


The query failed. Let's have a look at the server's response.

In [2]:
r.content

b'<!DOCTYPE html>\n<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->\n<!--[if IE 7]>    <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->\n<!--[if IE 8]>    <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->\n<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->\n<head>\n<title>Attention Required! | Cloudflare</title>\n<meta charset="UTF-8" />\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n<meta http-equiv="X-UA-Compatible" content="IE=Edge" />\n<meta name="robots" content="noindex, nofollow" />\n<meta name="viewport" content="width=device-width,initial-scale=1" />\n<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" />\n<!--[if lt IE 9]><link rel="stylesheet" id=\'cf_styles-ie-css\' href="/cdn-cgi/styles/cf.errors.ie.css" /><![endif]-->\n<style>body{margin:0;padding:0}</style>\n\n\n<!--[if gte IE 10]><!-->\n<script>\n  if (!navigator.cookieEnabled) {\n    window.addEventListen

The relevant sentence in the response is this one: "*This website is using a security service to protect itself from online attacks.*"
The server did receive our request but did not like the look of it. Many websites have security measures like this one to prevent bad actors
from repeatedly accessing their services.

But we come in good faith. The way to instill more trust in the server is by pretending that we are in fact accessing the server from a browser
instead of from a script. The way to do this is to send some metadata describing our fake browser with our request.

In [3]:
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0",
}

r = requests.post(URL, data=data, headers=headers)

if r.ok:
  print("The query was successful.")
else:
  print("The query failed.")

The query was successful.


Success! You can have a look at `r.content` again, however, we already know what it looks like: It is the same HTML page that is rendered as
when we send a regular request through the website of the Meteoritical Bulletin. The main information that we want is a table containing
the name of the meteorite we were looking for and its class.

# 4. Extracting the information

HTML is great for websites but terrible for anything else. Fortunately, the process of "website scraping" that we are doing now is quite common,
and so many tools exist to help us get the data we want. One of those tools is `pandas`, which can parse HTML and extract tabular data.

> NB: The `pd.read_html` function might fail with an error if you do not have the
> `lxml` python library on your system. A `python -m pip install lxml` should fix this.

In [4]:
import pandas as pd
tables = pd.read_html(r.content)

tables

[                                     0                                    1
 0                                  NaN                                  NaN
 1  MetSoc Home  Publications  Contacts  MetSoc Home  Publications  Contacts,
                                      0
 0  MetSoc Home  Publications  Contacts,
           0                                                  1  \
 0  Vigarano  Directions: Enter a name or a list of names in...   
 
                                                    2  
 0  Directions: Enter a name or a list of names in...  ,
   Search name   Status† Full name  Abbreviation Comments  Bull   Mass
 0    Vigarano  Official  Vigarano           NaN      CV3   NaN  15000,
   Type of name                           Explanation Use in publications
 0     Official  A formally recognized meteorite name            Approved]

`pandas` found five tables in on the HTML page, and the fourth one contains the information that we want!

In [5]:
result = tables[3]
result

Unnamed: 0,Search name,Status†,Full name,Abbreviation,Comments,Bull,Mass
0,Vigarano,Official,Vigarano,,CV3,,15000


There it is. We requested a meteorite by name and got the class information returned. Let's put this into a function.

# 5. Putting together the pieces

In [7]:
import pandas as pd
import requests

def get_class_by_name(name):
  """Query a meteorite's classification from the Meteoritical Bulletin.

  Parameters
  ----------
  name : str
      The name of a meteorite

  Returns
  -------
  str
      The meteorite's class.
  """

  URL = "https://www.lpi.usra.edu/meteor/metbullcheck.php"

  data = {
      "Search": "Search",
      "sea": name,
  }

  headers = {
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0",
  }

  r = requests.post(URL, data=data, headers=headers)
  tables = pd.read_html(r.content)
  result = tables[3]
  return result.Comments.values[0]

Let's try our new function with three meteorites:

In [8]:
for meteorite in ["Vigarano", "Orgueil", "Hoba"]:
    class_ = get_class_by_name(meteorite)
    print(f"{meteorite} is a {class_} meteorite.")

Vigarano is a CV3 meteorite.
Orgueil is a CI1 meteorite.
Hoba is a Iron, IVB meteorite.


It works! And it's really not that much code, is it?

# TODO: Smoothen the rough edges

We have a working prototype that's ready to be used in our analysis. Additionally, other people might be interested in this nice little tool.
Here is a list of items that we could work on next to improve our meteorite classifier. We leave them open here, as an exercise to the reader. This is the expert-level tutorial, after all.

- What happens if an unknown meteorite name is passed?
- What happens if the server is unavailable?
- Instead of sending many requests for many objects, we should should send a
single request with many objects. This is possible on the website and it would
severely decrease the run time of our script.
