<a href="https://colab.research.google.com/github/mohityadav11a/asteroid_spectra/blob/main/data_fetch_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Abstract**
This ML project is about classifying asteroid taxonomy spectra. We use over 1,000 spectra from [1] to train miscellaneous models to e.g., distinguish between the X class and non X class; to perform multi-label classification and unsupervised clustering using autoencoders.

# 1. Data Fetching
**References**

[1] Url: http://smass.mit.edu/smass.html (Under 2)

[2] Bus, Schelte J.; Compositional structure in the asteroid belt: results of a spectroscopic survey; Ph. D. Thesis; Massachusetts Institute of Technology, Dept. of Earth, Atmospheric, and Planetary Sciences; 1999

In [None]:
#Importing Modules
import hashlib
import os
import pathlib
import tarfile
import urllib.request

In [None]:
# Mount the Google Drive, where we store files and models.
try:
    from google.colab import drive
    drive.mount("/gdrive")
    core_path = "/gdrive/MyDrive/colab/asteroid_taxonomy"
except ModuleNotFoundError:
    core_path = ""

Mounted at /gdrive


In [None]:
# Define function to compute the sha256 value of the downloaded files
def comp_sha256(file_name):

    # Set the SHA256 hashing
    hash_sha256 = hashlib.sha256()

    # Open the file in binary mode (read-only) and parse it in 65,536 byte chunks (in case of
    # large files, the loading will not exceed the usable RAM)
    with pathlib.Path(file_name).open(mode="rb") as f_temp:
        for _seq in iter(lambda: f_temp.read(65536), b""):
            hash_sha256.update(_seq)

    # Digest the SHA256 result
    sha256_res = hash_sha256.hexdigest()

    return sha256_res

In [None]:
# Create the level0 data directory
pathlib.Path(os.path.join(core_path, "data/lvl0/")).mkdir(parents=True, exist_ok=True)

In [None]:
files_to_dl = \
    {'file1': {'url': 'http://smass.mit.edu/data/smass/Bus.Taxonomy.txt',
               'sha256': '0ce970a6972dd7c49d512848b9736d00b621c9d6395a035bd1b4f3780d4b56c6'},
     'file2': {'url': 'http://smass.mit.edu/data/smass/smass2data.tar.gz',
               'sha256': 'dacf575eb1403c08bdfbffcd5dbfe12503a588e09b04ed19cc4572584a57fa97'}}

In [None]:
# Iterate through the dictionary and download the files
for dl_key in files_to_dl:

    #  Extract filename from the URL
    split = urllib.parse.urlsplit(files_to_dl[dl_key]["url"])
    filename = pathlib.Path(os.path.join(core_path, "data/lvl0/", split.path.split("/")[-1]))

    # Check if file already exists locally
    if not filename.is_file():

        print(f"Downloading now: {files_to_dl[dl_key]['url']}")

        # Download file and retrieve the created filepath
        downl_file_path, _ = urllib.request.urlretrieve(url=files_to_dl[dl_key]["url"],
                                                        filename=filename)

        # Compute and compare the hash value
        tax_hash = comp_sha256(downl_file_path)
        assert tax_hash == files_to_dl[dl_key]["sha256"]

In [None]:
# Untar the spectra data
tar = tarfile.open(os.path.join(core_path, "data/lvl0/", "smass2data.tar.gz"), "r:gz")
tar.extractall(os.path.join(core_path, "data/lvl0/"))
tar.close()