# Croissant 🥐 Universe Explorer - openml

This notebook is based on the output produced after crawling openml datasets.

We project croissant metadata to text embedding space that can be ingested into visual universe explorer such as in https://github.com/luisoala/croissant-universe-surfer.

## bertopic imports (to extract embeddings)

In [1]:
import numpy as np
from bertopic import BERTopic
from umap import UMAP
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
import json

  from .autonotebook import tqdm as notebook_tqdm
2024-03-05 22:15:39.726719: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-05 22:15:39.888913: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-05 22:15:39.889103: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-05 22:15:39.918202: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-05 22:15:39.9

##  croissant imports

In [2]:
import sys
from etils import epath
from IPython.display import Markdown
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import polars as pl
import seaborn as sns
import re
sns.set()

# get croissant descriptions for openml

In [28]:
source = "openml" 
folder = epath.Path("/home/donald/repos/croissant/health/data")
files = folder.glob("*/*.parquet")
#files = list(files)[0:6] #if you have several vendor crawls on your local machine, make sure the file generator only picks the correct one

In [4]:
df = pl.scan_parquet(files).filter(pl.col("source") == source)
print(f"Report for {source}")

Report for openml


In [5]:
sns.set_style("white")
num_rows = df.select(pl.len()).collect().item()
display(Markdown(f"Scraped {num_rows} datasets for {source}"))
body = pl.col("body")

Scrapped 5435 datasets for openml

In [7]:
a = np.array(df.collect())

In [8]:
print(a.shape)

(5435, 15)


In [23]:
def extract_descriptions_and_urls(json_strings):
    descriptions = []
    all_urls = []
    invalid_json_counter = 0
    for string in json_strings:
        try:
            ds_dict = json.loads(string)
            descriptions.append(ds_dict["description"])
            all_urls.append((ds_dict["url"]))
        except:
            invalid_json_counter+=1
            
    print("Decoding done, encountered {invalid_json_counter} invalid json files.".format(invalid_json_counter = invalid_json_counter))
        
    return descriptions, all_urls

In [24]:
descriptions, urls = extract_descriptions_and_urls(x[0] for x in a) #get a list of croissant descriptions and urls from the openml crawl dump

Decoding done, encountered 1565 invalid json files.


In [25]:
#sanity checks
print(descriptions[0])
print(len(descriptions))
print(urls[0])
print(len(urls))

This classic dataset contains the prices and other attributes of almost 54,000 diamonds. It's a great dataset for beginners learning to work with data analysis and visualization.

Content
price price in US dollars (\$326--\$18,823)

carat weight of the diamond (0.2--5.01)

cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)

color diamond colour, from J (worst) to D (best)

clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

x length in mm (0--10.74)

y width in mm (0--58.9)

z depth in mm (0--31.8)

depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)

table width of top of diamond relative to widest point (43--95)
3870
https://www.openml.org/search?type=data&id=42225
3870


# Embed croissant descriptions

In [58]:
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(descriptions, show_progress_bar=True)

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 170/170 [02:52<00:00,  1.02s/it]


In [59]:
#sanity checks
print(embeddings.shape)

(5435, 384)


In [67]:
#write embeddings to csv for ingestion to the visual explorer
emb = pd.DataFrame(embeddings)
first_column = emb.columns[0]
# drop first index column
emb = emb.drop([first_column], axis=1)
emb.to_csv('openmlcroissant_inputs.csv', index=False)

In [69]:
#write corresponding urls to csv for ingestion to the visual explorer
urls = pd.DataFrame(urls)
urls.to_csv('openmlcroissant_labels.csv', index=False)

In [None]:
|