# Croissant 🥐 Universe Explorer - openml

This notebook is based on the output produced after crawling openml datasets.

We project croissant metadata to text embedding space that can be ingested into visual universe explorer such as in https://github.com/luisoala/croissant-universe-surfer.

## bertopic imports (to extract embeddings)

In [1]:
import numpy as np
from bertopic import BERTopic
from umap import UMAP
from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm
2024-02-29 22:55:27.866717: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-29 22:55:28.023257: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-29 22:55:28.023397: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-29 22:55:28.050145: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-29 22:55:28.1

##  croissant imports

In [2]:
import sys
from etils import epath
from IPython.display import Markdown
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import polars as pl
import seaborn as sns
import re
sns.set()

# get croissant descriptions for openml

In [8]:
source = "openml" 
folder = epath.Path("/home/donald/repos/croissant/health/data")
files = folder.glob("*/*.parquet")

In [11]:
df = pl.scan_parquet(files).filter(pl.col("source") == source)
print(f"Report for {source}")

Report for openml


In [12]:
sns.set_style("white")
num_rows = df.select(pl.len()).collect().item()
display(Markdown(f"Scrapped {num_rows} datasets for {source}"))
body = pl.col("body")

Scrapped 5435 datasets for openml

In [13]:
a = np.array(df.collect())

In [14]:
print(a.shape)

(5435, 15)


In [15]:
def extract_first_description_and_all_urls_per_entry(json_strings):
    descriptions = []
    all_urls = []  # This will be a list of lists
    description_pattern = re.compile(r'"description": "(.*?)"(?=[,}])', re.DOTALL)
    url_pattern = re.compile(r'"url": "(.*?)"(?=[,}])', re.DOTALL)
    
    for string in json_strings:
        # Search for the first description
        description_match = description_pattern.search(string)
        if description_match:
            description = description_match.group(1).encode('utf-8').decode('unicode_escape')
            descriptions.append(description)
        else:
            descriptions.append('No description found')
        
        # Search for URLs and keep them in a list corresponding to the current string
        url_matches = url_pattern.findall(string)
        if url_matches:
            urls = [match.encode('utf-8').decode('unicode_escape') for match in url_matches]
            all_urls.append(urls)  # Append list of URLs for the current entry
        else:
            all_urls.append(['No URL found'])  # Append a placeholder if no URLs are found
    
    return descriptions, all_urls

In [16]:
descriptions, urls = extract_first_description_and_all_urls_per_entry([str(x[0]) for x in a]) #get a list of croissant descriptions and urls from the openml crawl dump

In [17]:
#sanity checks
print(descriptions[0])
print(len(descriptions))
print(urls[0])
print(len(urls))

This classic dataset contains the prices and other attributes of almost 54,000 diamonds. It's a great dataset for beginners learning to work with data analysis and visualization.\n\nContent\nprice price in US dollars (\\$326--\\$18,823)\n\ncarat weight of the diamond (0.2--5.01)\n\ncut quality of the cut (Fair, Good, Very Good, Premium, Ideal)\n\ncolor diamond colour, from J (worst) to D (best)\n\nclarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))\n\nx length in mm (0--10.74)\n\ny width in mm (0--58.9)\n\nz depth in mm (0--31.8)\n\ndepth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)\n\ntable width of top of diamond relative to widest point (43--95)
5435
['https://www.openml.org/search?type=data&id=42225']
5435


# Embed croissant descriptions

In [58]:
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(descriptions, show_progress_bar=True)

Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 170/170 [02:52<00:00,  1.02s/it]


In [59]:
#sanity checks
print(embeddings.shape)

(5435, 384)


In [67]:
#write embeddings to csv for ingestion to the visual explorer
emb = pd.DataFrame(embeddings)
first_column = emb.columns[0]
# drop first index column
emb = emb.drop([first_column], axis=1)
emb.to_csv('openmlcroissant_inputs.csv', index=False)

In [69]:
#write corresponding urls to csv for ingestion to the visual explorer
urls = pd.DataFrame(urls)
urls.to_csv('openmlcroissant_labels.csv', index=False)

In [None]:
|