# Random Useless Facts

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/iragca/pit-predictive-analysis/blob/master/notebooks/1.0-random-useless-facts.ipynb)

## Executive Summary



## Table of Contents

```bash
.
├── Title Section
│       ├── Executive Summary
│       ├── Table of Contents                   #  ⭐ You are here
│       └── Setup                               #  Libraries and dataset imports
├── Introduction
│       ├── Problem Statement
└── Methodology
```


## Setup

Before running this notebook make sure to have environment variables set: `SUPABASE_URL` and `SUPABASE_KEY`.


In [1]:
try:
    import google.colab  # type: ignore
    IS_COLAB_ENVIRONMENT = True
except ImportError:
    IS_COLAB_ENVIRONMENT = False

if IS_COLAB_ENVIRONMENT:
    from google.colab import userdata # type: ignore
    %pip install iragca
    %pip install pydantic
    %pip install supabase

In [2]:
from enum import Enum
import os
from urllib.parse import urlencode, urljoin

import dotenv
import httpx
from pydantic import BaseModel
from supabase import Client, create_client

dotenv.load_dotenv()

True

# Introduction

## Problem Statement

What are the most common topics being observed despite the mundaneness of the subject matter?

# Methodology

## API Wrapper

In [3]:
from typing import Optional


class Language(str, Enum):
    """
    Enumeration of supported languages for the Random Useless Facts API.

    Each enum member corresponds to the language code expected by the
    remote API.
    """
    ENGLISH = "en"
    GERMAN = "de"

class Fact(BaseModel):
    """
    Data model representing a response returned by the Random Useless Facts API.

    Parameters
    ----------
    id : str
        Unique identifier of the fact.
    text : str
        The textual content of the fact.
    source : str
        Origin or reference for the fact.
    source_url : str
        URL pointing to the source.
    language : Language
        Language in which the fact is written.
    permalink : str
        Permanent URL for accessing the fact.

    Notes
    -----
    This model is parsed from JSON using Pydantic and ensures type validation
    of the incoming API response.
    """
    id: str
    text: str
    source: str
    source_url: str
    language: Language
    permalink: str

    def __hash__(self) -> int:
        return hash(self.id)


class RandomUselessFactAPI:
    """
    Client for interacting with the Random Useless Facts API.

    Parameters
    ----------
    base_url : str, optional
        Base URL of the API. Defaults to ``"https://uselessfacts.jsph.pl/"``.

    Notes
    -----
    This class exposes convenience methods to construct API URLs and retrieve
    random facts in various languages. Requests are executed using ``httpx``.
    """
    def __init__(self, base_url: str ="https://uselessfacts.jsph.pl/") -> None:
        self.base_url = base_url

    def get_random_fact(self, language: Language = Language.ENGLISH) -> Fact:
        """
        Retrieve a random useless fact.

        Parameters
        ----------
        language : Language, optional
            Language in which the fact should be returned. Defaults to
            ``Language.ENGLISH``.

        Returns
        -------
        Response
            Parsed API response containing the random fact.

        Raises
        ------
        httpx.HTTPError
            If the request fails or returns an error status code.

        Examples
        --------
        >>> api = RandomUselessFactAPI()
        >>> fact = api.get_random_fact()
        >>> fact.text
        'Banging your head against a wall for one hour burns 150 calories.'
        """
        url = self._build_random_fact_url(language)
        response = httpx.get(url)
        response.raise_for_status()
        data = response.json()

        return Fact(**data)
    
    def _build_random_fact_url(self, language: Optional[Language] = None) -> str:
        """
        Construct the request URL for fetching a random fact.

        Parameters
        ----------
        language : Optional[Language], optional
            Language code to include as a query parameter. Defaults to
            ``None``.

        Returns
        -------
        str
            Fully constructed URL pointing to the random fact endpoint.

        Notes
        -----
        This method is internal and is not intended to be called directly by
        users. It performs safe URL joining and query parameter encoding.
        """
        ENDPOINT = "api/v2/facts/random"
        path = urljoin(self.base_url, ENDPOINT)
        
        if language is not None:
            params = urlencode({"language": language.value})
            return f"{path}?{params}"
        return path
    

# Data Warehouse

In [4]:
from functools import lru_cache
from supabase import create_client, Client

class SupabaseStorage:
    """
    Storage interface for interacting with the Supabase `facts` table.

    This class handles:
    - Environment variable loading for Supabase credentials
    - Inserting single or multiple facts into the database
    - Retrieving all stored facts

    Attributes
    ----------
    supabase : supabase.Client
        Initialized Supabase client used for database operations.

    Raises
    ------
    ValueError
        If required Supabase environment variables are missing.
    """
    def __init__(self) -> None:
        """
        Initialize the Supabase client using environment variables.

        The method attempts to load the Supabase URL and API key using
        `load_env`. If either value is missing, an exception is raised.

        Raises
        ------
        ValueError
            If `SUPABASE_URL` or `SUPABASE_KEY` is not set.
        """
        env = self.load_env()
        if env and all(env.values()):
            self.supabase: Client = create_client(env["url"], env["key"])
        else:
            raise ValueError("SUPABASE_URL and SUPABASE_KEY must be set in environment variables.")
        
    @lru_cache
    def insert_fact(self, fact: Fact):
        """
        Insert a single fact into the Supabase `facts` table.

        This method is cached with an LRU cache of size 1.

        Parameters
        ----------
        fact : Fact
            The fact object to be inserted into the database.

        Returns
        -------
        supabase.lib.client_response.ClientResponse
            Response object returned by the Supabase insert operation.
        """
        response = self.supabase.table("facts").insert(fact.model_dump()).execute()
        return response
    
    def insert_facts(self, facts: list[Fact]):
        """
        Insert multiple facts into the Supabase `facts` table.

        Parameters
        ----------
        facts : list of Fact
            A list of fact objects to be inserted.

        Returns
        -------
        supabase.lib.client_response.ClientResponse
            Response object returned by the Supabase bulk insert operation.
        """
        response = self.supabase.table("facts").insert([fact.model_dump() for fact in facts]).execute()
        return response
    
    def get_all_facts(self) -> list[Fact]:
        """
        Retrieve all facts from the Supabase `facts` table.

        Returns
        -------
        list of Fact
            A list of `Fact` objects reconstructed from the database records.
        """
        response = self.supabase.table("facts").select("*").execute()
        data = response.data
        return [Fact(**item) for item in data]
    
    @staticmethod
    def load_env():
        """
        Load Supabase environment variables.

        This method supports both Google Colab and standard local
        environment variable loading.

        Returns
        -------
        dict
            Dictionary containing:
            - `"url"` : str
                Supabase project URL.
            - `"key"` : str
                Supabase API key.
        """
        if IS_COLAB_ENVIRONMENT:
            url: str = userdata.get("SUPABASE_URL")
            key: str = userdata.get("SUPABASE_KEY")
        else:
            url: str = os.environ.get("SUPABASE_URL", "")
            key: str = os.environ.get("SUPABASE_KEY", "")
        return {"url": url, "key": key}

    


# Ingesting data

In [5]:
api = RandomUselessFactAPI()
storage = SupabaseStorage()

In [6]:
fact: Fact = api.get_random_fact(language=Language.ENGLISH)
print(fact.text)

The pancreas produces Insulin.


In [7]:
fact

Fact(id='f0438fa76b4372f93f5bad20c34fb448', text='The pancreas produces Insulin.', source='djtech.net', source_url='https://www.djtech.net/humor/shorty_useless_facts.htm', language=<Language.ENGLISH: 'en'>, permalink='https://uselessfacts.jsph.pl/api/v2/facts/f0438fa76b4372f93f5bad20c34fb448')

In [None]:
from tqdm import tqdm
from itertools import batched
from time import sleep as cooldown
from itertools import chain

FETCH_COUNT = 100
BATCH_SIZE = 10


def fetch_facts(count: int, cooldown_sec: int = 1) -> list[Fact]:
    facts = []
    for _ in tqdm(range(count), desc="Fetching facts"):
        fact = api.get_random_fact()
        facts.append(fact)
        cooldown(cooldown_sec)  # To avoid hitting rate limits
    return facts

# Fetch facts from the API
facts = fetch_facts(FETCH_COUNT)

failed_batches = []

# Insert facts into Supabase in batches
for batch in tqdm(batched(facts, BATCH_SIZE), total=FETCH_COUNT / BATCH_SIZE, desc="Batch ingest"):
    try:
        storage.insert_facts(list(batch))
    except Exception as e:
        failed_batches.append(list(batch))


errors = []

# Retry failed inserts individually
if failed_batches:
    failed_batches = chain.from_iterable(failed_batches)
    for fact in tqdm(failed_batches, desc="Retrying failed inserts"):
        try:
            storage.insert_fact(fact)
        except Exception as e:
            errors.append((fact, str(e)))

# Analyze errors for duplicates
duplicate_count = sum(1 for fact, error in errors if "duplicate" in error.lower())
duplicate_count

## Data Exploration

In [7]:
facts = storage.get_all_facts()
len(facts)

1501

In [8]:
facts

[Fact(id='a79df1c3d4ee87fa1c6ffb798cb6c3d1', text='Approximately every seven minutes of every day, someone in an aerobics class pulls their hamstring.', source='djtech.net', source_url='http://www.djtech.net/humor/useless_facts.htm', language=<Language.ENGLISH: 'en'>, permalink='https://uselessfacts.jsph.pl/api/v2/facts/a79df1c3d4ee87fa1c6ffb798cb6c3d1'),
 Fact(id='9930af0979d0b01ffa1311000e6cf29c', text='The country code for Russia is "007".', source='djtech.net', source_url='https://www.djtech.net/humor/shorty_useless_facts.htm', language=<Language.ENGLISH: 'en'>, permalink='https://uselessfacts.jsph.pl/api/v2/facts/9930af0979d0b01ffa1311000e6cf29c'),
 Fact(id='38583dc74ede6858ca400242b81efd8d', text='Steely Dan got their name from a sexual device depicted in the book `The Naked Lunch`. \xa0', source='djtech.net', source_url='http://www.djtech.net/humor/useless_facts.htm', language=<Language.ENGLISH: 'en'>, permalink='https://uselessfacts.jsph.pl/api/v2/facts/38583dc74ede6858ca400242