### Title: Analyzing Metadata of Hugging Face Models

#### Introduction
 Hugging Face provides a wide variety of machine learning models that can be used in diverse domains. 
 In this notebook, we analyze metadata of models available on Hugging Face Hub. 
 Specifically, we will retrieve metadata such as download counts, parameters, release dates, and licenses, 
 and organize the data for insights.

#### Required Libraries
 We use the `huggingface_hub` library to interact with the Hugging Face Hub API and `pandas` for data manipulation.


In [1]:
import re
import pandas as pd
import huggingface_hub
from huggingface_hub import HfApi
from dotenv import load_dotenv
import os
import time
import concurrent.futures
import re

  from .autonotebook import tqdm as notebook_tqdm


### Step 1: Initialize the Hugging Face API
 The HfApi class is used to interact with the Hugging Face Hub API.

In [2]:
load_dotenv()
hugging_face_token = os.getenv("HF_TOKEN")

if hugging_face_token is None:
    raise ValueError(
        "HF_TOKEN not found in .env file or environment variables."
        " Make sure you have a line like HF_TOKEN=your_hf_token in your .env."
    )

api = HfApi(token=hugging_face_token)

# Test the API (fetch a small sample of models)
try:
    # Convert the generator to a list directly
    test_models = list(api.list_models())
    print(f"Test fetch OK. Number of models fetched: {len(test_models)}")
except Exception as e:
    print(f"Error fetching test models: {e}")

Test fetch OK. Number of models fetched: 1320778


### Step 2: Fetch Model Metadata
 We fetch metadata for a large number of public models from the Hugging Face Hub.

In [4]:
try:
    # Adjust limit as you wish. 'full=True' may or may not work on older versions.
    all_models = list(api.list_models(full=True))
    print(f"Fetched {len(all_models)} models.")
except Exception as e:
    print(f"Error fetching models: {e}")
    all_models = []

# If no models were fetched, we can't proceed further
if not all_models:
    print("No models available. Exiting.")
    exit()

Fetched 1321287 models.


### Step 3: Define Metadata Extraction Rules
 We use regular expressions to extract parameter patterns (e.g., "82M", "3.5B") from model names.

In [5]:
param_pattern = re.compile(r"(\d+(\.\d+)?[MBmb])")

### Step 4: Process Models to Extract Metadata
 For each model, we retrieve detailed metadata and organize it into a list of dictionaries.

In [6]:
# Initialize variables
model_metadata_list = []
param_pattern = re.compile(r"\d+(\.\d+)?[mMbB]")

# Define the number of models for which we want detailed info
detailed_info_count = 100

for index, model in enumerate(all_models):
    try:
        model_id = model.modelId

        # For the first 100 models, fetch detailed information
        if index < detailed_info_count:
            model_info = api.model_info(model_id)

            # Safely extract license and last_updated
            card_data = model_info.cardData if model_info.cardData else {}
            license_info = card_data.get("license", "Unknown")  # Use get() to avoid KeyError
            last_updated = str(model_info.lastModified) if model_info.lastModified else "Unknown"
        else:
            # Set defaults for the remaining models
            license_info = "Unknown"
            last_updated = "Unknown"

        # Extract parameters
        match = param_pattern.search(model_id)
        if match:
            param_candidate = match.group(0)
            if param_candidate.lower().endswith("m"):
                parameters = f"{float(param_candidate[:-1]) / 1000:.3f}B"
            elif param_candidate.lower().endswith("b"):
                parameters = param_candidate
            else:
                parameters = "Unknown"
        else:
            parameters = "Unknown"

        # Add model metadata to the list
        metadata = {
            "model_id": model_id,
            "downloads (M)": model_info.downloads / 1_000_000 if index < detailed_info_count and model_info.downloads else 0,
            "parameters": parameters,
            "last_updated": last_updated,
            "license": license_info,
        }
        model_metadata_list.append(metadata)

    except AttributeError as e:
        print(f"Error fetching metadata for model {model_id}: {e}")
        metadata = {
            "model_id": model_id,
            "downloads (M)": 0,
            "parameters": "Unknown",
            "last_updated": "Unknown",
            "license": "Unknown",
        }
        model_metadata_list.append(metadata)
    except Exception as e:
        print(f"Error processing model {model_id}: {e}")


Error processing model bytedance-research/UI-TARS-7B-gguf: 404 Client Error. (Request ID: Root=1-679223ad-68b1aa0077655fbf1a34816f;628b5935-c68d-4426-ad0e-bf1b943b0dc0)

Repository Not Found for url: https://huggingface.co/api/models/bytedance-research/UI-TARS-7B-gguf.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.


### Step 5: Organize Metadata into a DataFrame
 The metadata list is converted into a pandas DataFrame for further analysis.

In [7]:
df = pd.DataFrame(model_metadata_list)
print(len(model_metadata_list))
print(f"\nDataFrame created with {len(df)} entries.")
print("DataFrame columns:", df.columns.tolist())

# Sort by downloads (M) if available
if "downloads (M)" in df.columns:
    df.sort_values(by="downloads (M)", ascending=False, inplace=True)
    df.reset_index(drop=True, inplace=True)
    df.index += 1
else:
    print("Warning: 'downloads (M)' column is missing; skipping sort.")

1321286

DataFrame created with 1321286 entries.
DataFrame columns: ['model_id', 'downloads (M)', 'parameters', 'last_updated', 'license']


### Step 6: Display the DataFrame
 The complete DataFrame is displayed to view all rows and columns.

In [8]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("\nTop 10 models by 'downloads (M)':")
print(df.head(10).to_string())


Top 10 models by 'downloads (M)':
                                    model_id  downloads (M) parameters               last_updated     license
1              google-bert/bert-base-uncased      75.484575    Unknown  2024-02-19 11:06:12+00:00  apache-2.0
2     sentence-transformers/all-MiniLM-L6-v2      74.605162    Unknown  2024-11-01 10:26:30+00:00  apache-2.0
3                      openai-community/gpt2      12.553279    Unknown  2024-02-19 10:57:45+00:00         mit
4           pyannote/speaker-diarization-3.1       8.931313    Unknown  2024-05-10 19:43:23+00:00         mit
5           meta-llama/Llama-3.1-8B-Instruct       5.388887         8B  2024-09-25 17:00:57+00:00    llama3.1
6                    openai/whisper-large-v3       4.931610    Unknown  2024-08-12 10:20:10+00:00  apache-2.0
7                answerdotai/ModernBERT-base       4.708041    Unknown  2025-01-15 20:11:48+00:00  apache-2.0
8   meta-llama/Llama-3.2-11B-Vision-Instruct       2.555887        11B  2024-12-04 01

### Save the DataFrame for Offline Analysis
 The metadata is saved to a CSV file for further offline analysis.

In [9]:
df.to_csv("huggingface_model_metadata.csv")
print("\nMetadata saved to huggingface_model_metadata.csv")


Metadata saved to huggingface_model_metadata.csv


## Conclusion
 In this notebook, we successfully fetched metadata from the Hugging Face Hub. 
 The data is organized into a pandas DataFrame and includes information such as downloads, parameters, 
 last updated date, and license type. The metadata was sorted by download counts to identify the most popular models.