# Working with Hugging Face - Part 1

## Getting Started with Hugging Face

Start your journey with the Hugging Face platform by understanding what Hugging Face is and common use cases. Then, you'll learn about the Hugging Face Hub including models and datasets available, how to search for them, navigate model, or dataset, cards, and download. Lastly, you'll learn about the high-level components of transformers and LLMs.

### Searching the Hub with Python
The Hugging Face Hub provides a nice user interface for searching for models and learning more about them. At times, you may find it convenient to be able to do the same thing without leaving the development environment. Fortunately, Hugging Face also provides a Python package which allows you to find models through code.

Use the huggingface_hub package to find the most downloaded model for text classification.

HfApi and ModelFilter from the huggingface_hub package is already loaded for you.

In [1]:
from huggingface_hub import HfApi

api = HfApi()

# Return the filtered list from the Hub
models = api.list_models(
    task="text-classification",
    sort="downloads",
    direction=-1,
  	limit=1
)

# Store as a list
modelList = list(models)

print(modelList[0].modelId)

1231czx/llama3_it_ultra_list_and_bold500


### Saving a model
There may be situations where downloading and storing a model locally (i.e. a directory on your computer) is desired. For example, if working offline.

Practice downloading and saving here. An instance of AutoModel is already loaded for you under the same name.

In [2]:
from transformers import AutoModel

modelId = "distilbert-base-uncased-finetuned-sst-2-english"

# Instantiate the AutoModel class
model = AutoModel.from_pretrained(modelId)

# Save the model
model.save_pretrained(save_directory=f"models/{modelId}")

### Inspecting datasets
The datasets on Hugging Face range in terms of size, information, and features. Therefore it's beneficial to inspect it before committing to loading a dataset into your environment.

Let's inspect the "derenrich/wikidata-en-descriptions-small" dataset.

Note: this exercise may take a minute due to the dataset size.

In [3]:
# Load the module
from datasets import load_dataset_builder

# Create the dataset builder
reviews_builder = load_dataset_builder("derenrich/wikidata-en-descriptions-small")

# Print the features
print(reviews_builder.info.features)

README.md:   0%|          | 0.00/615 [00:00<?, ?B/s]

{'output': Value(dtype='string', id=None), 'qid': Value(dtype='string', id=None), 'name': Value(dtype='string', id=None), 'input': Value(dtype='string', id=None), 'instruction': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None)}


### Loading datasets
Hugging Face built the dataset package for interacting with datasets. There are a lot of convenient functions, including load_dataset_builder which we just used. After inspecting a dataset to ensure its the right one for your project, it's time to load the dataset! For this, we can leverage input parameters for load_dataset to specify which parts of a dataset to load, i.e. the "train" dataset for English wikipedia articles.

The load_dataset module from the datasets package is already loaded for you. Note: the load_dataset function was modified for the purpose of this exercise.

In [2]:
# Load the module
from datasets import load_dataset

# Load the train portion of the dataset
wikipedia = load_dataset("wikimedia/wikipedia", name="20231101.pl", split="train")

print(f"The length of the dataset is {len(wikipedia)}")

The length of the dataset is 1587721


### Manipulating datasets
There will likely be many occasions when you will need to manipulate a dataset before using it within a ML task. Two common manipulations are filtering and selecting (or slicing). Given the size of these datasets, Hugging Face leverages arrow file types.

This means performing manipulations are slightly different than what you might be used to. Fortunately, there's already methods to help with this!

The dataset is already loaded for you under wikipedia.

In [12]:
# Filter the documents
filtered = wikipedia.filter(lambda row: "bebok" in row["text"])

# Create a sample dataset
example = filtered.select(range(1))

print(example[0]["text"])

Bobo – nadprzyrodzona istota z polskiego folkloru, prawdopodobnie demon z wierzeń słowiańskich. Innymi nazwami tej istoty są: bobok (Wielkopolska, Małopolska), babok (Kujawy), bebok (Śląsk), bobek, bobik (Kraj morawsko-śląski). 

W polskich wierzeniach ludowych bobo był małą, brzydką i złośliwą istotą, którą straszono dzieci w celu ich zdyscyplinowania. Wzmianka o bobie znajduje się w pochodzącej z początku XVII wieku Peregrynacji dziadowskiej – według niej bobo miał bić dzieci i czynić w domu różne szkody. Można go było przebłagać ofiarą z żywności. 

Relikty tych wierzeń zachowały się jeszcze w XIX wieku (odnotował je m.in. Oskar Kolberg); w Wielkopolsce i Małopolsce straszono dzieci powiedzeniem: „cicho bądź, bo cię bobok weźmie”, albo „jak nie bydzies jád, to cie bebák zjé”. Bohdan Baranowski cytuje dwa utwory ludowe traktujące o bobie; w Puszczy Sandomierskiej znana była kołysanka:

zaś w okolicach Olkusza:

We współczesnej kulturze do postaci boba nawiązuje zespół Formacja Nieżyw