# HuggingFace dataset loader 

This notebook shows how to load Hugging Face Hub datasets to LangChain.

The Hugging Face Hub hosts a large number of community-curated datasets for a diverse range of tasks such as translation, automatic speech recognition, and image classification.


In [17]:
from langchain.document_loaders import HuggingFaceDatasetLoader

In [7]:
dataset_name="imdb"
page_content_column="text"


loader=HuggingFaceDatasetLoader(dataset_name,page_content_column)

In [None]:
data = loader.load()

In [10]:
data[:3]

[Document(page_content='I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are 

## Example 
In this example, we use data from a dataset to answer a question

In [8]:
from langchain.indexes import VectorstoreIndexCreator
from langchain.document_loaders.hugging_face_dataset import HuggingFaceDatasetLoader

In [24]:
dataset_name="tweet_eval"
page_content_column="text"
name="stance_climate"


loader=HuggingFaceDatasetLoader(dataset_name,page_content_column,name)

In [26]:
index = VectorstoreIndexCreator().from_loaders([loader])

Found cached dataset tweet_eval


  0%|          | 0/3 [00:00<?, ?it/s]

Using embedded DuckDB without persistence: data will be transient


In [29]:
query = "What are the most used hashtag?"
result = index.query(query)

In [30]:
result

' The most used hashtags in this context are #UKClimate2015, #Sustainability, #TakeDownTheFlag, #LoveWins, #CSOTA, #ClimateSummitoftheAmericas, #SM, and #SocialMedia.'

## Dataset streaming 

Dataset streaming lets you work with a dataset without downloading it. The data is streamed as you iterate over the dataset. This is especially helpful when:

You don’t want to wait for an extremely large dataset to download.

The dataset size exceeds the amount of available disk space on your computer.

You want to quickly explore just a few samples of a dataset.

In [12]:

path="glue"
name="qnli"
batch_size=5 ## by default its 10
page_content_column="sentence"


loader=HuggingFaceDatasetLoader(path=dataset_name,name=name,page_content_column=page_content_column,batch_size=batch_size,streaming=True)



### Stream first 5 element

In [13]:
loader.load()

Downloading builder script: 100%|█████████████████████████████████| 28.8k/28.8k [00:00<00:00, 3.80MB/s]
Downloading metadata: 100%|███████████████████████████████████████| 28.7k/28.7k [00:00<00:00, 4.27MB/s]
Downloading readme: 100%|█████████████████████████████████████████| 27.9k/27.9k [00:00<00:00, 3.76MB/s]


[Document(page_content='Unlike the two seasons before it and most of the seasons that followed, Digimon Tamers takes a darker and more realistic approach to its story featuring Digimon who do not reincarnate after their deaths and more complex character development in the original Japanese.', metadata={'question': 'When did the third Digimon series begin?', 'label': 1, 'idx': 0}),
 Document(page_content='When MANPADS is operated by specialists, batteries may have several dozen teams deploying separately in small sections; self-propelled air defence guns may deploy in pairs.', metadata={'question': 'Which missile batteries often have individual launchers several kilometres from one another?', 'label': 1, 'idx': 1}),
 Document(page_content='He bases this interpretation on the fact that examples such as the one described above refer to two things: assertions and the facts to which they refer.', metadata={'question': "What two things does Popper argue Tarski's theory involves in an evaluat

### Stream next 5 element

In [14]:
loader.load()

[Document(page_content='When talking about the German language, the term German dialects is only used for the traditional regional varieties.', metadata={'question': "When is the term 'German dialects' used in regard to the German language?", 'label': 0, 'idx': 5}),
 Document(page_content='At the end of the Second Anglo-Dutch War, the English gained New Amsterdam (New York) in North America in exchange for Dutch control of Run, an Indonesian island.', metadata={'question': 'What was the name of the island the English traded to the Dutch in return for New Amsterdam?', 'label': 0, 'idx': 6}),
 Document(page_content='From the 1720s onward, the kingdom was beset with repeated Meithei raids into Upper Myanmar and a nagging rebellion in Lan Na.', metadata={'question': 'How were the Portuguese expelled from Myanmar?', 'label': 1, 'idx': 7}),
 Document(page_content='The bill also required rotation of principal maintenance inspectors and stipulated that the word "customer" properly applies to t