# 加载本地数据集

![](https://chushi123.oss-cn-beijing.aliyuncs.com/img/202202262240993.png)

In [1]:
from datasets import load_dataset

## SQuAD-it 使用json格式，所有文本都存储在一个data字段中。我们可以通过指定field参数来加载数据集。

In [2]:
squad_it_dataset = load_dataset(
    "json", data_files="./data/SQuAD_it-train.json", field="data"
)

Using custom data configuration default-f503ae2a59b7a7f3
Reusing dataset json (C:\Users\ls\.cache\huggingface\datasets\json\default-f503ae2a59b7a7f3\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


  0%|          | 0/1 [00:00<?, ?it/s]

## 默认情况下，加载本地文件会创建一个DatasetDict带有train拆分的对象。

In [3]:
squad_it_dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
})

## 查看单条数据

In [4]:
# 数据太长，所以注释掉了
# squad_it_dataset["train"][0]

# 同时读取训练集和测试集

https://huggingface.co/docs/datasets/loading.html#local-and-remote-files

In [5]:
data_files = {
    "train": "./data/SQuAD_it-train.json",
    "test": "./data/SQuAD_it-test.json",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

Using custom data configuration default-907da1154175c1ae
Reusing dataset json (C:\Users\ls\.cache\huggingface\datasets\json\default-907da1154175c1ae\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 442
    })
    test: Dataset({
        features: ['title', 'paragraphs'],
        num_rows: 48
    })
})

## 也可以直接加载压缩包格式的文件

In [6]:
data_files = {
    "train": "./data/SQuAD_it-train.json.gz",
    "test": "./data/SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

Using custom data configuration default-a96f274d1e7cab13
Reusing dataset json (C:\Users\ls\.cache\huggingface\datasets\json\default-a96f274d1e7cab13\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


  0%|          | 0/2 [00:00<?, ?it/s]

> The data_files argument of the load_dataset() function is quite flexible and can be either a single file path, a list of file paths, or a dictionary that maps split names to file paths. You can also glob files that match a specified pattern according to the rules used by the Unix shell (e.g., you can glob all the JSON files in a directory as a single split by setting data_files="*.json"). See the 🤗 Datasets documentation for more details.

# 离线加载之前从Hub或 Datasets GitHub 存储库下载过数据集

Set the environment variable HF_DATASETS_OFFLINE to 1 to enable full offline mode.