![](https://chushi123.oss-cn-beijing.aliyuncs.com/img/202202271331129.png)

# 加载数据 load_dataset

TSV文件是CSV文件的一种变体，它使用制表符而不是逗号作为分隔符，我们可以通过使用csv加载脚本并在函数中指定delimiter参数来加载这些文件，load_dataset()如下所示：

In [1]:
from datasets import load_dataset

data_files = {
    "train": "./data/drugsComTrain_raw.tsv",
    "test": "./data/drugsComTest_raw.tsv",
}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

Using custom data configuration default-a01bd41b65dcc0bc
Reusing dataset csv (C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e)


  0%|          | 0/2 [00:00<?, ?it/s]

# 打乱数据  shuffle，选择数据 select

1. Dataset.shuffle()打乱数据顺序，出于可重复性目的，我们已将种子固定。
2. Dataset.select()选择数据，参数为一个可迭代的索引。

In [2]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

Loading cached shuffled indices for dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-f5b3e362d7ff485e.arrow


{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

In [3]:
drug_dataset.keys()

dict_keys(['train', 'test'])

# 数据去重 unique

使用该Dataset.unique()函数来验证ID的数量是否等于训练集/测试集的行数

In [4]:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

# 修改列名 rename_column

In [5]:
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

# 将condition字段内容都变成小写

In [6]:
# 将‘condition’字段内容都变成小写
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}


# 现在调用这函数会报错，因为有的数据的‘condition’字段为空。None不能调用lower()
# drug_dataset.map(lowercase_condition)

# 过滤掉condition字段为空的数据

In [7]:
def filter_nones(x):
    return x["condition"] is not None

## 匿名函数

In [8]:
(lambda x: x * x)(3)

9

In [9]:
(lambda base, height: 0.5 * base * height)(4, 8)

16.0

## 过滤数据 filter

In [10]:
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-919d91a03d003e4b.arrow
Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-8e67ef41d36fd171.arrow


## 去除掉空数据后，可以调用map函数将condition字段内容都变成小写

> map()对数据集的每一行或者每batch的数据应用某个函数。

In [11]:
drug_dataset = drug_dataset.map(lowercase_condition)
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-67d8981cc212cd65.arrow
Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-1472924cb945031d.arrow


['left ventricular dysfunction', 'adhd', 'birth control']

# 计算每条评论的单词数，创建新列

In [12]:
def compute_review_length(example):
    return {"review_length": len(example["review"].split())}

compute_review_length()返回一个字典，其键与数据集中的任何列都不对应。在这种情况下当compute_review_length()返回值给 Dataset.map()时，它将应用于数据集中的所有行，以创建新的review_length列。

In [13]:
drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0]

Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-53840265db1d8148.arrow
Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-79c032691f1a637e.arrow


{'patient_id': 206461,
 'drugName': 'Valsartan',
 'condition': 'left ventricular dysfunction',
 'review': '"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
 'rating': 9.0,
 'date': 'May 20, 2012',
 'usefulCount': 27,
 'review_length': 17}

# 按照指定的列排序 sort

In [14]:
drug_dataset["train"].sort("review_length")[:3]

Loading cached sorted indices for dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-22258ddd4621bce5.arrow


{'patient_id': [103488, 23627, 20558],
 'drugName': ['Loestrin 21 1 / 20', 'Chlorzoxazone', 'Nucynta'],
 'condition': ['birth control', 'muscle spasm', 'pain'],
 'review': ['"Excellent."', '"useless"', '"ok"'],
 'rating': [10.0, 1.0, 6.0],
 'date': ['November 4, 2008', 'March 24, 2017', 'August 20, 2016'],
 'usefulCount': [5, 2, 10],
 'review_length': [1, 1, 1]}

# 过滤掉评论长度<=30的数据 filter

In [15]:
print(drug_dataset.num_rows)
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)
print(drug_dataset.num_rows)

Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-8111a42f861070fa.arrow
Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-d7b64874736838a1.arrow


{'train': 160398, 'test': 53471}
{'train': 138514, 'test': 46108}


# 处理HTML转义字符

In [16]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

In [17]:
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-d66db6d403a78e6e.arrow
Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-76861f18ead11f74.arrow


# map()方法的加速超能力

Dataset.map()方法的batched参数，如果设置为True，则会导致它一次向map函数发送一批数据（批大小可配置，但默认为 1，000）。例如，上一个取消转义所有 HTML 的 map 函数运行了一些时间（您可以从进度条中读取所花费的时间）。我们可以通过使用列表表达式同时处理多个元素来加快速度。

**如果指定batched=True，则map应用的函数将接收到包含数据集内容的字典，字典的key是数据集的列名，value是列表，列表长度为batch_size。Dataset.map()的返回值也一样: a dictionary with the fields we want to update or add to our dataset, and a list of values.**

In [18]:
# 因为有了batched=True，匿名函数接收到的数据的value是列表了，不再是单个元素，所以一般需要改写下匿名函数
new_drug_dataset = drug_dataset.map(
    lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
)

# 因为unescape()函数能接收单个字符串作为输入，也能接收列表作为输入，所以也可以延用之前的写法
# new_drug_dataset = drug_dataset.map(
#     lambda x: {"review": html.unescape(x["review"])}, batched=True
# )

Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-39bcb5ade08da3a7.arrow
Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-e4a6a80c90f4fbeb.arrow


## 使用map给分词器提速

In [19]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

In [20]:
%time tokenized_dataset = drug_dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-87a2aae47092d473.arrow
Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-7522c7e127de9f68.arrow


CPU times: total: 62.5 ms
Wall time: 49 ms


## 对于Slow tokenizer，batched参数影响不大，可以使用num_proc指定进程数来提速

![](https://chushi123.oss-cn-beijing.aliyuncs.com/img/202202271913496.png)

# 数据样本一列包含多个特征的情况

> 在机器学习中，通常数据集的一行数据就是喂给模型的特征。大多数情况下，每一个列就是每一个特征。但是也有另一种情况，一个样本数据的一个列包含多个特征(如question answering)。

1. 设置return_overflowing_tokens=True， tokenizer返回的'input_ids'等值可能是一个列表也可能是单个值，不仅会将前128个字符标记结果返回，还会返回剩余字符的标记结果
2. 128不是前128个单词,有的单词会被拆成多份，还会加一个特殊符号。不同tokenizer拆法有区别

In [21]:
# return_overflowing_tokens=True 不仅会将前128个字符标记结果返回，还会返回剩余字符的标记结果
# 128不是前128个单词,有的单词会被拆成多份，还会加一个特殊符号。不同tokenizer拆法有区别
def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

In [22]:
print(drug_dataset["train"][0]["review"])

"My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. 
We have tried many different medications and so far this is the most effective."


In [23]:
result = tokenize_and_split(drug_dataset["train"][0])
print(tokenizer.decode(result["input_ids"][0]))
print(tokenizer.decode(result["input_ids"][1]))
[len(inp) for inp in result["input_ids"]]

[CLS] " My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation ( very unusual for him. ) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever [SEP]
[CLS]. He is less emotional ( a good thing ), less cranky. He is remembering all the things he should. Overall his behavior is better. We have tried many different medications and so far this is the most effective. " [SEP]


[128, 49]

In [24]:
drug_dataset

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 138514
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

In [25]:
# 直接对drug_dataset应用map会报错
# tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

## 方法1 删除原始数据的所有列

The problem is that we’re trying to mix two different datasets of different sizes: the drug_dataset columns will have a certain number of examples (the 1,000 in our error), but the tokenized_dataset we are building will have more (the 1,463 in the error message). That doesn’t work for a Dataset, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the remove_columns argument:

In [26]:
tokenized_dataset = drug_dataset.map(
    tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
)

Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-1a6a4a17f5e08048.arrow
Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-b28f1943efb924fb.arrow


In [27]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

(206772, 138514)

## 方法2 把原始数据的列重复映射到新数据

我们提到我们还可以通过使旧列与新列的大小相同来解决长度不匹配的问题。为此，我们需要设置 return_overflowing_tokens=True 时标记器返回的 overflow_to_sample_mapping 字段。它为我们提供了从新特征索引到其来源样本索引的映射。使用这个，我们可以通过重复每个示例的值与生成新特征一样多次，将原始数据集中存在的每个键与正确大小的值列表相关联：

In [28]:
# 查看overflow_to_sample_mapping字段，该字段反映了新特征是来源于原始样本的那个位置(索引）
tokenized_dataset["train"][:10]["overflow_to_sample_mapping"]

[0, 0, 1, 1, 2, 3, 3, 4, 5, 5]

In [29]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

In [30]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-56a17f2bb358609c.arrow
Loading cached processed dataset at C:\Users\ls\.cache\huggingface\datasets\csv\default-a01bd41b65dcc0bc\0.0.0\6b9057d9e23d9d8a2f05b985917a0da84d70c5dae3d22ddd8a3f22fb01c69d9e\cache-d0cb0cbdf07349fd.arrow


DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 206772
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 68876
    })
})

# 改变数据集的输出格式 set_format

## 转换为pandas的数据类型

![](https://chushi123.oss-cn-beijing.aliyuncs.com/img/202202272157010.png)

In [31]:
drug_dataset.set_format("pandas")

In [32]:
drug_dataset["train"][:3]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89


In [33]:
train_df = drug_dataset["train"][:]

In [34]:
frequencies = (
    train_df["condition"]
    .value_counts()
    .to_frame()
    .reset_index()
    .rename(columns={"index": "condition", "condition": "frequency"})
)
frequencies.head()

Unnamed: 0,condition,frequency
0,birth control,27655
1,depression,8023
2,acne,5209
3,anxiety,4991
4,pain,4744


# 从pandas数据类型生成datasets数据类型Arrow

In [35]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['condition', 'frequency'],
    num_rows: 819
})

In [36]:
# 把drug_dateset的数据类型恢复回datasets类型Arrow
drug_dataset.reset_format()

# 从训练集划分出训练集，验证集 train_test_split

In [37]:
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

# 保存数据到硬盘

![](https://chushi123.oss-cn-beijing.aliyuncs.com/img/202202272209854.png)

## 保存为Arrow格式

In [38]:
drug_dataset_clean.save_to_disk("drug-reviews")

Flattening the indices:   0%|          | 0/111 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/28 [00:00<?, ?ba/s]

## 加载保存的Arrow格式数据集

In [39]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 110811
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27703
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46108
    })
})

## 保存为csv json jsonl文件

对于 CSV 和 JSON 格式，我们必须将每个拆分存储为单独的文件。一种方法是遍历DatasetDict对象中的键和值：

In [40]:
for split, dataset in drug_dataset_clean.items():
    dataset.to_json(f"drug-reviews-{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/12 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

In [41]:
!head -n 1 drug-reviews-train.jsonl

{"patient_id":89879,"drugName":"Cyclosporine","condition":"keratoconjunctivitis sicca","review":"\"I have used Restasis for about a year now and have seen almost no progress.  For most of my life I've had red and bothersome eyes. After trying various eye drops, my doctor recommended Restasis.  He said it typically takes 3 to 6 months for it to really kick in but it never did kick in.  When I put the drops in it burns my eyes for the first 30 - 40 minutes.  I've talked with my doctor about this and he said it is normal but should go away after some time, but it hasn't. Every year around spring time my eyes get terrible irritated  and this year has been the same (maybe even worse than other years) even though I've been using Restasis for a year now. The only difference I notice was for the first couple weeks, but now I'm ready to move on.\"","rating":2.0,"date":"April 20, 2013","usefulCount":69,"review_length":147}


![](https://chushi123.oss-cn-beijing.aliyuncs.com/img/202202272214356.png)

## 加载保存的jsonl格式数据集

In [42]:
data_files = {
    "train": "drug-reviews-train.jsonl",
    "validation": "drug-reviews-validation.jsonl",
    "test": "drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)

Using custom data configuration default-4addb53bcc56e45e


Downloading and preparing dataset json/default to C:\Users\ls\.cache\huggingface\datasets\json\default-4addb53bcc56e45e\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

Dataset json downloaded and prepared to C:\Users\ls\.cache\huggingface\datasets\json\default-4addb53bcc56e45e\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

# 补充知识

## 从数据集中划分出训练集和测试集 train_test_split

![](https://chushi123.oss-cn-beijing.aliyuncs.com/img/202202271335259.png)

## 展开数据 flatten

将answers下的text和answer_start字段展开

![](https://chushi123.oss-cn-beijing.aliyuncs.com/img/202202271339390.png)

![](https://chushi123.oss-cn-beijing.aliyuncs.com/img/202202271338890.png)