### You will see the downloaded data has .arrow format, which refers to the Apache Arrow columnar data format, which is designed for high-performance analytics and efficient data interchange between systems. ###

- Columnar Storage: Unlike row-based formats (like CSV), Arrow stores data in columns. This makes it much faster for operations like filtering or aggregating over specific fields—especially useful in machine learning and analytics.

- Memory Mapping: Hugging Face’s datasets library uses Arrow to cache datasets on disk in a way that allows them to be accessed as if they were in memory. This means you can work with large datasets without loading everything into RAM.

- Zero-Copy Reads: Arrow enables zero-copy reads, which means data can be accessed without being copied into memory—speeding up processing and reducing overhead.

- Interoperability: Arrow is language-agnostic and integrates well with tools like NumPy, Pandas, PyTorch, and TensorFlow.

### Load dataset from internet

In [1]:
from datasets import load_dataset

In [None]:
# Activate VPN 
# import os
# os.environ['http_proxy'] = 'XXX.XXX.XX'

In [4]:
# dataset = load_dataset('seamew/ChnSentiCorp')
#this data set has some problem, we need to load the one in the local machine, saved for the course

In [5]:
from datasets import load_from_disk

In [6]:
dataset = load_from_disk('../ChnSentiCorp/')

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 9600
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 0
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
})

### Load another dataset from internet

In [8]:
dataset = load_dataset('dirtycomputer/ChnSentiCorp_htl_all')
dataset

ChnSentiCorp_htl_all.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/7766 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'review'],
        num_rows: 7766
    })
})

In [9]:
dataset.save_to_disk('./data/ChnSentiCorp_htl_all')

Saving the dataset (0/1 shards):   0%|          | 0/7766 [00:00<?, ? examples/s]

### Loads the GLUE benchmark dataset collection, which contains multiple NLP tasks.

In [10]:
load_dataset(path='glue', name='sst2', split='train', revision="script")

README.md: 0.00B [00:00, ?B/s]

glue.py: 0.00B [00:00, ?B/s]

dataset_infos.json: 0.00B [00:00, ?B/s]

The repository for glue contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/glue.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


Downloading data:   0%|          | 0.00/7.44M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

Dataset({
    features: ['sentence', 'label', 'idx'],
    num_rows: 67349
})

## Besic operations on dataset

In [12]:
dataset = load_from_disk('../ChnSentiCorp/')

In [13]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 9600
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 0
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
})

In [15]:
#it's like a dictionary
dataset = dataset['train']

In [16]:
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 9600
})

In [20]:
#it can be sliced
dataset[12]

{'text': '轻便，方便携带，性能也不错，能满足平时的工作需要，对出差人员来说非常不错', 'label': 1}

In [21]:
for i in range(10):
    print(dataset[i])

{'text': '选择珠江花园的原因就是方便，有电动扶梯直接到达海边，周围餐馆、食廊、商场、超市、摊位一应俱全。酒店装修一般，但还算整洁。 泳池在大堂的屋顶，因此很小，不过女儿倒是喜欢。 包的早餐是西式的，还算丰富。 服务吗，一般', 'label': 1}
{'text': '15.4寸笔记本的键盘确实爽，基本跟台式机差不多了，蛮喜欢数字小键盘，输数字特方便，样子也很美观，做工也相当不错', 'label': 1}
{'text': '房间太小。其他的都一般。。。。。。。。。', 'label': 0}
{'text': '1.接电源没有几分钟,电源适配器热的不行. 2.摄像头用不起来. 3.机盖的钢琴漆，手不能摸，一摸一个印. 4.硬盘分区不好办.', 'label': 0}
{'text': '今天才知道这书还有第6卷,真有点郁闷:为什么同一套书有两种版本呢?当当网是不是该跟出版社商量商量,单独出个第6卷,让我们的孩子不会有所遗憾。', 'label': 1}
{'text': '机器背面似乎被撕了张什么标签，残胶还在。但是又看不出是什么标签不见了，该有的都在，怪', 'label': 0}
{'text': '呵呵，虽然表皮看上去不错很精致，但是我还是能看得出来是盗的。但是里面的内容真的不错，我妈爱看，我自己也学着找一些穴位。', 'label': 0}
{'text': '这本书实在是太烂了,以前听浙大的老师说这本书怎么怎么不对,哪些地方都是误导的还不相信,终于买了一本看一下,发现真是~~~无语,这种书都写得出来', 'label': 0}
{'text': '地理位置佳，在市中心。酒店服务好、早餐品种丰富。我住的商务数码房电脑宽带速度满意,房间还算干净，离湖南路小吃街近。', 'label': 1}
{'text': '5.1期间在这住的，位置还可以，在市委市政府附近，要去商业区和步行街得打车，屋里有蚊子，虽然空间挺大，晚上熄灯后把窗帘拉上简直是伸手不见五指，很适合睡觉，但是会被该死的蚊子吵醒！打死了两只，第二天早上还是发现又没打死的，卫生间挺大，但是设备很老旧。', 'label': 1}


In [22]:
print(dataset['label'][:10])

[1, 1, 0, 0, 1, 0, 0, 0, 1, 1]


#### sort the data

Checking the lable, 1 seems to be positive, 0 is negative.

In [23]:
sorted_dataset = dataset.sort('label')

In [24]:
print(sorted_dataset['label'][:10])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [25]:
print(sorted_dataset['label'][-10:])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


#### shuffle the data

In [26]:
shuffled_dataset = sorted_dataset.shuffle(seed=10)
shuffled_dataset['label'][:10]

[0, 1, 0, 0, 0, 1, 1, 1, 0, 0]

In [27]:
# Sample the data
dataset.select([0, 10, 20, 30, 40, 50])

Dataset({
    features: ['text', 'label'],
    num_rows: 6
})

#### filter the data

In [31]:
def f(data):
    return data['text'].startswith('非常不错')
    
dataset.filter(f)

Dataset({
    features: ['text', 'label'],
    num_rows: 13
})

#### split train/test data

In [32]:
dataset.train_test_split(test_size = 0.2)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 7680
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1920
    })
})

split the data to N shares evenly

In [35]:
dataset.shard(num_shards = 4, index = 0)

Dataset({
    features: ['text', 'label'],
    num_rows: 2400
})

#### rename the column

In [36]:
dataset.rename_column('text', 'text2')

Dataset({
    features: ['text2', 'label'],
    num_rows: 9600
})

#### delete columns

In [37]:
dataset.remove_columns(['text'])

Dataset({
    features: ['label'],
    num_rows: 9600
})

#### mapping function

In [38]:
def f2(data):
    data['text'] = 'My sentence: ' + data['text']
    return data

maped_dataset = dataset.map(f2)

Map:   0%|          | 0/9600 [00:00<?, ? examples/s]

In [39]:
dataset['text'][5]

'机器背面似乎被撕了张什么标签，残胶还在。但是又看不出是什么标签不见了，该有的都在，怪'

In [40]:
maped_dataset['text'][5]

'My sentence: 机器背面似乎被撕了张什么标签，残胶还在。但是又看不出是什么标签不见了，该有的都在，怪'

#### batch acceleration

In [41]:
def f3(data):
    text = data['text']
    text = ['My sentence: ' + i for i in text]
    data['text'] = text
    return data

maped_dataset2 = dataset.map(f3, batched = True, batch_size = 10000, num_proc = 4)

Map (num_proc=4):   0%|          | 0/9600 [00:00<?, ? examples/s]

In [42]:
maped_dataset2['text'][5]

'My sentence: 机器背面似乎被撕了张什么标签，残胶还在。但是又看不出是什么标签不见了，该有的都在，怪'

#### configure data format