## Dataset walk thorugh 

Dataset Main Methods are:
- datasets.list_datasets() to list the available datasets
- datasets.load_dataset(dataset_name, **kwargs) to instantiate a dataset
- datasets.list_metrics() to list the available metrics
- datasets.load_metric(metric_name, **kwargs) to instantiate a metric

In [1]:
from datasets import Dataset,load_dataset,concatenate_datasets,load_from_disk
import datasets
import config
import os
import pandas as pd
from sklearn.model_selection import train_test_split

### Let's start with how to loading data in memory and from disk

- #### load a dictionary 

In [3]:
my_dict = {"text": ['a','b','c'],"label":[0,1,0]}
dataset = Dataset.from_dict(my_dict)
print(dataset)

Dataset({
    features: ['text', 'label'],
    num_rows: 3
})


- #### load a pandas df

In [31]:
data_path = os.path.join(config.data_folder,'tweet','data','tweets.csv')
df = pd.read_csv(data_path,encoding='utf8')
df.head(2)

Unnamed: 0,text,label
0,yeah true i defiantly think though they are no...,1.0
1,2 that is really the issue carbon tax paris ac...,0.0


In [32]:
dataset = Dataset.from_pandas(df)
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 400
})

- #### load from local files  

In [38]:
## load jsonl 
data_path = os.path.join(config.data_folder,'tweet','data','tweets.jsonl')
dataset = load_dataset('json', data_files=data_path,split='train') ## just put split default to 'train' otherwise it is going to be a problem
dataset

Using custom data configuration default-f2c67655031212df


Downloading and preparing dataset json/default to /home/chengyu/.cache/huggingface/datasets/json/default-f2c67655031212df/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /home/chengyu/.cache/huggingface/datasets/json/default-f2c67655031212df/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


Dataset({
    features: ['text', 'label'],
    num_rows: 400
})

- #### Load a local csv file

In [52]:
data_path = os.path.join(config.data_folder,'tweet','data','tweets.csv')
dataset = load_dataset('csv', data_files=data_path) ## default split is 'train'
dataset

Using custom data configuration default-7285691f768b7810
Reusing dataset csv (/home/chengyu/.cache/huggingface/datasets/csv/default-7285691f768b7810/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 400
    })
})

- you can also load train and test splits together 
```python
data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")
```

- #### other data_load functionalities please refer to : https://huggingface.co/docs/datasets/loading


## Main Data Processing functions 
- #### Detailed functionalities please refer to : https://huggingface.co/docs/datasets/process

In [2]:
## frist load a sample dataset 
dataset = load_dataset('glue', 'mrpc', split='train') # you can also specify specific fileds to load in

Downloading builder script:   0%|          | 0.00/7.78k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Reusing dataset glue (/Users/huang/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


- ### Sort, shuffle, select, split, and shard

In [8]:
dataset.column_names

['sentence1', 'sentence2', 'label', 'idx']

In [13]:
## shuffle
sample = dataset.shuffle(seed=42).select(range(10))

## sort 
sorted_sample = dataset.sort('label')

##filter 
filtered_sample = dataset.filter(lambda example: example['sentence1'] is not None)
filtered_sample = filtered_sample.filter(lambda example: len(example['sentence1'].split()) >20)
print("org length: {}; filtered length: {}".format(len(dataset),len(filtered_sample)))

Loading cached shuffled indices for dataset at /Users/huang/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-f4871afcfd9acd36.arrow
Loading cached sorted indices for dataset at /Users/huang/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-c88b0988174eb7f6.arrow
Loading cached processed dataset at /Users/huang/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-4ac85e9794860a54.arrow


  0%|          | 0/4 [00:00<?, ?ba/s]

org length: 3668; filtered length: 2142


In [3]:
## train test split 
#### it has stratified sample split : https://huggingface.co/docs/datasets/v2.3.2/en/package_reference/main_classes#datasets.Dataset.train_test_split
## datasets version need to be 2.3.* # install it from conda forge
splited_ds = dataset.train_test_split(test_size=0.2,shuffle=True,seed=42,stratify_by_column='label')
splited_ds


DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 2934
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 734
    })
})

- #### Rename, remove,create new, align

In [7]:
## create new 
def compute_length(example):
    return {"sentence_length": len(example["sentence1"].split())}
sample = dataset.map(compute_length)
sample[0]

  0%|          | 0/3668 [00:00<?, ?ex/s]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0,
 'sentence_length': 19}

In [8]:
## rename columns
sample = sample.rename_column("sentence1", "sentenceA")
sample = sample.rename_column("sentence2", "sentenceB")
sample

Dataset({
    features: ['sentenceA', 'sentenceB', 'label', 'idx', 'sentence_length'],
    num_rows: 3668
})

In [9]:
## remove columns 
sample= sample.remove_columns('sentence_length')
sample

Dataset({
    features: ['sentenceA', 'sentenceB', 'label', 'idx'],
    num_rows: 3668
})

### Sample process to read customized data (base), for more complete version, see [link](https://towardsdatascience.com/my-experience-with-uploading-a-dataset-on-huggingfaces-dataset-hub-803051942c2d) or official document

- you can also use pandas if you want

In [15]:
RANDOM_SEED = 4

In [16]:
data_path = os.path.join(config.data_folder,'tweet','data','tweets.csv')
df = pd.read_csv(data_path,encoding='utf8')
df = df[~df['text'].isna()]  ## need to make sure text is string 
df.head(2)

FileNotFoundError: [Errno 2] No such file or directory: '../../All_Data/HuggingFace/tweet/data/tweets.csv'

In [104]:
df_train, df_test = train_test_split(df,test_size=0.3,random_state=RANDOM_SEED)

In [105]:
dataset_train = Dataset.from_pandas(df_train,split='train')
dataset_test = Dataset.from_pandas(df_test,split='test')
dataset = datasets.DatasetDict({'train':dataset_train, 'test':dataset_test})
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 278
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 120
    })
})


- Here, you can also split using dataset functions [link](https://huggingface.co/docs/datasets/process#split)
- If you want to do cross validation, you can also use the [Shard](https://huggingface.co/docs/datasets/process#shard) function 

- #### Now we can follow the standard data process

In [106]:
from transformers import AutoTokenizer

In [107]:
dataset['test'][5]

{'text': 'once again total lack of climate leadership at fancy words amp empty promises on the ground kowtowing to powerful fossil fuel industry is m o of this admin leasing to ff industry in public lands must be restricted amp at high price cut subsidies no more tax incentives',
 'label': 0.0,
 '__index_level_0__': 272}

In [108]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True,num_proc=6) ##you can add remove columns here; 
                                                                            #batch proceee and multiprocess = 6 

In [110]:
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 278
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 120
    })
})


- ### Save and load encoded datasets 

In [111]:
dataset_out_path = os.path.join(config.data_folder,'tweet','data','tweets_dataset')
tokenized_datasets.save_to_disk(dataset_out_path)

In [113]:
tokenized_datasets = load_from_disk(dataset_out_path)
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 278
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 120
    })
})


- you can also save as jsonl file : 
- follow : https://huggingface.co/course/chapter5/3?fw=pt#saving-a-dataset
```python
# save
for split, dataset in tokenized_datasets.items():
    dataset.to_json(f"reviews-{split}.jsonl")
# load
data_files = {
    "train": "drug-reviews-train.jsonl",
    "validation": "drug-reviews-validation.jsonl",
    "test": "drug-reviews-test.jsonl",
}
drug_dataset_reloaded = load_dataset("json", data_files=data_files)
```

### For extremely large dataset, see : https://huggingface.co/course/chapter5/4?fw=pt