## Dataset walk thorugh 

Dataset Main Methods are:
- datasets.list_datasets() to list the available datasets
- datasets.load_dataset(dataset_name, **kwargs) to instantiate a dataset
- datasets.list_metrics() to list the available metrics
- datasets.load_metric(metric_name, **kwargs) to instantiate a metric

In [112]:
from datasets import Dataset,load_dataset,concatenate_datasets,load_from_disk
import datasets
import config
import os
import pandas as pd
from sklearn.model_selection import train_test_split

#### Let's start with how to loading data in memory 

- load a dictionary 

In [3]:
my_dict = {"text": ['a','b','c'],"label":[0,1,0]}
dataset = Dataset.from_dict(my_dict)
print(dataset)

Dataset({
    features: ['text', 'label'],
    num_rows: 3
})


- load a pandas df

In [31]:
data_path = os.path.join(config.data_folder,'tweet','data','tweets.csv')
df = pd.read_csv(data_path,encoding='utf8')
df.head(2)

Unnamed: 0,text,label
0,yeah true i defiantly think though they are no...,1.0
1,2 that is really the issue carbon tax paris ac...,0.0


In [32]:
dataset = Dataset.from_pandas(df)
dataset

Dataset({
    features: ['text', 'label'],
    num_rows: 400
})

- #### load from local files  

In [38]:
## load jsonl 
data_path = os.path.join(config.data_folder,'tweet','data','tweets.jsonl')
dataset = load_dataset('json', data_files=data_path,split='train') ## just put split default to 'train' otherwise it is going to be a problem
dataset

Using custom data configuration default-f2c67655031212df


Downloading and preparing dataset json/default to /home/chengyu/.cache/huggingface/datasets/json/default-f2c67655031212df/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /home/chengyu/.cache/huggingface/datasets/json/default-f2c67655031212df/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


Dataset({
    features: ['text', 'label'],
    num_rows: 400
})

In [52]:
data_path = os.path.join(config.data_folder,'tweet','data','tweets.csv')
dataset = load_dataset('csv', data_files=data_path) ## default split is 'train'
dataset

Using custom data configuration default-7285691f768b7810
Reusing dataset csv (/home/chengyu/.cache/huggingface/datasets/csv/default-7285691f768b7810/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519)


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 400
    })
})

- #### other data_load functionalities please refer to : https://huggingface.co/docs/datasets/loading
- #### other process functionalities please refer to : https://huggingface.co/docs/datasets/process

### Sample process to read customized data (base), for more complete version, see [link](https://towardsdatascience.com/my-experience-with-uploading-a-dataset-on-huggingfaces-dataset-hub-803051942c2d) or official document

In [58]:
RANDOM_SEED = 4

In [103]:
data_path = os.path.join(config.data_folder,'tweet','data','tweets.csv')
df = pd.read_csv(data_path,encoding='utf8')
df = df[~df['text'].isna()]  ## need to make sure text is string 
df.head(2)

Unnamed: 0,text,label
0,yeah true i defiantly think though they are no...,1.0
1,2 that is really the issue carbon tax paris ac...,0.0


In [104]:
df_train, df_test = train_test_split(df,test_size=0.3,random_state=RANDOM_SEED)

In [105]:
dataset_train = Dataset.from_pandas(df_train,split='train')
dataset_test = Dataset.from_pandas(df_test,split='test')
dataset = datasets.DatasetDict({'train':dataset_train, 'test':dataset_test})
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 278
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__'],
        num_rows: 120
    })
})


- Here, you can also split using dataset functions [link](https://huggingface.co/docs/datasets/process#split)
- If you want to do cross validation, you can also use the [Shard](https://huggingface.co/docs/datasets/process#shard) function 

- #### Now we can follow the standard data process

In [106]:
from transformers import AutoTokenizer

In [107]:
dataset['test'][5]

{'text': 'once again total lack of climate leadership at fancy words amp empty promises on the ground kowtowing to powerful fossil fuel industry is m o of this admin leasing to ff industry in public lands must be restricted amp at high price cut subsidies no more tax incentives',
 'label': 0.0,
 '__index_level_0__': 272}

In [108]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True,num_proc=6) ##batch proceee and multiprocess = 6 

In [110]:
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 278
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 120
    })
})


- ### Save and load encoded datasets 

In [111]:
dataset_out_path = os.path.join(config.data_folder,'tweet','data','tweets_dataset')
tokenized_datasets.save_to_disk(dataset_out_path)

In [113]:
tokenized_datasets = load_from_disk(dataset_out_path)
print(tokenized_datasets)

DatasetDict({
    train: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 278
    })
    test: Dataset({
        features: ['text', 'label', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 120
    })
})
