In [None]:
import os
from transformers import pipeline
from datasets import load_dataset

### TOC
1. [Load custom datasets](#Load-custom-datasets)
2. [Time to slice and dice](#Time-to-slice-and-dice)

### Load custom datasets

#### Loading a local dataset

In [None]:
!wget -P 05_data https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz
!wget -P 05_data https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz

In [None]:
!gzip -dkv 05_data/SQuAD_it-*.json.gz

In [None]:
squad_dataset = load_dataset('json', data_files='05_data/SQuAD_it-train.json', field='data')

In [None]:
squad_dataset

In [None]:
squad_dataset['train'][0]

In [None]:
data_files = {"train": "05_data/SQuAD_it-train.json", "test": "05_data/SQuAD_it-test.json"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

We can decompress the files directly without saving them to disk using the load_data function from the datasets library.

In [None]:
data_files = {"train": "05_data/SQuAD_it-train.json.gz", "test": "05_data/SQuAD_it-test.json.gz"}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")
squad_it_dataset

#### Loading a remote dataset

In [None]:
url = "https://github.com/crux82/squad-it/raw/master/"
data_files = {
    "train": url + "SQuAD_it-train.json.gz",
    "test": url + "SQuAD_it-test.json.gz",
}
squad_it_dataset = load_dataset("json", data_files=data_files, field="data")

In [None]:
squad_it_dataset

### Time to slice and dice

In [None]:
!wget -P 05_data "http://archive.ics.uci.edu/static/public/461/drug+review+dataset+druglib+com.zip"
!unzip -d 05_data 05_data/drug+review+dataset+druglib+com

In [None]:
!wget -P 05_data "http://archive.ics.uci.edu/static/public/462/drug+review+dataset+drugs+com.zip"
!unzip -d 05_data 05_data/drug+review+dataset+drugs+com

In [None]:
data_files = {"train": "05_data/drugsComTrain_raw.tsv", "test": "05_data/drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")

In [None]:
drug_dataset

In [None]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

Note that we’ve fixed the seed in Dataset.shuffle() for reproducibility purposes. Dataset.select() expects an iterable of indices, so we’ve passed range(1000) to grab the first 1,000 examples from the shuffled dataset. From this sample we can already see a few quirks in our dataset:

The Unnamed: 0 column looks suspiciously like an anonymized ID for each patient.
The condition column includes a mix of uppercase and lowercase labels.
The reviews are of varying length and contain a mix of Python line separators (\r\n) as well as HTML character codes like &\#039;.
Let’s see how we can use 🤗 Datasets to deal with each of these issues. To test the patient ID hypothesis for the Unnamed: 0 column, we can use the Dataset.unique() function to verify that the number of IDs matches the number of rows in each split: