#**Slicing and dicing our data**#
##Similar to Pandas, 🤗 Datasets provides several functions to manipulate the contents of Dataset and DatasetDict objects. We already encountered the `Dataset.map()` method in Chapter 3, and in this section we'll explore some of the other functions at our disposal.

##For this example we'll use the Drug Review Dataset that's hosted on the UC Irvine Machine Learning Repository, which contains patient reviews on various drugs, along with the condition being treated and a 10-star rating of the patient's satisfaction.

##First we need to download and extract the data, which can be done with the `wget` and `unzip` commands:

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.6/547.8 kB[0m [31m5.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-16.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (40.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any

In [None]:
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip

--2024-06-27 14:37:50--  https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘drugsCom_raw.zip’

drugsCom_raw.zip        [         <=>        ]  41.00M  21.6MB/s    in 1.9s    

2024-06-27 14:37:52 (21.6 MB/s) - ‘drugsCom_raw.zip’ saved [42989872]

Archive:  drugsCom_raw.zip
  inflating: drugsComTest_raw.tsv    
  inflating: drugsComTrain_raw.tsv   


##Since TSV is just a variant of CSV that uses tabs instead of commas as the separator, we can load these files by using the csv loading script and specifying the delimiter argument in the load_dataset() function as follows:

In [None]:
from datasets import load_dataset

data_files = {"train": "drugsComTrain_raw.tsv", "test": "drugsComTest_raw.tsv"}
# \t is the tab character in Python
drug_dataset = load_dataset("csv", data_files=data_files, delimiter="\t")
drug_dataset

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

##A good practice when doing any sort of data analysis is to grab a small random sample to get a quick feel for the type of data you're working with. In 🤗 Datasets, we can create a random sample by chaining the `Dataset.shuffle()` and `Dataset.select()` functions together:

In [None]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))
# Peek at the first few examples
drug_sample[:3]

{'Unnamed: 0': [87571, 178045, 80482],
 'drugName': ['Naproxen', 'Duloxetine', 'Mobic'],
 'condition': ['Gout, Acute', 'ibromyalgia', 'Inflammatory Conditions'],
 'review': ['"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"',
  '"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."',
  '"I have been taking Mobic for over a year with no side effects other than 

##Note that we've fixed the seed in `Dataset.shuffle()` for reproducibility purposes. `Dataset.select()` expects an iterable of indices, so we've passed `range(1000)` to grab the first 1,000 examples from the shuffled dataset. From this sample we can already see a few quirks in our dataset:

##The `Unnamed: 0` column looks suspiciously like an anonymized ID for each patient.
###The condition column includes a mix of uppercase and lowercase labels.
##The reviews are of varying length and contain a mix of Python line separators (\r\n) as well as HTML character codes like &\#039;.
##Let's see how we can use 🤗 Datasets to deal with each of these issues. To test the patient ID hypothesis for the Unnamed: 0 column, we can use the `Dataset.unique()` function to verify that the number of IDs matches the number of rows in each split:

#**Let's break down the code and explain its purpose:**#

##1. The outer loop iterates through each key (split) in the `drug_dataset` dictionary. This dictionary likely contains different splits of a dataset, such as training, validation, and test sets.

##2. For each split, the code performs an assertion to check for data integrity:
   #**- `len(drug_dataset[split])` gets the total number of rows in the current split.*
      - `drug_dataset[split].unique("Unnamed: 0")` retrieves the unique values in the "Unnamed: 0" column of the current split.
         - `len(drug_dataset[split].unique("Unnamed: 0"))` counts the number of unique values in the "Unnamed: 0" column.

         3. The assertion checks if the total number of rows is equal to the number of unique values in the "Unnamed: 0" column. This is likely checking for duplicates or ensuring that each row has a unique identifier in the "Unnamed: 0" column.

         This code snippet is a good practice for data validation. It ensures that each split of the dataset has the expected structure and that there are no duplicate entries based on the "Unnamed: 0" column. If the assertion fails for any split, it will raise an AssertionError, alerting the user to potential data integrity issues.


In [None]:
# Iterate through each split in the drug_dataset dictionary
for split in drug_dataset.keys():
    # Assert that the length of the current split is equal to the number of unique values in the "Unnamed: 0" column
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))


##This seems to confirm our hypothesis, so let's clean up the dataset a bit by renaming the Unnamed: 0 column to something a bit more interpretable. We can use the `DatasetDict.rename_column()` function to rename the column across both splits in one go:

In [None]:
drug_dataset = drug_dataset.rename_column(
        original_column_name="Unnamed: 0", new_column_name="patient_id"
        )
drug_dataset


DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

UNIQUE DRUGS

In [None]:
for split in drug_dataset.keys():
    num_rows = len(drug_dataset[split])
    num_unique_drugs = len(drug_dataset[split].unique("drugName"))
    if num_rows != num_unique_drugs:
        print(f"Assertion failed for split: {split}")
        print(f"Number of rows: {num_rows}")
        print(f"Number of unique drug names: {num_unique_drugs}")

Assertion failed for split: train
Number of rows: 161297
Number of unique drug names: 3436
Assertion failed for split: test
Number of rows: 53766
Number of unique drug names: 2637


##Next, let's normalize all the condition labels using `Dataset.map()`. As we did with tokenization in Chapter 3, we can define a simple function that can be applied across all the rows of each split in drug_dataset:

In [None]:
def lowercase_condition(example):
    if example["condition"] is not None:
        return {"condition": example["condition"].lower()}
    else:
        return {"condition": None}  # Or any suitable default value

drug_dataset.map(lowercase_condition)

Map:   0%|          | 0/161297 [00:00<?, ? examples/s]

Map:   0%|          | 0/53766 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 161297
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'],
        num_rows: 53766
    })
})

##Lambda functions are handy when you want to define small, single-use functions (for more information about them, we recommend reading the excellent Real Python tutorial by Andre Burgaud). In the 🤗 Datasets context, we can use lambda functions to define simple map and filter operations, so let's use this trick to eliminate the None entries in our dataset:

In [None]:
drug_dataset = drug_dataset.filter(lambda x: x["condition"] is not None)

Filter:   0%|          | 0/161297 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53766 [00:00<?, ? examples/s]

##With the None entries removed, we can normalize our condition column:

In [None]:
drug_dataset = drug_dataset.map(lowercase_condition)
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

['left ventricular dysfunction', 'adhd', 'birth control']

#**Creating new columns**#
##Whenever you're dealing with customer reviews, a good practice is to check the number of words in each review. A review might be just a single word like "Great!" or a full-blown essay with thousands of words, and depending on the use case you'll need to handle these extremes differently. To compute the number of words in each review, we'll use a rough heuristic based on splitting each text by whitespace.
##Let's define a simple function that counts the number of words in each review:

#**Let's break down the code and explain its purpose:**#

##1. The function `compute_review_length` takes a single parameter `example`, which is likely a dictionary containing review data.

##2. Inside the function, we access the "review" key from the `example` dictionary. This presumably contains the text of the review.

##3. The `split()` method is called on the review text. Without arguments, `split()` uses whitespace as the delimiter, effectively separating the text into words.

##4. The `len()` function is then used to count the number of elements in the resulting list of words, giving us the word count of the review.

##5. The function returns a new dictionary with a single key-value pair. The key is "review_length" and the value is the computed word count.

##This function follows good practices by having a clear, descriptive name and a single responsibility - computing the length of a review. It's a concise and efficient way to add a new feature (review length) to your dataset.

##To further improve this code, you could consider:

##1. Adding error handling to deal with cases where the "review" key might not exist in the input dictionary.
##2. Potentially normalizing the text (e.g., lowercasing, removing punctuation) before splitting, depending on your specific requirements.
##3. Adding a docstring to provide more detailed information about the function's purpose, parameters, and return value.


In [None]:
# Function to compute the length of a review in words
def compute_review_length(example):
    # Split the review text into words and count them
    return {"review_length": len(example["review"].split())}


##Unlike our `lowercase_condition()` function, `compute_review_length()` returns a dictionary whose key does not correspond to one of the column names in the dataset. In this case, when compute_review_length() is passed to Dataset.map(), it will be applied to all the rows in the dataset to create a new review_length column:

In [None]:
drug_dataset = drug_dataset.map(compute_review_length)
# Inspect the first training example
drug_dataset["train"][0:3]

Map:   0%|          | 0/161297 [00:00<?, ? examples/s]

Map:   0%|          | 0/53766 [00:00<?, ? examples/s]

{'patient_id': [206461, 95260, 92703],
 'drugName': ['Valsartan', 'Guanfacine', 'Lybrel'],
 'condition': ['left ventricular dysfunction', 'adhd', 'birth control'],
 'review': ['"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"',
  '"My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \r\nWe have tried many different medications and so far this is the most effect

##As expected, we can see a review_length column has been added to our training set. We can sort this new column with `Dataset.sort()` to see what the extreme values look like:

In [None]:
drug_dataset["train"].sort("review_length")[:3]

{'patient_id': [111469, 13653, 53602],
 'drugName': ['Ledipasvir / sofosbuvir',
  'Amphetamine / dextroamphetamine',
  'Alesse'],
 'condition': ['hepatitis c', 'adhd', 'birth control'],
 'review': ['"Headache"', '"Great"', '"Awesome"'],
 'rating': [10.0, 10.0, 10.0],
 'date': ['February 3, 2015', 'October 20, 2009', 'November 23, 2015'],
 'usefulCount': [41, 3, 0],
 'review_length': [1, 1, 1]}

##As we suspected, some reviews contain just a single word, which, although it may be okay for sentiment analysis, would not be informative if we want to predict the condition.

##🙋 An alternative way to add new columns to a dataset is with the `Dataset.add_column()` function. This allows you to provide the column as a Python list or NumPy array and can be handy in situations where `Dataset.map()` is not well suited for your analysis.

##Let's use the `Dataset.filter()` function to remove reviews that contain fewer than 30 words. Similarly to what we did with the condition column, we can filter out the very short reviews by requiring that the reviews have a length above this threshold:

In [None]:
# Filter the drug_dataset to keep only rows where the review length is greater than 30 words
drug_dataset = drug_dataset.filter(lambda x: x["review_length"] > 30)

# Print the number of rows in the filtered dataset
print(drug_dataset.num_rows)

Filter:   0%|          | 0/161297 [00:00<?, ? examples/s]

Filter:   0%|          | 0/53766 [00:00<?, ? examples/s]

{'train': 139280, 'test': 46352}


##The last thing we need to deal with is the presence of HTML character codes in our reviews. We can use Python's html module to unescape these characters, like so:

In [None]:
import html

text = "I&#039;m a transformer called BERT"
html.unescape(text)

"I'm a transformer called BERT"

##We'll use `Dataset.map()` to unescape all the `HTML characters` in our corpus:

In [None]:
drug_dataset = drug_dataset.map(lambda x: {"review": html.unescape(x["review"])})

Map:   0%|          | 0/139280 [00:00<?, ? examples/s]

Map:   0%|          | 0/46352 [00:00<?, ? examples/s]

#**The map() method's superpowers**#
##The Dataset.map() method takes a batched argument that, if set to True, causes it to send a batch of examples to the map function at once (the batch size is configurable but defaults to 1,000). For instance, the previous map function that unescaped all the HTML took a bit of time to run (you can read the time taken from the progress bars). We can speed this up by processing several elements at the same time using a list comprehension.

##When you specify `batched=True` the function receives a dictionary with the fields of the dataset, but each value is now a list of values, and not just a single value. The return value of Dataset.map() should be the same: a dictionary with the fields we want to update or add to our dataset, and a list of values. For example, here is another way to unescape all HTML characters, but using `batched=True`:

In [None]:
new_drug_dataset = drug_dataset.map(
        lambda x: {"review": [html.unescape(o) for o in x["review"]]}, batched=True
        )


Map:   0%|          | 0/139280 [00:00<?, ? examples/s]

Map:   0%|          | 0/46352 [00:00<?, ? examples/s]

##If you're running this code in a notebook, you'll see that this command executes way faster than the previous one. And it's not because our reviews have already been HTML-unescaped -- if you re-execute the instruction from the previous section (without batched=True), it will take the same amount of time as before. This is because list comprehensions are usually faster than executing the same code in a for loop, and we also gain some performance by accessing lots of elements at the same time instead of one by one.

##Using `Dataset.map()` with `batched=True` will be essential to unlock the speed of the "fast" tokenizers that we'll encounter in Chapter 6, which can quickly tokenize big lists of texts. For instance, to tokenize all the drug reviews with a fast tokenizer, we could use a function like this:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["review"], truncation=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

##As you saw in Chapter 3, we can pass one or several examples to the tokenizer, so we can use this function with or without batched=True. Let's take this opportunity to compare the performance of the different options. In a notebook, **you can time a one-line instruction by adding** `%time` **before the line of code you wish to measure**:

In [None]:
%time tokenized_dataset = drug_dataset.map(tokenize_function,batched=True)

Map:   0%|          | 0/139280 [00:00<?, ? examples/s]

Map:   0%|          | 0/46352 [00:00<?, ? examples/s]

CPU times: user 1min 36s, sys: 702 ms, total: 1min 36s
Wall time: 1min 4s


##Dataset.map() also has some parallelization capabilities of its own. Since they are not backed by Rust, they won't let a slow tokenizer catch up with a fast one, but they can still be helpful (especially if you're using a tokenizer that doesn't have a fast version). To enable multiprocessing, use the num_proc argument and specify the number of processes to use in your call to Dataset.map():

##This code is performing tokenization on a dataset of drug reviews. Here's a breakdown of what each part does:

##*1. First, it loads a BERT tokenizer using the `AutoTokenizer` class. The `use_fast=False` parameter ensures that the slower (but more customizable) Python implementation of the tokenizer is used instead of the faster Rust implementation.*

##*2. Next, it defines a function called `slow_tokenize_function`. This function takes a batch of examples and tokenizes the text in the "review" column. The `truncation=True` parameter ensures that any reviews longer than the maximum sequence length for BERT (typically 512 tokens) are truncated.*

##*3. Finally, it applies the tokenization function to the entire `drug_dataset` using the `map` method. The `batched=True` parameter allows for processing multiple examples at once, which can be more efficient. The `num_proc=8` parameter tells the function to use 8 CPU cores for parallel processing, which can significantly speed up the tokenization process for large datasets.*

##This code is a crucial step in preparing text data for input into a BERT model. Tokenization breaks down the text into tokens that the model can understand, and is a necessary preprocessing step for many NLP tasks. The use of parallel processing helps to speed up this potentially time-consuming task, especially for large datasets.

In [None]:
# Load the BERT tokenizer without using the fast tokenizer implementation
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)

# Define a function to tokenize the text in the 'review' column
def slow_tokenize_function(examples):
    # Tokenize the 'review' text, truncating if necessary
    return slow_tokenizer(examples["review"], truncation=True)

# Apply the tokenization function to the entire dataset
tokenized_dataset = drug_dataset.map(
    # The function to apply
    slow_tokenize_function,
    # Process the examples in batches for efficiency
    batched=True,
    # Use 8 CPU cores for parallel processing
    num_proc=8
)

Map (num_proc=8):   0%|          | 0/138514 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/46108 [00:00<?, ? examples/s]

##💡 In machine learning, an example is usually defined as the set of features that we feed to the model. In some contexts, these features will be the set of columns in a Dataset, but in others (like here and for question answering), multiple features can be extracted from a single example and belong to a single column.

##Let's have a look at how it works! Here we will tokenize our examples and truncate them to a maximum length of 128, but we will ask the tokenizer to return all the chunks of the texts instead of just the first one. This can be done with return_overflowing_tokens=True:

In [None]:
def tokenize_and_split(examples):
    return tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )

##Let's test this on one example before using `Dataset.map()` on the whole dataset:

In [None]:
result = tokenize_and_split(drug_dataset["train"][0])
[len(inp) for inp in result["input_ids"]]

[128, 49]

##So, our first example in the training set became two features because it was tokenized to more than the maximum number of tokens we specified: the first one of length 128 and the second one of length 49. Now let's do this for all elements of the dataset!

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)

##The problem is that we're trying to mix two different datasets of different sizes: the drug_dataset columns will have a certain number of examples (the 1,000 in our error), but the tokenized_dataset we are building will have more (the 1,463 in the error message; it is more than 1,000 because we are tokenizing long reviews into more than one example by using 1return_overflowing_tokens=True`). That doesn't work for a Dataset, so we need to either remove the columns from the old dataset or make them the same size as they are in the new dataset. We can do the former with the `remove_columns` argument:

In [None]:
tokenized_dataset = drug_dataset.map(
        tokenize_and_split, batched=True, remove_columns=drug_dataset["train"].column_names
        )


Map:   0%|          | 0/138514 [00:00<?, ? examples/s]

Map:   0%|          | 0/46108 [00:00<?, ? examples/s]

##Now this works without error. We can check that our new dataset has many more elements than the original dataset by comparing the lengths:

In [None]:
len(tokenized_dataset["train"]), len(drug_dataset["train"])

(206772, 138514)

##We mentioned that we can also deal with the mismatched length problem by making the old columns the same size as the new ones. To do this, we will need the `overflow_to_sample_mapping` field the tokenizer returns when we set `return_overflowing_tokens=True`. It gives us a mapping from a new feature index to the index of the sample it originated from. Using this, we can associate each key present in our original dataset with a list of values of the right size by repeating the values of each example as many times as it generates new features:

In [None]:
def tokenize_and_split(examples):
    result = tokenizer(
        examples["review"],
        truncation=True,
        max_length=128,
        return_overflowing_tokens=True,
    )
    # Extract mapping between new and old indices
    sample_map = result.pop("overflow_to_sample_mapping")
    for key, values in examples.items():
        result[key] = [values[i] for i in sample_map]
    return result

##We can see it works with `Dataset.map()` without us needing to remove the old columns:

In [None]:
tokenized_dataset = drug_dataset.map(tokenize_and_split, batched=True)
tokenized_dataset

Map:   0%|          | 0/139280 [00:00<?, ? examples/s]

Map:   0%|          | 0/46352 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 207852
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 69224
    })
})

#**From Datasets to DataFrames and back**#
##To enable the conversion between various third-party libraries, 🤗 Datasets provides a `Dataset.set_format()` function. This function only changes the output format of the dataset, so you can easily switch to another format without affecting the underlying data format, which is Apache Arrow. The formatting is done in place. To demonstrate, let's convert our dataset to `Pandas`:

In [None]:
drug_dataset.set_format("pandas")

##Now when we access elements of the dataset we get a `pandas.DataFrame` instead of a dictionary:

In [None]:
drug_dataset["train"][:3]

Unnamed: 0,patient_id,drugName,condition,review,rating,date,usefulCount,review_length
0,95260,Guanfacine,adhd,"""My son is halfway through his fourth week of ...",8.0,"April 27, 2010",192,141
1,92703,Lybrel,birth control,"""I used to take another oral contraceptive, wh...",5.0,"December 14, 2009",17,134
2,138000,Ortho Evra,birth control,"""This is my first time using any form of birth...",8.0,"November 3, 2015",10,89


##Let's create a `pandas.DataFrame` for the whole training set by selecting all the elements of `drug_dataset["train"]`:

In [None]:
train_df = drug_dataset["train"][:]


##🚨 Under the hood, `Dataset.set_format()` changes the return format for the `dataset's __getitem__()` dunder method. This means that when we want to create a new object like `train_df` from a Dataset in the `"pandas"` format, we need to slice the whole dataset to obtain a `pandas.DataFrame`. You can verify for yourself that the type of `drug_dataset["train"]` is Dataset, irrespective of the output format.

##From here we can use all the Pandas functionality that we want. For example, we can do fancy chaining to compute the class distribution among the `condition` entries:

In [None]:
frequencies = (
        train_df["condition"]
        .value_counts()
        .to_frame()
        .reset_index()
        .rename(columns={"index": "condition", "condition": "frequency"})
                        )
frequencies.head()


Unnamed: 0,frequency,count
0,birth control,27655
1,depression,8023
2,acne,5209
3,anxiety,4991
4,pain,4744



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.



##And once we're done with our Pandas analysis, we can always create a new Dataset object by using the `Dataset.from_pandas()` function as follows:

In [None]:
from datasets import Dataset

freq_dataset = Dataset.from_pandas(frequencies)
freq_dataset

Dataset({
    features: ['frequency', 'count'],
    num_rows: 819
})

##This wraps up our tour of the various preprocessing techniques available in 🤗 Datasets. To round out the section, let's create a validation set to prepare the dataset for training a classifier on. Before doing so, we'll reset the output format of drug_dataset from `"pandas"` to `"arrow"`:

In [None]:
drug_dataset.reset_format()

#**Creating a validation set**#
##Although we have a test set we could use for evaluation, it's a good practice to leave the test set untouched and create a separate validation set during development. Once you are happy with the performance of your models on the validation set, you can do a final sanity check on the test set. This process helps mitigate the risk that you'll overfit to the test set and deploy a model that fails on real-world data.

##🤗 Datasets provides a `Dataset.train_test_split()` function that is based on the famous functionality from scikit-learn. Let's use it to split our training set into train and validation splits (we set the seed argument for reproducibility):

In [None]:
drug_dataset_clean = drug_dataset["train"].train_test_split(train_size=0.8, seed=42)
# Rename the default "test" split to "validation"
drug_dataset_clean["validation"] = drug_dataset_clean.pop("test")
# Add the "test" set to our `DatasetDict`
drug_dataset_clean["test"] = drug_dataset["test"]
drug_dataset_clean

DatasetDict({
    train: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 111424
    })
    validation: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 27856
    })
    test: Dataset({
        features: ['patient_id', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'review_length'],
        num_rows: 46352
    })
})

#**Saving a dataset**#
##Although 🤗 Datasets will cache every downloaded dataset and the operations performed on it, there are times when you'll want to save a dataset to disk (e.g., in case the cache gets deleted). As shown in the table below, 🤗 Datasets provides three main functions to save your dataset in different formats:

formats:

Data format	Function

Arrow	 Dataset.save_to_disk()

CSV	  Dataset.to_csv()

JSON	 Dataset.to_json()

##For example, let's save our cleaned dataset in the Arrow format:
##Once the dataset is saved, we can load it by using the load_from_disk() function as follows:

In [None]:
from datasets import load_from_disk

drug_dataset_reloaded = load_from_disk("drug-reviews")
drug_dataset_reloaded

##For the CSV and JSON formats, we have to store each split as a separate file. One way to do this is by iterating over the keys and values in the DatasetDict object:

In [35]:
for split, dataset in drug_dataset_clean.items():
        dataset.to_json(f"drug-reviews-{split}.jsonl")

Creating json from Arrow format:   0%|          | 0/112 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/28 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/47 [00:00<?, ?ba/s]

##This saves each split in JSON Lines format, where each row in the dataset is stored as a single line of JSON. Here's what the first example looks like:

In [36]:
!head -n 1 drug-reviews-train.jsonl

{"patient_id":151426,"drugName":"Chantix","condition":"smoking cessation","review":"\"I smoked for 38 years, half a pack a day. 2 days of Chantix my cigarettes tasted awful. I get no side affects, I'm sleeping so much better, I used to wake up in the middle of the night and think I had to get up and take a couple puffs to go back to sleep. Now when I wake up I fall right back to sleep. The best part is I don't even think about cigarettes, I'm on day 20. I'm just afraid once I stop the Chantix I might get cravings. I really feel blessed at this point for just being able to stop smoking, especially with no side affects. Oh and how can I forget, the Chantix was free through my insurance, also a blessing.\"","rating":10.0,"date":"November 8, 2016","usefulCount":29,"review_length":128}


##We can then use the techniques from section 2 to load the JSON files as follows:

In [37]:
data_files = {
        "train": "drug-reviews-train.jsonl",
        "validation": "drug-reviews-validation.jsonl",
        "test": "drug-reviews-test.jsonl",
                }
drug_dataset_reloaded = load_dataset("json", data_files=data_files)


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]