### Time to slice and dice

First we need to download and extract the data, which can be done with the wget and unzip commands:
```bash
!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
!unzip drugsCom_raw.zip
```

In [None]:
from datasets import load_dataset

data_files = {"train":"drugComTrain_raw.tsv" , "test":"drugComTest_raw.tsv"}
drug_dataset = load_dataset("csv" , data_files=data_files , delimiter="\t")

 In 🤗 Datasets, we can create a random sample by chaining the Dataset.shuffle() and Dataset.select() functions together:

In [None]:
drug_sample = drug_dataset["train"].shuffle(seed=42).select(range(1000))

drug_sample[:3]

From this sample we can already see a few quirks in our dataset:

1. The Unnamed: 0 column looks suspiciously like an anonymized ID for each patient.
2. The condition column includes a mix of uppercase and lowercase labels.
3. The reviews are of varying length and contain a mix of Python line separators (\r\n) as well as HTML character codes like &\#039;.

In [None]:
# To test the patient ID hypothesis for the Unnamed: 0 column, we can use the Dataset.unique() function to verify that the number of IDs matches the number of rows in each split:
for split in drug_dataset.keys():
    assert len(drug_dataset[split]) == len(drug_dataset[split].unique("Unnamed: 0"))

In [None]:
# renaming of the column
drug_dataset = drug_dataset.rename_column(
    original_column_name="Unnamed: 0", new_column_name="patient_id"
)
drug_dataset

In [None]:
# lowercasing of the condition column
def lowercase_condition(example):
    return {"condition": example["condition"].lower()}

def filter_nones(x):
    return x["condition"] is not None

drug_dataset.filter(filter_nones)
drug_dataset.map(lowercase_condition)

In [None]:
drug_dataset = drug_dataset.map(lowercase_condition)
# Check that lowercasing worked
drug_dataset["train"]["condition"][:3]

In [None]:
#  creatng new columns
