# Working with Custom Data

### Introduction

In this lesson, we'll see how to use torchtext to work with custom data.  Many of the steps, like specifying field objects to process and numericalize the data are the same.  One difference is that we now use the `TabularDataset` method to read csv, and json files.  Let's get started.

### Loading our data

We can begin by loading our data from the csv file.

In [3]:
import pandas as pd
coconut_water_df = pd.read_csv('./coconut_water.csv', index_col = 0)

In [16]:
selected_df = coconut_water_df.iloc[:, 5:]

Because our data is often read locally using torchtext, let's create a folder called `data` and then save the data to the path `/data/coconut_reviews.csv`.  

In [17]:
selected_df.to_csv('./data/coconut_reviews.csv', index = False)

And let's check that we stored it correctly.

In [126]:
coconut_water_df = pd.read_csv('./data/coconut_reviews.csv')

In [127]:
coconut_water_df[:2]

Unnamed: 0,Score,Time,Summary,Text
0,1,1314144000,Switched to O.N.E.,Must admit the taste of O.N.E. coconut water i...
1,5,1313884800,WOW!!,I love this stuff! Perfect blend of dark choc...


### Using Torchtext

Ok, now, let's try to use our data in `/data/coconut_reviews.csv` with torchtext.  We can do so by first defining a list of fields, and then specifying where we will read the files with `data.TabularDataset.splits`.

In [99]:
from torchtext import data
import torch
TEXT = data.Field(tokenize='spacy')
SCORE = data.LabelField(dtype = torch.float)

In [104]:
fields = [('score', SCORE), (None, None), (None, None), ('text', TEXT)]

In [179]:
train_data, test_data = data.TabularDataset.splits(
                                        path = 'data',
                                        train = 'coconut_reviews.csv',
    test = 'coconut_reviews.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)

And now we have a list of examples that we can work with.

In [92]:
train_data.examples[0].score

'1'

Let's take a moment to break down the code above.

First, we used `data.Field` to define fields for each column from our dataset that we wanted to process.  So we defined a TEXT field for both the TEXT, and for the SCORE, our target.  Then we specified a list tuples, one tuple for each element in our csv file.  If we do not want to include the column, we fill our tuple with the elements `(None, None)`.  If we do want to include the column, then we specify the name of the attribute we want to store for the column, as well as the predefined field to process the text with.

We can see this if we take a closer look at one of our `examples` from our `train_data` above.  

In [234]:
train_data.examples[:2]

[<torchtext.data.example.Example at 0x12a69d550>,
 <torchtext.data.example.Example at 0x126bae750>]

In [180]:
first_example = train_data.examples[0]

We can see that our example, has score and text attributes, just as we defined above in our `fields` above.

In [181]:
first_example.score

'1'

In [51]:
first_example.text[:5]

['Must', 'admit', 'the', 'taste', 'of']

Once we have initialized our dataset, we can then numericalize and batch our data just as we've done previously.

In [236]:
TEXT.build_vocab(train_data)
SCORE.build_vocab(train_data)

In [129]:
TEXT.vocab.freqs.most_common(10)

[('.', 1852),
 ('I', 1470),
 ('the', 1457),
 (',', 1191),
 ('and', 962),
 ('it', 857),
 ('a', 828),
 ('to', 762),
 (' ', 705),
 ('is', 628)]

Now, if we look at how our labels are translated to numbers, we do not currently translate them to the corresponding integer.

In [237]:
SCORE.vocab.stoi

defaultdict(None, {'5': 0, '1': 1, '4': 2, '3': 3, '2': 4})

We can set up the translation that we prefer by setting `stoi` to our own dictionary.

In [192]:
SCORE.vocab.stoi = {'5': 5, '1': 1, '4': 4, '3': 3, '2': 2}

Now, we can batch the data with the bucket iterator.

In [193]:
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 20

train_iter, test_iter = data.BucketIterator.splits(
    (train_data, test_data), 
    sort_key=lambda x: len(x.text),
    batch_size=100)



> Notice that above we are specifying a `sort_key`.  The bucketiterator needs to be told how to bucket the data, and here we specify to batch the data by the length of the document.

Now let's select the first batch from the from our `train_iter`.

In [194]:
for batch in train_iter:
    first_batch = batch
    break



If we take a look at the data, it seems like it was numericalized properly.

In [195]:
first_batch.text

tensor([[ 62,   3,  32,  ..., 145,   3,   3],
        [ 11,  64,  88,  ...,   8, 293, 717],
        [  8,  60,  26,  ..., 721,  16, 101],
        ...,
        [  1,   1,   1,  ...,   1,   1,   1],
        [  1,   1,   1,  ...,   1,   1,   1],
        [  1,   1,   1,  ...,   1,   1,   1]])

In [231]:
first_doc_in_batch = first_batch.text[:, 0]

In [232]:
[TEXT.vocab.itos[i] for i in first_doc_in_batch][:15]

['This',
 'is',
 'a',
 'great',
 'product',
 '.',
 ' ',
 'Although',
 'I',
 "'d",
 'initially',
 'planned',
 'to',
 'buy',
 'a']

We can also look at the scores in the batch.

In [221]:
first_batch.score

tensor([4., 5., 2., 1., 5., 5., 5., 5., 1., 3., 5., 1., 3., 1., 1., 4., 5., 1.,
        5., 5., 5., 4., 4., 5., 1., 5., 1., 5., 5., 5., 4., 5., 5., 5., 5., 5.,
        4., 2., 3., 4., 5., 5., 5., 4., 5., 1., 5., 5., 5., 4., 1., 5., 5., 5.,
        4., 1., 5., 5., 1., 5., 1., 5., 5., 5., 5., 5., 5., 3., 5., 5., 5., 5.,
        5., 5., 1., 3., 4., 1., 5., 4., 1., 5., 5., 3., 5., 5., 5., 1., 1., 1.,
        3., 1., 1., 5., 5., 5., 5., 5., 5., 1.])

### Working with Json

Working with json occurs in a similar manner.  One difference is how we specify the fields.  Notice that with json, we specify the name of the key in the dictionary, and then follow suit with the tuple.  When we do not wish to include a key with the json, we can simply leave out the field.

In [233]:
# fields = {'score': ('score', SCORE), 'text': ('text', TEXT)}
# train_data, test_data = data.TabularDataset.splits(
#                             path = 'data',
#                             train = 'train.json',
#                             test = 'test.json',
#                             format = 'json',
#                             fields = fields
# )

### Summary

In this lesson, we saw how to work with a custom dataset with torchtext.  We begin by initializing and specifying each field that we would like to use.  Our field object specifies how to tokenize the data.

```python
from torchtext import data
import torch
TEXT = data.Field(tokenize='spacy')
SCORE = data.LabelField(dtype = torch.float)
```

Then, we declare a list of tuples where we specify the name of the attribute we would like our field to be stored as, and the Field object used to process it.

```python
fields = [('score', SCORE), (None, None), (None, None), ('text', TEXT)]
```

When working with json, this collections of fields is a dictionary.

```python
fields = {'score': ('score', SCORE), 'text': ('text', TEXT)}
```

Then, we retreive the data with the `TabularDataset` method.

```python
train_data, test_data = data.TabularDataset.splits(
                                        path = 'data',
                                        train = 'coconut_reviews.csv',
    test = 'coconut_reviews.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)
```

We then numericalize with `build_vocab`, and batch the data with the BucketIterator as we have previously.