# Finding Data Block Nirvana (a journey through the fastai data block API)

This notebook illustrates how to create a custom `ItemList` for use in the fastai data block API.  It is heavily annotated to further aid in also understanding how all the different bits in the API interact as well as what is happening at each step and why.

Please consult the [fastai docs](https://docs.fast.ai/) for installing required packages and setting up your environment to run the code below.

The accompanying Medium article highlighing the data block API mechanics based on my work here can be found [here](https://medium.com/@wgilliam/finding-data-block-nirvana-a-journey-through-the-fastai-data-block-api-c38210537fe4).

## Yelp Dataset

This example utilize a subset of the Yelp review dataset I've made available as part of the code repo for the purposes of illustrating how my `MixedTabularList` would work with a pandas DataFrame containing categorical, continuous, and numercalized text data.  The full dataset and documentation can be found following the links below.

Available from https://www.yelp.com/dataset/download  
Documentation here:  https://www.yelp.com/dataset/documentation/main  
More information here:  https://www.yelp.com/dataset

Unzip the `joined_sample.zip` .csv file into a `data/yelp_dataset` folder relative to this notebook and you should be good to go.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
import pdb

from fastai.tabular import *
from fastai.text import *
from fastai.text.data import _join_texts

print(f'fastai version: {__version__}')  #=> I test this against 1.0.39

fastai version: 1.0.39


In [3]:
torch.cuda.set_device(1)
print(f'using GPU: {torch.cuda.current_device()}')

using GPU: 1


## Configuration and utility methods

In [4]:
PATH=Path('data/yelp_dataset/')
# PATH.ls()

## Define ItemBase subclass

`ItemBase` defines the inputs for your custom dataset, the X and optionally y values you are going to feed into the `forward` function of your pytorch model.  Here we define what an an input item looks like (we'll let fastai infer the `ItemBase` type to use based on our target values).

If your custom `ItemBase` needs to have some kind of data augmentation applied to it, you should overload the `apply_tfms` method as needed.  This method will be called you apply a `transform` block via the Data Block API.

In [5]:
class MixedTabularLine(TabularLine):
    "Item's that include both tabular data(`conts` and `cats`) and textual data (numericalized `ids`)"
    
    def __init__(self, cats, conts, cat_classes, col_names, txt_ids, txt_cols, txt_string):
        # tabular
        super().__init__(cats, conts, cat_classes, col_names)

        # add the text bits
        self.text_ids = txt_ids
        self.text_cols = txt_cols
        self.text = txt_string
        
        # append numericalted text data to your input (represents your X values that are fed into your model)
        # self.data = [tensor(cats), tensor(conts), tensor(txt_ids)]
        self.data += [ np.array(txt_ids, dtype=np.int64) ]
        self.obj = self.data
        
    def __str__(self):
        res = super().__str__() + f'Text: {self.text}'
        return res

## Define custom Processor, DataBunch, and utility methods

Our custom `ItemList` is going to require a custom `PreProcessor` and a custom `DataBunch`, so we define them here

In [6]:
class MixedTabularProcessor(TabularProcessor):
    
    def __init__(self, ds:ItemList=None, procs=None, 
                 tokenizer:Tokenizer=None, chunksize:int=10000,
                 vocab:Vocab=None, max_vocab:int=60000, min_freq:int=2):
        #pdb.set_trace()
        super().__init__(ds, procs)
    
        self.tokenizer, self.chunksize = ifnone(tokenizer, Tokenizer()), chunksize
        
        vocab = ifnone(vocab, ds.vocab if ds is not None else None)
        self.vocab, self.max_vocab, self.min_freq = vocab, max_vocab, min_freq
        
    # process a single item in a dataset
    # NOTE: THIS IS METHOD HAS NOT BEEN TESTED AT THIS POINT (WILL COVER IN A FUTURE ARTICLE)
    def process_one(self, item):
        # process tabular data (copied form tabular.data)
        df = pd.DataFrame([item, item])
        for proc in self.procs: proc(df, test=True)
            
        if len(self.cat_names) != 0:
            codes = np.stack([c.cat.codes.values for n,c in df[self.cat_names].items()], 1).astype(np.int64) + 1
        else: 
            codes = [[]]
            
        if len(self.cont_names) != 0:
            conts = np.stack([c.astype('float32').values for n,c in df[self.cont_names].items()], 1)
        else: 
            conts = [[]]
            
        classes = None
        col_names = list(df[self.cat_names].columns.values) + list(df[self.cont_names].columns.values)
        
        # process textual data
        if len(self.text_cols) != 0:
            txt = _join_texts(df[self.text_cols].values, (len(self.text_cols) > 1))
            txt_toks = self.tokenizer._process_all_1(txt)[0]
            text_ids = np.array(self.vocab.numericalize(txt_toks), dtype=np.int64)
        else:
            txt_toks, text_ids = None, [[]]
            
        # return ItemBase
        return MixedTabularLine(codes[0], conts[0], classes, col_names, text_ids, self.txt_cols, txt_toks)
    
    # processes the entire dataset
    def process(self, ds):
        #pdb.set_trace()
        # process tabular data and then set "preprocessed=False" since we still have text data possibly
        super().process(ds)
        ds.preprocessed = False
        
        # process text data from column(s) containing text
        if len(ds.text_cols) != 0:
            texts = _join_texts(ds.xtra[ds.text_cols].values, (len(ds.text_cols) > 1))

            # tokenize (set = .text)
            tokens = []
            for i in progress_bar(range(0, len(ds), self.chunksize), leave=False):
                tokens += self.tokenizer.process_all(texts[i:i+self.chunksize])
            ds.text = tokens

            # set/build vocab
            if self.vocab is None: self.vocab = Vocab.create(ds.text, self.max_vocab, self.min_freq)
            ds.vocab = self.vocab
            ds.text_ids = [ np.array(self.vocab.numericalize(toks), dtype=np.int64) for toks in ds.text ]
        else:
            ds.text, ds.vocab, ds.text_ids = None, None, []
            
        ds.preprocessed = True
        

In [7]:
# similar to the "fasta.text.data.pad_collate" except that it is designed to work with MixedTabularLine items,
# where the final thing in an item is the numericalized text ids.
# we need a collate function to ensure a square matrix with the text ids, which will be of variable length.
def mixed_tabular_pad_collate(samples:BatchSamples, 
                              pad_idx:int=1, pad_first:bool=True) -> Tuple[LongTensor, LongTensor]:
    "Function that collect samples and adds padding."

    samples = to_data(samples)
    max_len = max([len(s[0][-1]) for s in samples])
    res = torch.zeros(len(samples), max_len).long() + pad_idx
   
    for i,s in enumerate(samples):
        if pad_first: 
            res[i,-len(s[0][-1]):] = LongTensor(s[0][-1])
        else:         
            res[i,:len(s[0][-1]):] = LongTensor(s[0][-1])
            
        # replace the text_ids array (the last thing in the inputs) with the padded tensor matrix
        s[0][-1] = res[i]
              
    # for the inputs, return a list containing 3 elements: a list of cats, a list of conts, and a list of text_ids
    return [x for x in zip(*[s[0] for s in samples])], tensor([s[1] for s in samples])

In [8]:
# each "ds" is of type LabelList(Dataset)
class MixedTabularDataBunch(DataBunch):
    @classmethod
    def create(cls, train_ds, valid_ds, test_ds=None, path:PathOrStr='.', bs=64, 
               pad_idx=1, pad_first=True, no_check:bool=False, **kwargs) -> DataBunch:
        
        # only thing we're doing here is setting the collate_fn = to our new "pad_collate" method above
        collate_fn = partial(mixed_tabular_pad_collate, pad_idx=pad_idx, pad_first=pad_first)
        
        return super().create(train_ds, valid_ds, test_ds, path=path, bs=bs, num_workers=1,
                              collate_fn=collate_fn, no_check=no_check, **kwargs)

## Define ItemList subclass

An `ItemList` consists of a set of `ItemBase` objects. Once created, you can use any of splitting or labeling methods prior to creating a `DataBunch` for training.

You'll likely want to set the following three class variables to something specific to your situation:

**`_bunch`**:  
The name of the class used to create a `DataBunch`.  `TabularList` uses the default `DataBunch` as is and so does not set this variable. We create a custom `DataBunch` here because we need to add padding to the column with the text ids in order to ensure a square matrix per batch before integrating the text bits with the tabular.

When you call `databunch()` via the Data Block API, `_bunch.create` will be called passing in the datasets (training, validation and optionally test) defined by your `ItemLists` and returning a set of `DataLoader`s in a `DataBunch` for training.

**`_processor`**:  
A class or list of classes of type `PreProcessor` that will be used to create the default processor for this `ItemList`.

The processors are **called at the end of the labelling** to apply some kind of function on your items. The **default processor of the inputs** can be overriden by passing a `processor` in the kwargs when creating the `ItemList`, the **default processor of the targets** can be overriden by passing a `processor` in the kwargs of the labelling function.

Processors are useful for pre-processing data, and **you also need to save any computed state required for future datasets when `data.export()` is called.**

**`_item_cls`**:   
The name of the class that will be used to create the "items" by default.

**`_label_cls`**:   
The name of the class that will be used to create the labels by default. (**If this variable is set to None, the label class will be guessed** between `CategoryList`, `MultiCategoryList` and `FloatList` depending on the type of the first item. Since we are creating a custom `ItemList` with a very distinct signature, we want to set it to that class)



In [32]:
class MixedTabularList(TabularList):
    "A custom `ItemList` that merges tabular data along with textual data"
    
    _item_cls = MixedTabularLine
    _processor = MixedTabularProcessor
    _bunch = MixedTabularDataBunch
    
    def __init__(self, items:Iterator, cat_names:OptStrList=None, cont_names:OptStrList=None, 
                 text_cols=None, vocab:Vocab=None, pad_idx:int=1, 
                 procs=None, **kwargs) -> 'MixedTabularList':
        #pdb.set_trace()
        super().__init__(items, cat_names, cont_names, procs, **kwargs)
        
        self.cols = [] if cat_names == None else cat_names.copy()
        if cont_names: self.cols += cont_names.copy()
        if txt_cols: self.cols += text_cols.copy()
        
        self.text_cols, self.vocab, self.pad_idx = text_cols, vocab, pad_idx
        
        # add any ItemList state into "copy_new" that needs to be copied each time "new()" is called; 
        # your ItemList acts as a prototype for training, validation, and/or test ItemList instances that
        # are created via ItemList.new()
        self.copy_new += ['text_cols', 'vocab', 'pad_idx']
        
        self.preprocessed = False
        
    # defines how to construct an ItemBase from the data in the ItemList.items array
    def get(self, i):
        if not self.preprocessed: 
            return self.xtra.iloc[i][self.cols] if hasattr(self, 'xtra') else self.items[i]
        
        codes = [] if self.codes is None else self.codes[i]
        conts = [] if self.conts is None else self.conts[i]
        text_ids = [] if self.text_ids is None else self.text_ids[i]
        text_string = None if self.text_ids is None else self.vocab.textify(self.text_ids[i])
        
        return self._item_cls(codes, conts, self.classes, self.col_names, text_ids, self.text_cols, text_string)
    
    # this is the method that is called in data.show_batch(), learn.predict() or learn.show_results() 
    # to transform a pytorch tensor back in an ItemBase. 
    # in a way, it does the opposite of calling ItemBase.data. It should take a tensor t and return 
    # the same king of thing as the get method.
    def reconstruct(self, t:Tensor):
        return self._item_cls(t[0], t[1], self.classes, self.col_names, 
                              t[2], self.text_cols, self.vocab.textify(t[2]))
    
    # tells fastai how to display a custom ItemBase when data.show_batch() is called
    def show_xys(self, xs, ys) -> None:
        "Show the `xs` (inputs) and `ys` (targets)."
        from IPython.display import display, HTML
        
        # show tabular
        display(HTML('TABULAR:<br>'))
        super().show_xys(xs, ys)
        
        # show text
        items = [['text_data', 'target']]
        for i, (x,y) in enumerate(zip(xs,ys)):
            res = []
            res += [' '.join([ f'{tok}({self.vocab.stoi[tok]})' 
                              for tok in x.text.split() if (not self.vocab.stoi[tok] == self.pad_idx) ])]
                
            res += [str(y)]
            items.append(res)
            
        col_widths = [90, 1]
        
        display(HTML('TEXT:<br>'))
        display(HTML(text2html_table(items, (col_widths))))
        
    # tells fastai how to display a custom ItemBase when learn.show_results() is called
    def show_xyzs(self, xs, ys, zs):
        "Show `xs` (inputs), `ys` (targets) and `zs` (predictions)."
        from IPython.display import display, HTML
        
        # show tabular
        super().show_xyzs(xs, ys, zs)
        
        # show text
        items = [['text_data','target', 'prediction']]
        for i, (x,y,z) in enumerate(zip(xs,ys,zs)):
            res = []
            res += [' '.join([ f'{tok}({self.vocab.stoi[tok]})'
                              for tok in x.text.split() if (not self.vocab.stoi[tok] == self.pad_idx) ])]
                
            res += [str(y),str(z)]
            items.append(res)
            
        col_widths = [90, 1, 1]
        display(HTML('<br>' + text2html_table(items, (col_widths))))
    
        
    @classmethod
    def from_df(cls, df:DataFrame, cat_names:OptStrList=None, cont_names:OptStrList=None, 
                text_cols=None, vocab=None, procs=None, **kwargs) -> 'ItemList':
        
        return cls(items=range(len(df)), cat_names=cat_names, cont_names=cont_names, 
                   text_cols=text_cols, vocab=vocab, procs=procs, xtra=df, **kwargs)
    
    

## Fetch joined yelp reviews (includes busines and user info)

In [33]:
joined_df = pd.read_csv(PATH/'joined_sample.csv', index_col=None)

display(len(joined_df))
display(joined_df.head())
display(joined_df.describe().T)

50000

Unnamed: 0,business_id,cool,date,funny,review_id,stars,text,useful,user_id,user_average_stars,...,business_hours,business_is_open,business_latitude,business_longitude,business_name,business_neighborhood,business_postal_code,business_review_count,business_stars,business_state
0,8jpIK1WHmzzbXPaK51GenQ,1,2012-08-08,3,W7wcVRiw5T8TMrmGnxPsxQ,4,I've been here at least 10 times ... I like it...,1,g6gTSnUKZIxLZPQVrFKscw,4.14,...,"{'Tuesday': '6:30-14:30', 'Wednesday': '6:30-1...",0,33.320994,-111.912682,Dessie's Cafe,,85226,67,3.5,AZ
1,wH4Q0y8C-lkq21yf4WWedw,0,2015-01-31,0,emypFL3PJjQBcllPZw_d5A,5,"Although I had heard of Nekter, mainly from se...",2,LAEJWZSvzsfWJ686VOaQig,5.0,...,"{'Monday': '6:30-20:0', 'Tuesday': '6:30-20:0'...",1,33.580474,-111.881062,Nekter Juice Bar,,85260,59,4.0,AZ
2,cRMC2eQ9CP6ivhEY8EdaGg,1,2010-09-13,0,5X5ISEAp6HFTpMd_wlq_9w,3,Last week I met up with a highschool friend fo...,1,TwilnpgwW43r9-O2AS4PDQ,3.14,...,"{'Monday': '12:0-21:0', 'Tuesday': '12:0-21:0'...",0,43.664193,-79.380196,Chino Locos,Church-Wellesley Village,M4Y 2C5,34,3.5,ON
3,zunMkZ4U2eVojempQtLngg,1,2014-03-07,0,OGekU1U_wWgV--zL2gEgYw,4,A friend and I were driving by and decided to ...,1,eITkQlKYsYqOBASP-QS0iQ,3.72,...,"{'Monday': '11:0-1:0', 'Tuesday': '11:0-1:0', ...",0,33.639158,-112.18511,The Australian AZ,,85308,26,2.5,AZ
4,1vLf-v7foAu3tJ7vAEoKdA,0,2014-11-26,1,tTe2cLFmpkLop3wKcT0Zgw,5,Our Bulldog LOVES this place and so do we! Won...,0,l3okl_UjyNdqRKAzYGdWaA,2.95,...,"{'Monday': '7:30-19:0', 'Tuesday': '7:30-19:0'...",1,33.582848,-111.929296,Lori's Grooming,,85254,148,5.0,AZ


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
cool,50000.0,0.5599,2.017199,0.0,0.0,0.0,1.0,90.0
funny,50000.0,0.47632,2.406208,0.0,0.0,0.0,0.0,388.0
stars,50000.0,3.73374,1.452036,1.0,3.0,4.0,5.0,5.0
useful,50000.0,1.34556,3.286281,0.0,0.0,0.0,2.0,212.0
user_average_stars,50000.0,3.740348,0.797706,1.0,3.4,3.81,4.21,5.0
user_compliment_cool,50000.0,33.42034,267.086374,0.0,0.0,0.0,2.0,13014.0
user_compliment_cute,50000.0,1.5728,28.521646,0.0,0.0,0.0,0.0,2250.0
user_compliment_funny,50000.0,33.42034,267.086374,0.0,0.0,0.0,2.0,13014.0
user_compliment_hot,50000.0,23.13954,225.898754,0.0,0.0,0.0,1.0,12390.0
user_compliment_list,50000.0,0.98766,20.821018,0.0,0.0,0.0,0.0,2259.0


## Use and test our MixedTabularList ItemList with the Data Block API

In [34]:
cat_cols = ['business_id', 'user_id', 'business_stars', 'business_postal_code', 'business_state']
cont_cols = ['useful', 'user_average_stars', 'user_review_count', 'business_review_count']
txt_cols = ['text']

dep_var = ['stars']

procs = [FillMissing, Categorify, Normalize]

**Step 1: Define the source of your inputs**

In [35]:
il = MixedTabularList.from_df(joined_df, cat_cols, cont_cols, txt_cols, vocab=None, procs=procs, path=PATH)

> <ipython-input-32-1a4a4ec6164c>(12)__init__()
-> super().__init__(items, cat_names, cont_names, procs, **kwargs)
(Pdb) c


In [36]:
print(f'CATS:\n{il.cat_names}')
print(f'CONTS:\n{il.cont_names}')
print(f'TEXT COLS:\n{il.text_cols}')
print(f'PROCS:\n{il.procs}')
print('')
print(il.get(0))

CATS:
['business_id', 'user_id', 'business_stars', 'business_postal_code', 'business_state']
CONTS:
['useful', 'user_average_stars', 'user_review_count', 'business_review_count']
TEXT COLS:
['text']
PROCS:
[<class 'fastai.tabular.transform.FillMissing'>, <class 'fastai.tabular.transform.Categorify'>, <class 'fastai.tabular.transform.Normalize'>]

business_id                                         8jpIK1WHmzzbXPaK51GenQ
user_id                                             g6gTSnUKZIxLZPQVrFKscw
business_stars                                                         3.5
business_postal_code                                                 85226
business_state                                                          AZ
useful                                                                   1
user_average_stars                                                    4.14
user_review_count                                                       26
business_review_count                              

**Step 2: Split your dataset into training and validation `ItemList`s**

This is going to trigger the `ItemList.new()` method getting called for each `ItemList` it needs to create (e.g., train, validation).  Here it will be called 2x, once to create the training dataset and then to create the validation dataset.

In [37]:
ils = il.random_split_by_pct(valid_pct=0.1, seed=42)

> <ipython-input-32-1a4a4ec6164c>(12)__init__()
-> super().__init__(items, cat_names, cont_names, procs, **kwargs)
(Pdb) c
> <ipython-input-32-1a4a4ec6164c>(12)__init__()
-> super().__init__(items, cat_names, cont_names, procs, **kwargs)
(Pdb) c


In [38]:
len(ils.train), len(ils.valid), ils.path

(45000, 5000, PosixPath('data/yelp_dataset'))

**Step 3: Add your labels (your targets or "y" values)**

This will grab the targets (the "y") for each `ItemList` in your `ItemLists` object (e.g, `.train`, `.valid`) and build a `LabelList(Dataset)` for each accordingly that is then combined in and returned in a `LabelLists` object.

You'll notice that the processor is created 1x but that .process is called 2x.  *Why?* So that the preprocessing defined by the training data is applied to the validation and optionally the test data later on.

In [39]:
ll = ils.label_from_df(dep_var)

In [40]:
type(ll), type(ll.train), len(ll.lists)

(fastai.data_block.LabelLists, fastai.data_block.LabelList, 2)

In [41]:
ll.train

LabelList
y: CategoryList (45000 items)
[Category 4, Category 5, Category 3, Category 4, Category 2]...
Path: data/yelp_dataset
x: MixedTabularList (45000 items)
[MixedTabularLine business_id 8jpIK1WHmzzbXPaK51GenQ; user_id g6gTSnUKZIxLZPQVrFKscw; business_stars 3.5; business_postal_code 85226; business_state AZ; useful -0.1049; user_average_stars 0.4985; user_review_count -0.2776; business_review_count -0.3911; Text: xxbos i 've been here at least 10 times ... i like it ... but its not my favorite . i always get the spanish omelette egg whites without chorizo because i do n't eat meat . xxmaj once the chef forgot and put it in and i could hear him swearing from my table when he had to remake it . xxmaj the wait staff are all very friendly , but seem a bit overwhelmed keeping up with drink refills . xxmaj once my omelette came out scrambled instead of an omelette because the chef says its too hard to make an omelette out of egg whites ... which i 've gotten the other 9 times . xxmaj th

In [42]:
ll.train.x[0], ll.train.y[0], ll.train.x.codes[0], ll.train.x.cat_names, ll.train.x.text_ids[0]

(MixedTabularLine business_id 8jpIK1WHmzzbXPaK51GenQ; user_id g6gTSnUKZIxLZPQVrFKscw; business_stars 3.5; business_postal_code 85226; business_state AZ; useful -0.1049; user_average_stars 0.4985; user_review_count -0.2776; business_review_count -0.3911; Text: xxbos i 've been here at least 10 times ... i like it ... but its not my favorite . i always get the spanish omelette egg whites without chorizo because i do n't eat meat . xxmaj once the chef forgot and put it in and i could hear him swearing from my table when he had to remake it . xxmaj the wait staff are all very friendly , but seem a bit overwhelmed keeping up with drink refills . xxmaj once my omelette came out scrambled instead of an omelette because the chef says its too hard to make an omelette out of egg whites ... which i 've gotten the other 9 times . xxmaj they are a mom and pop restaurant and i sometimes think they feel like it 's ok to do it there way rather then the customers way . 
 
  xxmaj all in all though , wi

In [43]:
len(ll.train.x.vocab.itos), len(ll.valid.x.vocab.itos)

(25217, 25217)

**Step 6: Build your DataBunch**

We're skilling steps 4 (add a test dataset) and 5 (apply data augmentation) since we have neither a test set or any transforms we need to apply to the data.

In [44]:
data_bunch = ll.databunch(bs=64)
b = data_bunch.one_batch()
len(b), len(b[0]), len(b[0][0]), len(b[0][1]), len(b[0][1]), b[1].shape

(2, 3, 64, 64, 64, torch.Size([64]))

`len(b) = 2`:  the inputs and the targets

`len(b[0]) = 3`: the three things in the input (cats, conts, text_ids)

`len(b[0][0|1|2|]) = 64`: there are 64 of each of the 3 things (so there is a list 64 categorical tensors followed by a list of 64 continuous tensors that is followed by a list of 64 text tensors)

The shape length of the categorical and continuous tensors are the same for every batch, whereas the shape of the numericalized token ids will be the same *per* batch thanks to the `mixed_tabular_pad_collate` function above.  This fulfills the requirement that each of the inputs be a squared matrix per batch.

In [45]:
b[0][0][0], b[0][1][0], b[0][2][0]

(tensor([21069, 15870,     9,   498,    13]),
 tensor([ 0.1975, -1.0674,  0.2514, -0.2308]),
 tensor([    1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
             1,     2,     4,   328,    37,    12,   164,    14,    45,   800,
            78,    29,    10,  4725,     9,   155,   902,    39,    49,   728,
           933,  1591,    10,    25,  

The above shows the categorical, continuous, and token ids for the first item in the batch

In [46]:
data_bunch.show_batch()

business_id,user_id,business_stars,business_postal_code,business_state,useful,user_average_stars,user_review_count,business_review_count,target
2NiBvT5zL272IRcxru_x9A,WVbzw3IPJ29PWOsI2iESSw,4.0,85054,AZ,-0.4074,0.9495,-0.3383,-0.1494,5
vo8rCTuhM19GhpY07VtXpw,WVVkGFSoatEZWU2oGdn4fQ,2.0,85016,AZ,-0.1049,-1.7689,-0.3475,-0.4712,1
wghDrzcZ0VloAtaIZ7GEBg,q9yxse9JxjhnEJy42leUGg,4.5,85016,AZ,-0.4074,0.4359,-0.3231,-0.2028,4
3dw6xhzG08htY5HcL2OjeA,TzRbkwSLFym6pgPJKYT5xw,3.5,85212,AZ,-0.1049,0.5987,-0.2897,-0.2397,3
IWDA6Tp8aFIaIqJJnZF0oA,0k4ZY5M55ceFdbw74AS_kA,2.0,44107,OH,-0.1049,0.1728,-0.3323,-0.4572,2


text_data,target
"xxbos(2) xxmaj(4) service(59) from(70) xxmaj(4) coco(5707) was(15) top(304) notch(1248) !(19) xxmaj(4) they(25) know(147) their(69) food(42) and(10) wine(453) parings(22235) -(41) great(50) happy(236) hour(265) ,(11) decent(397) prices(244) for(20) fresh(192) food(42) prepared(767) perfectly(590) .(8) 10-minute(20013) from(70) the(9) xxmaj(4) marriott(5151) desert(1412) ridge(8270) .(8) xxmaj(4) nice(104) outdoor(1194) patio(596) .(8) xxmaj(4) would(65) definitely(131) go(76) again(134) .(8) xxmaj(4) great(50) adult(2402) venue(1476) .(8)",5
"xxbos(2) xxmaj(4) this(29) place(43) will(80) kill(3152) your(87) dog(515) .(8) xxmaj(4) then(139) they(25) wo(390) n't(32) have(33) the(9) decency(9828) to(14) refund(1543) your(87) pet(1555) fees(1935) that(23) you(26) had(35) to(14) pay(352) the(9) entire(592) year(408) living(1234) there(48) .(8) xxmaj(4) if(54) you(26) have(33) a(13) pet(1555) ,(11) do(57) n't(32) live(513) here(58) .(8) xxmaj(4) maintenance(1727) will(80) show(387) up(73) while(181) you(26) are(39) at(40) work(187) unannounced(19606) .(8) xxmaj(4) your(87) dog(515) will(80) escape(3445) because(98) he(77) 's(36) confused(1983) and(10) scared(3007) .(8) xxmaj(4) the(9) apartment(1663) complex(2052) will(80) not(34) tell(354) you(26) why(298) they(25) are(39) calling(1391) you(26) at(40) work(187) until(432) you(26) show(387) up(73) at(40) 5(142) pm(551) and(10) realize(1487) your(87) dog(515) is(18) missing(1268) .(8) xxmaj(4) they(25) will(80) cover(1459) their(69) own(450) a(13) *(475) *(475) before(160) attempting(5914) to(14) find(211) your(87) dog(515) .(8) xxmaj(4) the(9) next(201) day(154) you(26) will(80) find(211) your(87) dog(515) dead(1927) on(31) the(9) side(242) of(17) 16th(7260) street(532) in(21) front(343) of(17) the(9) xxmaj(4) salvation(17185) xxmaj(4) army(9624) .(8) xxmaj(4) you(26) will(80) be(45) heartbroken(23334) and(10) they(25) will(80) be(45) unforgivable(20204) during(368) the(9) process(889) .(8) xxmaj(4) do(57) n't(32) live(513) here(58)",1
"xxbos(2) ""(56) xxmaj(4) interesting(861) ""(56) because(98) it(16) starts(2253) out(55) pretty(149) tame(12263) ,(11) but(30) if(54) you(26) 're(162) going(148) to(14) the(9) top(304) of(17) the(9) trail(2748) ,(11) you(26) 're(162) in(21) for(20) a(13) hike(3365) !(19) xxmaj(4) you(26) 'll(212) be(45) xxunk(0) at(40) the(9) top(304) !(19) !(19)",4
"xxbos(2) xxmaj(4) once(303) again(134) the(9) food(42) made(137) up(73) for(20) the(9) bad(206) customer(210) service(59) .(8) xxmaj(4) how(140) are(39) you(26) supposed(1119) to(14) eat(191) snow(2440) crab(654) without(346) a(13) way(151) to(14) get(66) it(16) out(55) of(17) the(9) shell(2057) ?(100) xxmaj(4) we(24) skipped(4664) the(9) crab(654) this(29) time(63) because(98) the(9) waitress(377) said(152) the(9) ""(56) health(1462) dept(6129) .(8) ""(56) wo(390) n't(32) allow(1978) them(96) to(14) use(339) the(9) scissors(8965) anymore(1401) but(30) you(26) could(103) use(339) a(13) plastic(1830) fork(3002) to(14) process(889) your(87) crab(654) legs(1747) ...(85) later(399) we(24) were(38) told(186) by(94) a(13) more(93) helpful(384) waiter(537) that(23) they(25) were(38) just(62) out(55) of(17) them(96) because(98) it(16) 's(36) busy(331) ((53) xxmaj(4) fri(6065) night(175) )(51) understandable(3671) but(30) not(34) acceptable(2873) xxmaj(4) so(37) a(13) lie(2518) /(110) incompetence(10430) in(21) the(9) back(72) cost(631) the(9) restaurant(125) 36(4795) bucks(1267) as(46) three(422) of(17) our(61) party(478) got(97) meals(879) 1(274) /(110) 2(150) of(17) what(81) the(9) crab(654) cost(631) and(10) the(9) second(428) strike(4745) of(17) not(34) coming(290) back(72) .(8) xxup(5) and(10) another(199) big(250) fail(3117) was(15) that(23) one(67) meal(222) came(114) out(55) nice(104) and(10) quick(371) ((53) the(9) shrimp(430) )(51) but(30) the(9) other(102) three(422) in(21) the(9) party(478) were(38) a(13) full(300) 25(948) minutes(177) later(399) xxrep(6) 4(117) .(8) and(10) it(16) was(15) our(61) fault(1716) the(9) fryer(6066) was(15) backed(4267) up(73) apparently(1223) ...(85) xxmaj(4) waiting(394) a(13) couple(379) of(17) weeks(675) but(30) xxmaj(4) three(422) strikes(8083) and(10) you(26) 're(162) out(55) xxrep(6) 4(117) .(8)",3
"xxbos(2) this(29) location(188) is(18) always(116) late(538) on(31) orders(682) .(8) i(12) placed(1328) an(75) order(119) and(10) was(15) to(14) 15(445) -(41) 20(402) minutes(177) .(8) i(12) waited(479) until(432) 20(402) minutes(177) until(432) i(12) left(275) my(22) house(334) ,(11) to(14) give(190) them(96) the(9) benefit(3646) of(17) a(13) couple(379) extra(418) minutes(177) .(8) get(66) there(48) ,(11) pizza(195) was(15) n't(32) even(101) started(439) yet(496) .(8) did(68) n't(32) remember(701) my(22) order(119) i(12) called(273) in(21) .(8) ended(558) up(73) waiting(394) for(20) 30(495) at(40) the(9) restaurant(125) for(20) them(96) to(14) make(141) the(9) 1(274) pizza(195) .(8) xxmaj(4) all(52) in(21) all(52) ,(11) it(16) took(183) over(130) an(75) hour(265) to(14) munch(6693) on(31) some(90) pizza(195) .(8) when(71) it(16) should(234) have(33) only(92) taken(708) 20(402) .(8) xxmaj(4) they(25) get(66) two(155) starts(2253) because(98) the(9) pizza(195) was(15) actually(299) better(153) than(121) expected(698) .(8)",2


Because we included the `Normalize` proc, notice that the continuous variables are normalized *per dataset*: 
`(x - x.mean) / x.std`