Assign the *Pets* dataset to a path object

In [2]:
from fastai.vision.all import *
path = untar_data(URLs.PETS)

In [3]:
path

Path('/home/petewin/.fastai/data/oxford-iiit-pet')

Change the base path of the `Path` object to a local relative one, in this case '.'

This will make our Path objects reference the hidden `.fastai/data` folder as if it were our local working directory

In [4]:
Path.BASE_PATH = path

In [5]:
Path.BASE_PATH

Path('.')

In [6]:
path.ls()

(#2) [Path('annotations'),Path('images')]

We'll be focusing on classification rather than localization, and so we'll ignore the annotations 

Most functions in fastai are going to return the `L` class, which is like an enhancement to the standard python `list` class

It can do a few extra things:

In [9]:
fname = (path/"images").ls()[:3]
fname

(#3) [Path('images/Birman_115.jpg'),Path('images/leonberger_142.jpg'),Path('images/Bombay_68.jpg')]

Above, it's showing that each item is a `Path` object and that there are `#3` items in this particular list

For longer lists it will add an ellipse `...` rather than try to spit out however many items

In [10]:
fname = (path/"images").ls()
fname

(#7393) [Path('images/Birman_115.jpg'),Path('images/leonberger_142.jpg'),Path('images/Bombay_68.jpg'),Path('images/japanese_chin_26.jpg'),Path('images/saint_bernard_149.jpg'),Path('images/Ragdoll_41.jpg'),Path('images/japanese_chin_32.jpg'),Path('images/Ragdoll_68.jpg'),Path('images/Persian_202.jpg'),Path('images/scottish_terrier_143.jpg')...]

We need to take note of the filenames here because some are one word, some are multiple words, all using an underscore `_` as a delimiter

In [11]:
fname = (path/"images").ls()[0]
fname

Path('images/Birman_115.jpg')

Regex to find any amount of characters until the last underscore is reached, then looks for a digit string and `.jpg`

In [13]:
re.findall(r'(.+)_\d+.jpg$', fname.name)

['Birman']

Fastai comes with a built-in `RegexLabeller` class, which we can use in creating out `pets` DataBlock

In [15]:
pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
                 get_items=get_image_files,
                 splitter=RandomSplitter(seed=42),
                 get_y=using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'),
                 item_tfms=Resize(460),
                 batch_tfms=aug_transforms(size=224, min_scale=0.75))
dls = pets.dataloaders(path/"images")

I'm always curious about some of the 'black box' aspects of programming in general. Not that it's really a black box, but I've made plenty of `DataBlock` objects at this point. So let's take a look under the hood

I'm going to make this a dataframe, just so it give me everything.<br>
`dir` is going to return all the attributes of my `pets` Datablock object now that it's created

In [26]:
df = pd.DataFrame(dir(pets))
df

Unnamed: 0,0
0,__class__
1,__delattr__
2,__dict__
3,__dir__
4,__doc__
5,__eq__
6,__format__
7,__ge__
8,__getattribute__
9,__gt__


In [25]:
pets.get_y

functools.partial(<function _using_attr at 0x7fe902a0c1f0>, <fastai.data.transforms.RegexLabeller object at 0x7fe8f67b2770>, 'name')

There we go, I was looking for the values at `get_y`, specifically what `name` was doing, and turns out it's just hanging out in there as a string - so likely for labeling(duh, but I wanted to see how)

`item_tmfs` and `batch_tmfs` in the above `DataBlock` are a data augmentation strategy called *presizing*, which is meant to perform image augmentation while minimizing data destruction and maintaining good performance

#### Presizing

Needs:
- Images to be the same dimensions so they can collate to tensors to head to the GPU
- Minimize the number of distinct augmentation computations we perform

Therefore:
- Where possible compose augmentation transforms into fewer tranforms and uniform image sizes


Complications:
- If we perform augmentations after resizing, we may introduce empty zones, degrade data, or both

Therefore:
- *Presizing*
    - Resize images to larger dimensions(scale up)
    - Compose all common augmentation operations(including rescale down) into a single one, and perform only at the end of processing