This is a simple example of a classifier model. i'll briefly explain some parts of the code, but the conceptual guide, as well as relevant snippets of the code, can be found in my notes *here*(reminder to link to chapter 2). 

Our model here is designed to classify cat breeds!

We first do all our initial setup for fastai:

In [None]:
!pip install fastbook

In [None]:
#hide
! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [None]:
#hide
import os
from fastbook import *
from fastai.vision.widgets import *

And then we grab our first category of images. in this case, siamese cats.

In [None]:
results = search_images_ddg('siamese cats')
ims = results.attrgot('contentUrl')
len(ims)

Let's take a look at one of our cats here: 

In [None]:
#hide
ims = ['https://images.wagwalkingweb.com/media/daily_wag/blog_articles/hero/1678934108.5188236/everything-you-need-to-know-about-siamese-cats.png']


In [None]:
dest = 'images/siamese.jpg'
download_url(ims[0], dest)

In [None]:
im = Image.open(dest)
im.to_thumb(128,128)

This seems to have worked nicely, so let's use fastai's `download_images` to download all the URLs for each of our search terms. We'll put each in a separate folder:

In [None]:
#bear_types = 'grizzly','black','teddy'
cat_types = 'siamese', 'tabby', 'maine coon'

#path = Path('bears')
path = Path('cats')

In [None]:
!pip install duckduckgo_search
from duckduckgo_search import DDGS
import os
os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
from fastcore.all import *

ddgs = DDGS()

def search_images(term, max_images=30):
    print(f"Searching for '{term}'")
    # return L(ddg_images(term, max_results=max_images)).itemgot('image')
    return L(ddgs.images(keywords=term, max_results=max_images)).itemgot('image')

if not path.exists():
    path.mkdir()
    print("made path")
for o in cat_types:
    dest = (path/o)
    dest.mkdir(exist_ok=True, parents = True)
    results = search_images_ddg(f'{o} cat')
    download_images(dest, urls=search_images(f'{o} cat'))


Our folder has image files, as we'd expect:

In [None]:
fns = get_image_files(path)
fns

Often when we download files from the internet, there are a few that are corrupt. Let's check:

In [None]:
failed = verify_images(fns)
failed

To remove all the failed images, you can use `unlink` on each of them. Note that, like most fastai functions that return a collection, `verify_images` returns an object of type `L`, which includes the `map` method. This calls the passed function on each element of the collection:

In [None]:
failed.map(Path.unlink);

Now that we have downloaded some data, we need to assemble it in a format suitable for model training. Usually, these are called`DataLoaders`.

## From Data to DataLoaders

The dataloaders class just stores multiple dataloader objects,usually`train` and `valid` sets. 

```python
class DataLoaders(GetAttr):
    def __init__(self, *loaders): self.loaders = loaders
    def __getitem__(self, i): return self.loaders[i]
    train,valid = add_props(lambda i,self: self[i])
```


To turn downloaded data into a `DataLoaders` object we need at least four things:

- What kinds of data we are working with
- How to get the list of items
- How to label these items
- How to create the validation set

There are a number of default methods according to common data/application pipelines, but a `Datablock` gives us more control in our construction of a `DataLoaders` object.

In [None]:
cats = DataBlock(
    blocks=(ImageBlock, CategoryBlock), #independent, then dependent variable
    get_items=get_image_files, #takes a path, recursivley returns list of images
    splitter=RandomSplitter(valid_pct=0.2, seed=42),#common split of data
    get_y=parent_label,#basically just gets folder name
    item_tfms=Resize(128)) #standardizing dims of images 


This command has given us a `DataBlock` object. This is like a *template* for creating a `DataLoaders`. We still need to tell fastai the actual source of our data—in this case, the path where the images can be found:

In [None]:
dls = cats.dataloaders(path)

A `DataLoaders` includes validation and training `DataLoader`s. `DataLoader` is a class that provides batches of a few items at a time to the GPU. When you loop through a `DataLoader` fastai will give  64 (by default) items at a time, all stacked up into a single tensor. We can take a look at a few of those items by calling the `show_batch` method on a `DataLoader`:

In [None]:
dls.valid.show_batch(max_n=6, nrows=1)

By default `Resize` *crops* the images to fit a square shape of the size requested, using the full width or height. This can result in losing some important details. Alternatively, you can ask fastai to pad the images with zeros (black), or squish/stretch them:

In [None]:
cats = cats.new(item_tfms=Resize(128, ResizeMethod.Squish))
dls = cats.dataloaders(path)
dls.valid.show_batch(max_n=6, nrows=1)

In [None]:
cats= cats.new(item_tfms=Resize(128, ResizeMethod.Pad, pad_mode='zeros'))
dls = cats.dataloaders(path)
dls.valid.show_batch(max_n=6, nrows=1)

Depending on the specifications of our model, we can replace `Resize` with `RandomResizedCrop`. It can take in a parameter`min_scale`, which determines how much of the image to select at minimum each time:

In [None]:
cats = cats.new(item_tfms=RandomResizedCrop(128, min_scale=0.3))
dls = cats.dataloaders(path)
dls.train.show_batch(max_n=6, nrows=1, unique=True)

We used `unique=True` to have the same image repeated with different versions of this `RandomResizedCrop` transform. 

a standard set of augmentations that work well(in general) are provided with the `aug_transforms` function. Because our images are now all the same size, we can apply these augmentations to an entire batch of them using the GPU, which will save a lot of time. To tell fastai we want to use these transforms on a batch, we use the `batch_tfms` parameter (note that we're not using `RandomResizedCrop` in this example, so you can see the differences more clearly; we're also using double the amount of augmentation compared to the default, for the same reason):

In [None]:
cats = cats.new(item_tfms=Resize(128), batch_tfms=aug_transforms(mult=2))
#mult is multiplier, or the extent of our aug_transforms
dls = cats.dataloaders(path)
dls.train.show_batch(max_n=6, nrows=2, unique=True)

Now that we have assembled our data in a format fit for model training, let's actually train our classifier. 

We can train our model first to identify outliers in our data by looking at the losses. Here, we use both `RandomResizedCrop` and `aug_transforms`. 

In [None]:
cats = cats.new(
    item_tfms=RandomResizedCrop(224, min_scale=0.5),
    batch_tfms=aug_transforms())
dls = cats.dataloaders(path)

We can now create our `Learner` and fine-tune it in the usual way:

In [None]:
learn = vision_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(10)

Now let's see whether the mistakes the model is making are mainly thinking that grizzlies are teddies (that would be bad for safety!), or that grizzlies are black bears, or something else. To visualize this, we can create a *confusion matrix*:

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

 we can see that the diagonals are the correct match of predicted to actual. 
The loss is a number that is higher if the model is incorrect (especially if it's also confident of its incorrect answer), or if it's correct, but not confident of its correct answer. In a couple of chapters we'll learn in depth how loss is calculated and used in the training process. For now, `plot_top_losses` shows us the images with the highest loss in our dataset. As the title of the output says, each image is labeled with four things: prediction, actual (target label), loss, and probability. The *probability* here is the confidence level, from zero to one, that the model has assigned to its prediction:

In [None]:
interp.plot_top_losses(3, nrows=3)

We can see that our very first image kind of looks like a main coon AND a tabby, which is why our model got confused.

Besides that, our model seems to be performing well so far.


We can then save our model(and how we created our `DataLoaders`.)


> This method even saves the definition of how to create your `DataLoaders`. This is important, because otherwise you would have to redefine how to transform your data in order to use your model in production. fastai automatically uses your validation set `DataLoader` for inference by default, so your data augmentation will not be applied, which is generally what you want.

When you call `export`, fastai will save a file called "export.pkl":

In [None]:
learn.export()

In [None]:
learn_inf = load_learner(path/'export.pkl')

When we're doing inference, we're generally just getting predictions for one image at a time. To do this, pass a filename to `predict`:

In [None]:
learn_inf.predict('images/siamese.jpg')

This has returned three things: the predicted category in the same format you originally provided (in this case that's a string), the index of the predicted category, and the probabilities of each category. The last two are based on the order of categories in the *vocab* of the `DataLoaders`; that is, the stored list of all possible categories. At inference time, you can access the `DataLoaders` as an attribute of the `Learner`:

In [None]:
learn_inf.dls.vocab

## Questionnaire

1. Provide an example of where the cat classification model might work poorly in production, due to structural or style differences in the training data.
1. Where do text models currently have a major deficiency?
1. What are possible negative societal implications of text generation models?
1. In situations where a model might make mistakes, and those mistakes could be harmful, what is a good alternative to automating a process?
1. What kind of tabular data is deep learning particularly good at?
1. What's a key downside of directly using a deep learning model for recommendation systems?
1. What are the steps of the Drivetrain Approach?
1. How do the steps of the Drivetrain Approach map to a recommendation system?
1. Create an image recognition model using data you curate, and deploy it on the web.
1. What is `DataLoaders`?
1. What four things do we need to tell fastai to create `DataLoaders`?
1. What does the `splitter` parameter to `DataBlock` do?
1. How do we ensure a random split always gives the same validation set?
1. What letters are often used to signify the independent and dependent variables?
1. What's the difference between the crop, pad, and squish resize approaches? When might you choose one over the others?
1. What is data augmentation? Why is it needed?
1. What is the difference between `item_tfms` and `batch_tfms`?
1. What is a confusion matrix?
1. What does `export` save?
1. What is it called when we use a model for getting predictions, instead of training?
1. When might you want to use CPU for deployment? When might GPU be better?
1. What are the downsides of deploying your app to a server, instead of to a client (or edge) device such as a phone or PC?
1. What are three examples of problems that could occur when rolling out a cat identification system in practice?
1. What is "out-of-domain data"?
1. What is "domain shift"?
1. What are the three steps in the deployment process?