This notebook accompanies [this YouTube video](https://youtu.be/lzXKsY3bANw) which explains how to use scikit-learn for image classification. You should be able to run all cells below to run the same code as the video once you've installed all the dependencies.

In [45]:
# %pip install scikit-learn jupyterlab "embetter[sentence-tfm]"

In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(
    StandardScaler(),
    LinearRegression()
)

In [2]:
pipe

In [16]:
from embetter.vision import ImageLoader
from embetter.multi import ClipEncoder
from sklearn.linear_model import LogisticRegression

image_emb_pipeline = make_pipeline(
    ImageLoader(convert="RGB"),
    ClipEncoder(),
    LogisticRegression()
)

image_emb_pipeline

If you want to follow along with the same dataset, it can be found here:

```
https://github.com/koaning/bulk-datasets/raw/main/pets.tar.gz
```

In [38]:
from pathlib import Path

pet_types = set()
n_img = 0
image_paths = []
y = []
for path in Path("pets").glob("*.jpg"):
    stem = path.stem
    image_paths.append(str(path))
    y.append(stem[:stem.rfind("_")])
    pet_types.add(stem[:stem.rfind("_")])
    n_img += 1 

In [52]:
from embetter.vision import ImageLoader
from embetter.multi import ClipEncoder

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1_000)

image_emb_pipeline = make_pipeline(
  ImageLoader(convert="RGB"),
  ClipEncoder(),
)

In [60]:
%%time 

X = image_emb_pipeline.transform(image_paths)

CPU times: user 18.4 s, sys: 3.15 s, total: 21.5 s
Wall time: 11.1 s


In [61]:
%%time 

cross_val_score(model, X, y, cv=5)

CPU times: user 13.2 s, sys: 634 ms, total: 13.9 s
Wall time: 2.12 s


array([0.9 , 0.88, 0.87, 0.88, 0.85])