# 04. PyTorch Custom Datasets 

We used some datasets with PyTorch before.

How to use own data?

One of the ways to do so is using custom datasets.

**Resources**: 
- Book version on learnpytorch.io
- Notebook on Daniel's Github

## Domain libraries

Depending on what is working on, vision, text, audio, recommendations.... -> PyTorch domain libs


## Importing PyTorch and setting up device-agnostic code

In [1]:
import torch
from torch import nn

# PyTorch 1.10+ is needed
print(f"Torch version: {torch.__version__}")
print(f"Torch build cuda version: {torch.version.cuda}")

Torch version: 2.5.1
Torch build cuda version: 11.8


In [2]:
# Setup device agnostic code
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

cuda


In [3]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_22:08:44_Pacific_Standard_Time_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0


In [4]:
!nvidia-smi

Fri Nov 15 19:51:09 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 566.14                 Driver Version: 566.14         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3060 ...  WDDM  |   00000000:01:00.0 Off |                  N/A |
| N/A   44C    P3             23W /   40W |       0MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## 1. Get some data

The dataset is a subset of the food101 dataset, which starts with 101 different classes of food. The following dataset is reduced to three classes and 1000 images.

It's important to try on a small scale and then increase the scale when necessary.

In [5]:
import os
import requests
from zipfile import ZipFile

# URL of the file to be downloaded
url = "https://github.com/mrdbourke/pytorch-deep-learning/blob/main/data/pizza_steak_sushi.zip?raw=true"
filename = "pizza_steak_sushi.zip"
extract_folder = "Food101_pizza_steak_sushi"

# Download the file if it does not exist
if not os.path.exists(filename):
    response = requests.get(url)
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"{filename} downloaded.")

# Unzip the file into the specified folder if it does not exist
if not os.path.exists(extract_folder):
    with ZipFile(filename, 'r') as zip_ref:
        zip_ref.extractall(extract_folder)
    print(f"{filename} extracted to {extract_folder}.")

## 2. Becoming one with the data (data preparation and data exploration)

In [6]:
def walk_through_dir(dir_path):
    for dirpath, dirnames, filenames in os.walk(dir_path):
        print(f"There are {len(dirnames)} directories and {len(filenames)} images in '{dirpath}'.")

In [7]:
walk_through_dir(extract_folder)

There are 2 directories and 0 images in 'Food101_pizza_steak_sushi'.
There are 3 directories and 0 images in 'Food101_pizza_steak_sushi\test'.
There are 0 directories and 25 images in 'Food101_pizza_steak_sushi\test\pizza'.
There are 0 directories and 19 images in 'Food101_pizza_steak_sushi\test\steak'.
There are 0 directories and 31 images in 'Food101_pizza_steak_sushi\test\sushi'.
There are 3 directories and 0 images in 'Food101_pizza_steak_sushi\train'.
There are 0 directories and 78 images in 'Food101_pizza_steak_sushi\train\pizza'.
There are 0 directories and 75 images in 'Food101_pizza_steak_sushi\train\steak'.
There are 0 directories and 72 images in 'Food101_pizza_steak_sushi\train\sushi'.


In [8]:
# Setup train and test directories
train_dir = os.path.join(extract_folder, "train")
test_dir = os.path.join(extract_folder, "test")

train_dir, test_dir

('Food101_pizza_steak_sushi\\train', 'Food101_pizza_steak_sushi\\test')