# PyTorch basics: data loading and preprocessing
PyTorch encapsulates commonly used data loading through torch.utils.data, which can easily realize multi-threaded data pre-reading and batch loading.
And torchvision has implemented common image data sets in advance, including the previously used CIFAR-10, ImageNet, COCO, MNIST, LSUN and other data sets, which can be easily called through torchvision.datasets

In [None]:
# First introduce relevant packages
import torch
#Print the version
torch.__version__

## Dataset
Dataset is an abstract class. In order to be easily read, the data to be used needs to be packaged as a Dataset class.
The custom Dataset needs to inherit it and implement two member methods:
1. The `__getitem__()` method defines to use the index (`0` to `len(self)`) to get a piece of data or a sample
2. `__len__()` This method returns the total length of the data set

Below we use a competition on kaggle [bluebook for bulldozers](https://www.kaggle.com/c/bluebook-for-bulldozers/data) to customize a data set, for the convenience of introduction, we use the data dictionary inside To explain (because the number is small)

In [None]:
#Quote
from torch.utils.data import Dataset
import pandas as pd

In [None]:
#Define a data set
class BulldozerDataset(Dataset):
    """ Data Set Demo """
    def __init__(self, csv_file):
        """Realize the initialization method, read and load data during initialization"""
        self.df=pd.read_csv(csv_file)
    def __len__(self):
        '''
        Returns the length of df
        '''
        return len(self.df)
    def __getitem__(self, idx):
        '''
        Return a row of data according to idx
        '''
        return self.df.iloc[idx].SalePrice

At this point, our data set has been defined, we can use an instance to call an object to access it

In [None]:
ds_demo = BulldozerDataset('median_benchmark.csv')

We can directly use the following command to view the data set data


In [None]:
#Implemented the __len__ method so you can directly use len to get the total number of data
len(ds_demo)

In [None]:
#Use the index to directly access the corresponding data, corresponding to the __getitem__ method
ds_demo[0]

The custom data set has been created. Below we use the official data loader to read the data
## Dataloader
DataLoader provides us with the read operation of the Dataset. Common parameters are: batch_size (size of each batch), shuffle (whether to perform shuffle operation), num_workers (use several subprocesses when loading data). Do a simple operation below

In [None]:
dl = torch.utils.data.DataLoader(ds_demo, batch_size=10, shuffle=True, num_workers=0)

DataLoader returns an iterable object, we can use the iterator to get the data in stages

In [None]:
idata=iter(dl)
print(next(idata))

Common usage is to traverse it using a for loop

In [None]:
for i, data in enumerate(dl):
    print(i,data)
    # In order to save space, we only loop once
    break

We can already define data sets through dataset, and use Datalorder to load and traverse the data sets. In addition to these, PyTorch also provides a computer vision extension package capable of torcvision, which is encapsulated
## torchvision package
torchvision is a library dedicated to processing images in PyTorch. The last pip install torchvision in the installation tutorial on the PyTorch official website is to install this package.

### torchvision.datasets
torchvision.datasets can be understood as the dataset customized by the PyTorch team. These datasets help us process a lot of image datasets in advance, and we can use them directly:
-MNIST
-COCO
-Captions
-Detection
-LSUN
-ImageFolder
-Imagenet-12
-CIFAR
-STL10
-SVHN
-PhotoTour
We can use it directly, an example is as follows:

In [None]:
import torchvision.datasets as datasets
trainset = datasets.MNIST(root='./data', # indicates the directory where MNIST data is loaded
                                      train=True, # Indicates whether to load the training set of the database, and load the test set when false
                                      download=True, # indicates whether to download the MNIST data set automatically
                                      transform=None) # Indicates whether the data needs to be preprocessed, none means no preprocessing


### torchvision.models
Torchvision not only provides commonly used image data sets, but also provides trained models, which can be used directly after loading, or in migration learning
The sub-module of the torchvision.models module contains the following model structure.
-AlexNet
-VGG
-ResNet
-SqueezeNet
-DenseNet

In [None]:
#We can directly use the trained model, of course, this is the same as datasets, which need to be downloaded from the server
import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)

### torchvision.transforms
The transforms module provides general image transformation operation classes for data processing and data enhancement

In [None]:
from torchvision import transforms as transforms
transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4), #Fill 0 around first, then randomly crop the image to 32*32
    transforms.RandomHorizontalFlip(), #The image has half the probability of flipping, and half the probability of not flipping
    transforms.RandomRotation((-45,45)), #Random rotation
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.229, 0.224, 0.225)), #R,G,B The mean and variance used for normalization of each layer
])

Someone will definitely ask: (0.485, 0.456, 0.406), (0.2023, 0.1994, 0.2010) What do these numbers mean?

This official post has detailed instructions:
https://discuss.pytorch.org/t/normalization-in-the-mnist-example/457/21
These are all normalized parameters trained on ImageNet and can be used directly. We think this is a fixed value.

We have completed the introduction of the basic content of Python. Next, we will introduce the theoretical basis of neural networks. We use PyTorch to implement the formulas inside.