## **Creating Your Own Datasets**
- Two abstract classes for datasets are provided,
    + torch_geometric.data.Dataset ([more info](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.Dataset))
    + torch_geometric.data.InMemoryDataset ([more info](https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html#torch_geometric.data.InMemoryDataset))
- Following the torchvision convention, each dataset gets passed a root folder which indicates where the dataset should be stored. 
    + We split up the root folder into two folders: the raw_dir, where the dataset gets downloaded to, and the processed_dir, where the processed dataset is being saved.
- Each dataset can be passed a **transform**, a **pre_transform** and a **pre_filter** function
    + **Transform** : The transform function dynamically transforms the data object before accessing (so it is best used for data augmentation)
    + **Pre-Transform** : he pre_transform function applies the transformation before saving the data objects to disk (so it is best used for heavy precomputation which needs to be **only done once**).
    + **Pre-Filter** : The pre_filter function can manually filter out data objects before saving

### **Creating "In Memory Datasets"**
You need to implement four fundamental methods:
#### torch_geometric.data.InMemoryDataset.***raw_file_names***():
- A list of files in the raw_dir which needs to be found in order to skip the download.

#### torch_geometric.data.InMemoryDataset.***processed_file_names***():
- A list of files in the processed_dir which needs to be found in order to skip the processing.

#### torch_geometric.data.InMemoryDataset.***download***():
- Downloads raw data into raw_dir.

#### torch_geometric.data.InMemoryDataset.***process***():
- Processes raw data and saves it into the processed_dir.
- We need to read and create a list of **torch_geometric.data.Data** objects and save it into the processed_dir.
- Because saving a huge python list is really slow, we collate the list into one huge torch_geometric.data.
- The data object via **torch_geometric.data.InMemoryDataset.collate()** before saving .

In [2]:
import torch
from torch_geometric.data import InMemoryDataset


class MyOwnDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(MyOwnDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return ['some_file_1', 'some_file_2', ...]

    @property
    def processed_file_names(self):
        return ['data.pt']

    def download(self):
        # Download to `self.raw_dir`.
        return

    def process(self):
        # Read data into huge `Data` list.
        data_list = [...]

        if self.pre_filter is not None:
            data_list = [data for data in data_list if self.pre_filter(data)]

        if self.pre_transform is not None:
            data_list = [self.pre_transform(data) for data in data_list]

        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

### **Creating “Larger” Datasets**
- For creating datasets which do not fit into memory, the torch_geometric.data.Dataset must be used, where we closely follow the concepts of the torchvision datasets.
- The following methods need to be further implemented:

#### torch_geometric.data.Dataset.len():
- Returns the number of examples in your dataset.

#### torch_geometric.data.Dataset.get():
- Implements the logic to load a single graph.
- Internally, torch_geometric.data.Dataset.$__getitem__$() gets data objects from torch_geometric.data.Dataset.get()

In [5]:
import os.path as osp

import torch
from torch_geometric.data import Dataset


class MyOwnDataset(Dataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(MyOwnDataset, self).__init__(root, transform, pre_transform)

    @property
    def raw_file_names(self):
        return ['some_file_1', 'some_file_2', ...]

    @property
    def processed_file_names(self):
        return ['data_1.pt', 'data_2.pt', ...]
    
    """
    
    # IF YOU ALREADY DOWNLOADED DATASET OR PROCESSED,
    # IF YOU DO NOT WANT TO RESTART THESE FUNCTIONS,
    # JUST, DO NOT OVERRIDE BELOWS! 
    
    def download(self):
        # Download to `self.raw_dir`.
        return
    
    def process(self):
        i = 0
        for raw_path in self.raw_paths:
            # Read data from `raw_path`.
            data = Data(...)

            if self.pre_filter is not None and not self.pre_filter(data):
                continue

            if self.pre_transform is not None:
                data = self.pre_transform(data)

            torch.save(data, osp.join(self.processed_dir, 'data_{}.pt'.format(i)))
            i += 1
    """
    
    # Added functions
    def len(self):
        return len(self.processed_file_names)
    
    # Added functions
    def get(self, idx):
        data = torch.load(osp.join(self.processed_dir, 'data_{}.pt'.format(idx)))
        return data


### **Simple Version**
- Just initialize Data object your style
- Append Data objects to the list, which is feeded to the DataLoader object 

In [6]:
from torch_geometric.data import Data, DataLoader

data_list = [Data(...), ..., Data(...)]
loader = DataLoader(data_list, batch_size=32)