<a href="https://colab.research.google.com/github/kameshcodes/deep-learning-codes/blob/main/datasets_dataloader_pytorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

import pandas as pd

In [35]:
csv_path = "/content/sample_data/california_housing_train.csv"
df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


- All columns are in numerical $\rightarrow$ easier to load, no transformation is needed.

### **1.1 Loading the CSV data in pytorch**

In [24]:
class HousingDataset(Dataset):
  def __init__(self, csv_path):
    df = pd.read_csv(csv_path)
    self.X = torch.tensor(df.iloc[:, :-1].values, dtype = torch.float16)
    self.y = torch.tensor(df.iloc[:, -1].values, dtype = torch.float32)

  def __len__(self):
    return len(self.X)

  def __getitem__(self, idx):
    return self.X[idx], self.y[idx]

In [25]:
dataset = HousingDataset(csv_path)
dataloader = DataLoader(dataset, batch_size= 4, shuffle=True)

next(iter(dataloader))

[tensor([[-122.2500,   37.7812,   52.0000, 1704.0000,  371.0000,  663.0000,
           340.0000,    4.2266],
         [-121.8125,   38.0000,   47.0000, 1265.0000,  254.0000,  587.0000,
           247.0000,    2.6367],
         [-122.2500,   37.8750,   52.0000, 2256.0000,  410.0000,  823.0000,
           377.0000,    5.7969],
         [-120.7500,   38.5625,    8.0000,  892.0000,  185.0000,  427.0000,
           164.0000,    2.6836]], dtype=torch.float16),
 tensor([275000.,  93500., 415300., 118800.])]

### **1.1.1 Scaling the data with loading**

In [26]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [45]:
class HousingDataset2(Dataset):
  def __init__(self, csv_path, scaler):
    df = pd.read_csv(csv_path)
    df = pd.read_csv(csv_path)
    self.X = df.iloc[:, :-1].values
    self.y = df.iloc[:, -1].values

    if scaler:
      self.X = scaler.fit_transform(self.X)

    self.X = torch.tensor(self.X, dtype=torch.float32)
    self.y = torch.tensor(self.y, dtype=torch.float32) #float16 here will give inf. why ? because number are big to be represented as 16 bit floaat

  def __len__(self):
    return len(self.X)

  def __getitem__(self, idx):
    return self.X[idx], self.y[idx]

In [46]:
# scaler = StandardScaler()
scaler = MinMaxScaler()
dataset = HousingDataset2(csv_path, scaler)
dataloader = DataLoader(dataset, batch_size= 4, shuffle=True)

next(iter(dataloader))

[tensor([[0.7012, 0.1923, 0.1569, 0.0633, 0.0604, 0.0300, 0.0589, 0.3117],
         [0.1594, 0.6334, 0.2941, 0.1106, 0.0989, 0.0479, 0.1010, 0.3413],
         [0.1912, 0.5409, 0.6471, 0.1089, 0.1065, 0.0603, 0.1219, 0.3085],
         [0.7311, 0.0329, 0.4314, 0.0683, 0.0641, 0.0392, 0.0707, 0.3442]]),
 tensor([151900., 252100., 342300., 151400.], dtype=torch.float64)]

**Note:** Do all the dataframe operations inside $\text{__init__}$ method or and if it is heavy processing consder doing it outside the class.