<a href="https://www.kaggle.com/code/stephenkolesh/introduction-to-pytorch?scriptVersionId=142643817" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import time

import matplotlib.pyplot as plt

%matplotlib inline
import matplotlib_inline.backend_inline
import numpy as np
import torch
import torch.nn as nn
import torch.utils.data as data
from matplotlib.colors import to_rgba
from torch import Tensor
from tqdm.notebook import tqdm  # Progress bar

matplotlib_inline.backend_inline.set_matplotlib_formats("svg", "pdf")  # For export

In [None]:
print("Using torch", torch.__version__)

In [None]:
torch.manual_seed(42) #reproducibility

### Tensors
* A vector is a 1-D tensor, a matric is a 2-D tensor, etc
* Tensor Creation
    * torch.zeros  creates a tensor filled with zeros
    * torch.ones creates a tensor filled with ones
    * torch.rand creates a tensor with random values uniformly sampled between o and 1
    * torch.randn creates a tensor with random values sampled from a normal distribution with mean 0 and variance 1
    * torch.arange creates a tensor containing the values N, N+1, N+2, ...., M
    * torc.Tensor(input list) creates a tensor from the list you provide
* Tensor Shape and Size
    * x.shape
    * x.size()
    
* Tensor to Numpy and Numpy to Tensor
    * Numpy to tensor: torch.from_numpy()
    * Tensor to Numpy: x.numpy() -> this requires the tensor to be on the CPU and not the GPU, call .cpu() on the tensor before hand 
        * np_arr = tensor.cpu().numpy()
        
* Tensor reshaping
    * View -> a tensor of 2,3 can be reshaped to any other shape with the same number of elements (e.g a tensor of size(6), or (3,2) ..
    * permute: swap dimensions(0,1)
    
* Tensor Operations
    * addition: x+ y or x.add_(y)
    * torch.matmul : performs matrix product over two tensors, where the specific behavior depends on the dimensions. If both inputs are matrices (2-dimensional tensors) it performs the standard matrix product. For higher dimensional inputs, the function supports broad casting
    * torch.mm : performs matrix product over two matrices but doesnt support broadcasting
    * torch.bmm : performs matrix product with a support batch dimension
    * torch.einsum : perfroms matrix multiplications and more (sums of products) using the einstein summation convention
    
    * mostly used are: torch.matmul and torch.bmm
    

In [None]:
x = Tensor(2, 3 , 4) #creates a tensor from the given list
print(x)
print("shape",x.shape)
print("size", x.size())
print("-" * 35)
x = Tensor([[2,3], [4,5]]) #creates a tesnor from e nested list
print(x)
print("shape",x.shape)
print("size", x.size())
print("-" * 35)
x = torch.rand(2,3,4) #creates a tensor with random values between 0 and 1 with the shape [2,3,4]
print(x)
print("shape",x.shape)
print("size", x.size())
print("-" * 35)

In [None]:
np_array = np.array([[1,2], [3,4]])
tensor = torch.from_numpy(np_array)
print("np_array", np_array)
print("tensor", tensor)
print("-" * 50)

tensor = torch.ones(3)
np_array = tensor.numpy()

print("tensor", tensor)
print("np_array", np_array)
print("-" * 50)

### Tensor Operations
#### Adding two tensors

In [None]:
x1 = torch.rand(2,3)
x2 = torch.rand(2,3)
y = x1+ x2 
print("sum non inplace", y)

#using inplace operations - marked with an underscore postfix
print("X2 before: ", x2)
x2.add_(x1)

print("X2 after: ", x2)


#### Matmul

In [None]:
x = torch.arange(6).view(2,3)
W = torch.arange(9).view(3,3)
print("X: ", x)
print("W: ", W)

In [None]:
h = torch.matmul(x, W)
print("h", h)

### Tensor Reshaping

In [None]:
x = torch.arange(6)
print("x arange 6: ", x)
x = x.view(2,3)
print("X reshaped 2, 3: ",x)
x = x.view(3,2)
print("X reshaped 3, 2", x)

x = x.permute(1,0) #swapping dimension 0 and 1
print("X swappeed dimension 0 and 1(permute)",x)

### Indexing

In [None]:
x = torch.arange(12).view(3,4)
x

In [None]:
print("first column", x[:, 0])
print("second column", x[:, 1])

print("first row", x[0])
print("secod row", x[1])

print("middle two rows: ", x[1:3, :])


### Dynamic Computation Graph and Back Propagation
* Why do we need gradients?
    * consider we have defined a function, a neural net, that is supposed to compute a certain output for an input vector. We then define an error measure that tells us how wrong our network is, how bad it is in predicting output y, from the input. Based on this error measure we can use the gradients to update the weights that were responsible for the output so that next time we present input to our network the output will be closer to what we want

In [None]:
#first thing is to specify which tensors require gradients. By default when we create a tensor it does not require gradients
x = torch.ones((3,))
print(x.requires_grad)
print(x)

In [None]:
#we can change the requires grad for an existing tensor using x.requires_grad_()
x.requires_grad_(True)
print(x.requires_grad)

In [None]:
x = torch.arange(3, dtype = torch.float32, requires_grad = True)
print("X: -> ", x)

In [None]:
#building the computation graph
a = x+ 2
b = a**2
c = b + 3
y = c.mean()
print("Y: -> ", y)

y -> c -> (b+3)->(b -> a -> x=2)

In [None]:
# # we can now perform backward propagation of the computation graph by calling the function .backward() on the last output, which effectively calculates
# # the gradients for each tensor that has the property
y.backward()
print(x.grad) #x now contains the gradient which indicates how a change in x would affect output y given the current input (0,1,2)

### GPU Support
* A GPU can perform many thousands of small operations in parallel, making it very well suitable for performing large matrix operations in neural networks
* Gpu availability? torch.cuda.is_available()
* chose automatically: torch.device('cuda') if torch.cuda.is_available() else torch.device("cpu")
* by default all tensors you create are stored on the CPU. We can push a tensor to the GPU by using the function.to(cuda)

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
device

In [None]:
x = torch.zeros(2,3)
x = x.to(device)
print("X", x)

#### cpu vs gpu runtime

In [None]:
x = torch.randn(5000, 5000)

#CPU version
start_time = time.time()
_ = torch.matmul(x, x)
end_time = time.time()

print(f"CPU time: {(end_time-start_time):6.5f}s")

#GPU version
if torch.cuda.is_available():
    x = x.to(device)
    start = torch.cuda.Event(enable_timing= True)
    end = torch.cuda.Event(enable_timing= True)
    start.record()
    _ = torch.matmul(x,x)
    end.record()
    torch.cuda.synchronize()
    print(f"GPU time: {0.001 * start.elapsed_time(end):6.5f}s")  # Milliseconds to seconds

### Reproducibility while using GPU

In [None]:
## GPU operations have separate seed we also want to set
if torch.cuda.is_available():
    torch.cuda.manual_seed(42)
    torch.cuda.manual_seed_all(42)

#we want to ensure that all operations are deterministic on GPU (if ussed) for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

### Continuous XOR
* torch.nn defines a series of useful classes like linear networks layers, activation functions , loss functions
* torch.nn.functional contains functions that are used in network layers
* nn.Module -> a nn is built up out of modules. Modules can contain other modules and a nn is considered to be a module itself as well

In [None]:
class SimpleClassifier(nn.Module):
    def __init__(self, num_inputs, num_hidden, num_outputs):
        super().__init__()
        #some init for my module
        self.linear1 = nn.Linear(num_inputs, num_hidden)
        self.act_fn = nn.Tanh()
        self.linear2 = nn.Linear(num_hidden, num_outputs)
        
    def forward(self, x):
        #function for performing the calculation of the module
        x = self.linear1(x)
        x= self.act_fn(x)
        x = self.linear2(x)
        return x
    
#a simple nn with 2 input neurons and four hidden neurons
model = SimpleClassifier(num_inputs = 2, num_hidden = 4, num_outputs =1)
print(model)

In [None]:
for name, param in model.named_parameters():
    print(f"Parameter {name}, shape {param.shape}")

### The dataset class
* torch.utils.data defines two data classes namely: data.Dataset and data.DataLoader
* data.Dataset providees a uniform interface to access the training/test data
    * we specify the __getitem__ to return the i-th data point in the dataset and __len__ to return the size of the dataset
* data.DataLoader makes sure to efficiently load and stack the data points from the dataset into batches during training
    * batch_size: No of samples to stack per batch
    * shuffle; if true return the data in a random order
    * num_workers: number of subprocesses to use for data loading
    * pin memory: copies Tensors into CUDA pinned memory before returning them, can save some time for large data points on GPUs(for train)
    * drop last: if true the last batch is dropped in case it is smaller than the specified batch size
    


In [None]:
class XORDataset(data.Dataset):
    def __init__(self, size, std=0.1):
        """
            inputs: 
                size - NUmber of data points we need to generate
                std - std of the noise(used in the generate_continuous_xor)
        """
        super().__init__()
        self.size = size
        self.std = std
        self.generate_continous_xor()
        
    def generate_continous_xor(self):
        """
            each data point in the XOR dataset has two variables , x and y that can either be 0 or 1
            the label is their XOR combination i.e 1 if only x or only y is 1 while the other is 0
            if x=y , the label is 0
        """
        data = torch.randint(low=0, high =2, size = (self.size, 2), dtype = torch.float32)
        label = (data.sum(dim=1) ==1).to(torch.long)
        # we add a bit of gaussian noise to the data points
        data += self.std * torch.randn(data.shape) #add a bit of gaussian noise to make it challenging
        
        self.data = data
        self.label  = label
        
    def __len__(self):
        #number of data points we have
        return self.size
    
    def __getitem__(self, idx):
        #return the idx-ith data point of the dataset
        data_point = self.data[idx]
        data_label = self.label[idx]
        return data_point, data_label
        
        

In [None]:
dataset = XORDataset(size =200)
print("Size of dataset: ", len(dataset))
print("Data Point 0: ", dataset[0])

In [None]:
def visualize_samples(data, label):
    if isinstance(data, Tensor):
        data = data.cpu().numpy()
    if isinstance(label, Tensor):
        label = label.cpu().numpy()
        
    data_0 = data[label == 0]
    data_1 = data[label == 1]
    
    
    plt.figure(figsize=(4, 4))
    plt.scatter(data_0[:, 0], data_0[:, 1], edgecolor="#333", label="Class 0")
    plt.scatter(data_1[:, 0], data_1[:, 1], edgecolor="#333", label="Class 1")
    plt.title("Dataset samples")
    plt.ylabel(r"$x_2$")
    plt.xlabel(r"$x_1$")
    plt.legend()
    
visualize_samples(dataset.data, dataset.label)

### DataLoader

In [None]:
data_loader = data.DataLoader(dataset, batch_size =8, shuffle = True)

data_inputs , data_labels = next(iter(data_loader))

print("Data inputs", data_inputs.shape, "\n", data_inputs)
print("Data labels", data_labels.shape, "\n", data_labels)

### Optimization
* Get a batch from the data loader
* Obtain the predictions from the model for the batch
* calculate the loss based on the difference btwn predictions and labels
* backpropagation: calculate the gradients for every parameter with respect to the loss
* update the parameters of the model in the direction of the gradients

pytorch loss functions
* nn.BCELoss()
* nn.BCEWithLogitsLoss() - combines a sigmoid layer and the BCE loss in a single class(more stable)

pytorch optimizers
* torch.optim.SGD - SGD updates params by multiplying the gradients with a small constant , called Learning rate, and subtracting those from the parameters(hence minimizing the loss) therefore we slowly move towards the direction of minimizing the loss
    * optimizer.step() - updates the params based the gradients as explained above
    * optimizer.zero_grad() - sets the grdients of all parameters to zero - crucial step before performing back propagation
        * we need to clear the grads because the gradients would be added to the prev ones instead of overwriting them(a param might occur multiple times in a computation graph and we need to sum the gradients in this case instead of replacing them)
        * call optimizer.zero_grad() before calculating the gradients of a batch
        
      

    

In [None]:
loss_module = nn.BCEWithLogitsLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

### Training
* Push our model to the desired device
* Set our model to training mode
    * there exists certain modules that need to perform a different forward step during training than during testing(BatchNorm and DropOut) and we can switch btwn them using model.train() and model.eval()

In [None]:
train_dataset = XORDataset(size =1000)
train_data_loader = data.DataLoader(train_dataset, batch_size = 128, shuffle = True)
#push the data to the device of our choice(GPU if available)
model.to(device)

In [None]:
def train_model(model, optimizer, data_loader, loss_module, num_epochs=100):
    model.train() #set the model in training mode
    
    for epoch in tqdm(range(num_epochs)):
        for data_inputs , data_labels in data_loader:
            #step 1; Move input data to device (only strictly necessary if we use GPU)
            data_inputs = data_inputs.to(device)
            data_labels = data_labels.to(device)
            
            #step 2: Run the model on the input data
            preds = model(data_inputs)
            preds = preds.squeeze(dim=1) #Output is [Batch_size, 1] but we want [Batch_size] -remove unnecessary dimensions
            
            #step3: calculate the loss
            loss = loss_module(preds, data_labels.float())
            
            #step 4: perform backpropagation
            #Before calculating the gradients we need to ensure they are all zero
            #The gradients would not be overwritten , but actually added to existing ones
            optimizer.zero_grad()
            #perform backpropagation
            loss.backward()
            
            #step5: Update the parameters
            optimizer.step()

train_model(model, optimizer, train_data_loader, loss_module)

### Saving a model
* we extract the so called state-dict which contains all learnable parameters

In [None]:
state_dict = model.state_dict()
print(state_dict)

In [None]:
#saving the state
torch.save(state_dict, "our_model.tar")

In [None]:
#load the state dict from the disk
state_dict = torch.load("our_model.tar")

#create a new model and load the state
new_model = SimpleClassifier(num_inputs = 2, num_hidden=4, num_outputs = 1)
new_model.load_state_dict(state_dict)

#verify that the paramters are the same

# Verify that the parameters are the same
print("Original model\n", model.state_dict())
print("\nLoaded model\n", new_model.state_dict())

### Evaluation
* we dont need to keep track of the computational graph as we do not intend to calculate the gradients hence reducing the required memory and speed up the model -> with torch.no_grad() to deactivate it
* set the model in eval mode

In [None]:
test_dataset = XORDataset(size = 500)
test_data_loader = data.DataLoader(test_dataset, batch_size=128, shuffle =False, drop_last = False)


In [None]:
def eval_model(model, data_loader):
    model.eval()
    true_preds, num_preds = 0.0, 0.0
    with torch.no_grad():
        for data_inputs, data_labels in data_loader:
            
            data_inputs, data_labels = data_inputs.to(device), data_labels.to(device)
            preds = model(data_inputs)
            preds = preds.squeeze(dim=1)
            preds = torch.sigmoid(preds) #map predictions between 0 and 1
            pred_labels = (preds>=0.5).long() #binarize predictions
            
            true_preds += (pred_labels == data_labels).sum()
            num_preds += data_labels.shape[0]
            
    acc = true_preds / num_preds
    print(f"Accuracy of the model: {100.0*acc:4.2f}%")

In [None]:
new_model.to(device) #move the loaded model to device
eval_model(new_model, test_data_loader)


### Visualizing classification boundaries
* shows where the model has created decision boundaries and which points would be clasifed as 0 and which as 1
    * blue-class 0
    * orange - class 1
    * blurry - unsure

In [None]:
@torch.no_grad()  # Decorator, same effect as "with torch.no_grad(): ..." over the whole function.
def visualize_classification(model, data, label):
    if isinstance(data, Tensor):
        data = data.cpu().numpy()
    if isinstance(label, Tensor):
        label = label.cpu().numpy()
    data_0 = data[label == 0]
    data_1 = data[label == 1]

    plt.figure(figsize=(4, 4))
    plt.scatter(data_0[:, 0], data_0[:, 1], edgecolor="#333", label="Class 0")
    plt.scatter(data_1[:, 0], data_1[:, 1], edgecolor="#333", label="Class 1")
    plt.title("Dataset samples")
    plt.ylabel(r"$x_2$")
    plt.xlabel(r"$x_1$")
    plt.legend()

    # Let's make use of a lot of operations we have learned above
    model.to(device)
    c0 = Tensor(to_rgba("C0")).to(device)
    c1 = Tensor(to_rgba("C1")).to(device)
    x1 = torch.arange(-0.5, 1.5, step=0.01, device=device)
    x2 = torch.arange(-0.5, 1.5, step=0.01, device=device)
    xx1, xx2 = torch.meshgrid(x1, x2)  # Meshgrid function as in numpy
    model_inputs = torch.stack([xx1, xx2], dim=-1)
    preds = model(model_inputs)
    preds = torch.sigmoid(preds)
    # Specifying "None" in a dimension creates a new one
    output_image = (1 - preds) * c0[None, None] + preds * c1[None, None]
    output_image = (
        output_image.cpu().numpy()
    )  # Convert to numpy array. This only works for tensors on CPU, hence first push to CPU
    plt.imshow(output_image, origin="lower", extent=(-0.5, 1.5, -0.5, 1.5))
    plt.grid(False)


visualize_classification(new_model, dataset.data, dataset.label)
plt.show()