# Module 1 - Implementing and training a neural network

## Environment verification
Start by confirming you have PyTorch, TorchVision and TensorBoard installed.


In [1]:
import torch
import torchvision
from torch.utils.data import DataLoader

## QUESTIONS - General autonomous driving questions
In this part, some general questions about autonomous driving, both general and specific to formula student, are presented. You should read the relevant parts of the rulebook and beginner's guide to answer some of the questions. Feel free to use the internet.

1. List some pros and cons of using a stereo camera versus LiDAR versus RADAR for perception. You can research examples from the industry on why do they use specific sensors and not others.


Sensors in Autonomous Vehicles: Choosing the Best Approach
In autonomous vehicles, the use of various sensor systems is common for environment perception. The most popular sensors include LiDAR, radar, and cameras. These sensor systems work together to provide a comprehensive view of the external world, creating a safety network.

LiDAR:

Advantages:

Distance Accuracy: LiDAR is highly accurate in measuring distances, crucial for collision avoidance.

Light Independence: It works well in various lighting conditions, including complete darkness.

Fast Scanning: It can quickly create 3D point clouds, enabling real-time perception.

Disadvantages:

High Cost: LiDAR sensors can be expensive, especially high-quality ones.

Susceptible to Obstructions: Opaque objects can block the laser beam, causing blind spots.

Less Color Information: It doesn't provide color information, limiting some applications.

Cameras:

Advantages:

Low Cost: Cameras are relatively affordable compared to other options.

High Resolution: They can capture detailed images, useful for obstacle detection and navigation.

Color and Texture: Cameras capture color and texture information, beneficial for object recognition.

Disadvantages:

Lighting Condition Sensitivity: Performance can be limited in adverse lighting conditions, such as rain, snow, or intense sunlight.

Complex Processing: Stereoscopic image processing can be computationally expensive and requires powerful hardware.

Depth Challenge: Estimating depth accurately at long distances can be challenging.

RADAR:

Advantages:

Works in All Weather Conditions: RADAR is robust and performs well in rain, snow, and fog.

Long Range: It can detect objects at long distances, ideal for high-speed vehicle detection.

Less Affected by Reflective Surfaces: RADAR is less sensitive to reflective surfaces than LiDAR.

Disadvantages:

Lower Spatial Resolution: Compared to cameras and LiDAR, RADAR has lower spatial resolution and doesn't provide detailed object shape information.

Complex Interactions: Interpreting RADAR signals in scenarios with multiple objects can be complicated.

Considerable Cost: RADAR sensors can still be expensive.
Use Examples:

Cameras:

Tesla: Tesla uses cameras in advanced driver-assistance systems and its autopilot.

Waymo: Waymo, a subsidiary of Alphabet and a leader in autonomous vehicles, combines cameras with other technologies.

Mobileye: Intel's Mobileye provides advanced computer vision solutions and cameras for autonomous vehicles and driver-assistance systems.

LiDAR:

Velodyne Lidar: Velodyne is a leading LiDAR sensor manufacturer, supplying several automotive and autonomous technology companies.

Luminar: Luminar focuses on developing LiDAR sensors and provides technology for autonomous vehicles.

Aurora: Autonomous vehicle company Aurora uses LiDAR technology in its autonomous vehicles and transport systems.

RADAR:

Bosch: Bosch is one of the major manufacturers of RADAR sensors for the automotive industry, used in advanced driver-assistance systems.

Continental: Continental provides automotive radar systems to various car manufacturers.

Uber ATG (Advanced Technologies Group): Uber ATG used RADAR sensors in its autonomous vehicles before selling the autonomous vehicle division.

In conclusion, the choice of the ideal sensor depends on the project's specific requirements, the available budget, and operational conditions. A common approach is to combine multiple sensors to create redundancy and maximize environmental perception. This strategy compensates for individual weaknesses and results in safer and more efficient autonomous vehicle systems.

Extra Sensors:

Ultrasonic:

Advantages: Low cost, detection of obstacles at short distances, effective in parking maneuvers.

Disadvantages: Limited range, does not provide color information, low resolution.

Applications: Parking, detection of obstacles at short distances.

Inertial Sensors (IMU):

Advantages: Measures acceleration and rotation, useful for detecting changes in the vehicle's position and orientation.

Disadvantages: Does not provide information about objects in the environment.

Applications: Complement for navigation and control systems.

Additional Video: https://www.youtube.com/watch?v=qbxx7dsVLkw&list=PLtuNXpGOPQ_aeLQNxB4rLzfb8uktPABU9&index=3


2. Stereo cameras are capable of perceiving both color and depth for each pixel. These cameras can be bought plug-and-play solutions (for example Intel RealSense or StereoLabs ZED 2) or self-made using industrial cameras (for example Basler). Computing depth from multiple cameras requires processing, called "depth estimation", which is done onboard on the plug and play solutions. Which solution would you opt for if you had a small team with a short budget? Consider complexity, reliability and cost on your decision.

Development of a "Self-Made" Stereo Vision Solution:

Hardware Component Selection:

Camera Selection: We need to choose high-quality stereo cameras that are compatible and have features such as proper synchronization and the ability to capture high-resolution images.

Additional Sensor Selection: In addition to cameras, we may need additional sensors such as gyroscopes and accelerometers to improve depth estimation accuracy.

Camera Calibration:
Camera calibration is a critical process. This involves determining the intrinsic (camera properties) and extrinsic (relative position and orientation relationships) parameters of stereo cameras.
Calibrating the lenses and ensuring that the cameras are correctly aligned is essential for obtaining accurate depth information.

Image Acquisition:
Implementation of a system to capture images from both stereo cameras simultaneously.
Precise synchronization of the cameras to ensure that the images are aligned in time.

Image Processing for Stereo Matching:
Implementation of stereo matching algorithms to find correspondences between points in the left and right camera images.
The disparity calculated from these correspondences is used to estimate depth.

Depth Calibration:
Calibrating the depth output to convert depth information into real-world units.

Integration with the Application:
Integration of the generated depth information with your application or system to meet project requirements.

Optimization and Improvements:
Optimization of stereo matching algorithms to improve accuracy and performance.

Testing and Validation:
Conducting rigorous tests to ensure that the system provides accurate and reliable depth information.

Ongoing Maintenance:
Addressing potential issues, software updates, and continuous system maintenance.

Advantages of Plug-and-Play Solutions (Intel RealSense, StereoLabs Zed2):

Ease of Use: These solutions are designed to be user-friendly, allowing you to get started quickly without the need to assemble a complex system.

Pre-Calibration: Cameras come pre-calibrated, eliminating the need for manual calibration.

Embedded Processing: Plug-and-play solutions include embedded processors that perform real-time depth processing, eliminating the need to implement stereo matching algorithms.

APIs and Documentation: Well-documented APIs are provided for integration with your applications.

Support and Updates: You receive support from the manufacturing company and regular firmware and software updates.

If I had a small team with a limited budget, I would opt for plug-and-play solutions, such as Intel RealSense or StereoLabs Zed2, which already include depth estimation processing and offer better performance while consuming fewer resources. Although they may be slightly more expensive compared to industrial cameras, the reduction in development complexity and workload can offset this additional cost.


3. In an autonomous car, monitorization and reaction to critical failures are essential to prevent uncontrolled behavior. According to the rulebook and the beginner's guide, what must happen if the car detects a camera and/or LiDAR malfunction? Select the correct option(s), mentioning the relevant rule(s) you found:
    1. Play a sound using the TSAC.
    2. Eject the processing computer.
    3. Activate the EBS.
    4. Send a text message to the officials notifying the issue.
    5. Autonomously approach the ASR to perform a safe shutdown.

Answer: 3

"Concerning the high-level parts of the AS that rely on a variety of different sensor inputs,the system shall detect,if any of those is malfunctioning. If the proper vehicle operation cannot be ensured (e.g. loss of environmental perception) the system shall react by activating the EBS immediately."

From: FSG23_AS_Beginners_Guide_v1.1.pdf

4. Usually an autonomous driving pipeline is divided into perception, planning and control. Which algorithms are most commonly used by formula student teams on each of these stages? You can research other teams' social media or FSG Academy, for example.

Perception: MLP, CNN, Image Processing Algorithms, RANSAC, EKF, AHRS

Planning: Trajectory Planning Algorithms, SLAM

Control: PID Controllers and Model-Based Control, EKF


## Dataset
The used dataset is the well-known MNIST, which is composed of images of handwritten digits (0 to 9) with 28 pixels wide and 28 pixels high.

The goals of most of the models using this dataset is to classify the digit of the image, which is our case.

Download the training and validation dataset:

In [2]:
training_set: torch.utils.data.Dataset = torchvision.datasets.MNIST("./data", train=True, download=True, transform=torchvision.transforms.ToTensor())
validation_set: torch.utils.data.Dataset = torchvision.datasets.MNIST("./data", train=False, download=True, transform=torchvision.transforms.ToTensor())

## Part 1 - MLP evaluation

Import the example MLP:

In [3]:
from bobnet import BobNet

Create an instance of this model:

In [4]:
model1 = BobNet()

Define the hyperparameters for this model:

In [5]:
# batch size
MLP_BATCH_SIZE=64

# learning rate
MLP_LEARNING_RATE=0.001

# momentum
MLP_MOMENTUM=0.9

# training epochs to run
MLP_EPOCHS=10

Create the training and validation dataloaders from the datasets downloaded earlier:

In [6]:
# create the training loader
mlp_training_loader = DataLoader(training_set, batch_size=MLP_BATCH_SIZE, shuffle=True) 

# create the validation loader
mlp_validation_loader = DataLoader(validation_set, batch_size=MLP_BATCH_SIZE, shuffle=True)

Define the loss function and the optimizer:

In [7]:
mlp_loss_fn = torch.nn.CrossEntropyLoss()

mlp_optimizer = torch.optim.SGD(model1.parameters(), lr=MLP_LEARNING_RATE, momentum=MLP_MOMENTUM)

Run the training and validation:

In [8]:
import utils

# how many batches between logs
LOGGING_INTERVAL=100

utils.train_model(model1, MLP_EPOCHS, mlp_optimizer, mlp_loss_fn, mlp_training_loader, mlp_validation_loader, LOGGING_INTERVAL)

Epoch 0 (99/938): training_loss = 2.3244037170602816
Epoch 0 (199/938): training_loss = 2.311289117563909
Epoch 0 (299/938): training_loss = 2.3059036030019806
Epoch 0 (399/938): training_loss = 2.3023344843011153
Epoch 0 (499/938): training_loss = 2.2993474040098323
Epoch 0 (599/938): training_loss = 2.2965415658457253
Epoch 0 (699/938): training_loss = 2.2938109979097425
Epoch 0 (799/938): training_loss = 2.2907284765279337
Epoch 0 (899/938): training_loss = 2.2870842132207683
Epoch 0 (99/157): validation_loss = 2.270613670349121
Epoch 1 (99/938): training_loss = 2.262201097276476
Epoch 1 (199/938): training_loss = 2.246020259569638
Epoch 1 (299/938): training_loss = 2.2368425820583484
Epoch 1 (399/938): training_loss = 2.228174024357234
Epoch 1 (499/938): training_loss = 2.220615878611624
Epoch 1 (599/938): training_loss = 2.211533604559795
Epoch 1 (699/938): training_loss = 2.2028195039397147
Epoch 1 (799/938): training_loss = 2.193259028827443
Epoch 1 (899/938): training_loss = 2.

tensor(1.6519)

### QUESTIONS
Explore the architecture on the script `mod1/bobnet.py`.
1. Why does the input layer have 784 inputs? Consider the MNIST dataset samples' characteristics.

The input layer has 784 units because the MNIST dataset consists of images that are 28 pixels wide and 28 pixels high, and if you multiply them, it results in 784.

2. Why does the output layer have 10 outputs?

The output layer has 10 outputs because MNIST has 10 output classes. In other words, the goal of MNIST is to evaluate images, so we use these images as input parameters, and the output will be the highest classification from 0 to 9.

## Part 2 - CNN implementation

Head over to the `cnn.py` file and implement a convolutional architecture (add some convolutional layers and fully connected layers). You can search the LeNet architecture or AlexNet to get some insights and/or inspiration (you can implement a simpler version: with less layers). 2D convolutional layers in PyTorch are created using the `torch.nn.Conv2d` class. Activation and loss functions can be found under `torch.nn.functional` (like ReLU and softmax).

In [9]:
import torch
from cnn import CNN

In [10]:
model2 = CNN()

In [11]:
# batch size
MLP_BATCH_SIZE=64

# learning rate
MLP_LEARNING_RATE=0.001

# momentum
MLP_MOMENTUM=0.9

# training epochs to run
MLP_EPOCHS=10

In [12]:
# create the training loader
mlp_training_loader = DataLoader(training_set, batch_size=MLP_BATCH_SIZE, shuffle=True)

# create the validation loader
mlp_validation_loader = DataLoader(validation_set, batch_size=MLP_BATCH_SIZE, shuffle=True)

In [13]:
mlp_loss_fn = torch.nn.CrossEntropyLoss()

mlp_optimizer = torch.optim.SGD(model2.parameters(), lr=MLP_LEARNING_RATE, momentum=MLP_MOMENTUM)

In [None]:
import utils

# how many batches between logs
LOGGING_INTERVAL=100

utils.train_model(model2, MLP_EPOCHS, mlp_optimizer, mlp_loss_fn, mlp_training_loader, mlp_validation_loader, LOGGING_INTERVAL)

Epoch 0 (99/938): training_loss = 2.3258340117907284
Epoch 0 (199/938): training_loss = 2.314182950024629
Epoch 0 (299/938): training_loss = 2.3102758169971582
Epoch 0 (399/938): training_loss = 2.3083539289938177
Epoch 0 (499/938): training_loss = 2.3071903250738233
Epoch 0 (599/938): training_loss = 2.306415553881051
Epoch 0 (699/938): training_loss = 2.3058560382313655
Epoch 0 (799/938): training_loss = 2.305422584166067
Epoch 0 (899/938): training_loss = 2.3050860144007324
Epoch 0 (99/157): validation_loss = 2.3256568908691406
Epoch 1 (99/938): training_loss = 2.325635221269396
Epoch 1 (199/938): training_loss = 2.313941799815576
Epoch 1 (299/938): training_loss = 2.310120033181232
Epoch 1 (399/938): training_loss = 2.308163289139444
Epoch 1 (499/938): training_loss = 2.3069919522157414
Epoch 1 (599/938): training_loss = 2.306235903292546
Epoch 1 (699/938): training_loss = 2.3056740382198613
Epoch 1 (799/938): training_loss = 2.3052605221954843
Epoch 1 (899/938): training_loss = 2.

### QUESTIONS

1. What are the advantages of using convolutional layers versus fully-connected layers for image processing?


Parameter Sharing: Convolutional layers share weights, reducing parameters, while fully connected layers have many parameters, leading to overfitting.

Spatial Hierarchy: Convolutional layers capture a hierarchy of features, from low to high-level, preserving spatial structure. Fully connected layers do not preserve spatial structure.

Translation Invariance: Convolutional layers are translation-invariant, being robust to position changes. Fully connected layers do not possess this property.

Efficiency: Convolutional layers are efficient, reusing weights, while fully connected layers can be computationally expensive, especially with large images.

Local Receptive Fields: Convolutional layers use local receptive fields, capturing local details. Fully connected layers are not as effective in this regard.

Feature Hierarchies: CNNs learn feature hierarchies, useful for image processing tasks. Fully connected layers do not have this advantage.

In conclusion, convolutional layers are suitable for image processing tasks because they leverage the spatial structure of images, reduce the number of parameters, and effectively capture local and hierarchical features. Fully connected layers are often used in conjunction with convolutional layers for end-to-end learning tasks in neural networks.