<a href="https://colab.research.google.com/github/marcelarosalesj/e2e-vision-apps/blob/main/Week_2_Project_Self_Driving_Car.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### This project is from [Abubakar Abid's](https://twitter.com/abidlabs) course: *Building Computer Vision Applications* on CoRise. Learn more about the course [here](https://corise.com/course/vision-applications).

# Week 2 Project: Building the "Eyes" of a Self-Driving Car

Welcome to the second week's project for *Building Computer Vision Applications*!

In this week, we are going to get familiar with the key steps of machine learning, with a particular focus on image segmentation. Specifically, we will cover:

* finding image segmentation datasets and pretrained models 📖
* fine-tuning an image segmentation model on new data 👾
* building a computer vision app you can run on your phone or laptop 📷
* measuring the performance of a segmentation model on test data and the real world 📈

# Introduction

Self-driving cars are an exciting real-world application of machine learning, with the potential to save many lives each year. In order for self-driving cars to be fully autonomous, they need to "see" and "understand" the world around them. What are the machine learning algorithms that enable this? Let's take a look at [Tesla's website](https://www.tesla.com/AI): "Our per-camera networks analyze raws images to perform **semantic segmentation**..."

What is semantic segmentation? Semantic segmentation is the process of assigning a class to *every pixel in an image*. In week 1, we studied *image classification*, which assigns a class to the entire image. Semantic segmentation is a more fine-grained version, which recognizes that an image can be made up of different objects: for example, an image taken by a camera on a self-driving car could consist of pedestrians, trees, and other cars. Semantic segmentation is used in many other applications as well, such as medical machine learning, where it can be used to identify organs in radiological images. Rather than assigning a single label to the entire image, a semantic segmentation model assigns each pixel a category so that we understand both *what* an image is, and *where* it is. 

By the end of this project, you'll have built an app that you can run on your laptop or phone that performs semantic segmentation on pictures of the outdoors scenes and will identify the road from the cars from the pedestrians, and so on. It will look something like this:

![](https://i.ibb.co/RNv8MgQ/image.png)

# Step 0: Hardware Setup & Software Libraries

We will be utilizing GPUs to train our machine learning model, so we will need to make sure that our Colab notebook is set up correctly. Go to the menu bar and click on Runtime > Change runtime type > Hardware accelerator and **make sure it is set to GPU**. Your Colab notebook may restart once you make the change.

We're going to be using some fantastic open-source Python libraries to load our dataset (`datasets`), train our model (`transformers`), evaluate our model (`evaluate`), and build a demo of our model (`gradio`). So let's go ahead and install all of these libraries. 

In [1]:
!pip install datasets transformers evaluate gradio huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In Week 1, you created a Hugging Face account to upload your Gradio demo to Spaces. This week, we'll be uploading a model to your Hugging Face account *programmatically*! The first step is to log in using your Hugging Face token:

In [2]:
from huggingface_hub import notebook_login

In [3]:
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


In [4]:
!pip install wandb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
%env WANDB_PROJECT="CoRise-CV_applications_week_2"

env: WANDB_PROJECT="CoRise-CV_applications_week_2"


In [6]:
import wandb

In [7]:
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mmarcelarosalesj[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

# Step 1: Loading a Dataset

In this project, we will be using the `datasets` library, which can load tens of thousands of datasets with a single line of code. It can also be used to apply preprocessing functions. Learn more about the datasets library here: https://huggingface.co/docs/datasets/tutorial

Most datasets are divided into different splits. For example, you'll often see a *training* data subset, which is used to build the model, a *validation* data subset, which is used to measure the performance of the model while it is training, and a *test* dataset which is used to measure the performance of the model at the very end of training, and is usually considered to describe how well the model will perform in the real world (we'll come back to this).

Specifically, we will be using the `segments/sidewalk-semantic` dataset that is available for free from the Hugging Face Hub: https://huggingface.co/datasets/segments/sidewalk-semantic

* **Load the Semantic Sidewalk Dataset**

In [None]:
import datasets

dataset = datasets.load_dataset("segments/sidewalk-semantic")

* **Explore the dataset by running code below and reading the dataset card linked above. Answer the questions below**

In [None]:
print(dataset)

In [None]:
print(f"Size of image: {dataset['train'][0]['pixel_values'].size}")

In [None]:
for i in range(10):
  display(dataset['train'][i]['pixel_values'])

* How many training samples do we have? 
 * 1000
* What's the size of each image? 
 * 1920 x 1080
* How many categories are in this dataset's labels? 
 * From the documentation, there are 35 categories.
* Look at a random subset of ~10 training images, do you notice anything interesting about the images in the dataset? Are they as diverse/representative as you would expect or do they have limitations?
 * The images from the dataset are all from Belgium. Most of the roads are 1 lane, and there are not too many things happening on the roads.
 * In my opinion this dataset doesn't seem too diverse. Probably this could only work on Belgium (or a city with similar roads) with a few cars or people on the streets.

* **Simplifying the Training Dataset**

You'll notice that the original dataset has many similar categories (for example, "vehicle-car" is a category, along with "vehicle-truck"). To simplify the training process, we will collapse together related categories. In the end, we will have 5 separate categories:
* 0: road/sidewalk/path
* 1: human
* 2: vehicles
* 3: other objects (e.g. traffic lights)
* 4: nature and background

For the purpose of this exercise, we will also make the images a lot smaller (64px by 64px) so that training is easier and faster. The following code processes the training images and labels.

We've written the function that applies this transformation to a given sample. Efficiently apply it to each item in the dataset, using for example 8 CPU workers (even then, this code may take a few minutes to run)

In [None]:
import numpy as np
from PIL import Image


num_classes = 5

def transform(sample):
    sample["pixel_values"] = sample["pixel_values"].convert("RGB").resize((64,64))
    sample["label"] = sample["label"].resize((64,64), Image.NEAREST)
    collapse_categories = {**{i: 0 for i in range(1, 8)}, 
                            **{i: 1 for i in range(8, 10)}, 
                            **{i: 2 for i in range(10, 18)}, 
                            **{i: 3 for i in range(18, 28)}}
    sample["label"] = np.vectorize(lambda x: collapse_categories.get(x, 4))(np.array(sample["label"]))
    return sample
    
dataset = dataset.map(transform, num_proc=8)

Finally, shuffle the dataset and split the dataset into a training dataset (with 99% of the samples) and a test dataset (with the remaining 1%). We have a very small test dataset so that the evaluation step is quick. If you were training a model in a more realistic setting, you would pick a bigger evaluation dataset.

You might find the `train_test_split()` method in the `datasets` library useful.

In [None]:
dataset = dataset['train'].train_test_split(test_size=0.01, shuffle=True)

train_ds = dataset["train"]
test_ds = dataset["test"]

In [None]:
train_ds

In [None]:
test_ds

In [None]:
# Size of samples
train_ds[0]['pixel_values'].size

After you run the steps above, examine the `train_ds` and `test_ds` objects, and confirm that the samples look as you expect. Specifically,

* How many training and test samples do we have?
 * training samples 990
 * test samples 10
* What's the size of each image?
 * 64 x 64
* What are the potential risks or downsides of having such a small test dataset?
 * The main risk of a small test dataset is not having enough data to evaluate for overfitting.

# Step 2: Loading a Pretrained Model

We will be using the `transformers` library, which can load tens of thousands of machine learning models with a few lines of code. It can also be used to fine-tune these models. Learn more about the `transformers` library here: https://huggingface.co/docs/transformers/index

Specifically, we will be using the `Segformer` model that is available for anyone from the Hugging Face Hub: https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512. While the details of this architecture are beyond the scope of this course, we will point out that it is based on transformers, just like the vision transformers (ViT) network we used last week for image classification. Also, notice that it has already been fine-tuned for detecting everyday objects. We will _further_ fine-tune it for our specific dataset to speed up the training process.

* **Load the Segformer Model and FeatureExtractor for Inference**

In [None]:
from transformers import AutoFeatureExtractor, SegformerForSemanticSegmentation
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = SegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")

model.eval()
model.to(device);

We also need to load the **feature extractor** corresponding to the model, so that we can convert the input images into a feature vector that the model can take as input.

In [None]:
extractor = AutoFeatureExtractor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")

# Step 3: Fine-tuning Your Model on the Dataset

## 3a. Preprocess the Dataset and Load the Metric

Off the shelf, the Segformer model will not be usable for the task that we have in mind, since it was trained for "general" image segmentation, not for the specific categories that we would like to predict. As a result, we will need to "fine-tune" our model.

Learn more about fine-tuning models with the `transformers` library here: https://huggingface.co/docs/transformers/training

We will also need to decide which metric to use for our task. Since our task is image segmentation, the `mean IOU` metric seems reasonable: https://huggingface.co/spaces/evaluate-metric/mean_iou

* **Preprocess the Dataset**

We will convert the images to feature vectors on the fly as we train the model using the `set_transform()` method. This time, the `transform()` has been left for you to write:

In [None]:
def transform(example_batch):
    images = [x for x in example_batch['pixel_values']]
    labels = [x for x in example_batch['label']]
    inputs = extractor(images, labels)
    return inputs

train_ds.set_transform(transform)
test_ds.set_transform(transform)

In [None]:
train_ds[0]['pixel_values'].shape

In [None]:
len(train_ds)

In [None]:
len(test_ds)

## 3b. Fine-Tune the Segformer Model on a Training Subset (and Overfit)

As we discussed in lecture, a good way to start training a model is by making sure that you are able to overfit on a small subset of the training dataset. Train your model on 10 images from your training dataset for 10 epochs. 

We will start by defining our training hyperparameters as a `TrainingArguments` instance.

Note that we leave the choice of learning rate to you. You may need to try different learning rates and batch sizes until you are able to overfit successfully on this training dataset.


In [None]:
train_subset_ds = train_ds.select(range(10))
train_subset_ds.set_transform(transform)

In [None]:
from transformers import TrainingArguments
from transformers import Trainer

lr = 0.0005
epochs = 10
batch_size = 2

training_args = TrainingArguments(
    "overfit-segmentation-model",
    learning_rate=lr,
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluation_strategy="steps",
    save_steps=20,
    eval_steps=5,
    logging_strategy='epoch',
    logging_steps=1,
    report_to='wandb'
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_subset_ds,
    eval_dataset=test_ds,
)

train_results = trainer.train()
trainer.save_model("saved_model_files")
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

* **Plot the Loss on the Training and Test Sets Over the 10 Epochs** 

In [None]:
import pandas as pd

df = pd.DataFrame(trainer.state.log_history)[['eval_loss', 'loss', 'epoch']]

df_eval_loss = df[['eval_loss', 'epoch']].dropna()
df_train_loss = df[['loss', 'epoch']].dropna()

df_epoch = df_eval_loss.merge(df_train_loss, on='epoch', how='left')

plt = df_epoch.plot.line(x='epoch', y=['loss', 'eval_loss'])

In [110]:
df.columns

Index(['loss', 'learning_rate', 'epoch', 'step', 'eval_loss', 'eval_runtime',
       'eval_samples_per_second', 'eval_steps_per_second', 'train_runtime',
       'train_samples_per_second', 'train_steps_per_second', 'total_flos',
       'train_loss'],
      dtype='object')

* Is there any sign of overfitting? [ANSWER HERE]

## 3c. Fine-Tune the Segformer Model on the Entire Training Set

* **Load the Mean IoU Metric**

In addition to the loss, we now have to decide on a *metric* we will use to measure the performance for our machine learning model. A natural choice for image classification is *mean Intersection-over-Union (mean IoU)*, which measures the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth. It is probably the most common metric used for segmentation tasks. 

Read about the `evaluate` library, which contains many common machine learning metrics here: https://github.com/huggingface/evaluate

And use `evaluate.load()` to load the mean IoU metric:

In [None]:
import numpy as np
import evaluate
from torch import nn

metric = # FILL HERE

We will need to write some code to apply the mean IOU metric to the right layers of the neural network. We first need to convert our predictions to logits first, and then reshaped to match the size of the labels. This code has already been written for you:

In [1]:
def compute_metrics(eval_pred):
    with torch.no_grad():
        logits, labels = eval_pred
        logits_tensor = torch.from_numpy(logits)
        logits_tensor = nn.functional.interpolate(
            logits_tensor,
            size=labels.shape[-2:],
            mode="bilinear",
            align_corners=False,
        ).argmax(dim=1)

        pred_labels = logits_tensor.detach().cpu().numpy()
        metrics = metric.compute(
            predictions=pred_labels,
            references=labels,
            num_labels=num_classes,
            ignore_index=255,
            reduce_labels=False,
        )
        for key, value in metrics.items():
            if type(value) is np.ndarray:
                metrics[key] = value.tolist()
        return metrics


Now, we will take all of the code that you have written and use it to fine-tune the Segformer model on the sidewalk segmentation dataset. Simply run the code below, and your model will fine-tune for 5 epochs. On a **GPU**, this should take about or leass than 30 minutes with the default settings.

**Important Note:** these default settings may **NOT** produce a very good segmentation model. For this task, you likely need significantly more training time. That is OK, the point of this exercise is not to train a highly-performant model, but to walk through the steps that would be needed to do that. We will **NOT** be looking at the performance of this model to grade your project. If you have been able to overfit on a small training subset (in part 3b), and the loss is going down in this part, that is sufficient.

In [2]:
lr = 0.1
epochs = 5
batch_size = 1

training_args = TrainingArguments(
    "regular-segmentation-model",
    learning_rate=lr,
    num_train_epochs=epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    evaluation_strategy="steps",
    save_steps=200,
    eval_steps=200,
    logging_steps=20,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

trainer.train()

NameError: ignored

## 3d. Upload your model to the Hugging Face Hub!

In two lines of code, upload your feature extractor and model to the Hugging Face Hub!

In [None]:
extractor.push_to_hub("my-segmentation-model")

In [None]:
model.push_to_hub("my-segmentation-model")

What is the URL for your model on the Hub? [ANSWER HERE]

Please make sure that the model is **public**

# Step 4: Reporting Model Metrics

* **Plot the Loss and Mean IoU on the Training and Test Sets Over the 5 Epochs** 

In [None]:
# FILL HERE

* Is there any sign of overfitting? [ANSWER HERE]

# Step 5: Building a Demo

A high-level metric like mean test IoU doesn't give us a great idea on how the model will work when presented with new data from the real world. To understand this, we will build a web-based demo that we can use on our phones or computers through a web browser to test our model.

The `gradio` library lets you build web demos of machine learning models with just a few lines code. Learn more about Gradio here: https://gradio.app/getting_started/

Gradio lets you build machine learning demos simply by specifying (1) a prediction function, (2) the input type and (3) the output type of your model. We have already written the prediction function here:

In [None]:
import matplotlib.pyplot as plt

def classify(im):
  inputs = extractor(images=im, return_tensors="pt").to("cuda")
  outputs = model(**inputs)
  logits = outputs.logits
  classes = logits[0].detach().cpu().numpy().argmax(axis=0)
  colors = np.array([[128,0,0], [128,128,0], [0, 0, 128], 	[128,0,128], [0, 0, 0]])
  return colors[classes]

* **Build a Gradio web demo of your image classifier and `launch()` it**

Create a `gradio.Interface` and launch it! For image classification, the input component should be an `Image` component that passes the image in as a "PIL" image, and the output should be a `Image` component as well

In [None]:
import gradio as gr

interface = # FILL HERE

interface.launch(debug=True)

# Step 6: Trying your Model with "Real World" Data!

* **Use the share link created above to open up your app on your phone**

Now test your model on some real images -- perhaps you can go outside and take a picture of your car. Or you can upload a picture of a road you found online. Although your model may not have been trained for very long, is it still able to distinguish any object classes? Why do you think that may or may not be the case?  

[ANSWER HERE]

# Bonus: Extensions

Now that you've worked through the project and have a functioning app, what else can we try?
* **Try training the model to convergence.** For this project, we only trained the model for 5 epochs, which is far too little for a real image segmentation model. Instead you can let the model train until it fully converges. How far can you increase the mean IoU?
* **Try a zero-shot image segmentation model.** If you're tired of waiting for your model to train, you could try a zero-shot image segmentation models, which does not have to be retrained for specific applications. How well does a zero-shot segmentation model like [GroupViT](https://huggingface.co/nvidia/groupvit-gcc-yfcc) work for this problem?
* **Set up an inference widget.** After you uploaded your model to the Hugging Face Hub, you may have noticed a message on the right side of the screen saying, "Unable to determine this model’s pipeline type. Check the docs." This is usually where the inference widget goes, which allows anyone to try out your model directly from the web. Follow the docs to set up the inference widget for your model. 
* **Systematically explore different learning rates**: The learning rate is one of the most important hyperparameters when it comes to training machine learning models. Explore at least 8 different learning rates across 4 orders of magnitude. Which learning rates produce the best model?
* **Try training a segmentation model on the original data**: To speed up the learning process, we reduced the number of classes and the resolution of the images. Can you successfully train a model on the original data? This might require you to have Colab Pro, so that you can fit the images in the original resolution in memory.




---


#### This project is from [Abubakar Abid's](https://twitter.com/abidlabs) course: *Building Computer Vision Applications* on CoRise. Learn more about the course [here](https://corise.com/course/vision-applications).