<a href="https://colab.research.google.com/github/mhrgroup/course_self_supervised_learning/blob/main/Section%2003%3A%20Labeling%20Task/ssl_section03_lecture05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lecture 05: Labeling Challenges**

By the end of this lecture, you will be able to:

1. Describe the challenges of labeling large repositories.
2. Implement supervised labeling.


# **5.1. Labeling Large Repositories**

---

**Solution 1:** Manual labeling, for example:

* Manually label a repository of ten million animal images.

* Identify positive, negative, and neutral statements in thousands of text documents.

* Characterize materials properties by purchasing ingredients and laboratory equipment and hiring experts to build and break thousands of materials sample.

> ***Manual labeling is often a costly, labor-intensive, and time-consuming task.***

**Solution 2:** Manually label a small portion of the data first. Next, develop a supervised learning model to learn from the labeled data as the training and label the rest.

* Let’s say we have ten million images of cats and dogs, with only 10,000 (0.1%) of them manually labeled. The problem is how to label the unlabeled images (i.e., 9,990,000 images)?

* Since the training rate is only 0.1% (i.e., testing rate of 99.9%), the model is highly probable to be inaccurate and/or overfitted.

* We can benefit from transfer learning and fine-tuning techniques to increase the accuracy of such supervised models.

> ***Heads up: it should be noted that in either of these solutions, the unlabeled portion of the repository is unused.***

* Let's create experiments to practice supervised labeling using CIFAR-10.

> **Abbreviations:**
*	datain: input data
*	dataou: output data
*	te: testing
*	tf: tensorflow
*	tr: training

In [None]:
#@title Install necessary libraries & restart the session

# Install the required libraries using the `pip` package manager.
!pip install tensorflow==2.15

# Import the time module to add a delay before restarting the session.
import time

# Import `clear_output` from IPython to clear the notebook output, ensuring a clean display for the user.
from IPython.display import clear_output

# Clear the output after the packages are installed to make the notebook cleaner.
clear_output()

# Print a message to let the user know that the libraries are installed & the session will restart.
print("Necessary Libraries are Installed. Restarting the session!")

# Add a short delay (1 second) before restarting to allow the message to be displayed to the user.
time.sleep(1)

# Import the `os` module to access low-level operating system functionality.
import os

# Use `os._exit(00)` to exit the current Python runtime environment forcefully.
# This effectively simulates a restart in notebook environments like Google Colab or Jupyter.
# After this command, the environment will be restarted & all the packages installed will be properly loaded.
os._exit(00)

# **5.2. Supervised Labeling Experiments with Limited Labeled Data**
---

## 5.2.1. Supervised Labeling with Randomly Generated Parameters

In [None]:
#@title Import necessary libraries
import time
import tensorflow as tf

In [None]:
#@title Load and process the CIFAR-10 data
(datain_tr, dataou_tr), (datain_te, dataou_te) = tf.keras.datasets.cifar10.load_data()

datain_tr = datain_tr/255 # trasnform unit-8 values between 0 and 1
datain_te = datain_te/255 # trasnform unit-8 values between 0 and 1

dataou_tr = tf.keras.utils.to_categorical(dataou_tr)
dataou_te = tf.keras.utils.to_categorical(dataou_te)

print('Shape of datain_tr: {}'.format(datain_tr.shape))
print('Shape of datain_te: {}'.format(datain_te.shape))
print('Shape of dataou_tr: {}'.format(dataou_tr.shape))
print('Shape of dataou_te: {}'.format(dataou_te.shape))


In [None]:
#@title Limit the labeled training data
'''
Let's say we use 5% of training inputs and outputs.
'''

rate_labeled = 0.05
num_labeled  = int(datain_tr.shape[0] * rate_labeled)

print("Number of labeled training data: {}".format(num_labeled))

# randomly select 5% of training data
index_tr  = tf.experimental.numpy.random.randint(0,
                                                 datain_tr.shape[0],
                                                 num_labeled)

datain_tr = datain_tr[index_tr,:,:,:]
dataou_tr = dataou_tr[index_tr,:]

# print shapes of data

print('Shape of datain_tr: {}'.format(datain_tr.shape))
print('Shape of datain_te: {}'.format(datain_te.shape))
print('Shape of dataou_tr: {}'.format(dataou_tr.shape))
print('Shape of dataou_te: {}'.format(dataou_te.shape))


In [None]:
#@title Create a model similar to DenseNet121

layerin = tf.keras.Input(shape=(datain_tr.shape[1],
                                datain_tr.shape[2],
                                datain_tr.shape[3]))

upscale = tf.keras.layers.Lambda(lambda x: tf.image.resize_with_pad(x,
                                                                    160,
                                                                    160,
                                                                    method=tf.image.ResizeMethod.BILINEAR))(layerin)

model_DenseNet121 = tf.keras.applications.DenseNet121(include_top  = False,
                                                      weights      = None,
                                                      input_shape  = (160,160,3),
                                                      input_tensor = upscale,
                                                      pooling      = 'max')

layerou = tf.keras.layers.Dense(dataou_tr.shape[-1], activation = 'softmax')

model   = tf.keras.models.Sequential([model_DenseNet121,
                                      tf.keras.layers.BatchNormalization(),
                                      layerou])


model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
              loss      = 'categorical_crossentropy',
              metrics   = ['accuracy'])

print('\nsummary of model_DenseNet121:\n')
model_DenseNet121.summary()

print('\nsummary of model:\n')
model.summary()

In [None]:
#@title Train the model for five epochs
t_start = time.time()

history = model.fit(datain_tr,
                    dataou_tr,
                    epochs           = 5,
                    batch_size       = 128,
                    verbose          = 1,
                    shuffle          = True,
                    validation_split = 0.05)

t_end   = time.time()


In [None]:
#@title Compute testing accuracy
_, accuracy_te = model.evaluate(datain_te,
                                dataou_te,
                                batch_size = 128)

print('\nTraining time: {:06.2f} sec'.format(t_end - t_start))
print('\nTesting acuuracy: {:05.2f}%'.format(accuracy_te * 100))

In [None]:
#@title Clean up memory
%reset

## 5.2.2. Supervised Labeling with Transfer Learning and Fine-Tuning

In [None]:
#@title Import necessary libraries
import time
import tensorflow as tf

In [None]:
#@title Load and process the CIFAR-10 data
(datain_tr, dataou_tr), (datain_te, dataou_te) = tf.keras.datasets.cifar10.load_data()

datain_tr = datain_tr/255 # trasnform unit-8 values between 0 and 1
datain_te = datain_te/255 # trasnform unit-8 values between 0 and 1

dataou_tr = tf.keras.utils.to_categorical(dataou_tr)
dataou_te = tf.keras.utils.to_categorical(dataou_te)

# print shapes of data

print('Shape of datain_tr: {}'.format(datain_tr.shape))
print('Shape of datain_te: {}'.format(datain_te.shape))
print('Shape of dataou_tr: {}'.format(dataou_tr.shape))
print('Shape of dataou_te: {}'.format(dataou_te.shape))


In [None]:
#@title Limit the labeled training data
'''
Let's say we use 5% of training inputs and outputs.
'''
rate_labeled = 0.05
num_labeled  = int(datain_tr.shape[0] * rate_labeled)

print("Number of labeled training data: {}".format(num_labeled))

# randomly select 5% of training data
index_tr  = tf.experimental.numpy.random.randint(0,
                                                 datain_tr.shape[0],
                                                 num_labeled)

datain_tr = datain_tr[index_tr,:,:,:]
dataou_tr = dataou_tr[index_tr,:]

# print shapes of data

print('Shape of datain_tr: {}'.format(datain_tr.shape))
print('Shape of datain_te: {}'.format(datain_te.shape))
print('Shape of dataou_tr: {}'.format(dataou_tr.shape))
print('Shape of dataou_te: {}'.format(dataou_te.shape))

In [None]:
#@title Create a model similar to DenseNet121
layerin = tf.keras.Input(shape=(datain_tr.shape[1], datain_tr.shape[2],datain_tr.shape[3]))

upscale = tf.keras.layers.Lambda(lambda x: tf.image.resize_with_pad(x,
                                                                    160,
                                                                    160,
                                                                    method=tf.image.ResizeMethod.BILINEAR))(layerin)

'''
Here we are going to use ImageNet weights to create our base model.
'''
model_base = tf.keras.applications.DenseNet121(include_top  = False,
                                               weights      = 'imagenet',
                                               input_shape  = (160,160,3),
                                               input_tensor = upscale,
                                               pooling      = 'max')

layerou = tf.keras.layers.Dense(dataou_tr.shape[-1], activation = 'softmax')

model   = tf.keras.models.Sequential([model_base,
                                      tf.keras.layers.BatchNormalization(),
                                      layerou])

model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
              loss      = 'categorical_crossentropy',
              metrics   = ['accuracy'])

print('\nsummary of model_DenseNet121:\n')
model_base.summary()

print('\nsummary of model:\n')
model.summary()

In [None]:
#@title Transfer learning
'''
Freeze the base model and train the last layer only.
'''
model_base.trainable = False

model.compile(optimizer = tf.keras.optimizers.Adam(0.001),
              loss      = 'categorical_crossentropy',
              metrics   = ['accuracy'])

print('\nsummary of model_DenseNet121:\n')
model_base.summary()

print('\nsummary of model:\n')
model.summary()

'''
Train for three epochs
'''

t_start = time.time()

history = model.fit(datain_tr,
                    dataou_tr,
                    epochs           = 3,
                    batch_size       = 128,
                    verbose          = 1,
                    shuffle          = True,
                    validation_split = 0.05)

t_end   = time.time()

t_transfer_learning = t_end - t_start

print('\nTransfer learning training time: {:06.2f} sec'.format(t_transfer_learning))


In [None]:
#@title Fine-tuning
'''
Unfreeze the base model and train the whole model.
'''
model_base.trainable = True

model.compile(optimizer = tf.keras.optimizers.Adam(0.00001),
              loss      = 'categorical_crossentropy',
              metrics   = ['accuracy'])

print('summary of model_DenseNet121:\n')
model_base.summary()

print('summary of model:\n')
model.summary()

'''
Train for two epochs.
'''

t_start = time.time()

history = model.fit(datain_tr,
                    dataou_tr,
                    epochs           = 2,
                    batch_size       = 128,
                    verbose          = 1,
                    shuffle          = True,
                    validation_split = 0.05)

t_end   = time.time()

t_fine_tuning = t_end - t_start

print('\nFine-tuning training time: {:06.2f} sec'.format(t_fine_tuning))


In [None]:
#@title Compute testing accuracy
_, accuracy_te = model.evaluate(datain_te,
                                dataou_te,
                                batch_size = 128)

print('\nTotal Training time: {:06.2f} sec'.format(t_transfer_learning + t_fine_tuning))
print('\nTesting acuuracy: {:05.2f}%'.format(accuracy_te * 100))

In [None]:
#@title Clean up memory
%reset

> ***Reminder: in all these practices, we did not touch the remaining, assumed to be unlabeled, training input data.***

# **Lecture 5: Labeling Challenges**

In this lecture, you learned about:

1. Challenges of labeling large repositories.
2. Supervised labeling implementation.

> ***In the following lecture, we will talk about "Self-Supervised Learning (SSL)."***