<a href="https://colab.research.google.com/github/masch78/global_opt/blob/maren/ML1_Captcha_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning 1 - KIT

In this exercise you will classify images taken from google's reCAPTCHA.
<div>
<img src=https://i.ds.at/LuvqbQ/rs:fill:1600:0/plain/2022/06/23/captcha.jpg width="300">
<div>

reCAPTCHA was created to differentiate between real humans and computer porgrams. With the breakthrough of deep learning based methods, these tactics to differentiate between humans and machines no longer work. Computer programs nowadays are perfectly able to solve classic captchas.

This notebook shows the initial steps to load the datasets, create a dummy classifier and use the classifier to create the resulting file, which you will upload for grading.

## Your Task



*   Split the labeled Data into sensible training and validation datasets
*   Train a model to classify the training data
*   Evaluate the model on your validation data
*   If you think your model has a high accuracy, and is generalized well, predict the classes of the images from the testdataset and upload the results.csv at https://kit-ml1.streamlitapp.com/
* You will get Bonus Points in the exam if your accuracy on test-data is high enough

## Learning Goals

* How to preprocess data
* How to split data to prevent over- and underfitting
* How to train a model
* How to improve accuracy on unlabeled data
    * Model architecture
    * Model initialization
    * Optimizer
    * Batch size
    * Image Augmentation
    * ...



In [1]:
## Lots of imports
import matplotlib.pyplot as plt # for visualization
import numpy as np #for fast calculation of matrices and vectors
import os # for reading and writing files
import pandas as pd # for creating dataframes and later saving the .csv
import torch # PyTorch
import torch.nn as nn # layers of neural netowrk
from torch.utils.data import random_split, DataLoader # Creating datasets
import torchvision # the part of PyTorch which is used for images
from torchvision import datasets, models, transforms # used for loading images


torch.manual_seed(3407) # makes your code deterministic so you can compare your results
np.random.seed(3407)

Download the two .zip files that are available on ilias.
You should have `train_val.zip` and `test.zip`





## Using Google Colab and Google Drive


* Upload both files (drag and drop) to your free google drive account https://drive.google.com/drive/my-drive
* On the left press the folder (Dateien) Symbol.
* Then press the *Mount drive/ Drive bereitstellen* button which has the google drive symbol (triangle)
* Allow access to your google drive


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
## If you did this correctly you should see here "drive" and "sample_data"
from google.colab import drive
drive.mount('/content/drive')
!ls


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
drive  sample_data


Unzip the files in your Google Drive
Only once

In [4]:
#!unzip drive/MyDrive/train_val.zip -d drive/MyDrive/
#!unzip drive/MyDrive/test.zip -d drive/MyDrive/

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
  inflating: drive/MyDrive/test_data/test/06730.png  
  inflating: drive/MyDrive/test_data/test/06731.png  
  inflating: drive/MyDrive/test_data/test/06732.png  
  inflating: drive/MyDrive/test_data/test/06733.png  
  inflating: drive/MyDrive/test_data/test/06734.png  
  inflating: drive/MyDrive/test_data/test/06735.png  
  inflating: drive/MyDrive/test_data/test/06736.png  
  inflating: drive/MyDrive/test_data/test/06737.png  
  inflating: drive/MyDrive/test_data/test/06738.png  
  inflating: drive/MyDrive/test_data/test/06739.png  
  inflating: drive/MyDrive/test_data/test/06740.png  
  inflating: drive/MyDrive/test_data/test/06741.png  
 extracting: drive/MyDrive/test_data/test/06742.png  
  inflating: drive/MyDrive/test_data/test/06743.png  
  inflating: drive/MyDrive/test_data/test/06744.png  
  inflating: drive/MyDrive/test_data/test/06745.png  
 extracting: drive/MyDrive/test_data/test/06746.png  
 

This should have created the folders train_val_data and test_data in your google drive.

In [5]:
root = "./drive/MyDrive/" # where are these folders located?

Now we have to create Datasets from these folders.

For the train_val folder the images are sorted into their correct class folder.
For the test folder we don't know the correct classes.

We will use ImageFolder Datasets from  [PyTorch](https://pytorch.org/vision/stable/generated/torchvision.datasets.ImageFolder.html#torchvision.datasets.ImageFolder)

Each Image Folder uses [transforms](https://pytorch.org/vision/stable/transforms.html) to augment the image and create a tensor out of it.

Some initial transforms are given. You are allowed (and probably should) add more transformations or modify the existing ones

In [None]:
test_transform = transforms.Compose([
        transforms.CenterCrop(120), # makes that every image has size 120*120 # you can choose different resolutions
        # you can add more augmentations here
        transforms.ToTensor(), # creates a tensor out of Image
    ])

train_val_transform = transforms.Compose([
        transforms.CenterCrop(120), # should be the same resolution as the test_transform
        transforms.ToTensor(),
    ])

Now we use these transformations to create our dataset

In [None]:
train_val_folder = root + "train_val_data/"
train_val_dataset = datasets.ImageFolder(train_val_folder, transform=train_val_transform)

train_val_length = len(train_val_dataset)
print(f"The trainval dataset contains {train_val_length} labeled images") # should be 3000


test_folder = root + "test_data/"
test_dataset = datasets.ImageFolder(test_folder, transform=test_transform)

print(f"The test dataset contains {len(test_dataset)} unlabeled images") # should be 8730

Let's look at the first element of our dataset

In [None]:
first_elem = train_val_dataset.__getitem__(0)
print(f"An element of a dataset contains {len(first_elem)} fields. (should be 2). The first field is an image, the second value is its corresponding label \n")

# the first index should be a tensor representation of an image
print("tensor of first image", first_elem[0], "\n")

print("image should be of shape 3,size,size: ", first_elem[0].shape)

# convert tensor back to a PIL image and visualize it with display()
display(transforms.ToPILImage()(first_elem[0]))
# Each folder is a class
classes = train_val_dataset.classes
print("We have the follwing classes", classes)

# Each classname is assigned an index
class_names = train_val_dataset.class_to_idx
print("Each class gets an index value", class_names)

# the second index is the numerical value of our label taken from the folder name
print(f"For the first image we have index {first_elem[1]}")

Split this dataset into a training set and a validation set.
For this you can use [random_split](https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split)

In this example we will use 10% of the dataset for training and 90% for validation. You should change this percentage to a reasonable value
Remember overfitting and underfitting

In [None]:
train_percentage = 0.1 # how much of the dataset should be used for training --> change this value

no_train_images = int(train_val_length * train_percentage)
no_valid_images = train_val_length - no_train_images

train_dataset, valid_dataset = random_split(dataset=train_val_dataset, lengths=[no_train_images ,no_valid_images], generator=torch.Generator().manual_seed(42))

print(f"we divided the {len(train_val_dataset)} labeled images into {len(train_dataset)} training images and {len(valid_dataset)} validation images")

Let's Create [Dataloaders](https://pytorch.org/docs/stable/data.html)
Dataloaders loads our data in batches and faster so out training speed increases.

The important arguments of the Dataloader are `dataset, batch_size, shuffle and  num_workers`
We are already giving the argument for dataset, you should choose fitting values for the other arguments

Let's create dataloaders for train and test

In [None]:
train_loader = DataLoader(dataset=train_dataset) # You are free to add values for other arguments
valid_loader = DataLoader(dataset=valid_dataset) # You are free to add values add values for other arguments

Lets visualize images from the train loader

In [None]:
def vis_batch(loader):
    def show(inp, label):
        fig = plt.gcf()
        plt.imshow(inp.permute(1,2,0))
        plt.title(label)
    
    for batch_inputs, labels in loader:
        grid = torchvision.utils.make_grid(batch_inputs)
        show(grid, label=[classes[int(labels[x])] for x in range(len(labels))])
        break
vis_batch(train_loader)

Let's create a  dummy pytorch model that takes an image and predicts a class

In [None]:
# Do not use this model.
model = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=3, kernel_size=3,padding=1),
    nn.Flatten(),
    nn.Linear(in_features=120*120*3, out_features=12) # your model has to predict 12 classes so your last layer should most likely be a linear layer with 12 out_features
)

You should use a different model. 
Also you should now train your model. 

# Mein Modell

## Importiere Libaries

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.optim.lr_scheduler as lr_scheduler
from torch.optim.lr_scheduler import _LRScheduler
import torch.utils.data as data

import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models

from sklearn import decomposition
from sklearn import manifold
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import numpy as np

import copy
from collections import namedtuple
import os
import random
import shutil
import time



Set the seed for reproductability.

In [7]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Setze Directories zu Datensätzen

In [9]:
train_dir = root + "train_val_data/"
test_dir= root + "test_data/"

Normalisiere die Daten:

In [11]:
train_data = datasets.ImageFolder(root = train_dir, 
                                  transform = transforms.ToTensor())

# Initialise Mean and Standard deviation array
means = torch.zeros(3)
stds = torch.zeros(3)

for img, label in train_data:
    means += torch.mean(img, dim = (1,2))
    stds += torch.std(img, dim = (1,2))

means /= len(train_data)
stds /= len(train_data)
    
print(f'Calculated means: {means}')
print(f'Calculated stds: {stds}')

Calculated means: tensor([0.4795, 0.4722, 0.4359])
Calculated stds: tensor([0.1675, 0.1676, 0.1834])


Bild Transformation and Augmentation of Train Data

In [13]:
pretrained_size = 224 
pretrained_means = [0.485, 0.456, 0.406] #depend on the pretrained net
pretrained_stds= [0.229, 0.224, 0.225]  #depend on the pretrained net

train_transforms = transforms.Compose([
                           transforms.Resize(pretrained_size), #makes that every image has resolution of 224*224
                           transforms.RandomRotation(5),
                           transforms.RandomHorizontalFlip(0.5),
                           transforms.RandomCrop(pretrained_size, padding = 10),
                           transforms.ToTensor(), #creates a tensor out of image
                           transforms.Normalize(mean = pretrained_means, 
                                                std = pretrained_stds)
                       ])

test_transforms = transforms.Compose([
                           transforms.Resize(pretrained_size),
                           transforms.CenterCrop(pretrained_size),
                           transforms.ToTensor(),
                           transforms.Normalize(mean = pretrained_means, 
                                                std = pretrained_stds)
                       ])

Load data with transforms

In [14]:
train_data = datasets.ImageFolder(root = train_dir, 
                                  transform = train_transforms)

test_data = datasets.ImageFolder(root = test_dir, 
                                 transform = test_transforms)

Create validation Split

In [15]:
VALID_RATIO = 0.9

n_train_examples = int(len(train_data) * VALID_RATIO)
n_valid_examples = len(train_data) - n_train_examples

train_data, valid_data = data.random_split(train_data, 
                                           [n_train_examples, n_valid_examples])

print("# of training examples", n_train_examples, "# of validation examples", n_valid_examples)

# of training examples 2700 # of validation examples 300


Overwrite the validation transforms, making sure to do a deecopy to stop this also changing the training data transforms.

In [16]:
valid_data = copy.deepcopy(valid_data)
valid_data.dataset.transform = test_transforms

In [17]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 2700
Number of validation examples: 300
Number of testing examples: 8730


Create the iterators with the largest batch size

An iterator is an object that allows you to traverse through a collection of items, such as a list or a dataset, one at a time. In Python, an iterator is an object that implements two methods: `__iter__()` and `__next__()`.

The `__iter__()` method returns the iterator object itself. This method is called when the for loop starts.

The `__next__()` method returns the next item from the iterator. This method is called every time the for loop proceeds to the next iteration.

In deep learning, iterators are often used to load and process data in small chunks, as loading the entire dataset into memory can be impractical. One common example is the DataLoader class in PyTorch library, which is used to create a iterator that loads and processes data in batches for training and evaluation.

In [21]:
# Get the maximum amount of GPU memory that has been allocated
max_gpu_memory = torch.cuda.max_memory_allocated()

# Calculate the maximum batch size that can fit in the GPU memory
#batch_size = max_gpu_memory // (input_size * output_size * 4) # 4 bytes per float

print("Max GPU", max_gpu_memory)
#print("Max batch size", batch_size)

Max GPU 0


In [None]:
BATCH_SIZE = 64

train_iterator = data.DataLoader(train_data, 
                                 shuffle = True, 
                                 batch_size = BATCH_SIZE)

valid_iterator = data.DataLoader(valid_data, 
                                 batch_size = BATCH_SIZE)

test_iterator = data.DataLoader(test_data, 
                                batch_size = BATCH_SIZE)


The following method should not be changed. It predicts the classes for each image in the test dataset and stores them in a .csv file.


In [None]:
def create_result_file(model, test_dataset, classes): # DO NOT CHANGE THIS METHOD
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    keys = ["ImageName", *classes]  

    prediction_dict = {key: [] for key in keys}
    names = test_dataset.imgs
    model.to(device)
    model.eval() # set model to evaluation mode. 
    for i in range(len(test_dataset)):
        input = test_dataset.__getitem__(i)
        input = input[0].to(device).unsqueeze(0) # take image tensor and add batch dimension
        with torch.no_grad(): # don't calculate gradients
            outputs = model(input).cpu().squeeze().numpy() # get prediction for input image
            prediction_dict["ImageName"].append(os.path.basename(names[i][0])) # save image name
            for class_idx, class_name in enumerate(classes): # save prediction for each class 
                prediction_dict[class_name].append(outputs[class_idx])
        
    df = pd.DataFrame.from_dict(prediction_dict) # convert list into pandas dataframe
    df.to_csv("result.csv", index=False) # save dataframe as .csv

After training we can execute the 
`
create_result_file(model, test_dataset, classes) method
`
In this given code we skip training and use our untrained model


In [None]:
create_result_file(model, test_dataset, classes)

If you use Google colab, press the button `update/aktualisieren`
<div>
<img src=https://git.scc.kit.edu/vy9905/ml2images/-/raw/main/UpdateColab.jpg width="300">
<div>
You should see that the file result.csv was created. You can now download this file and upload it at

https://kit-ml1.streamlitapp.com/
