# Sesssion 12: Final Coding Test

```{contents}

```

In [None]:
from google.colab import drive
drive.mount('/content/drive')

The test is simple enough to complete in 1 hour and 30 minutes. You should time yourself to try it.

## Language Model for Alphabet characters and numbers

As you have experienced, Language Model has the most basic use of representing words by 1 numerical vector so that words with similar semantics will have distances in the vector space close to each other.

Before you start coding, think about what the result of the Language model for alphabetic characters and numbers will be.

**Read the Data**

Our data consists of the titles of 10,000 English articles saved as 1 `list`

In [None]:
import pickle as pkl

with open("/content/drive/MyDrive/Colab Notebooks/ML-intensive/data/problem_1.pkl", "rb") as f:
  data = pkl.load(f)

print(data[0:10])

In [None]:
data

**Approach**

We will build 1 Language Model **very basic** by doing the following:
- Use the Embedding layer to create a representation of each word in the dataset
- Update the weight for the Embedding layer through the problem **"Predict the next character with the input of 1 unique character in front"** (multi-label classification problem, the number of labels is the number of unique characters appearing in the dataset)

Therefore, from the above text, we need to build a dataset with `x` as 1 character and `y` as the adjacent character immediately after.

Example: The first sentence in the dataset `aba decides against community broadcasting licence`

This sentence is 50 characters long (including spaces) → we will create 49 data samples to train the model
```
# For example, the "aba decides" segment will produce x pairs like
x     | y
------|-------
a     | b
b     | a
a     | space
space | d
d     | e
e     | c
c     | i
i     | d
d     | e
e     | s
```

#### TODO 1

Design a code to:
- Create `x_char` and `y_char` containing the training data described above (on top of all the data in the `data` variable)
- Creating `unique_chars` is the `list` containing unique characters in `data` (including space characters), the data in this list is `sorted incrementally`.
- Based on `unique_chars`:
  - Create a `NUM_CHAR` indicating the number of unique characters in `data`
  - Create `char_to_index` and `index_to_char` as 2 `dictionary` used to map each character with their index and vice versa

In [None]:
# YOUR SOLUTION

After you have created all the above variables, the code below will generate `x` `y` which is the model training data

In [None]:
x = []
y = []
for char_x, char_y in zip(x_char, y_char):
  x.append(char_to_index[char_x])
  y.append(char_to_index[char_y])

len(x), len(y)

#### TODO 2

Use Tensorflow to build the model as follows, the model consists of 2 layers:
- `Input`
- `Embedding` with
  - Number of lines equal to the number of unique characters
  - Each character is represented by 1 vector with 2 numbers
  - Name this layer `"embedding"`
    - `model.add(Embedding(..., name="embedding"))`

Just create a model, no compile and fit required



In [None]:
# YOUR SOLUTION

Since the last layer of the model is `Embedding`, when we call the `predict` function and pass in all the unique characters, we get their representation (you need to implement the model correctly for the code below to run)

In [None]:
character_embeddings = model.predict(list(index_to_char.keys()))
print(character_embeddings.shape) # notice the printed results

Visualize the vector space of characters

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(6, 6))
for index, vector in enumerate(character_embeddings):
  plt.scatter(vector[0, 0], vector[0, 1], alpha=0)
  if index != 0:
    plt.text(vector[0, 0], vector[0, 1], index_to_char[index])
  else:
    plt.text(vector[0, 0], vector[0, 1], "SPACE")
plt.show()

We see that without training, the characters allocated are very messy in space

### Train

To train the model, we need One Hot Encoding the variable `y`.

One Hot result is `y_encode` with shape `(num_sample, 37)`

Since the Embedding layer will return a result with a shape of `(batch_size, 1, 2)`, we will transform the One Hot result to have a similar shape (if you do not understand here, just try deleting the `expand_dims` part and do TODO 3 below will see an error)

In [None]:
x = np.array(x_char)

# One Hot and transform the shape to get more 1 in the middle
y_encode = tf.keras.utils.to_categorical(y, num_classes=NUM_CHAR)
y_encode = np.expand_dims(y_encode, axis=1)
print(y_encode.shape)

#### TODO 3

- Add `1 Dense layer` to the model, which is used for prediction
- Train the above simple model in 5 epochs with `model.compile(..., optimizer="adam", metrics=["accuracy"]`)

In [None]:
# YOUR SOLUTION

After training, we will remove the `Dense` layer at the end to get the Language Model (put in 1 character to return the vector representing that character)

In [None]:
from tensorflow.keras.models import Model

language_model = Model(
  inputs=model.input,
  outputs=model.get_layer("embedding").output
)

In [None]:
char_embeddings = language_model.predict(list(index_to_char.keys()))

plt.figure(figsize=(6, 6))
for index, vector in enumerate(char_embeddings):
  plt.scatter(vector[0, 0], vector[0, 1], alpha=0)
  if index != 0:
    plt.text(vector[0, 0], vector[0, 1], index_to_char[index])
  else:
    plt.text(vector[0, 0], vector[0, 1], "SPACE")
plt.show()

Observing the position of characters in vector space, you will see:
- The cluster of `e u o a i`
- Cluster of remaining alphabet characters
- Cluster of numbers
- SPACE character

## Optical Character Recognition

In this part, we practice combining CNN and RNN models to solve the problem of optical letter recognition.

The model is already trained, you just need to **write code to create the architecture for the model according to the instructions**, then load the weighting file into the model.

### Prepare the dataset

We have the variable `char_list` which contains all alphabetic characters (a-z, A-Z) and numbers (0-9)

In [None]:
import string

char_list = string.ascii_letters+string.digits
print(char_list)

**Read data from pickle file**

Fix the path back to 2 files `img.pkl` and `label.pkl` in the `ocr_data` folder

In [None]:
import pickle

def load_pickle_data(path):
  f = open(path, 'rb')
  data = pickle.load(f)
  f.close()
  return data

IMG_PATH = '/content/drive/MyDrive/Colab Notebooks/ML-intensive/data/ocr_data/img.pkl'
LABEL_PATH = '/content/drive/MyDrive/Colab Notebooks/ML-intensive/data/ocr_data/label.pkl'

img = load_pickle_data(IMG_PATH)
label = load_pickle_data(LABEL_PATH)

Visualize the data

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(20,4))
for i in range(10):
  target = np.random.randint(0, len(img))
  plt.subplot(2,5, i+1)
  imgplot = plt.imshow(img[target].reshape(32, 128),cmap='binary')
  title = 'Ground Truth: '
  for j in label[target]:
    title += char_list[j]
  plt.title(title)
plt.show()

Convert images from list to numpy array

In [None]:
img = np.array(img)
print('Dataset Shape =', img.shape)

Dataset Shape = (1371, 32, 128, 1)


### Build the architecture

#### TODO 4
**Do this correctly, the steps below WILL work**

- When running the code box below, instructions for creating a model architecture will appear
- In which the colored rectangles represent 1 layer in the model.
- Click on the rectangle to **show/disable** the parameters of that layer.
  - Based on layer name and suggested parameters to calculate the parameters yourself?
  - 2 purple layers (LSTM) you don't care about the parameter `dropout=0.2`




In [None]:
from IPython.display import HTML
HTML('<iframe width="100%" height="250" src="https://final-exam-litahung.vercel.app/ml" allowfullscreen></iframe>')

In [None]:
# YOUR SOLUTION

Change the path to where to place the FILE in your Drive (file `ocr_weights.hdf5`)

In [None]:
model.load_weights('/content/drive/MyDrive/Colab Notebooks/ML-intensive/data/ocr_weights.hdf5')

### Inference

If you create the right architecture, you can run the code below

In [None]:
prediction = model.predict(img)
result = K.ctc_decode(prediction,
                      input_length=np.ones(prediction.shape[0]) * prediction.shape[1],
                      greedy=True)[0][0]
result = K.get_value(result)

The variable `result` contains the predicted result of the model (index of characters), but we do not care about values equal to `-1`

In [None]:
print('Result shape:', result.shape)
print('1st item of result:', result[0])

In [None]:
plt.figure(figsize=(20,4))
for i in range(10):
  target = np.random.randint(0, len(img))
  plt.subplot(2,5, i+1)
  imgplot = plt.imshow(img[target].reshape(32, 128),cmap='binary')
  title = 'Ground Truth: '
  for j in label[target]:
    title += char_list[j]
  title = title + '\nPrediction: '
  for k in result[target]:
    if k == -1:
      continue
    title += char_list[k]
  plt.title(title)
plt.show()