# Session 1: Assigment

```{contents}

```

## Problem 1: Face Recognition with PCA and KNN (7 popints)

In this problem, we will use PCA to extract features and then use KNN model to give predictions based on extracted features.

### Prepare the dataset

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

In [None]:
import sklearn.datasets as datasets
dataset = datasets.fetch_lfw_people(min_faces_per_person=70, resize=0.8)

### Analyze the data

In [None]:
print(dataset.keys())

Some important `key` you need to consider when solving this problem:
- `images`: the grayscale image dataset (already normalized)
- `data`: the grayscale image dataset already normalized). Each image is flattened into one vector
- `target`: label of each image/face (type `int`)
- `target_names`: name of each face (type `str`)

In [None]:
data = dataset.data
target = dataset.target
target_names = dataset.target_names
num_image, h, w = dataset["images"].shape

print("Number of images:", num_image)
print("Height of each image:", h)
print("Width of each image:", w)
print("Data shape:", data.shape)

Let's see the names of faces included in the dataset

In [None]:
print("Number of people in the dataset:", len(target_names))
print(target_names)

The variable `target` contains the labels of the above 7 faces, numbered from 0 to 6

In [None]:
print(target)
print("-" * 30)
ids, counts = np.unique(target, return_counts=True)
for id, count in zip(ids, counts):
  print(f"There are {count} images of {target_names[id]}")

Visualization

In [None]:
target_with_name = [target_names[id] for id in target]
fig = px.histogram(x=target_with_name, color=target_with_name)
fig.show()

Visualize 5 random images of each person

In [None]:
n_people = 7
n_image = 5

fig, axes = plt.subplots(n_people, n_image, figsize=(10,15))
for row in range(n_people):
  current_id_indices = np.where(target == row)[0]
  random_indices = np.random.choice(current_id_indices, size=n_image, replace=False)

  for col in range(n_image):
    current_ax = axes[row][col]
    current_ax.grid('off')
    current_ax.axis('off')

    image_index = random_indices[col]
    current_ax.imshow(data[image_index].reshape(h,w), cmap='gray')
    current_ax.set_title(target_names[row])

plt.show()

### Use PCA to reduce data dimension and draw Embedding Space

#### TODO 1

Use PCA to reduce the dimension of the data set to 3 dimensions, then print out the amount of information retained.

- `pca.explained_variance_ratio_` is the percentage of variance explained by each of the principal components. It tells you how much information (variance) can be attributed to each of the components. The sum of all the ratios is equal to 1.0.

- For example, if you have two components, and the output is `[0.8, 0.2]`, it means that the first component explains 80% of the variance in the data, and the second component explains 20% of the variance.

- The `pca.explained_variance_ratio_` is calculated by dividing the `pca.explained_variance_` by the sum of all the variances. The `pca.explained_variance_` is the eigenvalue of each component, which measures how much of the variance of the data is along that component.

In [None]:
# YOUR SOLUTION


We see from the above results
- 1st component explains 20.05% of the variance in the data
- 2nd component explains 13.60% of the variance in the data
- 3rd component explains 6.75% of the variance in the data

Sum is 40.4% which means there are 59.6% of the variance that is not captured by these three components $→$, we need to increase the components to get a better representation of the data.

But when we need to plot the representation, we should only use 3 components to plot on 3D graph.

In [None]:
# YOUR SOLUTION

#### TODO 2

Use `plotly.express` visualize a digram `scatter_3d` on the 3D dataset, dots must be colored based human names (use variable `target_with_name` already declared above)

In [None]:
# YOUR SOLUTION

We see that the Embedding Space of the 3 main components is not very good, the data points are mixed together, so it will be difficult to classify properly.

In fact, Face Recognition applications do not use PCA to extract features. Instead, they will use the Pretrained Model (which has been trained on a large amount of face data) to create Embedded Vector for faces. These pretrained embedded vectors are so good that if we reduce the data dimension to 3 and draw on the graph, we still see that the faces are separated very well.

### Train Test Split

#### TODO 3

Use the **Stratified Split** technique to split the dataset into 2 sets: Train và Test
- The train set accounts for 80%
- Shuffle
- Use random seed 42 to maintain the similar result


In [None]:
# YOUR SOLUTION

### Feature Extraction with PCA

#### TODO 4

Use PCA to extract features on train set and test set under constraint that retained information is 99%

Name the 2 new variables as `x_train_pca` và `x_test_pca`

In [None]:
# YOUR SOLUTION


Compare the original face with the approximate face (reconstructed with PCA).

Instead of using math formulas, we use `pca.inverse_transform`

In [None]:
# YOUR SOLUTIONs

### Classification with KNN

#### TODO 5

- Train K-NN models to classify faces on extracted datasets after applying PCA
  - Experiment with different `k` to find the best model.
  - Call function ``score`` to view Accuracy on Train Set and Test Set
  ```
  print('Accuracy on Train Set',model.score(x_train_pca, y_train))
  print('Accuracy on Test Set',model.score(x_test_pca, y_test))
  ```

In [None]:
# YOUR SOLUTION

#### TODO 6

Use the `classification_report` metrics to print out the accuracy of the model for each person in the Test set

In [None]:
# YOUR SOLUTION

#### TODO 7

- Try graphing faces to compare the predicted results of the KNN model with the actual label (random 5 photos per person, for example). Suggested steps:
  - Random out 5 photos in the Test episode belonging to each person (0 - 6)
  - Use trained models to predict names
  - Draw shapes, display the correct people's names and people's names predicted by the model

In [None]:
# YOUR SOLUTION

## Problem 2: Console game Semantris (3 points)

In this article, we will make 1 simple game running on the Python console window. This game simulates Google's Sementris game, please try it at [here](https://research.google.com/semantris/) (choose Play Arcade)

## Guildance

Given the pre-trained AI model below

In [None]:
import tensorflow_hub as hub
import tensorflow as tf

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)

This model is capable of converting English words into vectors

In [None]:
result = model(["cat", "dog", "chihuahua"]).numpy()
print(type(result))
print(result.shape)

<class 'numpy.ndarray'>
(3, 512)


Let design a game following [this diagram](https://drive.google.com/file/d/1WQdZGszYniiBzoDpdx-VWEIyNz2rFvip/view?usp=sharing)

Note
- File `words.txt` Download [here](https://drive.google.com/file/d/1KYMBK_j3g7_ROEJ46Nb0PmUerY5Xdyx_/view?usp=sharing)
- you need to `strip` and `lowercase` value `y`
- values of `y` should not be duplicated with `x`
- Users only have 3 lives for the entire game.

In [None]:
with open('Problem_2_words.txt', 'r') as f:
    words = [line.strip() for line in f]

print(f"Number of words in file txt: {len(words)}")

Number of words in file txt: 476


In [None]:
# YOUR SOLUTION