<a href="https://colab.research.google.com/github/j0ngle/wvu-engr-camp/blob/main/ASL_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
PATH = '/content/drive/MyDrive/data'

# Preprocessing Data

In [None]:
import os
import pandas as pd

HOUSING_PATH = os.path.join("datasets", "asl")
os.makedirs(HOUSING_PATH, exist_ok=True)

def get_csv_path(filename, path):
  csv_path = os.path.join(path, filename)
  return pd.read_csv(csv_path)

In [None]:
train_whole = get_csv_path('ASL_train.csv', PATH)
test_whole = get_csv_path('ASL_test.csv', PATH)

train_whole.head()

The image data is saved to the csv as strings of 1D arrays of length 2500. Because of how Pandas' to_csv works, there are new line characters every 8 entries. Below are the steps to convert these string back to numpy arrays

1. Get rid of end brackets by getting the substring of each image from index 2 to index len(< str >) - 2

2. Use numpy's .fromstring function on \n separators to convert the string to an array

3. Reshape the numpy array with .reshape to get the original 50x50 image

After the list is filled, I am going to ditch the dataframe and convert all of the lists to numpy arrays. I had previously replaced the index in the dataframe with updated values, but that wasn't playing nice with model.fit() later. Sending them straight to numpy arrays with .asarray() should address this problem

Another Note: I forgot that I need my labels to be integers and I'm not running the preprocessing step again, so I'm just going to encode them manually here

In [None]:
import numpy as np

length = len(train_whole['images'])
img_array_train = [None] * length
label_array_train = [None] * length

for i in range(0, length):
  #Grab image
  img = train_whole['images'][i]

  #Get substring (omitting brackets)
  img = img[2:len(img) - 2]

  #Convert to array, reshape, and add to new list
  img_array_train[i] = np.fromstring(img, sep='\n').reshape(50, 50)

  #Handling labels
  label = train_whole['labels'][i]
  if len(label) == 1:
    label_array_train[i] = ord(label) - ord('A')
  else:
    if label == 'nothing':
      label_array_train[i] = 26
    elif label == 'space':
      label_array_train[i] = 27
    elif label == 'del':
      label_array_train[i] = 28

length = len(test_whole['images'])
img_array_test = [None] * length
label_array_test = [None] * length

for i in range(0, length):
  img = test_whole['images'][i]

  img = img[2:len(img) - 2]

  img_array_test[i] = np.fromstring(img, sep='\n').reshape(50, 50)

  label = test_whole['labels'][i]
  if len(label) == 1:
    label_array_test[i] = ord(label) - ord('A')
  else:
    if label == 'nothing':
      label_array_test[i] = 26
    elif label == 'space':
      label_array_test[i] = 27
    elif label == 'del':
      label_array_test[i] = 28

In [None]:
class_names = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
       'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U',
       'V', 'W', 'X', 'Y', 'Z', 'nothing', 'space', 'del']

In [None]:
from sklearn.utils import shuffle

#x represents data
#y represents labels
x_train_whole = shuffle(np.asarray(img_array_train), random_state=42)
y_train_whole = shuffle(np.asarray(label_array_train), random_state=42)

x_test, x_train = x_train_whole[:17000] / 255.0, x_train_whole[17000:] / 255.0
y_test, y_train = y_train_whole[:17000], y_train_whole[17000:]

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

Just confirming that the data is formatted correctly

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import cv2

def plot_sign(img, label):
  plt.imshow(img, cmap='binary')
  plt.axis('off')
  plt.show()
  print("Label: ", class_names[label])

In [None]:
instance = 0
plot_sign(x_train[instance], y_train[instance])

All of the data appears to be formatted and labeled correctly. The Kaggle contibutor incluced a test set, but there is only 1 image for each sign (totalling 29 images including del, nothing, and space).

I'm going to make a new test set containing about 20% of all of the training data and then a validation set containing a fraction of what remains. This way there is adequate material to train, validate, and test on

# Creating the Model

Here we are going to build the model. Before we do that, we need to establish what everything is and what it's doing.

`models.keras.Sequential` - Defines the Sequential API. This just says we are going to pass our data straight through the network and that we aren't going to need the ability to send it our of order at all. It takes a `list` of layers as its input.

`keras.layers.Flatten` - We can't feed a 2D matrix into a 1D vector, so we can use this layer to convert the inage into something the network can understand

`keras.layers.Dense` - This is the most basic layer. It is a simply fully connected layer, meaning the output of every neuron in a layer is mapped to every neuron in the next layer

ReLU and Softmax - Common activation functions used in most networks



To define the model, we simply pass in a list of our layers. You can probably gather what all the inputs mean. For the most part the number of neurons we choose for each layer is aribtrary with the exception being the output layer. 

The output *must* have the same number of neurons as we have possible output classes. In the case of the ASL dataset we have 29 (26 letters + space + del + nothing)

---


```
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Flatten, Dense

layers = [
  Flatten(input_shape=[50, 50]),
  Dense(300, activation="relu"),  
  Dense(100, activation="relu"),
  Dense(29, activation="softmax") 
]

model = Sequential(layers)
```

We can use `.summary()` to show all the details of every layer.

```
model.summary()
```

For reference, a model I am working on has 1,436,849,665 trainable parameters

---

`.compile()` is used to prep our network for training. Here we define our loss function, optimizer, and desired metrics.

**sparse_categorical_crossentropy** - let's break this down. "Crossentropy" is a type of loss function which essentially figured out the distance our output is from the correct answer. "categorical" suggests we are working with numberical labels which corespond with some category list. By default, categorical crossentropy expects your inputs to be one-hot encoded. Ours are not. Thus we need to specific that our intputs are "sparse"

**one-hot encoding** - If I remember correctly it works like this. Lets say we have n=4 possible categories (for simplicity). As it stands our labels are numbers that represent each of the 4 outputs. So we could get either a 1, 2, 3, or 4 as output. One-hot converts this into an array with length = n. Now it's a binary problem. So our four options are [1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], or [0, 0, 0, 1]

**sgd** - Short for "Stochastic Gradient Descent". There is some information on this in the Extra Content section of the powerpoint if you are interested :)

---
```
model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])
```

`.fit` just starts the training loop. We pass in our training set, x_train, and our training labels, y_train

*epochs* - Essentially the amount of times we are going to loop through the whole training loop. So here we are going to run through all of the images 30 times.

*validation_split* - During training, it is generally bad to test your results with your testing set. Once the model has seen the image in the test set it can end up having a bias toward those images because it's seen them before. Instead, we can use this to create what's called a "validation set". This is just a subset of our training set used for periodic testing to give us feedback on what's going on. It serves as a good estimate of how it'll perform on the testing set without spoiling the test set itself

---

```
history = model.fit(x_train, y_train, epochs=10)

```


Notice the variable name we stored the results in, `history`. 

History is a neat thing with Keras that automatically saves the results from each epoch. So here we can create a datafrom from the history and plot it.

We would expect the loss to decline overtime and the accuracy to increase.

When analyzing graphs be sure to keep convergence in mind. If you don't know what convergence means it's basically just the graph gradually getting closer and closer to a some value. We don't want the graph to oscillate at all - that suggest something could be wrong with the network, training loop, loss function, optimizer, input preprocessing, etc, etc, etc. 

---

```
pd.DataFrame(history.history).plot(figsize=(8, 5))
plt.grid(True)
plt.show()
```

# Testing

To get some predictions, we can call `.predict`. Just pass in the testing set and it'll spit out some predictions - though these may not be exactly what you'd expect.

Say we feed a single image of the letter B to the model. Instead of spitting out '1' (the index of B in our class list), instead it will give us a list of 29 elements. Each of these elements correspond to a certainty that that index is the correct one. So it'll look something like this:

[0.04, .98, .23, .01, .06, .10, ...]

So we need a way to grab the index with the highest certainty. Luckily `np.argmax` exists. This is a numpy function that just gives you the index of the largest element.

```
import numpy as np
y_pred = np.argmax(model.predict(x_test), axis=-1)
y_pred
```

Now we have an array of index predictions and an array of labels. Now it's super easy to determine the accuracy. We can just compare each index of our prediction list to the same index in our label list and bang, we have an accuracy.

```
total = 0
correct = 0
for i in range(0, len(y_test)):
  total += 1

  if (y_pred[i] == y_test[i]):
    correct += 1

print("Accuracy: ", correct / total)
```

And numbers are cool and all, but it's always cooler to actually see the results. This loop just prints out an image, its predicted label, and its actual label

```
class_names = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L',
       'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U',
       'V', 'W', 'X', 'Y', 'Z', 'nothing', 'space', 'del']

for i in range(0, 10):
  plot_sign(x_test[i], y_test[i])
  index = y_pred[i]
  print("Pred :", class_names[index])
```