# GRADTDA5622 - Big Data Computing Foundations 2
## Final Project: Sign Language Translator
Replace the example information below with your real information:
- Semester: Spring/Autumn 20xx (FILL IN)
- Instructor: Mr. X (FILL IN)
- Section: 12345 (FILL IN) (if applicable)
- Student Name(s): Able Baker (FILL IN)
- Student Email(s): baker.12345@osu.edu (FILL IN)
- Student ID(s): 123456789  (FILL IN)
***

***
# Section: Overview
- Insert a short description of the scope of this exercise, any supporting information, etc.
- **(I will fill this in for each assignment - Tom Bihari)**
***

### Assignment Overview
In this assignment, you will build a sign language translator.

**The Objectives of This Assignment are:**
1. To perform a classification task on image data, using the K-Nearest-Neighbors algorithm from the SciKit-Learn library (https://sklearn.org/).
2. To understand domain-independent evaluation measures.
3. To understand the impact meta-parameters have in algorithm performance
4. To develop a tool that uses the classification model that was developed.
5. To get practice discussing / explaining your results, findings, and insights.

### Problem Statement
Assume that you are the Director of Data Science for American Signing, Inc. (ASI), a company that provides innovative sign-language solutions.  ASI has new technology that can capture American Sign Language images in real time.  ASI would like your team to design a new product that can translate ASL images to text.

### Things To Do
The follow the instructions for each step in the sections below.

### Notes

- This dataset has exactly the same format as the "standard" MNIST dataset that is widely used.  See the link below for the documentation of the dataset.  You also can search for MNIST in the web.
  - This dataset was pulled on 4/13/23 from: https://www.kaggle.com/code/madz2000/cnn-using-keras-100-accuracy
  - See also: https://en.wikipedia.org/wiki/MNIST_database
  - These are 28x28 gray-scale pixel images, with 256-color (or gray-scale) values.
- You will use the KNN classifier that is provided in the SciKit Learn library (similar to the Case Studies you have done).  You do not need to write your own.
- You will be adjusting the number of training records for the exercise (so it runs reasonably fast), so you do not need to run the algorithm on "all" training records.
- Some code is provided (partially filled in) to assist in the development of the final product - this is a starting point.  You may adjust it as you choose.
- By the way, this exercise is identical to one where, for example, the images are medical scans, etc., and you are trying to classify cells as diseased or healthy.  (There are lots of MNIST examples on the web.)

It is essential that you **communicate** your goals, thought process, actions, results, and conclusions to the **audience** that will consume this work.  It is **not enough** to show just the code.  It is not appropriate to show long sections of **unexplained printout**, etc.  Be kind to your readers and provide value to them!

**ALWAYS follow this pattern** when doing **each portion** of the work.  This allows us to give feedback and assign scores, and to give partial credit.  Make it easy for the reader to understand your work.
- Say (briefly) **what** you are trying to do, and **why**.
- Do it (code).
- Show or describe the **result** clearly (and briefly as needed), and explain the significant **conclusions or insights** derived from the results. 

**HAVE FUN!**

***
# Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import time

In [None]:
# Convert label numbers 0-25 to characters A-Z
def label_to_letter(label_num):
    return chr(ord('A') + label_num)

# Convert characters A-Z to label numbers 0-25
def letter_to_label(letter):
    return ord(letter) - ord('A')

# Take a numpy array of y labels (true or predicted) and build a string with the decoded letters 
def y_labels_to_string(y_array):
    st = ""
    for num in np.nditer(y_array):
        st += label_to_letter(num)
    return st

***
# Read the Data

In [None]:
# Read the training data
train_pdf = pd.read_csv('../shared/sign_mnist_train.csv')

In [None]:
# Read the test data
test_pdf = pd.read_csv('../shared/sign_mnist_test.csv')
test_pdf.describe()

***
# Pre-process the Data for Use

In [None]:
# Normally, you might do EDA here to analyze and handle missing or bad values, transform or eliminate columns, scale values, etc.
# We don't have to do this with this dataset.

In [None]:
# Split the training and test datasets into X and y parts.
# y contains the labels (first column)
# X contains the attributes (remaining columns of the dataset)

train_row_count = train_pdf.shape[0]  # Gives number of rows
train_col_count = train_pdf.shape[1]  # Gives number of columns
test_row_count  =  test_pdf.shape[0]  # Gives number of rows
test_col_count  =  test_pdf.shape[1]  # Gives number of columns
print(train_row_count,train_col_count,test_row_count,test_col_count)

num_train = 6000  # Trim the number of training rows to use (you will adjust this later when optimizing)
num_test  = 1000  # Trim the number of test rows to use (don't change this)

y_train = train_pdf.iloc[:num_train, 0]
X_train = train_pdf.iloc[:num_train, 1:-1]

y_test = test_pdf.iloc[:num_test, 0]
X_test = test_pdf.iloc[:num_test, 1:-1]

***
# Run the K-Nearest-Neighbors Algorithm

In [None]:
# Set a value for K (you will adjust this later when optimizing)
K = 5

In [None]:
# Initialize the SciKit-Learn classifer and fit the data (train the classifier)
classifier = KNeighborsClassifier(n_neighbors=K, metric='euclidean', weights='distance')
classifier.fit(X_train, y_train)

In [None]:
# Predict the test data
y_pred = classifier.predict(X_test)

In [None]:
# Compute confusion matrix, classification report, and accuracy score
conf_matrix = confusion_matrix(y_test,y_pred)
classif_report = classification_report(y_test,y_pred)

print('Confusion Matrix (Rows=Actual, Cols=Predicted)')
print(conf_matrix)
print('\nClassification Report')
print(classif_report)
print('\nAccuracy:', accuracy_score(y_test,y_pred))

## Evaluate the classification errors
- Describe the errors qualitatively.  What do you notice?

In [None]:
# Add your text here

In [None]:
# Show an image taken from a row in an image dataframe (e.g., test_df).
import matplotlib.pyplot as plt
def show_image(caption,pixels):
    plt.figure(figsize=(1, 1))
    plt.gca().axes.get_xaxis().set_visible(False)
    plt.gca().axes.get_yaxis().set_visible(False)
    plt.title(caption)
    image = np.asarray(pixels).reshape((28,28))  #print(image.shape)
#    https://matplotlib.org/stable/tutorials/colors/colormaps.html
#    plt.imshow(image, cmap='gray', vmin = 0, vmax = 255, interpolation='none') #with grayscale colormap
    plt.imshow(image) #with default colormap (viridis)
    plt.show()

In [None]:
# Show test records that are mis-classified.
max_to_show = 5  # Limit the display
shown = 0
for i in range(min(len(y_test), len(y_pred))):
    if(y_pred[i] != y_test[i]):
        my_pred_label = y_pred[i]
        my_true_label  = y_test[i]
        my_pixels = test_pdf.iloc[i,1:]
        caption = "Record=" + str(i) + \
            ".  True label=" +      str(my_true_label) + ": " + label_to_letter(my_true_label) + \
            ".  Predicted label=" + str(my_pred_label) + ": " + label_to_letter(my_pred_label) + "."
        show_image(caption,my_pixels)
        shown += 1
        if shown >= max_to_show:
            print("Not all shown.")
            break

***
# Optimize
- Now we want to see how the behavior of the algorithm changes based on the meta-parameters.  The inputs are:
  - Different values for **K**
  - Different values for **num_train** (number of training records)
- The metrics are:
  - Accuracy
  - Running Time (This is secondary, but helps ensure the processing doesn't take too long.)

## Take the code you created above and make it into a single function with this shape:
- my_accuracy, my_running_time = **run_knn**( K, num_train )

In [None]:
def run_knn(K,num_train):
    # Fill in
    return acc, run_time

## Try various values for **K** and graph the results
- Try K = 1-20 or so.
- Use 6000 for num_train
- Choose a "best" value for K

In [None]:
# Fill in

In [None]:
# Graph Accuracy:
plt.figure(figsize=(12, 6))
plt.plot(Ks, accus, color='red', linestyle='dashed', marker='o', markerfacecolor='blue', markersize=10)
plt.title('Accuracy per K Value')
plt.xlabel('K Value')
plt.ylabel('Accuracy')

In [None]:
# Explain, in your words, what you observe in the results.

## Try various values for **num_train** and graph the results
- Try num_train for several values in multiples of 1000.
- Use a value for K that worked well above.
- Choose a "best" value for num_train.

In [None]:
# Fill in

In [None]:
# Graph Accuracy
# Fill in

In [None]:
# Explain, in your words, what you observe in the results.

***
# Create a Sign Language to Text Translator
- Now we want to build a translator that takes as input a CSV file of sign language images and prints out the corresponding text.
- The translator will work as follows:
  - Based on the experiments you ran above, choose a "best" K value and "num_train" value.
  - Create a new classifier using these parameters and train it on the training data (as you did above).
  - Create a function "translate_signs_to_text" that processes a CSV file of images.

## Based on the code you created above and make it into a single function with this shape:


In [None]:
# Initialize a SciKit-Learn classifer and fit the data (train the classifier)
#   with the "best" configuration parameters you "chose" above.

chosen_K = ...          # Fill this in
chosen_num_train = ...  # Fill this in

y_train = ...
X_train = ...

chosen_classifier = ...
chosen_classifier.fit(X_train, y_train)

In [None]:
# Fill in this function
def translate_signs_to_text(input_sign_csv_filename):
    # read the CSV file
    input_pdf = ...

    # create X and y tables from the dataframe
    y_true = ...
    X      = ...

    # predict the y values
    y_pred = ...
    
    # return the true text, predicted text, and the accuracy
    accuracy = ...
    true_text = y_labels_to_string(y_true)
    pred_text = ...
    return true_text, pred_text, accuracy

In [None]:
# Translate Message1
true_text, pred_text, acc = translate_signs_to_text('../shared/sign_mnist_message1.csv')
print("PRED:",pred_text)
# Guess the actual message.

In [None]:
# Now print the true message and the accuracy.  Were you right?
print("TRUE:",true_text)
print("ACC: ",acc)

In [None]:
# Translate Message2
true_text, pred_text, acc = translate_signs_to_text('../shared/sign_mnist_message2.csv')
print("PRED:",pred_text)
# Guess the actual message.

In [None]:
# Now print the true message and the accuracy.  Were you right?
print("TRUE:",true_text)
print("ACC: ",acc)

In [None]:
# Translate Message3
true_text, pred_text, acc = translate_signs_to_text('../shared/sign_mnist_message3.csv')
print("PRED:",pred_text)
# Guess the actual message.

In [None]:
# Now print the true message and the accuracy.  Were you right?  (What is the source of this quote?)
print("TRUE:",true_text)
print("ACC: ",acc)

***
# Optional Extra Credit: Create a Text to Sign Language Translator
- Now we want to build a translator function that takes as input a text string and outputs a CSV file of sign language images.
  - Create a function "translate_text_to_signs" as below.
  - Try the translator on several text strings (and use the translate_signs_to_text function above to check the results). 

In [None]:
def translate_text_to_signs(input_string, output_sign_csv_filename):
    # Hints:
    # Create a blank dataframe of the same shape as the dataframes above.
    # For each character in the input string:
    #   Find a row in an existing dataframe that matches that character.
    #   Copy that row to the new dataframe.
    # Write the final dataframe to the CSV file.
    return

***
# Write a summary of what you have learned in this exercise.

In [None]:
# Add your text here