# Dataset Exploration

**Goal:** Evaluate possible data sets for training a braces / no-braces binary cnn classifier.
1. What types of precanned data sets are available online?
2. What are the image filetypes associated with those datasets?
3. How do the answers to the above impact ux workflows?

## 0 Set Up

### 0.1 Imports

In [None]:
# import utility, data analysis, and plotting libraries
import os
import numpy as np
import matplotlib.pyplot as plt

# import AI/ML packages
import cv2
from sklearn.model_selection import train_test_split

In [None]:
# set plotting defaults
%matplotlib inline

## 1 Kagget Datasets

### 1.1 Kaggle Setup

__A) Create Kaggle API Token__
1. (create &) login to kaggle account
1. go to account page and select `Create New API Token` to download `kaggle.json` file
1. upload `kaggle.json` to colab in the `/content/` directory (aka the default working directory)

__B) Setup Colab with Kaggle API Token__

1. pip install kaggle

In [None]:
! pip install kaggle

2. create a hidden `.kaggle/` directory
3. copy the `kaggle.json` we uploaded in part (A) to the `.kaggle/` directory
4. update owner read/write permissions on the api key file

In [None]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle
! chmod 600 ~/.kaggle/kaggle.json

### 1.2 Kaggle Data Extraction

1. Download tufts dental database 

In [None]:
! kaggle datasets download deepologylab/tufts-dental-database

2. Unzip the download

In [None]:
%%capture
! unzip tufts-dental-database.zip

## 2 EDA for training & segmented images

### 2.1 Metadata eval

1. Get an array of the training images and the labeled teeth masks

In [None]:
train_image_fnames = os.listdir('./Tufts Dental Database/Radiographs')
label_image_fnames = os.listdir('./Tufts Dental Database/Segmentation/teeth_mask')

2. Check the lengths of the dental images and the masks datasets
3. Peek at the file names and confirm all files are jpgs

In [None]:
print(f'Length of training images: {len(train_image_fnames)}\nLength of teeth masks: {len(label_image_fnames)}')

# what do the file names look like?
sample_training_fnames = ', '.join(train_image_fnames[:5])
sample_label_fnames = ', '.join(label_image_fnames[:5])
print(f'Example training image file name: {sample_training_fnames}\nExample label image file name: {sample_label_fnames}')

In [None]:
def count_file_endings(file_list):
    file_endings = {}
    for fname in file_list:
        _, file_extension = os.path.splitext(fname)
        if file_extension in file_endings:
            file_endings[file_extension] += 1
        else:
            file_endings[file_extension] = 1
    return file_endings

training_file_endings = count_file_endings(train_image_fnames)
label_file_endings = count_file_endings(label_image_fnames)

print(f'Training file endings: {training_file_endings}\nLabel file endings: {label_file_endings}')

4. Confirm each image in the training set has a corresponding mask

In [None]:
# uppercase image file names
train_image_fnames = sorted([t.upper() for t in train_image_fnames])
label_image_fnames = sorted([l.upper() for l in label_image_fnames])

In [None]:
train_image_set = set(train_image_fnames)
label_image_set = set(label_image_fnames)

# find the symmetric difference between the two sets
train_only = train_image_set - label_image_set
label_only = label_image_set - train_image_set

print(f'Images in training set but not in label set: {train_only}\nImages in label set but not in training set: {label_only}')

### 2.2 Data sampling (view some images)

In [None]:
MAX_PIXEL_VALUE = 255
TRAIN_IMG_DIR = './Tufts Dental Database/Radiographs'
LABEL_IMG_DIR = './Tufts Dental Database/Segmentation/teeth_mask'

def get_train_img_path(img_fname: str):
  return os.path.join(TRAIN_IMG_DIR, img_fname)

def get_label_img_path(img_fname: str):
  return os.path.join(LABEL_IMG_DIR, img_fname)

def read_image(fpath: str, astype=np.float32, width=256, height=256, normalize=True):
  raw_img = plt.imread(fpath).astype(astype)
  resized_img = cv2.resize(raw_img, (width, height))
  if normalize:
    resized_img = resized_img / 255

  return resized_img.astype(astype)

1. Read the training and mask images
2. Store the images in an numpy array

In [None]:
train_images = []
for fname in train_image_fnames:
  train_image = read_image(get_train_img_path(fname))
  train_images.append(train_image)

train_images = np.array(train_images)

In [None]:
label_images = []
for fname in label_image_fnames:
  label_image = read_image(get_label_img_path(fname.lower()))
  label_images.append(label_image)

label_images = np.array(label_images)

In [None]:
def display_images(images, titles, num_images=5):
  _, axes = plt.subplots(1, num_images, figsize=(20, 20))
  for i in range(num_images):
    axes[i].imshow(images[i])
    axes[i].set_title(titles[i])
    axes[i].axis('off')

  plt.tight_layout()
  plt.show()

3. Display the train images

In [None]:
display_images(train_images, train_image_fnames)

4. Display the mask images

In [None]:
display_images(label_images, label_image_fnames)

### 2.3 Split Training Images for Validation

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    train_images,
    label_images,
    test_size=0.33,
    random_state=42)

print(f'X_train shape: {X_train.shape}\nX_test shape: {X_test.shape}\ny_train shape: {y_train.shape}\ny_test shape: {y_test.shape}')

## 3 (PAUSED)

__Team Meeting Notes:__
Radiography datasets are going to be a dead end. The core aim of Project Smile is to enable diagnosis without medical equipment.

__Next Steps:__
- Where can we get training data that would enable?