# Importing Libraries
**For more examples of what Kosh can do visit [GitHub Examples](https://github.com/LLNL/kosh/tree/stable/examples).**

In [None]:
from numbers import Number
from collections import defaultdict

import matplotlib.pyplot as plt
import seaborn as sns
import kosh
import math
import statistics
import numpy as np
from PIL import Image
import os
import sys

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

store = kosh.connect("my_store.sqlite", delete_all_contents=True)
print("Kosh is ready!")

# Loading Data

As mentioned in the `README.md`, the data we will be using is [Optical Recognition of Handwritten Digits](https://archive.ics.uci.edu/dataset/80/optical+recognition+of+handwritten+digits). Luckily, Scikit-Learn has pre-processed that data and made it easily accesible via its [Toy Datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html). We will use [`sklearn.datasets.load_digits()`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits) which returns a dictionary-like structure that can be seen below.


In [None]:
digits = load_digits() 
for key, val in digits.items():
    print(f"----- {key} -----")
    print(val)


# Converting to Images

However, in order to show how Kosh can load different types of data, we will first convert the arrays of image data from [`sklearn.datasets.load_digits()`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits) into images to use Kosh's Loaders. Loaders in Kosh allow the user (as the name suggests) to load any type of data they want via what is called "associating" a file to a Kosh Dataset. Kosh comes with a set of built-in Loaders (see [Example_02_Read_Data.ipynb](https://github.com/LLNL/kosh/blob/stable/examples/Example_02_Read_Data.ipynb)) but a user can create custom ones (see [Example_Custom_Loader.ipynb](https://github.com/LLNL/kosh/blob/stable/examples/Example_Custom_Loader.ipynb)). It is important to note that the Kosh database doesn't actually store any of the data when you associate a file, it just references the file. If the associated file is deleted, the data will no longer be "in" the Kosh database. You can create a dataset for each file or associate multiple files to one dataset. 

In [None]:
os.makedirs('images', exist_ok=True)

for i, image in enumerate(digits['images']):
    temp_image = Image.fromarray(image.astype(np.uint8))
    if i == 0:
        plt.imshow(temp_image, cmap='gray')
    temp_image.save(f'images/image_{i}.png')
    


# Creating Kosh Dataset

We create our Kosh Dataset and add whatever metadata we want. The metadata (which end up being dataset attributes) can later be used to find and filter specific datasets. See [kosh/examples/Example_Simulation_Workflow.ipynb](https://github.com/LLNL/kosh/blob/stable/examples/Example_Simulation_Workflow.ipynb) for more information on how to add metadata, update it, and extract it.

We add metadata that is in the `DESCR` of `sklearn.datasets.load_digits()`. Note that `id`, `name` and `creator` are special named Kosh Dataset attributes. The attributes of a dataset can be seen with `dataset.list_attributes()` and extracted via `dataset.MY_ATTRIBUTE`.

In [None]:
# Copy paste from digits['DESCR']
# print(digits['DESCR'])

metadata = {"name": "Optical recognition of handwritten digits dataset",  # 'name' is a special named Kosh attribute for name of Kosh Dataset
            "Number of Instances": 1797,
            "Number of Attributes": 64,
            "Attribute Information": "8x8 image of integer pixels in the range 0..16.",
            "Missing Attribute Values": None,
            "Creator": "E. Alpaydin (alpaydin '@' boun.edu.tr)", # 'creator' is a special named Kosh attribute for creator of Kosh dataset
            "Date": "July; 1998"}
            
  
dataset = store.create(metadata=metadata)

print("BEFORE ASSOCIATING DATA:\n\n", dataset)

print('\n\nAttributes:\n\n')
print(dataset.list_attributes())
print(getattr(dataset, 'Attribute Information'))
print(dataset.Date)

# Associating Data

Here we will be associating all the pngs to a single Kosh Dataset so we can use the `dataset.to_dataframe()` method. As discussed in the Ball Bounce Metadata Machine Learning tutorial, each dataset can have its own metadata attributes but you can also add metadata attributes to each associated file within each dataset. We can extract all the attributes for all associated files in a Kosh Dataset into a Pandas DataFrame using `dataset.to_dataframe()`. By default this dataframe will always have the `id`, `mime_type`, `uri`, and `associated` columns.

When a user associates a file to a dataset, the data in the file now becomes a Kosh Dataset feature. The features of a dataset can be seen with `dataset.list_features()` and extracted via `dataset['MY_FEATURE'][:]`.

In [None]:
for i, target in enumerate(digits['target']):
    if i % 10 == 0:
        print(f"Image {i+1} of {len(digits['target'])}")
    temp_image_path = f'images/image_{i}.png'
    metadata={"label": int(target)}  # int or float type
    dataset.associate(temp_image_path,
                      metadata=metadata,
                      mime_type="png")  # mime_type is what determins which Kosh loader to use
    
print("AFTER ASSOCIATING DATA:\n\n", dataset)


df = dataset.to_dataframe()
print("\n\nPANDAS DATAFRAME:\n\n", df)


print('\n\nFeatures:\n\n')
print(dataset.list_features())
print(dataset['image_@_/g/g20/moreno45/Projects/WEAVE/weave_docs/docs/tutorials/Public/digit_classification/images/image_0.png'][:])

# Splitting Data

We will extract our features and labels of interest from the dataframe above since `dataset.to_dataframe()` also includes other metadata by default. The features will be each of the 8 x 8 = 64 pixels in the image and the label will be the actual number of the image. We use SciKit Learn's `train_test_split()` method to split the data into train, validation, and test data.

No need to scale the data since all pixel values have the same range.

In [None]:
# Extracting image data Kosh feature into a column
df['image_data'] = df.apply(lambda row: np.array(Image.open(row["uri"])).ravel(), axis=1) 
print("\n\nDataframe with image_data:\n\n",df)

# List comprehension since each row of 'image_data' is an array itself
reshaped = np.array([array for array in df['image_data'].values]).reshape(-1, len(digits['feature_names']))

df_original = pd.DataFrame(reshaped, columns=digits['feature_names'])
df_original['label'] = df['label']
print("\n\nMachine Learning Dataframe:\n\n", df_original)

df_original_features = df_original[digits['feature_names']].copy()
df_original_labels = df_original['label'].copy()


# Splitting data
df_train_features, df_test_features, df_train_labels, df_test_labels = train_test_split(df_original_features, df_original_labels, test_size=0.2, random_state=42)
df_train_features, df_validation_features, df_train_labels, df_validation_labels = train_test_split(df_train_features, df_train_labels, test_size=0.2, random_state=42)

print(df_train_features.head())

print(f"Train Size features: {df_train_features.shape} and labels: {df_train_labels.shape}")
print(f"Validation Size features: {df_validation_features.shape} and labels: {df_validation_labels.shape}")
print(f"Test Size features: {df_test_features.shape} and labels: {df_test_labels.shape}")

# Turning Pandas DataFrames into Matricies

Now we will convert our Pandas DataFrames into matricies so the Machine Learning algorithms can process the data.

In [None]:
X_train = df_train_features.to_numpy()
y_train = df_train_labels.to_numpy()

X_validation = df_validation_features.to_numpy()
y_validation = df_validation_labels.to_numpy()

X_test = df_test_features.to_numpy()
y_test = df_test_labels.to_numpy()

# Train The Model

We will now train our model using `sklearn.linear_model.LogisticRegression()` by using `sklearn.linear_model.LogisticRegression.fit()`.

In [None]:
LogReg = LogisticRegression(max_iter=int(1e6))
LogReg.fit(X_train, y_train)


# Inference and Confusion Matrix

Now that our model is trained, we can calculate the score which is just the mean accuracy `sklearn.linear_model.LogisticRegression.score()` for our train, validation, and test data. We can also see what the model will infer/predict using `sklearn.linear_model.LogisticRegression.predict()` and plot them via the confusion matrix `sklearn.metrics.confusion_matrix()`.


In [None]:
print("Train Mean Accuracy:", LogReg.score(X_train, y_train))
print("Validation Mean Accuracy:", LogReg.score(X_validation, y_validation))
print("Test Mean Accuracy:", LogReg.score(X_test, y_test))

print(f'Train Prediction {LogReg.predict(X_train[-1].reshape(1,-1))[0]} and actual value {y_train[-1]}')
print(f'Validation Prediction {LogReg.predict(X_validation[-1].reshape(1,-1))[0]} and actual value {y_validation[-1]}')
print(f'Test Prediction {LogReg.predict(X_test[-1].reshape(1,-1))[0]} and actual value {y_test[-1]}')

for X, y, data_type in zip([X_train, X_validation, X_test],
                           [y_train, y_validation, y_test],
                           ['Train_Data', 'Validation_Data', 'Test_Data']):
                
    predictions = LogReg.predict(X)
    cm = confusion_matrix(y, predictions, labels=LogReg.classes_)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                  display_labels=LogReg.classes_)
    disp.plot()
    plt.title(data_type)
    plt.savefig(f'{data_type}_confusion_matrix.png')
    plt.show()
