**To do**
>1. load all the images as a single array
2. combine the csv files into a pandas dataframe
3. create a dataset with image and label combo (using filename to search the dataframe) 
4. if possible export the dataset to later use in other notebooks

We will be using a Bengali handwritten dataset named [NumtaDB: Bengali Handwritten Digits](https://www.kaggle.com/datasets/BengaliAI/numta)

**Note:** As there are no labels for the testing images we will only load the training images and labels from the dataset and later split them into train, test for further analysis

In [24]:
import glob
import os
import cv2
import numpy as np
import pandas as pd
from itertools import chain

In [2]:
os.listdir()

['.ipynb_checkpoints',
 'Bangla Handwritten Digit Recognition Using Deep CNN for Large and Unbiased Dataset.pdf',
 'dataset exploration.ipynb',
 'NumtaDB - Assembled Bengali Handwritten Digits.pdf',
 'NumtaDB_Bengali Handwritten Digits']

In [3]:
dataset_directory_files = glob.glob("NumtaDB_Bengali Handwritten Digits/*")

In [4]:
dataset_directory_files

['NumtaDB_Bengali Handwritten Digits\\testing-a',
 'NumtaDB_Bengali Handwritten Digits\\testing-all-corrected',
 'NumtaDB_Bengali Handwritten Digits\\testing-auga',
 'NumtaDB_Bengali Handwritten Digits\\testing-augc',
 'NumtaDB_Bengali Handwritten Digits\\testing-b',
 'NumtaDB_Bengali Handwritten Digits\\testing-c',
 'NumtaDB_Bengali Handwritten Digits\\testing-d',
 'NumtaDB_Bengali Handwritten Digits\\testing-e',
 'NumtaDB_Bengali Handwritten Digits\\testing-f',
 'NumtaDB_Bengali Handwritten Digits\\training-a',
 'NumtaDB_Bengali Handwritten Digits\\training-a.csv',
 'NumtaDB_Bengali Handwritten Digits\\training-b',
 'NumtaDB_Bengali Handwritten Digits\\training-b.csv',
 'NumtaDB_Bengali Handwritten Digits\\training-c',
 'NumtaDB_Bengali Handwritten Digits\\training-c.csv',
 'NumtaDB_Bengali Handwritten Digits\\training-d',
 'NumtaDB_Bengali Handwritten Digits\\training-d.csv',
 'NumtaDB_Bengali Handwritten Digits\\training-e',
 'NumtaDB_Bengali Handwritten Digits\\training-e.csv']

In [25]:
dataset_training_image_paths = [
    glob.glob(f"{_}/**/*.png", recursive=True)
    for _ in dataset_directory_files
    if os.path.isdir(_) and _.split("\\")[-1].split("-")[0] == "training"
]
dataset_training_image_paths = list(
    chain.from_iterable(dataset_training_image_paths)
)  # as the original list would be [[training-a files], [training-b files]........]
dataset_training_labels_paths = [
    _ for _ in dataset_directory_files if not os.path.isdir(_)
]

In [26]:
# combining all the labels files

labels_df = pd.concat(map(pd.read_csv, dataset_training_labels_paths), ignore_index=True)

In [27]:
labels_df.head()

Unnamed: 0,filename,original filename,scanid,digit,database name original,contributing team,database name,num,districtid,institutionid,gender,age,datestamp
0,a00000.png,Scan_58_digit_5_num_8.png,58,5,BHDDB,Buet_Broncos,training-a,,,,,,
1,a00001.png,Scan_73_digit_3_num_5.png,73,3,BHDDB,Buet_Broncos,training-a,,,,,,
2,a00002.png,Scan_18_digit_1_num_3.png,18,1,BHDDB,Buet_Broncos,training-a,,,,,,
3,a00003.png,Scan_166_digit_7_num_3.png,166,7,BHDDB,Buet_Broncos,training-a,,,,,,
4,a00004.png,Scan_108_digit_0_num_1.png,108,0,BHDDB,Buet_Broncos,training-a,,,,,,


In [28]:
# setting the filename as index so that it can be used as key to retrieve related label for each image
labels_df.set_index('filename', inplace=True)

In [29]:
labels_df.head()

Unnamed: 0_level_0,original filename,scanid,digit,database name original,contributing team,database name,num,districtid,institutionid,gender,age,datestamp
filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
a00000.png,Scan_58_digit_5_num_8.png,58,5,BHDDB,Buet_Broncos,training-a,,,,,,
a00001.png,Scan_73_digit_3_num_5.png,73,3,BHDDB,Buet_Broncos,training-a,,,,,,
a00002.png,Scan_18_digit_1_num_3.png,18,1,BHDDB,Buet_Broncos,training-a,,,,,,
a00003.png,Scan_166_digit_7_num_3.png,166,7,BHDDB,Buet_Broncos,training-a,,,,,,
a00004.png,Scan_108_digit_0_num_1.png,108,0,BHDDB,Buet_Broncos,training-a,,,,,,


In [33]:
resize_size = 32 #same width and height
images = []
labels = []
for img_path in dataset_training_image_paths:
    key = img_path.split(os.sep)[-1]
    img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
    img = cv2.resize(img, (resize_size, resize_size))
    images.append(img)
    label = labels_df.loc[key]['digit']
    labels.append(label)

In [42]:
# converting to numpy array
images = np.array(images)
labels = np.array(labels)

In [43]:
# Now it's time to convert the whole process to a python function so that it can be easily accessed by other notebooks.
# see necessary_functions.py