### Classifying Ancient Egyptian Landmarks

Created by Josh Sanchez. Last updated February 1, 2024.

This dataset consists of roughly 3,500 images of Ancient Egyptian landmarks.
The dataset was created by user Marvy Ayman Halim on Kaggle and can be found at: https://www.kaggle.com/datasets/marvyaymanhalim/ancient-egyptian-landmarks-dataset.

In [1]:
# import necessary packages

# data preprocessing
import os
import shutil
import pandas as pd

# data visualization
%matplotlib inline
from matplotlib import pyplot as plt

# data scaling
from preamble import *
import mglearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# Data Preprocessing
Import the data and clean up the folder and file names. Then make the image dataset into a .csv file. 

**Note:** I removed the Node.js folder from the dataset because it does not pertain to the project. The original data can be found under the folder called *Data*. The *raw_data* folder contains the original dataset minus the Node.js folder. The *processed_data folder* contains the properly formatted folder and file names.

In [12]:
filenames = []
labels = []

# create the processed_data folder
try:
    os.mkdir(os.path.join(os.getcwd(), "processed_data"))
    processed_data_dir = os.path.join(os.getcwd(), "processed_data")
except OSError as error:
    print(error)

for folder in os.scandir("raw_data"):
    # if the selected item is not a directory
    if not folder.is_dir():
        continue

    # lowercase the folder name and replace space with underscore
    changed_folder_name = folder.name.lower()
    changed_folder_name = changed_folder_name.replace(" ", "_")

    # add folder to processed_data
    try:
        os.mkdir(os.path.join(processed_data_dir, changed_folder_name))
    except OSError as error:
        print(error)

    # pull filename and store in filenames for dataframe
    for file in os.scandir(folder.path):
        # if the folder is empty
        if len(folder.path) == 0:
            continue
        
        # lowercase the file name and replace space with underscore
        changed_file_name = file.name.lower()
        changed_file_name = changed_file_name.replace(" ", "_")
        # os.rename(file, changed_file_name)

        # copy the file to its correct folder in processed_data and rename
        shutil.copy(file, os.path.join(processed_data_dir, changed_folder_name))
        raw_file = os.path.join(processed_data_dir, changed_folder_name, file.name)
        processed_file = os.path.join(processed_data_dir, changed_folder_name, changed_file_name)
        os.rename(raw_file, processed_file)

        # store properly formatted folder name as a label
        labels.append(changed_folder_name)

        # store filename as a filename
        filenames.append(changed_file_name)

        # create a dataframe by making a tuple of the filename and label
        df = pd.DataFrame(list(zip(filenames, labels)), columns=["image", "label"])

#change the file to a csv
df.to_csv("ancient_egyptian_dataset.csv", index=False)

# Describe the sample and features

In [None]:
df.head(10)

# Understanding patterns/trends in the data

In order to understand the data, it must first be described. The code below shows that there are 1337 "unique" images (meaning image filenames) and 22 unique labels. At this point, the model might attempt to cluster the images into 22 distinct groups.

In [None]:
df.describe()

The below code shows the number of images for each label. This is the number of points expected for each cluster.

In [None]:
df["label"].value_counts()

# Gray Scaling Images

In [None]:
plt.rcParams['image.cmap'] = "gray"

## Continuing this project

In order to continue working on the project, the following questions need to be answered:

* What other descriptive plots can I create to see trends in the data?
* Do I need to greyscale my images?
* Do I need to get my images to be of the same resolution?
* How do I tell the algorithm to cluster into 2 groups, sculpture and structure, in an unsupervised approach?
    * How can it decipher that an image is a structure or sculpture?
* Should the set up of a CNN or pixel boundaries be done in at this step?
* Should images be cropped at this point, or does the pixel boundary eliminate the need for image cropping?