### Classifying Ancient Egyptian Landmarks

Created by Josh Sanchez. Last updated February 1, 2024.

This dataset consists of roughly 3,500 images of Ancient Egyptian landmarks.
The dataset was created by user Marvy Ayman Halim on Kaggle and can be found at: https://www.kaggle.com/datasets/marvyaymanhalim/ancient-egyptian-landmarks-dataset.

In [2]:
# import necessary packages
import os

# data processing
import pandas as pd

# data visualization
%matplotlib inline
from matplotlib import pyplot as plt

# data preprocessing and scaling
from preamble import *
plt.rcParams['image.cmap'] = "gray"
import mglearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# Data Preprocessing
Import the data and clean up the folder names. Then make the image dataset into a csv file

In [38]:
parent_dir = "D:/Data Visualization and Machine Learning/Final Project/raw_data"
filenames = []
labels = []

for folder in os.scandir(parent_dir):
    # if the selected item is not a directory
    if not folder.is_dir():
        continue

    # lowercase the folder name and replace space with underscore
    changed_folder_name = folder.name.lower()
    changed_folder_name = changed_folder_name.replace(" ", "_")

    # pull filename and store in filenames for dataframe

    for file in os.scandir(folder.path):
        # if the folder is empty
        if len(folder.path) == 0:
            continue

        # lowercase the file name and replace space with underscore
        changed_file_name = file.name.lower()
        changed_file_name = changed_file_name.replace(" ", "_")

        # store properly formatted folder name as a label
        labels.append(changed_folder_name)

        # store filename as a filename
        filenames.append(changed_file_name)

        # create a dataframe by making a tuple of the filename and label
        df = pd.DataFrame(list(zip(filenames, labels)), columns=["image", "label"])

        # rename the file
        os.rename(os.path.join(parent_dir, folder, file), os.path.join(parent_dir, folder, changed_file_name))
    
    # renname the folder
    os.rename(
        os.path.join(parent_dir, folder),
        os.path.join(parent_dir, changed_folder_name),
    )
        


# change the file to a csv
df.to_csv("ancient_egyptian_dataset.csv", index=False)

# Describe the sample and features

In [4]:
df.head(10)

Unnamed: 0,image,label
0,15634726773_a8ac65d6ef_m_-_copy.jpg,akhenaten
1,19281291360_5a49331215_m.jpg,akhenaten
2,2906415757_50c2bc0414_m.jpg,akhenaten
3,41957529164_421e9f622f_m.jpg,akhenaten
4,4902788942_1c4ee56ede_m.jpg,akhenaten
5,7731634374_fe4e21a493_m.jpg,akhenaten
6,9711457465_051cf60521_n.jpg,akhenaten
7,a.1.jpg,akhenaten
8,a.10.jpg,akhenaten
9,a.11.jpg,akhenaten


# Understanding patterns/trends in the data

In order to understand the data, it must first be described. The code below shows that there are 1337 "unique" images (meaning image filenames) and 22 unique labels. At this point, the model might attempt to cluster the images into 22 distinct groups.

In [5]:
df.describe()

Unnamed: 0,image,label
count,3576,3576
unique,1321,22
top,8.jpg,khafre_pyramid
freq,18,444


The below code shows the number of images for each label. This is the number of points expected for each cluster.

In [6]:
df["label"].value_counts()

label
khafre_pyramid                             444
hatshepsut                                 349
sphinx                                     345
bent_pyramid_for_senefru                   336
colossoi_of_memnon                         234
the_great_temple_of_ramesses_ii            219
mask_of_tutankhamun                        201
ramessum                                   189
menkaure_pyramid                           120
statue_of_king_zoser                       109
nefertiti                                  104
colossal_statue_of_ramesses_ii              99
temple_of_kom_ombo                          99
amenhotep_iii_and_tiye                      98
ramses_ii_red_granite_statue                97
akhenaten                                   95
hatshepsut_face                             84
goddess_isis_with_her_child                 84
pyramid_of_djoser                           76
bust_of_ramesses_ii                         70
statue_of_tutankhamun_with_ankhesenamun     67
temple_

## Continuing this project

In order to continue working on the project, the following questions need to be answered:

* What other descriptive plots can I create to see trends in the data?
* Do I need to greyscale my images?
* Do I need to get my images to be of the same resolution?
* How do I tell the algorithm to cluster into 2 groups, sculpture and structure, in an unsupervised approach?
    * How can it decipher that an image is a structure or sculpture?
* Should the set up of a CNN or pixel boundaries be done in at this step?
* Should images be cropped at this point, or does the pixel boundary eliminate the need for image cropping?