<h2>📁 Introduction: Data Preparation</h2>

<p>
This notebook focuses on the essential data preparation steps for our Deep Learning project as part of the 2024/25 Master’s in Data Science program. Our task is to develop a deep-learning model capable of classifying rare species into their correct <strong>family</strong>, using image data curated from the <a href="https://eol.org/" target="_blank">Encyclopedia of Life (EOL)</a> and structured metadata based on the <em>BioCLIP</em> dataset.
</p>

<p>
Before training our model, it is critical to organize and preprocess the data properly. The dataset includes:
</p>

<ul>
  <li>A CSV file containing image file paths and corresponding labels (e.g., Kingdom, Phylum, Family)</li>
  <li>Image files categorized into various species</li>
</ul>

<p>
In this notebook, we:
</p>

<ol>
  <li>Load and inspect the metadata CSV</li>
  <li>Validate image paths and label consistency</li>
  <li>Split the data into training, validation, and test sets</li>
  <li>Apply preprocessing and augmentations (e.g., resizing, normalization)</li>
</ol>

<p>
The goal is to ensure a clean and structured dataset pipeline that supports robust and reproducible model training, following the project guidelines and evaluation criteria.
</p>


In [1]:
import pandas as pd 

In [3]:
metadata = pd.read_csv('../../rare_species 1/metadata.csv')
metadata.head(10)

Unnamed: 0,rare_species_id,eol_content_id,eol_page_id,kingdom,phylum,family,file_path
0,75fd91cb-2881-41cd-88e6-de451e8b60e2,12853737,449393,animalia,mollusca,unionidae,mollusca_unionidae/12853737_449393_eol-full-si...
1,28c508bc-63ff-4e60-9c8f-1934367e1528,20969394,793083,animalia,chordata,geoemydidae,chordata_geoemydidae/20969394_793083_eol-full-...
2,00372441-588c-4af8-9665-29bee20822c0,28895411,319982,animalia,chordata,cryptobranchidae,chordata_cryptobranchidae/28895411_319982_eol-...
3,29cc6040-6af2-49ee-86ec-ab7d89793828,29658536,45510188,animalia,chordata,turdidae,chordata_turdidae/29658536_45510188_eol-full-s...
4,94004bff-3a33-4758-8125-bf72e6e57eab,21252576,7250886,animalia,chordata,indriidae,chordata_indriidae/21252576_7250886_eol-full-s...
5,dc48f2ce-4feb-4ef7-b2a2-c3c3f42bf19b,28657539,491832,animalia,arthropoda,formicidae,arthropoda_formicidae/28657539_491832_eol-full...
6,3d881320-8ba8-4580-a72c-0e7ab116b664,29548208,47043290,animalia,chordata,fringillidae,chordata_fringillidae/29548208_47043290_eol-fu...
7,7faca96a-54e6-4c80-b9e4-77ab126d904a,21232818,1033999,animalia,arthropoda,gomphidae,arthropoda_gomphidae/21232818_1033999_eol-full...
8,9f89ecab-aabd-41a4-b5b4-8ce106d85959,20315204,46561012,animalia,chordata,myliobatidae,chordata_myliobatidae/20315204_46561012_eol-fu...
9,b6ec7a70-c470-4ede-8930-05844e1efd2e,20124498,46570095,animalia,chordata,pleuronectidae,chordata_pleuronectidae/20124498_46570095_eol-...
