# Separate Training Dataset: Dolphins vs Whales

#### Author: Daniela Vorkel

This notebook is about data preparation, mainly to separate images into two folders. All data were retrieved from a Kaggle challenge (https://www.kaggle.com/competitions/happy-whale-and-dolphin/data) and the further goal is to built a binary image classifier model (CNN), which is able to distinguish between images of back fins of whales and dolphins. 

The given dataset (on Kaggle) consists of 2 folders: 'train_images' and 'test_images'. 
The folder 'train_images' contains mixed images of both dolphins and whales. In addition, a 'train.csv'-file holds information about image names, species classification and individual IDs. 

To build a model based on subfolder structure, where each subfolder accounts for one class (0 or 1), the 'train.csv'-file is used to separate all images of the 'train_images'-folder. 

## 1. Setup and import libraries

In [19]:
import pandas as pd
import os
import shutil
import csv

## 2. Load csv-file

In the csv-file, we find image names, species and individual IDs for each animal.

In [3]:
train = pd.read_csv("train.csv")
train.head(5)

Unnamed: 0,image,species,individual_id
0,00021adfb725ed.jpg,melon_headed_whale,cadddb1636b9
1,000562241d384d.jpg,humpback_whale,1a71fbb72250
2,0007c33415ce37.jpg,false_killer_whale,60008f293a2b
3,0007d9bca26a99.jpg,bottlenose_dolphin,4b00fe572063
4,00087baf5cef7a.jpg,humpback_whale,8e5253662392


## 3. Investigate DataFrame

### 3.1 Check datatype and counts

#### There are no missing values, the dataframe is consistent:

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51033 entries, 0 to 51032
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   image          51033 non-null  object
 1   species        51033 non-null  object
 2   individual_id  51033 non-null  object
dtypes: object(3)
memory usage: 1.2+ MB


### 3.2 Check which species are present and how many animals of each species are listed

#### Looking at provided species names, some of them should be fixed:

- two species without additional assignment: beluga and globis -> assign both to whales
- there are typos in species names

In [5]:
train.species.value_counts()

bottlenose_dolphin           9664
beluga                       7443
humpback_whale               7392
blue_whale                   4830
false_killer_whale           3326
dusky_dolphin                3139
spinner_dolphin              1700
melon_headed_whale           1689
minke_whale                  1608
killer_whale                 1493
fin_whale                    1324
gray_whale                   1123
bottlenose_dolpin            1117
kiler_whale                   962
southern_right_whale          866
spotted_dolphin               490
sei_whale                     428
short_finned_pilot_whale      367
common_dolphin                347
cuviers_beaked_whale          341
pilot_whale                   262
long_finned_pilot_whale       238
white_sided_dolphin           229
brydes_whale                  154
pantropic_spotted_dolphin     145
globis                        116
commersons_dolphin             90
pygmy_killer_whale             76
rough_toothed_dolphin          60
frasiers_dolph

## 4. Fix names of species

### 4.1 Correct for typos of species names

In [6]:
train.species = train.species.str.replace('kiler_whale','killer_whale')
train.species = train.species.str.replace('bottlenose_dolpin','bottlenose_dolphin')

### 4.2 Add the 'whale' criteria to 'beluga' and 'globis'

In [7]:
train.species = train.species.str.replace('beluga','beluga_whale')
train.species = train.species.str.replace('globis','globis_whale')

### 4.3 Check and confirm changed names

In [8]:
train.species.value_counts()

bottlenose_dolphin           10781
beluga_whale                  7443
humpback_whale                7392
blue_whale                    4830
false_killer_whale            3326
dusky_dolphin                 3139
killer_whale                  2455
spinner_dolphin               1700
melon_headed_whale            1689
minke_whale                   1608
fin_whale                     1324
gray_whale                    1123
southern_right_whale           866
spotted_dolphin                490
sei_whale                      428
short_finned_pilot_whale       367
common_dolphin                 347
cuviers_beaked_whale           341
pilot_whale                    262
long_finned_pilot_whale        238
white_sided_dolphin            229
brydes_whale                   154
pantropic_spotted_dolphin      145
globis_whale                   116
commersons_dolphin              90
pygmy_killer_whale              76
rough_toothed_dolphin           60
frasiers_dolphin                14
Name: species, dtype

## 5. Add labels for 'dolphin' or 'whale' to existing csv-file

#### Use the Lambda function to add a new column with a label for 'dolphin' or 'whale':


In [9]:
train['label'] = train.species.map(lambda x: 'dolphin' if 'dolphin' in x else 'whale')
train.head()

Unnamed: 0,image,species,individual_id,label
0,00021adfb725ed.jpg,melon_headed_whale,cadddb1636b9,whale
1,000562241d384d.jpg,humpback_whale,1a71fbb72250,whale
2,0007c33415ce37.jpg,false_killer_whale,60008f293a2b,whale
3,0007d9bca26a99.jpg,bottlenose_dolphin,4b00fe572063,dolphin
4,00087baf5cef7a.jpg,humpback_whale,8e5253662392,whale


## 6. Create and save two separate csv-files for 'whale' and 'dolphin'  

### 6.1 Filter existing csv-file for 'whales' and save DataFrame as new csv-file

In [20]:
whale = train[train.label == 'whale']
whale.head(5)

Unnamed: 0,image,species,individual_id,label
0,00021adfb725ed.jpg,melon_headed_whale,cadddb1636b9,whale
1,000562241d384d.jpg,humpback_whale,1a71fbb72250,whale
2,0007c33415ce37.jpg,false_killer_whale,60008f293a2b,whale
4,00087baf5cef7a.jpg,humpback_whale,8e5253662392,whale
6,000be9acf46619.jpg,beluga_whale,afb9b3978217,whale


#### Save as csv:

In [21]:
whale.to_csv('whale.csv')

### 6.2 Filter existing csv-file for 'dolphins' and save DataFrame as new csv-file

In [22]:
dolphin = train[train.label == 'dolphin']
dolphin.head(5)

Unnamed: 0,image,species,individual_id,label
3,0007d9bca26a99.jpg,bottlenose_dolphin,4b00fe572063,dolphin
5,000a8f2d5c316a.jpg,bottlenose_dolphin,b9907151f66e,dolphin
9,000c476c11bad5.jpg,bottlenose_dolphin,b11b2404c7e3,dolphin
12,00144776eb476d.jpg,bottlenose_dolphin,b9907151f66e,dolphin
14,00177f3c614d1e.jpg,bottlenose_dolphin,812be36c2aef,dolphin


#### Save as csv:

In [23]:
dolphin = dolphin.to_csv('dolphin.csv')

## 7. Separate images of whales and dolphins into new folder

### 7.1 Create two new folder

In [17]:
os.mkdir('dolphin')
os.mkdir('whale')

### 7.2 Use new csv-files to save images 

Since the images of whales and dolphins are mixed up in the 'train-images' folder, we now can sort them back into separate folders using the previously created csv-files:

### 7.3 Save images to 'dolphin'-folder

In [32]:
# define path to access created csv-file for dolphins
csvfile = '/.../PROJECT_happy_whales/.../dolphin.csv'
# define path of source folder containing images of dolphins and whales
source_folder = r'/.../PROJECT_happy_whales/.../train_images/'
# define path of destination folder called 'dolphin'
destination_folder = '/.../PROJECT_happy_whales/.../dolphin/'

In [33]:
# actual code to match images with help of csv-file
with open(csvfile, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in (reader):
        source = source_folder + row[1]
        destination = destination_folder + os.path.basename(source)
        if os.path.isfile(source):
            shutil.copy(source, destination)
            print('moved: ', row)
        else:
            print(row)    

['', 'image', 'species', 'individual_id', 'label']
moved:  ['3', '0007d9bca26a99.jpg', 'bottlenose_dolphin', '4b00fe572063', 'dolphin']
moved:  ['5', '000a8f2d5c316a.jpg', 'bottlenose_dolphin', 'b9907151f66e', 'dolphin']
moved:  ['9', '000c476c11bad5.jpg', 'bottlenose_dolphin', 'b11b2404c7e3', 'dolphin']
moved:  ['12', '00144776eb476d.jpg', 'bottlenose_dolphin', 'b9907151f66e', 'dolphin']
moved:  ['14', '00177f3c614d1e.jpg', 'bottlenose_dolphin', '812be36c2aef', 'dolphin']
moved:  ['15', '0017b3749cd769.jpg', 'bottlenose_dolphin', '445270d9ad52', 'dolphin']
moved:  ['27', '0028f6fa123686.jpg', 'bottlenose_dolphin', '956562ff2888', 'dolphin']
moved:  ['30', '002e00960cec44.jpg', 'common_dolphin', 'e943980b7a98', 'dolphin']
moved:  ['37', '0039599b58fc80.jpg', 'bottlenose_dolphin', 'e69d5f9f8d1e', 'dolphin']
moved:  ['41', '003e374b59c0e1.jpg', 'dusky_dolphin', '456bb79da64c', 'dolphin']
moved:  ['50', '0049b56a584eb1.jpg', 'bottlenose_dolphin', 'e69d5f9f8d1e', 'dolphin']
moved:  ['55', 

### 7.4 Save images to 'whale'-folder

Repeat same steps as previously done for dolphin images:

In [None]:
csvfile = '/.../PROJECT_happy_whales/.../whale.csv'
source_folder = r'/.../PROJECT_happy_whales/.../train_images/'
destination_folder = '/.../PROJECT_happy_whales/.../whale/'

with open(csvfile, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in (reader):
        source = source_folder + row[1]
        destination = destination_folder + os.path.basename(source)
        if os.path.isfile(source):
            shutil.copy(source, destination)
            print('moved: ', row)
        else:
            print(row) 

['', 'image', 'species', 'individual_id', 'label']
moved:  ['0', '00021adfb725ed.jpg', 'melon_headed_whale', 'cadddb1636b9', 'whale']
moved:  ['1', '000562241d384d.jpg', 'humpback_whale', '1a71fbb72250', 'whale']
moved:  ['2', '0007c33415ce37.jpg', 'false_killer_whale', '60008f293a2b', 'whale']
moved:  ['4', '00087baf5cef7a.jpg', 'humpback_whale', '8e5253662392', 'whale']
moved:  ['6', '000be9acf46619.jpg', 'beluga_whale', 'afb9b3978217', 'whale']
moved:  ['7', '000bef247c7a42.jpg', 'humpback_whale', '444d8894ccc8', 'whale']
moved:  ['8', '000c3d63069748.jpg', 'beluga_whale', 'df94b15285b9', 'whale']
moved:  ['10', '001001f099519f.jpg', 'minke_whale', '19fbb960f07d', 'whale']
moved:  ['11', '00103cbe9d25ce.jpg', 'fin_whale', '180c0ab04dcd', 'whale']
moved:  ['13', '00167e8375c967.jpg', 'beluga_whale', '0ad50d0d9b06', 'whale']
moved:  ['16', '0018064338b499.jpg', 'blue_whale', '4790ec346170', 'whale']
moved:  ['17', '001b0900f56e89.jpg', 'humpback_whale', 'bc14b5054353', 'whale']
moved: