The dataset contains two folders and four text files. 

* The *images* folder contains one subfolder for each species class and within each subfolder are the images of the birds of that species.
* The *text* folder contains one subfolder for each species class and within each subfolder are the text files with the textual descriptions of the birds. An important point is that there are 10 textual descriptions for each bird. The file names of the images and textual descriptions match to make it easier to extract the information.
* The file *classes.txt* contains the names of the classes of the problem and a numerical index associated with each class.
* The *images.txt* file contains the names of the images and a numerical index associated with each image.
* The file *image_class_labels.txt* associates an image index with a class index. This allows to know to which class a given image belongs.
* The file *train_test_split.txt* tells us which image should be used for *train* (value 1) and which image should be used for test (value 0).


In [47]:
data_directory = "CUB"

# Read classes from classes.txt file
classes = []
with open(os.path.join(data_directory, "classes.txt")) as f:
    for line in f:
        classes.append(line.strip().split()[1])
classes

['001.Black_footed_Albatross',
 '002.Laysan_Albatross',
 '003.Sooty_Albatross',
 '004.Groove_billed_Ani',
 '005.Crested_Auklet',
 '006.Least_Auklet',
 '007.Parakeet_Auklet',
 '008.Rhinoceros_Auklet',
 '009.Brewer_Blackbird',
 '010.Red_winged_Blackbird',
 '011.Rusty_Blackbird',
 '012.Yellow_headed_Blackbird',
 '013.Bobolink',
 '014.Indigo_Bunting',
 '015.Lazuli_Bunting',
 '016.Painted_Bunting',
 '017.Cardinal',
 '018.Spotted_Catbird',
 '019.Gray_Catbird',
 '020.Yellow_breasted_Chat',
 '021.Eastern_Towhee',
 '022.Chuck_will_Widow',
 '023.Brandt_Cormorant',
 '024.Red_faced_Cormorant',
 '025.Pelagic_Cormorant',
 '026.Bronzed_Cowbird',
 '027.Shiny_Cowbird',
 '028.Brown_Creeper',
 '029.American_Crow',
 '030.Fish_Crow',
 '031.Black_billed_Cuckoo',
 '032.Mangrove_Cuckoo',
 '033.Yellow_billed_Cuckoo',
 '034.Gray_crowned_Rosy_Finch',
 '035.Purple_Finch',
 '036.Northern_Flicker',
 '037.Acadian_Flycatcher',
 '038.Great_Crested_Flycatcher',
 '039.Least_Flycatcher',
 '040.Olive_sided_Flycatcher',
 '

In [48]:
# Read images from images directory. Each image is in a folder with the name of the class. Add the directories and the names of the images to a DataFrame.
import pandas as pd
data = []
for c in classes:
    for img in os.listdir(os.path.join(data_directory, "images", c)):
        data.append([os.path.join(data_directory, "images", c, img), c])
df_images = pd.DataFrame(data, columns=["image", "label"])

def get_number_from_name(name):
    return int(name.split("_")[-2].split(".")[0])

df_images["n"] = df_images["image"].apply(get_number_from_name)
df_images


Unnamed: 0,image,label,n
0,CUB/images/001.Black_footed_Albatross/Black_Fo...,001.Black_footed_Albatross,77
1,CUB/images/001.Black_footed_Albatross/Black_Fo...,001.Black_footed_Albatross,35
2,CUB/images/001.Black_footed_Albatross/Black_Fo...,001.Black_footed_Albatross,51
3,CUB/images/001.Black_footed_Albatross/Black_Fo...,001.Black_footed_Albatross,7
4,CUB/images/001.Black_footed_Albatross/Black_Fo...,001.Black_footed_Albatross,46
...,...,...,...
11783,CUB/images/200.Common_Yellowthroat/Common_Yell...,200.Common_Yellowthroat,94
11784,CUB/images/200.Common_Yellowthroat/Common_Yell...,200.Common_Yellowthroat,125
11785,CUB/images/200.Common_Yellowthroat/Common_Yell...,200.Common_Yellowthroat,4
11786,CUB/images/200.Common_Yellowthroat/Common_Yell...,200.Common_Yellowthroat,37


In [49]:
# Read texts from text directory. Each text is in a folder with the name of the class. Add the directories and the names of the texts to a DataFrame.
data = []
for c in classes:
    for txt in os.listdir(os.path.join(data_directory, "text", c)):
        data.append([os.path.join(data_directory, "text", c, txt), c])
df_texts = pd.DataFrame(data, columns=["text", "label"])

df_texts["n"] = df_texts["text"].apply(get_number_from_name)
df_texts

Unnamed: 0,text,label,n
0,CUB/text/001.Black_footed_Albatross/Black_Foot...,001.Black_footed_Albatross,10
1,CUB/text/001.Black_footed_Albatross/Black_Foot...,001.Black_footed_Albatross,82
2,CUB/text/001.Black_footed_Albatross/Black_Foot...,001.Black_footed_Albatross,69
3,CUB/text/001.Black_footed_Albatross/Black_Foot...,001.Black_footed_Albatross,68
4,CUB/text/001.Black_footed_Albatross/Black_Foot...,001.Black_footed_Albatross,41
...,...,...,...
11783,CUB/text/200.Common_Yellowthroat/Common_Yellow...,200.Common_Yellowthroat,125
11784,CUB/text/200.Common_Yellowthroat/Common_Yellow...,200.Common_Yellowthroat,86
11785,CUB/text/200.Common_Yellowthroat/Common_Yellow...,200.Common_Yellowthroat,75
11786,CUB/text/200.Common_Yellowthroat/Common_Yellow...,200.Common_Yellowthroat,77


In [50]:
# Join df_texts and df_images on class and n

df = df_texts.merge(df_images, on=["label", "n"])


In [51]:
# Read info about train and test from train_test_split.txt (csv separated by space)
df_train_test = pd.read_csv(os.path.join(data_directory, "train_test_split.txt"), sep=" ", header=None, names=["n", "train"])
# Join df with df_train_test on n
df = df.merge(df_train_test, on="n")

In [52]:
columns = ["n", "label", "image", "text", "train"]
df = df[columns]
df

Unnamed: 0,n,label,image,text,train
0,10,001.Black_footed_Albatross,CUB/images/001.Black_footed_Albatross/Black_Fo...,CUB/text/001.Black_footed_Albatross/Black_Foot...,0
1,10,003.Sooty_Albatross,CUB/images/003.Sooty_Albatross/Sooty_Albatross...,CUB/text/003.Sooty_Albatross/Sooty_Albatross_0...,0
2,10,004.Groove_billed_Ani,CUB/images/004.Groove_billed_Ani/Groove_Billed...,CUB/text/004.Groove_billed_Ani/Groove_Billed_A...,0
3,10,005.Crested_Auklet,CUB/images/005.Crested_Auklet/Crested_Auklet_0...,CUB/text/005.Crested_Auklet/Crested_Auklet_001...,0
4,10,006.Least_Auklet,CUB/images/006.Least_Auklet/Least_Auklet_0010_...,CUB/text/006.Least_Auklet/Least_Auklet_0010_79...,0
...,...,...,...,...,...
11783,143,062.Herring_Gull,CUB/images/062.Herring_Gull/Herring_Gull_0143_...,CUB/text/062.Herring_Gull/Herring_Gull_0143_46...,1
11784,143,066.Western_Gull,CUB/images/066.Western_Gull/Western_Gull_0143_...,CUB/text/066.Western_Gull/Western_Gull_0143_54...,1
11785,143,118.House_Sparrow,CUB/images/118.House_Sparrow/House_Sparrow_014...,CUB/text/118.House_Sparrow/House_Sparrow_0143_...,1
11786,144,118.House_Sparrow,CUB/images/118.House_Sparrow/House_Sparrow_014...,CUB/text/118.House_Sparrow/House_Sparrow_0144_...,0


In [53]:
df.to_csv("info_CUB.csv",sep=";",index=False)