# Extracting sources metadata

This notebook allows to extract the main features of all downloaded CoralNet sources.

Using the excel sheet `verified_labels.xlsx` containing all the verified labels of the CoralNet website,
it gives out a new excel sheet containing for each source:

- name
- number of images
- number of labels
- number of labels/image
- number of health labels

This summary will provide an easier selection of which sources to combine.

## Import necessary libraries

In [1]:
import numpy as np
import os
import pandas as pd

## Read the verified labels excel sheet

In [2]:
os.chdir("/home/jantina/code/src/preprocessing/")
data = pd.read_excel('verified_labels.xlsx',
                     sheet_name='Health_Labels',
                     usecols="A,C,D").dropna()
data["Label ID"] = data["Label ID"].astype(int)

## Extract data from all CoralNet sources

! Requires all sources to be in `/data/jantina/data/CoralNet/used` directory !

Right now only the used sources are there, the others are in `/data/jantina/data/CoralNet/other`.

In [3]:
# assign directory
directory = '/data/jantina/data/CoralNet/used'

# declare variables
names = []
images = []
labels = []
annots = []
health_labels = []
 
# iterate over files in that directory
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    os.chdir(f)
    
    try:
        # get the source name
        names.append(f.split("/")[4])
        
        # get number of images
        annotations = pd.read_csv('annotations.csv')
        temp = annotations.groupby('Name').count()
        images.append(len(temp))
        
        # get number of labels 
        label = pd.read_csv('labelset.csv')
        labels.append(len(label))
        
        # get number of labels per image
        annots.append(temp["Label"][0])
        
        # get number of health labels 
        label["Label ID"] = label["Label ID"].astype(int)
        df_common = label.loc[label["Label ID"].isin(data["Label ID"])]
        temp2 = annotations.groupby('Label').count()
        df_common = temp2.loc[temp2.index.isin(df_common["Short Code"])]
        health_labels.append(len(df_common))
        
    except:
        continue

## Create dataframe with all information

In [4]:
df = pd.DataFrame(list(zip(names,
                           images,
                           labels,
                           annots, 
                           health_labels)),
               columns = ['Name', 
                         '# Images',
                         '# Labels', 
                         '# Labels/Images',
                         '# Health Labels'])

df.sort_values('Name', inplace = True,)

## Save metadata as excel file

In [5]:
os.chdir("/home/jantina/code/src/preprocessing/")
df.to_excel("sources_metadata.xlsx")
print("[INFO] Excel sheet created !")