# Creating a project from an existing dataset
This notebook shows how the `geti-sdk` package allows to create a project from an existing dataset, importing all the images and annotations contained in it.

### Set up the connection to the platform
First, set up the connection to the server. This is done by instantiating a Geti object, with the hostname (or ip address) and authentication details for the server. As in notebook [001 create_project](001_create_project.ipynb), the server details are stored in the `.env` file and are loaded in the cell below. For details on how to create the `.env` file, please see the [readme](README.md).

In [None]:
from geti_sdk import Geti
from geti_sdk.utils import get_server_details_from_env

geti_server_configuration = get_server_details_from_env()
geti = Geti(server_config=geti_server_configuration)

## Import from a supported dataset format

The Intel Geti SDK package provides a class `GetiIE` to easily import datasets from one of the formats supported by Geti, including:
- Datumaro
- COCO
- VOC
- YOLO

For more details about the Geti dataset import capabilities, refer to its [documentation](https://docs.geti.intel.com/docs/user-guide/geti-fundamentals/project-management/dataset-export-import).

### Get the dataset

Prepare your dataset by ensuring it is a zip archive in one of the formats mentioned above, accessible on the filesystem.
This example uses a small dataset downloaded from the Internet. It contains some annotated images of cats and dogs, which can be used to create a _classification_ project.

In [None]:
import os

from geti_sdk.demos.data_helpers.download_helpers import download_file

# Replace the URL below with the path to your dataset archive
# To use a local dataset, set `dataset_archive` directly to the file path of your dataset, e.g., "/home/duck/dataset.zip"
dataset_archive_url = "https://storage.geti.intel.com/test-data/geti-sdk/datasets/cats-dogs-classification-datumaro.zip"
dataset_archive_folder = "/tmp"
dataset_archive = os.path.join(dataset_archive_folder, os.path.basename(dataset_archive_url))

if not os.path.exists(dataset_archive):
    download_file(url=dataset_archive_url, target_folder=dataset_archive_folder)

### Import the dataset

Now that the dataset archive is ready, it's time to import it.
This can be done with the `import_dataset_as_new_project` method of `GetiIE`, which also allows to assign a custom name to the project.
The choice of `project_type` depends of course on the type of annotations stored in the archive.

In [None]:
from geti_sdk.import_export import GetiIE
from geti_sdk.rest_clients import ProjectClient

project_client = ProjectClient(session=geti.session, workspace_id=geti.workspace_id)
geti_ie = GetiIE(session=geti.session, workspace_id=geti.workspace_id, project_client=project_client)

project = geti_ie.import_dataset_as_new_project(
    filepath=dataset_archive,
    project_name="Cats & Dogs",
    project_type="classification",
)

That's it! A new project named `Cats & Dogs` should now appear in your workspace.
To check its properties, can print a summary using the cell below.
You can expect to find the same labels (classes) present in the dataset archive; for certain project types like `detection`, you may also find a label called `No object` to represent the absence of any object in the image.


In [None]:
print(project.summary)

## Import from a format not natively supported

If your dataset does not comply with one of the supported formats, there are several ways how to go around this.
- You can try to convert your dataset to one of the supported formats and come back to the automated approach. This can be done by writing a script that will do the conversion. The drawback of this approach is that you can end up keeping multiple copies of the same dataset.
- You can implement an [AnnotationReader](https://openvinotoolkit.github.io/geti-sdk/geti_sdk.annotation_readers.html#) of your own by following a few implementation examples already present in the Intel Geti SDK package - [DirectoryTreeAnnotationReader](https://github.com/openvinotoolkit/geti-sdk/blob/main/geti_sdk/annotation_readers/directory_tree_annotation_reader.py). It is especially useful if you have an established home-grown annotation format and the data you gather will be kept in this format in the future as well.
- You can create a project from scratch and upload the data and annotations to it, in multiple steps. This is the most straightforward approach, but it requires more interactions with the server.

The following example adopts the last approach: it reads the dataset annotations from a `csv` file, then uses `geti-sdk` to create a detection project, upload images and annotations to it.

First, let's read a few lines from the dataset annotation file to see what it looks like.

In [None]:
import csv

ANNOTATION_FILE_PATH = "./custom_dataset.csv"
annotation_file_contents = r"""image,xmin,ymin,xmax,ymax,label_name
dog1.jpg,263,100,873,690,dog
dog2.jpg,104,16,270,178,dog"""
with open(ANNOTATION_FILE_PATH, "w") as csv_file:
    csv_file.write(annotation_file_contents)

with open(ANNOTATION_FILE_PATH, newline="") as csv_file:
    reader = csv.reader(csv_file)
    header_line = next(reader)
    first_data_line = next(reader)
print(header_line)
print(first_data_line)

In this example, the dataset annotation `csv` file contain six columns: `image` which is the sample path, `x_min`, `y_min`, `x_max`, `y_max` columns contain the bounding box coordinates, and the `label` column contains the object class label. The annotation file structure may vary and the processing code must be adjusted accordingly. It is also important to take into account all the known information about the dataset, such as the computer vision task(s) that the dataset is labeled for, number of classes and the number of images in the dataset to optimally process the it.\
As an example, you may not know the number of classes in the dataset, so you must find it out by reading the full annotation file to memory and extracting the unique values from the `label` column.\
In other cases, you may know the number of classes and their names, but the sample files are so big you would prefer to read and process the annotations line by line.

To create a project, first to initialize a `ProjectClient` and call the `create_project` method, which is well explained in the previous notebook [001 create project](./001_create_project.ipynb). The dataset is labeled for `detection`, so the created Project should have the same type. The list of labels is passed directly to the `create_project` method; the label names should match the ones previously used in the dataset definition.

In [None]:
from geti_sdk.rest_clients.project_client.project_client import ProjectClient

project_client = ProjectClient(session=geti.session, workspace_id=geti.workspace_id)

# Label names for the first (and only) trainable task in our Project.
CLASS_NAMES = [
    "dog",
]

project = project_client.create_project(
    project_name="Dog Detection Project",
    project_type="detection",
    labels=[
        CLASS_NAMES,
    ],
)

You can examine the list of labels that are present in our newly created Project. The `get_all_labels` method of the ProjectClient returns a list of Geti SDK objects representing labels in the project. It's handy to define a dictionary that maps label names to the label objects, to be used later.

In [None]:
all_labels = project.get_all_labels()
label_dict = {label.name: label for label in all_labels}
print(all_labels)

To upload the images and annotations to the project, you need an `ImageClient` and an `AnnotationClient`.

In [None]:
from geti_sdk.rest_clients.annotation_clients.annotation_client import AnnotationClient
from geti_sdk.rest_clients.media_client.image_client import ImageClient

image_client = ImageClient(session=geti.session, workspace_id=geti.workspace_id, project=project)
annotation_client = AnnotationClient(session=geti.session, workspace_id=geti.workspace_id, project=project)

The project is now ready to be populated. This can be done in two steps:
1. Upload an image to the project.
2. Prepare and upload the corresponding annotation.

The first part is straightforward, thanks to the `upload_image` method of the `ImageClient`. The method can load an image from disk and send it to the server, it returns an `Image` object that can be later used to upload the annotation.

In [None]:
from geti_sdk.demos.constants import DEFAULT_DATA_PATH

image_path = os.path.join(DEFAULT_DATA_PATH, "example", first_data_line[0])
image_object = image_client.upload_image(image=image_path)
image_object

Annotation creation is possible through the `upload_annotation` method of the `AnnotationClient`. The method requires the `Image` object, and the `AnnotationScene` object, needed to create from the annotation data. The `AnnotationScene` object is a container for the annotations of a single data sample, it consists of several `Annotation` instances each representing a single object in the image. The `Annotation` requires a bounding shape and a list of labels for that shape.

In [None]:
from geti_sdk.data_models.annotation_scene import AnnotationScene
from geti_sdk.data_models.annotations import Annotation
from geti_sdk.data_models.shapes import Rectangle

# From the CSV file entry, extract the coordinates of the rectangle
x_min, y_min, x_max, y_max = first_data_line[1:5]

# Create a Rectangle object to represent the shape of the annotation
# Note: the Rectangle object requires the x, y, width and height of the rectangle,
# so it's necessary to calculate the width and height from the x_min, y_min, x_max and y_max
rectangle = Rectangle(
    x=int(x_min),
    y=int(y_min),
    width=int(x_max) - int(x_min),
    height=int(y_max) - int(y_min),
)

# Create the Annotation object,
# The Label object can be found in the previously created label_dict
# using the label name from the CSV file entry as a key
label = label_dict[first_data_line[5]]
annotation = Annotation(
    labels=[
        label,
    ],
    shape=rectangle,
)

# Create the AnnotationScene object and upload the annotation
annotation_scene = AnnotationScene(
    [
        annotation,
    ]
)
annotation_client.upload_annotation(image_object, annotation_scene)

The previous steps can be slightly refactored and extended to upload the rest of the dataset:

```python
from typing import List

def upload_and_annotate_image(dataset_line: List[str]) -> None:
    """
    Uploads an image and its annotation to the project

    :param dataset_line: The line from the dataset that contains the image path and annotation
        in format ['image_path', 'xmin', 'ymin', 'xmax', 'ymax', 'label_name']
    """
    image_path = dataset_line[0]
    image_object = image_client.upload_image(image=image_path)

    x_min, y_min, x_max, y_max = map(int, dataset_line[1:5])
    rectangle = Rectangle(
        x=x_min,
        y=y_min,
        width=x_max - x_min,
        height=y_max - y_min,
    )
    annotation = Annotation(
        labels=[label_dict[dataset_line[5]],],
        shape=rectangle,
    )
    annotation_scene = AnnotationScene([annotation])
    annotation_client.upload_annotation(image_object, annotation_scene)
    print(f"Uploaded and annotated {image_path}")

# Iterate over the rest of the lines in the CSV file and upload and annotate the images
with open(ANNOTATION_FILE_PATH, newline='') as csv_file:
    reader = csv.reader(csv_file)
    header_line = next(reader)
    for line in reader:
        upload_and_annotate_image(line)
```

