# Creating a project from an existing dataset
In this notebook, we will use the `geti-sdk` package to create a project from an existing dataset, and upload images and annotations to it.

### Setting up the connection to the platform
First, we set up the connection to the server. This is done by instantiating a Geti object, with the hostname (or ip address) and authentication details for the server. As in notebook [001 create_project](001_create_project.ipynb), the server details are stored in the `.env` file and are loaded in the cell below. For details on how to create the `.env` file, please see the [readme](README.md).

In [None]:
from geti_sdk.utils import get_server_details_from_env

geti_server_configuration = get_server_details_from_env()

In [None]:
from geti_sdk import Geti

geti = Geti(server_config=geti_server_configuration)

## 1. Automated Project Creation
The Intel Geti SDK package provides a method to create a project from an existing dataset. This method will create a project, upload the images and annotations to the project, and create the necessary labels and classes. This approach is useful when you have a dataset that is already annotated in one of the supported formats (COCO, Pascal VOC, YOLO, etc.).

### Getting the COCO dataset
In the next cell, we get the path to the MS COCO dataset. 

If you already have the COCO dataset on your machine, please specify the `dataset_path` to point to the folder containing the dataset. 

If you do not have the dataset yet, the `get_coco_dataset` method will make an attempt to download the dataset. Even though it will only download the 2017 validation subset, this is still a ~1 Gb download so it may take some time, depending on your internet connection. 

Of course the data will only be downloaded once; if you have downloaded the dataset previously, the method should detect it and return the path to the data.

In [None]:
from geti_sdk.demos import get_coco_dataset

COCO_PATH = get_coco_dataset(dataset_path=None)

### Reading the dataset
Next, we need to load the COCO dataset using Datumaro. The `geti-sdk` package provides the `DatumAnnotationReader` class to do so. It can read datasets in all formats supported by Datumaro.

In [None]:
from geti_sdk.annotation_readers import DatumAnnotationReader

annotation_reader = DatumAnnotationReader(
    base_data_folder=COCO_PATH, annotation_format="coco"
)

### Selecting the labels
The MS COCO dataset contains 80 different classes, and while we could create a project including all of them, for this demo we will select only a couple of them. This is done using the `filter_dataset` method of the annotation reader.

In [None]:
annotation_reader.filter_dataset(labels=["dog", "cat", "horse"], criterion="OR")

### Creating the project
Now that we have a selection of data we would like to upload, we get to create the project. The COCO dataset is best suited for detection or segmentation type projects. 

To create the project, we will be using a method `create_single_task_project_from_dataset` from the `Geti` instance that we set up previously. This will not only create the project, but also upload the media and annotations from our dataset. 

The project name and type can be set via their respective input parameters, `project_name` and `project_type`. Have a look at notebook [001 create project](./001_create_project.ipynb) for further details about which values are supported for the `project_type` parameter.

The number of images that is uploaded and annotated can be controlled as well. Finally, if `enable_auto_train` is set to `True` the project will start training right after all annotations have been uploaded (provided that sufficient images have been annotated to trigger auto-training).

In [None]:
PROJECT_NAME = "COCO animal detection demo"
PROJECT_TYPE = "detection"

In [None]:
project = geti.create_single_task_project_from_dataset(
    project_name=PROJECT_NAME,
    project_type=PROJECT_TYPE,
    path_to_images=COCO_PATH,
    annotation_reader=annotation_reader,
    number_of_images_to_upload=100,
    number_of_images_to_annotate=90,
    enable_auto_train=True,
)

That's it! A new project named `COCO animal detection demo` should now appear in your workspace. To check its properties, we can print a summary of it in the cell below.

In [None]:
print(project.summary)

As you might have noticed, there is one additional label in the project, the `No Object` label. This is added by the system automatically to represent the absence of any 'horse', 'cat' or 'dog' in an image.

## 2. Manual Project Creation
If your dataset does not comply with one of the supported formats, there are several ways how to go around this.
- You can try to convert your dataset to one of the supported formats and come back to the automated approach. This can be done by writing a script that will do the conversion. The drawback of this approach is that you can end up keeping multiple copies of the same dataset.
- You can implement an [AnnotationReader](https://openvinotoolkit.github.io/geti-sdk/geti_sdk.annotation_readers.html#) of your own by following a few implementation examples already present in the Intel Geti SDK package - [DirectoryTreeAnnotationReader](https://github.com/openvinotoolkit/geti-sdk/blob/main/geti_sdk/annotation_readers/directory_tree_annotation_reader.py) and [DatumAnnotationReader](https://github.com/openvinotoolkit/geti-sdk/blob/main/geti_sdk/annotation_readers/datumaro_annotation_reader/datumaro_annotation_reader.py). It is especially useful if you have an established home-grown annotation format and the data you gather will be kept in this format in the future as well.
- You can create a project manually and upload the data and annotations to it. This is the most straightforward approach, but it requires a bit more work with the Geti SDK entities.

In this section we will go with the last approach and create a project manually. We will read the dataset annotations from a `csv` file and use the `geti-sdk` package to create a detection project, upload images and annotations to it.\
First, let's read a few lines from the dataset annotation file to see what it looks like.

In [None]:
import csv

ANNOTATION_FILE_PATH = "./custom_dataset.csv"
annotation_file_contents = r"""image,xmin,ymin,xmax,ymax,label_name
/images/val2017/000000001675.jpg,0,16,640,308,cat
/images/val2017/000000004795.jpg,157,131,532,480,cat"""
with open(ANNOTATION_FILE_PATH, "w") as csv_file:
    csv_file.write(annotation_file_contents)

with open(ANNOTATION_FILE_PATH, newline="") as csv_file:
    reader = csv.reader(csv_file)
    header_line = next(reader)
    first_data_line = next(reader)
print(header_line)
print(first_data_line)

We see, that in our example the dataset annotation `csv` file contain six columns: `image` which is the sample path, `x_min`, `y_min`, `x_max`, `y_max` columns contain the bounding box coordinates, and the `label` column contains the object class label. The annotation file structure may vary and the processing code must be adjusted accordingly. It is also important to take into account all the known information about the dataset, such as the computer vision task(s) that the dataset is labeled for, number of classes and the number of images in the dataset to optimally process the it.\
As an example, you may not know the number of classes in the dataset, so you must find it out by reading the full annotation file to memory and extracting the unique values from the `label` column.\
In other cases, you may know the number of classes and their names, but the sample files are so big you would prefer to read and process the annotations line by line.

To create a project we need to initialize a `ProjectClient` and call the `create_project` method, which is well explained in the previous notebook [001 create project](./001_create_project.ipynb). Our dataset is labeled for the `detection` so we will create a Project of the corresponding type. It will only have one trainable task, which is detection, so we will pass one list of labels to the `create_project` method. We will use our prior knowledge of the dataset - it was labeled for one-class detection so we only use one label.

In [None]:
from geti_sdk.rest_clients.project_client.project_client import ProjectClient

project_client = ProjectClient(session=geti.session, workspace_id=geti.workspace_id)

# Label names for the first (and only) trainable task in our Project.
CLASS_NAMES = [
    "cat",
]

project = project_client.create_project(
    project_name="Manualy Created Detection Project",
    project_type="detection",
    labels=[
        CLASS_NAMES,
    ],
)

We can examine the list of labels that are present in our newly created Project. The `get_all_labels` method of the ProjectClient returns a list of Geti SDK objects representing labels in the project. We will compile a dictionary that will help us mapping label names to the label objects later.

In [None]:
all_labels = project.get_all_labels()
label_dict = {label.name: label for label in all_labels}
print(all_labels)

To upload the images and annotations to the project, we will need an `ImageClient` and an `AnnotationClient` correspondingly.

In [None]:
from geti_sdk.rest_clients.annotation_clients.annotation_client import AnnotationClient
from geti_sdk.rest_clients.media_client.image_client import ImageClient

image_client = ImageClient(
    session=geti.session, workspace_id=geti.workspace_id, project=project
)
annotation_client = AnnotationClient(
    session=geti.session, workspace_id=geti.workspace_id, project=project
)

Now we have everything to populate our project's dataset manualy. We will break the process into two steps for the first entry in the dataset:
1. Upload the image to the project.
2. Prepare and Upload the annotation to the project.

The first part is straightforward, we will use the `upload_image` method of the `ImageClient` to upload the image to the project. The method can load an image from disk and send it to the server, it returns an `Image` object that we will use to upload the annotation in the next step.

In [None]:
image_path = first_data_line[0]
image_object = image_client.upload_image(image=COCO_PATH + image_path)
image_object

To upload the annotation we will use the `upload_annotation` method of the `AnnotationClient`. The method requires the `Image` object, and the `AnnotationScene` object, which we need to create from the annotation data. The `AnnotationScene` object is a container for the annotations of a single data sample, it consists of several `Annotation` instances each representing a single object in the image. The `Annotation` requires a bounding shape and a list of labels for that shape.\
Now let's code the same way bottom up.

In [None]:
from geti_sdk.data_models.annotation_scene import AnnotationScene
from geti_sdk.data_models.annotations import Annotation
from geti_sdk.data_models.shapes import Rectangle

# From the CSV file entry we can get the coordinates of the rectangle
x_min, y_min, x_max, y_max = first_data_line[1:5]

# We need to create a Rectangle object to represent the shape of the annotation
# Note: the Rectangle object requires the x, y, width and height of the rectangle,
# so we need to calculate the width and height from the x_min, y_min, x_max and y_max
rectangle = Rectangle(
    x=int(x_min),
    y=int(y_min),
    width=int(x_max) - int(x_min),
    height=int(y_max) - int(y_min),
)

# We can now create the Annotation object,
# We can get a Label object from the label_dict we created earlier
# using the label name from the CSV file entry as a key
label = label_dict[first_data_line[5]]
annotation = Annotation(
    labels=[
        label,
    ],
    shape=rectangle,
)

# We can now create the AnnotationScene object and upload the annotation
annotation_scene = AnnotationScene(
    [
        annotation,
    ]
)
annotation_client.upload_annotation(image_object, annotation_scene)

Now we can gather all the steps in one method and iteratively apply it to the rest of the dataset.

```python
from typing import List

def upload_and_annotate_image(dataset_line: List[str]) -> None:
    """
    Uploads an image and its annotation to the project

    :param dataset_line: The line from the dataset that contains the image path and annotation
        in format ['image_path', 'xmin', 'ymin', 'xmax', 'ymax', 'label_name']
    """
    image_path = dataset_line[0]
    image_object = image_client.upload_image(image=image_path)

    x_min, y_min, x_max, y_max = map(int, dataset_line[1:5])
    rectangle = Rectangle(
        x=x_min,
        y=y_min,
        width=x_max - x_min,
        height=y_max - y_min,
    )
    annotation = Annotation(
        labels=[label_dict[dataset_line[5]],],
        shape=rectangle,
    )
    annotation_scene = AnnotationScene([annotation])
    annotation_client.upload_annotation(image_object, annotation_scene)
    print(f"Uploaded and annotated {image_path}")

# We can now iterate over the rest of the lines in the CSV file and upload and annotate the images
with open(ANNOTATION_FILE_PATH, newline='') as csv_file:
    reader = csv.reader(csv_file)
    header_line = next(reader)
    for line in reader:
        upload_and_annotate_image(line)
```

