# Setting up Projects and Data in Slideflow

After you have successfully created your conda environment and installed Slideflow, you can start making your first project. In this tutorial, we will create a project that will be used to test the functionality of Slideflow.

Slideflow deals in **Projects** and in **Data**. 

This is the typical data directory structure that is recommended for working with Slideflow:

- ```PROJECTS/```: directory where all projects are stored
    - ```TEST_PROJECT/```
        - ```annotations.csv```: annotations file (Recommend to put other annotation files into directory ```annotations/```)
        - ```slideflow.log```: Slideflow's console output log (you can manually set the desired logging level)
        - ```settings.json```: project settings which should be edited for each project
        - ```datasets.json```: address book for dataset directories
        - ```models/```: folder containing trained model folders
        - ```eval/```: folder containing result folders from model evaluation 
        - ```script.py``` or ```notebook.ipynb```: your experiment scripts/notebook with your code (Recommend to put into directory```scripts/```)
- ```DATA/```: the below directories can be anywhere, pointed to in ```datasets.json```, and each should contain a subdirectory specfic to each dataset.
    - ```slides/```: slide image directory 
    - ```roi/```: region of interest CSV files generated in QuPath by ```export_rois.groovy``` script
    - ```tiles/```: folder used to temporarily store extracted tiles prior to saving as TFRecords; typically tiles are deleted once TFRecords are created
    - ```tfrecords/```: folder used to store TFRecords 

The easiest place to put the ```tiles/``` and ```tfrecords/``` directories is in the project directory since you will be extracting tiles and creating TFRecords for each project.

It is recommended to use the above directory structure to keep your projects organized.  

---------

1. [Importing libraries](#import-libraries)
    - [Note on filepaths](#note-on-structuring-filepaths-in-your-code)<br>
2. [Create a Project](#create-a-project)
    - [Own data, manual creation](#option-1-create-a-project-with-your-own-data-manually)<br>
    - [Own data, sf.create_project()](#option-2-create-a-project-with-your-own-data-using-slideflows-api-sfcreate_project)<br>
    - [Test data, extant labshare project](#option-4-create-a-test-project-by-downloading-some-test-data)<br>
    - [Test data, download from Amazon/Box](#pre-fe)<br>
3. [Update settings.json and datasets.json](#update-settingsjson-and-datasetsjson)<br><br>
4. [Create a Dataset](#create-a-dataset-class-object)<br>

### Import libraries

Always the first step.

In [None]:
# Set environment variables with os package
import os
os.environ['SF_BACKEND'] = 'torch' # Alternative is 'tensorflow'
os.environ['SF_SLIDE_BACKEND'] = 'cucim' # Alternative is 'libvips'
os.environ['CUDA_VISIBLE_DEVICES'] = '0' # Set which GPU(s) to use 

# Check if GPU is available
if os.environ['SF_BACKEND']=='torch':
    import torch
    print('GPU available: ', torch.cuda.is_available())
    print('GPU count: ', torch.cuda.device_count())
    print('GPU current: ', torch.cuda.current_device())
    print('GPU name: ', torch.cuda.get_device_name(torch.cuda.current_device()))
elif os.environ['SF_BACKEND']=='tensorflow':
    import tensorflow as tf
    print("GPU: ", len(tf.config.list_physical_devices('GPU')))

# import slideflow
import slideflow as sf

# Set verbose logging
import logging
logging.getLogger('slideflow').setLevel(logging.INFO)
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '10'

# Check if slideflow was properly installed
sf.about()

### Note: on structuring filepaths in your code

We have a lot of different servers and computers that we work on, and we have to be careful about how we structure our filepaths. You don't want to hardcode all your paths in your code, because if you have to switch your notebook from one server to another, you'll have to change all your paths.

So, set your paths at the beginning of your notebook, and then use those variables throughout your code. The below is just an example, but think about structuring things this way. 

For example, for projects on Randi's scratch space (or local workstation):

In [None]:
# Set root paths for making a quick test project on Randi's /scratch space just to play around with code
username = "skochanny"
project_name = "TEST_PROJECT"
project_root_path = f'/scratch/{username}/PROJECTS/{project_name}'
# labshare_path will change depending on if you are on Randi v. wheelbarrow v. workstation
labshare_path = '/gpfs/data/pearson-lab/'
project_root_path = os.path.join(labshare_path, project_root_path)
# useful for specific paths to data on the labshare so you can os.join.path(labshare_path, relative_data_path) to point to where the data is
relative_data_path = "PROJECTS/TEST_PROJECTS/lung-adeno-v-squam"

## Create a project

There are a few different options to create a project, we'll offer a few different options based on if you want to create a project with your own data vs. downloading some test data just to get you started. 

### Option 1: Create a project with your own data manually

This option is simple, and what I have done many times in the past. I use a file browser or the command line and I create all of my directories manually, then use a text editor to create & edit the necessary files. Or I copy a previous project and modify it. 

Look, it works, and I don't have to worry about bugs in my code. 

Make the directory structure as listed above in [Setting up Projects and Data in Slideflow](#setting-up-projects-and-data-in-slideflow), and then go to the last section about how to [Update the settings.json and datasets.json](#update-settingsjson-and-datasetsjson).

### Option 2: Create a project with your own data using Slideflow's API, *sf.create_project()*

See [note](#a-note-about-how-to-structure-filepaths) about paths above, or hard code paths here. We'll assume you're setting up a project on the labshare, working from Randi.

In [None]:
project_name = "TEST_PROJECT"
# labshare_path will change depending on if you are on Randi v. wheelbarrow v. workstation
labshare_path = '/gpfs/data/pearson-lab/'
project_root_path = f"{labshare_path}/PROJECTS/{project_name}"

We'll put the ```tiles/``` and ```tfrecords/``` directories in the project directory. 

Notes:
- There is an argument ```rois```, which is broken, it wants to ROIs to be a tar.gz file instead of a directory, you need to manually edit the datasets.json file afterwards.
- Last time I did this, the ```name``` arg for the project name didn't work and I had to manually edit that as well. 

In [None]:
project = sf.create_project(
        root = project_root_path,
        annotations = f"{project_root_path}/annotations.csv",
        name = 'LUADvsLUSC', # if you already have a created datasets.json, you can put the source name here
        slides = f"{labshare_path}/relative/path/to/slides",
        # rois = os.path.join(labshare_path, relative_roi_path), # is broken, wants ROIs to be a tar.gz file instead of a directory
        tiles = f"{project_root_path}/tiles",
        tfrecords = f"{project_root_path}/tfrecords",
    )

In [None]:
project_name = "TCGA_BMI"
# labshare_path will change depending on if you are on Randi v. wheelbarrow v. workstation
labshare_path = '/gpfs/data/pearson-lab/'
project_root_path = f"{labshare_path}/PROJECTS/{project_name}"

project = sf.create_project(
        root = project_root_path,
        annotations = f"{project_root_path}/annotations.csv",
        name = 'LUADvsLUSC', # if you already have a created datasets.json, you can put the source name here
        slides = f"{labshare_path}/relative/path/to/slides",
        # rois = os.path.join(labshare_path, relative_roi_path), # is broken, wants ROIs to be a tar.gz file instead of a directory
        tiles = f"{project_root_path}/tiles",
        tfrecords = f"{project_root_path}/tfrecords",
    )

### (EASIEST) Option 3: Create a test project on Randi by accessing the test project data on the labshare

We'll assume you just want a quick test project on Randi's scratch space to play around with Slideflow.

See [note](#a-note-about-how-to-structure-filepaths) about paths above, or hard code paths here.

In [None]:
# Set root paths
username = "skochanny" # change me
root_path = f'/scratch/{username}/PROJECTS'
labshare_path = '/gpfs/data/pearson-lab/'
project_name = "TEST_PROJECT"
relative_annotation_path = 'PROJECTS/TEST_PROJECTS/lung-adeno-v-squam/annotations.csv' # do not have leading / i.e. "/DL_OTHER..." it messes up os.path.join
relative_slide_path = 'PROJECTS/TEST_PROJECTS/lung-adeno-v-squam/slides'
relative_roi_path = 'PROJECTS/TEST_PROJECTS/lung-adeno-v-squam/roi'

We'll put the ```tiles/``` and ```tfrecords/``` directories in the project directory. 

Notes:
- There is an argument ```rois```, which is broken, it wants to ROIs to be a tar.gz file instead of a directory, you need to manually edit the datasets.json file afterwards.
- Last time I did this, the ```name``` arg for the project name didn't work and I had to manually edit that as well. 
- I used ```os.path.join()``` below but you can also use ```f"{}"``` to format strings.

In [None]:
# Create a new project, if one does not already exist
project_root_path = os.path.join(root_path, project_name)
project = sf.create_project(
        root = project_root_path,
        annotations = os.path.join(labshare_path, relative_annotation_path),
        name = 'LUADvsLUSC', # if you already have a created datasets.json, you can put the source name here
        slides = os.path.join(labshare_path, relative_slide_path),
        # rois = os.path.join(labshare_path, relative_roi_path),
        tiles = os.path.join(project_root_path, "tiles"),
        tfrecords = os.path.join(project_root_path, "tfrecords")
    )

### Option 4: Create a test project by downloading some test data

#### UChicago Box repo option:

We have created a project plus test data which you can download from [here]("https://uchicago.box.com/s/02puzu0dzp9mtfej2gabe0t4d1zn2m0b"). You will need to update the paths in ```datasets.json``` and ```settings.json``` to point to your data directories.

In [None]:
# TODO: make automatic download of data from box
dl_path="https://uchicago.box.com/s/02puzu0dzp9mtfej2gabe0t4d1zn2m0b"

#### Amazon S3 repo option:

We have three things available for download, which you can choose by specifying the ```REMOTE_DIRECTORY_NAME``` variable.
- ```REMOTE_DIRECTORY_NAME = 'TEST_PROJECT'```: Project folder with some sample data for testing (TCGA lung)
- ```REMOTE_DIRECTORY_NAME = 'lung-adeno-v-squam'```: TCGA lung slides, ROIs, and annotation file
- ```REMOTE_DIRECTORY_NAME = 'thyroid-braf-v-ras'```: TCGA thyroid slides, ROIs, and annotation file

You will need to update the paths in ```datasets.json``` and ```settings.json``` to point to your data directories.

In [None]:
# Download a test project/test data from our public Amazon S3 bucket at s3://slideflow-test-projects
import boto3
import os 

def download_s3_folder(bucket_name, s3_folder, local_dir=None):
    """
    Download the contents of a folder directory
    Args:
        bucket_name: the name of the s3 bucket
        s3_folder: the folder path in the s3 bucket
        local_dir: a relative or absolute directory path in the local file system
    """
    s3 = boto3.resource('s3') # assumes credentials & configuration are handled outside python in .aws directory or environment variables
    bucket = s3.Bucket(bucket_name)
    for obj in bucket.objects.filter(Prefix=s3_folder):
        target = obj.key if local_dir is None \
            else os.path.join(local_dir, os.path.relpath(obj.key, s3_folder))
        if not os.path.exists(os.path.dirname(target)):
            os.makedirs(os.path.dirname(target))
        if obj.key[-1] == '/':
            continue
        bucket.download_file(obj.key, target)

BUCKET_NAME = 'slideflow-test-projects' # replace with your bucket name
REMOTE_DIRECTORY_NAME = 'TEST_PROJECT' # Project folder with some sample data for testing
# REMOTE_DIRECTORY_NAME = 'lung-adeno-v-squam' # TCGA lung slides, ROIs, and annotation file
# REMOTE_DIRECTORY_NAME = 'thyroid-braf-v-ras' # TCGA thyroid slides, ROIs, and annotation file

# Set local project directory.
project_dir = "/Users/sarakochanny/Python/slideflow-tutorials/TEST_PROJECT"

# downloadDirectoryFroms3(BUCKET_NAME, REMOTE_DIRECTORY_NAME)
download_s3_folder(BUCKET_NAME, REMOTE_DIRECTORY_NAME, local_dir=project_dir)

If boto3 is not installed, you can install it with ```!pip install boto3```.

In [None]:
#!pip3 install boto3

## Update settings.json and datasets.json

You want to ensure that your settings.json and datasets.json files are updated with the correct information. Slideflow's create_project function is good, but it still has some issues sometimes, and minor typos will cause you problems. 

```settings.json```

The ```settings.json``` file should be in your project folder. Everything can be relative paths (```./``` is notation for the current directory). The "sources" is a list of the source names listed in ```datasets.json```.

Here is an example of what ```settings.json``` should looke like. 
```
{
    "name": "TEST_PROJECT",
    "annotations": "./annotations.csv",
    "dataset_config": "./datasets.json",
    "sources": [
        "SOURCE_1",
        "SOURCE_2"
    ],
    "models_dir": "./models",
    "eval_dir": "./eval"
}
```

```datasets.json``` 

Slideflow does not require your directories to all be in one place: your slides & ROIs can be stored in one place, the tiles & TFRecords in another, the Project folders in another. Slideflow *does* need an “address book” which lists the paths to the data for each different dataset (”datasets” are called “sources”, as you will seen in ```settings.json``` later). The “address book” is the file ```datasets.json```, and its purpose is to act as the one place were all the paths to your data are logged.

The easiest place to put the ```tiles/``` and ```tfrecords/``` directories is in the project directory since you will be extracting tiles and creating TFRecords for each project.

Here is what ```datasets.json``` should look like. This file requires the use of "hard paths" to your data (not relative paths).

```
{
  "SOURCE_1":
  {
    "slides": "/directory",
    "roi": "/directory",
    "tiles": "/directory",
    "tfrecords": "/directory",
  },
  "SOURCE_2":
  {
    "slides": "/directory",
    "roi": "/directory",
    "tiles": "/directory",
    "tfrecords": "/directory",
  }
}
```

You can either add the lines to the JSON file manually or you can add a source to a project with the below code:

In [None]:
import slideflow as sf
P = sf.load_project('/path/to/project/directory')
P.add_source(
    name="SOURCE_NAME",
    slides="/slides/directory",
    roi="/roi/directory",
    tiles="/tiles/directory",
    tfrecords="/tfrecords/directory"
)

Once your Project has been created and your data paths have been added to the ```datasets.json``` file, you can start working with Slideflow.

## Create a Dataset class object

I am pretty sure that you don't need to create a whole project to work with slideflow, you can just create a Dataset object.

To initialize a [Dataset object](https://slideflow.dev/dataset/), you need the following:
- `config`: Path to the `datasets.json` file that lists data.
- `sources`: Name of each of the datasets you want to include in the analysis. These are the names that you provided for each dataset listed in the `datasets.json` file
- `annotations`: path to annotation file<br>
- The `tile_px` and `tile_um` will most likely be 299 and 302, respectively, which is about 10x magnification. For feature extraction, 224px/224um is the expected tile size for most extractors.

In [None]:
# Load the dataset (LUADvsLUSC) from the test project data
dataset = sf.Dataset(
        config='/hard/path/to/datasets.json',
        sources=['LUADvsLUSC'],
        annotations='/hard/path/to/annotations/annotations.csv',
        tile_px=299,
        tile_um=302) 

# Get a summary of the dataset
dataset.summary()

In some situations, you may want to perform analysis only on a subset of images within a single dataset. You can filter a dataset by a specific feature of the annotation file with `Dataset.filter()`.

In [None]:
# Filter by site. Sites included in the filter will be included in the dataset
# This is an example of filtering, for the purposes of this tutorial, we will not filter
filter_dataset = dataset.filter({'site': ['Site-97', 'Site-40', 'Site-9', 'Site-177', 'Site-130', 'Side-69', 'Site-67', 'Site-93', 'Site-96']})