# Preparing for Sagemaker
## Code
### Uploading local code
There should be a single folder with code. the name of this folder will be passed as an argument for `source_dir`. 

### Importing third party modules
You can add a `requirements.txt` file for external packages, which would be installed by sagemaker as long as it is accompanied by a `setup.py`.

### Importing custom modules
For internal modules, you need to explicitly pass the name of the module as dependencies.

For example, if you want to access a module called `ssd300` in the scripts folder, you need to pass it to SageMaker as:
```python
{dependencies=['scripts/ssd300']
```

## Data
### Local data
For local data with the name `data_dir`, you need to specify it as:
```python
inputs = {'training': f'file://{data_dir}'}
```
### Remote data
For remote data, you to upload to s3 and then point Sagemaker to that bucket.

## Training Code
### Command line arguments
All training scripts must be able to accept the following arguments:
```python
p.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))  # for tf
p.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))  # for pytorch
p.add_argument('--data_dir', type=str, default = os.environ.get('SM_CHANNEL_TRAINING'))
```
### Accessing uploaded data
`SM_CHANNEL_XX` is the location of your data. For example, if you passed inputs with:
```python
inputs = {'training': training_data, 'test': test_data}
```
then you can access that data through `os.environ.get('SM_CHANNEL_TRAINING')` and `os.environ.get('SM_CHANNEL_TEST')`.

The default location for `SM_CHANNEL_XX` is `/opt/ml/input/data/`. That means `SM_CHANNEL_TRAINING` is the equivalent to `/opt/ml/input/data/training`, and that `SM_CHANNEL_TEST` is equivalent to `/opt/ml/input/data/test`.

### Writing artifacts
Any writes must be saved at `SM_MODEL_DIR`.
```python
p.add_argument('-o', '--output_path', default = os.environ.get('SM_MODEL_DIR'))
```
The default location for `SM_MODEL_DIR` is `/opt/ml/model`.

In [3]:
import sagemaker
import os
from sagemaker.tensorflow import TensorFlow

sess = sagemaker.Session()
role = "SageMakerRole"

In [4]:
git_config = {'repo': 'https://github.com/mynameisvinn/SSD300', 
              'branch': 'master'}

In [5]:
tf_estimator = TensorFlow(entry_point='train_ssd300.py', 
                          role=role,
                          source_dir="scripts",
                          git_config=git_config,
                          instance_count=1, 
                          instance_type='local',
                          framework_version='1.12.0', 
                          py_version='py3',
                          script_mode=True,
                          dependencies=['scripts/ssd300'],
                          hyperparameters={
                              'epochs': 2,
                              'batch_size': 1,
                              'data_def_dir': '/opt/ml/input/data/training/tooth_id_v1.3',
                              'reload_data_path': '/opt/ml/input/data/training/image_label_sample_data.npy',
                              'exp_name': 'myexperiment',
                              'model_type': 'tooth-id',
                              'steps_per_epoch': 1,
                          }
                         )

In [6]:
data_dir = os.path.join(os.getcwd(), 'for_vin')
f'file://{data_dir}'

'file:///Users/mynameisvinn/Dropbox/Temp/ml_dental_ssd300/for_vin'

In [7]:
inputs = {'training': f'file://{data_dir}'}
tf_estimator.fit(inputs) 

ValueError: Dependency ssd300 does not exist in the repo.