# Setup account

To use AWS services one needs additional Beckman's account (called `adm` account, for example admnbozinovic@beckman.com). Request this account from IT and follow [these](https://wiki.beckman.com/display/GC/Log+In+AWS+Console+using+ADM+account) instructions.  

Use Microsoft Edge or Chrome private mode or Firefox browser 
SageMaker container (d-ekzpkxrxgszq) and open:
https://beckman.awsapps.com/start#/


# Set up new conda environment

SageMaker comes with many built-in [Images](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-images.html) that have many preloaded packages. Each image can support different Jupyter kernels (or equivalent conda environments). Let's create a new conda environment 'hana' on 'Data Science 3.0' Image and run a notebook on it.

First access Image terminal directly (navigate to the Launcher (click on "Amazon SageMaker Studio in the top left corner) and "Open Image terminal")
![](assets/aws0.png)
![](assets/aws1.png)


```zsh
conda env list
```
will print `opt/conda` only.

I had to install pip (for some reason it is not preinstalled):
```zsh
apt-get update
apt-get install python-pip
```

Let's create a new environment:
```zsh
conda create -n hana python=3.9
conda activate hana
conda install jupyter
```

Let's also add this environment as a Jupyter kernel:
```
jupyter kernelspec list
```
```
ipython kernel install --name hana --user
```
Now our custom environment is showing up:
![aws2](assets/aws2.png)

Let's install required packages in the `hana` env:

```zsh
pip install streamlit pandas polars
pip install torch torchvision torchaudio
pip install jupyter methodtools pytorch-lightning scikit-learn colorama libtmux onnxruntime openpyxl xlsxwriter matplotlib pulp
```
most of these are already preinstalled.

That's it. Our kernel is now ready to be used.

There are [more](https://aws.amazon.com/blogs/machine-learning/four-approaches-to-manage-python-packages-in-amazon-sagemaker-studio-notebooks/) ways this can be done and I dabbled in [Lifecycle Configuration](#optional-set-up-lifecycle-configuration-not-functional-yet) but haven't had success.

# Clone HANA GitHub repository

Best way to do this is via terminal (SageMaker has GitHub tab option but couldn't make it run). We'll show here one of the easy ways to do this is via Personal Access Token (there is ssh-key option as well). First, In Github, navigate to [Personal Access Tokens](https://github.com/settings/tokens). 

Generate a token (give it some name you like) and make sure to configure it for Single-Sign-On (SSO).
![](assets/github_personal_access_token.png)

then in SageMaker Studio terminal clone the repo:
```
git clone https://github.com/BecdxMicrobiology/HANA.git
```
**when asked for password enter the token that you just created!**

To set up global git credentials, edit and run following lines:
```zsh
git config --global user.email "you@example.com"
git config --global user.name "Your Name"
```
that way all the git changes will have your credentials.

# To run SageMaker in VS-Code web UI

SageMaker Studio is the AWSs' UI. If you prefer the VSCode look then you need to install somewhat awkwardly named "Code Server" (see installation [instructions](https://aws.amazon.com/blogs/machine-learning/host-code-server-on-amazon-sagemaker/). You will now be able to run VS Code in a browser.

![Alt Text](https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/10/14/Screenshot-2022-09-29-at-11.33.15-2-1024x267.png)


# S3 access

For s3 access one must set up credentials. Follow instructions from AWS account command line screen to set up `aws configure sso` or other way to set up credentials:

![](../assets/aws_access.jpeg)

Accessing S3 files is not as straightforward as if having them locally since interaction has to go via AWS APIs. `boto3` and `s3fs` are two libraries that have their own APIs. I find s3fs to be better since "S3Fs is a Pythonic file interface to S3. It builds on top of botocore". Pandas is an exception and `pd.read_csv` just works. There is `s3fs-fuse` library (and apparently one with even higher performance called `goofys`) that kind of offer mounting s3 option, however, I ran into issues when trying to install them. Following are some examples:  

In [32]:
import boto3
import s3fs
import pandas as pd
import os

bucket_name = 'bec-ip-prod-sac-projecthana'
file_key = 'part1_data/FF1/All_Retest_Isolates_Dropped_Read.csv'

## Read into DataFrame

In [5]:
df = pd.read_csv(f's3://{bucket_name}/{file_key}')

## s3fs

In [24]:
s3 = s3fs.S3FileSystem(anon=False)  # Set anon to True for public buckets

### ls

In [37]:
contents = s3.ls(bucket_name)
print(contents)

['bec-ip-prod-sac-projecthana/Flagging Tool', 'bec-ip-prod-sac-projecthana/part1_data', 'bec-ip-prod-sac-projecthana/part1_partitioned']


### glob

In [28]:
s3.glob(f's3://{bucket_name}/**/a*.csv')

['bec-ip-prod-sac-projecthana/part1_partitioned/all_Tier_204112023.csv',
 'bec-ip-prod-sac-projecthana/part1_partitioned/all_Tier_204112023_deduplicated.csv']

### download

In [36]:
# Specify local directory for downloaded file
local_directory = '/tmp/'
local_file_path = os.path.join(local_directory, 'All_Retest_Isolates_Dropped_Read.csv')

# Ensure the local directory exists
os.makedirs(local_directory, exist_ok=True)

# Download CSV file
s3.download(f's3://{bucket_name}/{file_key}', local_file_path)

# Load CSV into Pandas DataFrame
df = pd.read_csv(local_file_path)

# Clean up downloaded file
# os.remove(local_file_path)

## boto3

In [6]:
s3_client = boto3.client('s3')  

### download

In [46]:
# Specify local directory for downloaded file
local_directory = '/tmp/'
local_file_path = os.path.join(local_directory, 'All_Retest_Isolates_Dropped_Read.csv')

# Ensure the local directory exists
os.makedirs(local_directory, exist_ok=True)

# Download CSV file
s3_client.download_file(bucket_name, file_key, local_file_path)

# Load CSV into Pandas DataFrame
df = pd.read_csv(local_file_path)

# Clean up downloaded file
os.remove(local_file_path)


### ls

In [None]:
contents = s3_client.list_objects(Bucket=bucket_name, Prefix='part1')['Contents']  
for f in contents:  
    print(f['Key'])

# (Optional) Set up LifeCycle configuration (not functional yet)

There is a way to run a startup script when running the Kernel. To do this we'll follow this [link](https://aws.amazon.com/blogs/machine-learning/customize-amazon-sagemaker-studio-using-lifecycle-configurations/). 

First find out SageMaker domain ID (this should be `d-ekzpkxrxgszq`):
```zsh
aws sagemaker list-domains
```

Find out your username (for example: `admnbozinovic-beckman-com-f0f`):
```zsh
aws sagemaker list-user-profiles --domain-id d-ekzpkxrxgszq
```
Next create a startup bash script, for example with the name `install-package.sh`, here is just an example and there are many [more](https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples):


```zsh
#!/bin/bash
# This script installs a single pip package on a SageMaker Studio Kernel Application

set -eux  # for logging errors

# PARAMETERS
PACKAGE=pyarrow

pip install --upgrade $PACKAGE
```

Next convert the script to a base64 encoded string:
```zsh
LCC_CONTENT=`openssl base64 -A -in install-package.sh`
```

Next we create lifecycle configuration entity:
```zsh
aws sagemaker create-studio-lifecycle-config \
--studio-lifecycle-config-name install-pip-package-on-kernel \
--studio-lifecycle-config-content $LCC_CONTENT \
--studio-lifecycle-config-app-type KernelGateway
```

this will print an `ARN` which will be immutable (you will have to create a new one if script changes):
```
"StudioLifecycleConfigArn": "arn:aws:sagemaker:us-west-2:327333652600:studio-lifecycle-config/install-pip-package-on-kernel"
```

Associate the lifecycle configuration to the domain and username:

```zsh
aws sagemaker update-user-profile --domain-id d-ekzpkxrxgszq \
--user-profile-name admnbozinovic-beckman-com-f0f \
--user-settings '{
"KernelGatewayAppSettings": {
  "LifecycleConfigArns":
    ["arn:aws:sagemaker:us-west-2:327333652600:studio-lifecycle-config/install-pip-package-on-kernel"]
  }
}'
```
I get error:  

```
An error occurred (AccessDeniedException) when calling the UpdateUserProfile operation: User: arn:aws:sts::327333652600:assumed-role/AmazonSageMaker-ExecutionRole-20230803T101829/SageMaker is not authorized to perform: sagemaker:UpdateUserProfile on resource: arn:aws:sagemaker:us-west-2:327333652600:user-profile/d-ekzpkxrxgszq/admnbozinovic-beckman-com-f0f because no identity-based policy allows the sagemaker:UpdateUserProfile action
```
