# Creating an augemented manifest file (AMF)
An augmented manifest file is a set of data locations for your training and/or validation data. The format it uses is [jsonlines](https://jsonlines.org/)
## Preparation
We start by setting up an AWS S3 session and defining an S3 bucket or bucket alias to our training/validation data.

In [None]:
from urllib import request, parse
from http.cookiejar import CookieJar
import numpy
import getpass
import netrc
import requests
import json
import boto3
import os
import boto3
import matplotlib.pyplot as plt
import io
from urllib.parse import urlparse
import sys  
!{sys.executable} -m pip install --user jsonlines
import jsonlines
from PIL import Image

session = boto3.Session(profile_name='sandbox')
client = session.client('s3')

# bucket = 'dnewman2-pwm6dtmajjx83833e31rt99qeg1d4usw2a-s3alias'
bucket = '<your S3 bucket here>'
sourceRef='s3://' + bucket

## File creation

For this example, each line in the AMF references a single HLS image and a class describing whether that image is cloudy (1) or clear (0)

For example,

`{"source-ref": "s3://lp-prod-public/HLSL30.020/HLS.L30.T01FBF.2021104T213801.v2.0/HLS.L30.T01FBF.2021104T213801.v2.0.jpg", "class": "1"}`

This method takes an empty file, an S3 bucket containing the training data and a JSON object representing a listing of objects within that bucket. This allows us to limit the number of objects that the AMF represents.
For each key in the object, the method will render the image to the user and ask them if the image is cloudy or not. It will then record the response as a new row in the AMF.

In [None]:
def createManifestFile(f, bucket, response):
    with jsonlines.Writer(f) as writer:
        client
        for key in response['Contents']:
            s3_loc = key['Key']
            if s3_loc.endswith('.jpg'):
                response = client.get_object(
                    Bucket=bucket,
                    Key=s3_loc)
                # Render image
                file_stream = response['Body']
                img = Image.open(file_stream)
                imgplot = plt.imshow(img)
                plt.show(imgplot)       
                # Ask for classification           
                while True:
                    try:
                        cloudy = int(input("Cloudy? [1:cloudy, 0:not cloudy]:"))
                        if cloudy < 0 or cloudy > 1:
                            raise ValueError          
                        data = {}
                        data['source-ref'] = 's3://' + bucket + '/' + s3_loc
                        data['class'] = str(cloudy)
                        writer.write(data)        
                        break
                    except ValueError:
                        print("Invalid input. The number must be in the range of 0-1.")  
    f.close()

## Generate training augmented manifest file
Get the first 100 objects in the bucket and generate our training dataset. We can use the continuation token to fetch the next set of images for our validation file.

In [None]:
response = client.list_objects_v2(
    Bucket=bucket,
    MaxKeys=100,
)
f = open("cloud_training.json",'w')
createManifestFile(f, bucket, response)
token = response['NextContinuationToken']

## Write the file to our S3 bucket

In [None]:

with open("cloud_training.json", "rb") as f:
    client.put_object(
        Bucket='dug-cloud-manifest',
        Key='cloud_training.json',
        Body=f
    )
    f.close()

## Generate validation augmented manifest file
Get the next 100 objects in the bucket and generate our validation dataset

In [None]:
# Generate validation augmented manifest file
response = client.list_objects_v2(
    Bucket=bucket,
    MaxKeys=100,
    ContinuationToken=token
)
f = open("cloud_validation.json",'w')
createManifestFile(f, bucket, response)

## Write the file to our S3 bucket

In [None]:
with open("cloud_validation.json", "rb") as f:
    client.put_object(
        Bucket='dug-cloud-manifest',
        Key='cloud_validation.json',
        Body=f
    )
    f.close()

We have now created an AMF for our training data, containing 100 images and an AMF for our validation data containing 100 images. We can use these Augmented Manifest Files to train our 'cloudy?' classification model.

In [None]:
print('Done!')