# Predicting missed tifs
### Context

Crop mask inference is setup such that:

- one tif file (satellite time series) added to `crop-mask-earthengine` -> one prediction file (crop/non-crop class) in `crop-mask-preds`

Similarly
- 10 tif files added to `crop-mask-earthengine` -> 10 prediction files in `crop-mask-preds`
- 10,000 tif files added to `crop-mask-earthengine` -> 10,000 prediction files in `crop-mask-preds`

### Problem

However, when processing > 10,000s of tif files at the same time, the system may fail on a few tif files due to inability to scale up fast enough. 
- 100,000 tif files added to `crop-mask-earthengine` -> 99,823 prediction files in `crop-mask-preds`

### Solution

In this event this notebook can be used to:
1. Identify the tif files that failed to produce a prediction file 
2. Rename only those tif files to trigger crop-mask inference on only those files and produce prediction files:

Example
1. 100,000 tif files added to `crop-mask-earthengine` -> 99,823 prediction files in `crop-mask-preds`
2. Notebook identifies missing 177 files (100,000-99,823)
3. 177 tif files renamed in `crop-mask-earthengine` -> 177 prediction files in `crop-mask-preds`
4. Now `crop-mask-preds` contains 100,000


## 1. Setup

In [104]:
#!gcloud auth login
#!gcloud config set project bsos-geog-harvest1

In [78]:
import os
from collections import defaultdict
from google.cloud import storage
from pathlib import Path
from tqdm.notebook import tqdm

In [79]:
client = storage.Client()

In [80]:
tifs_bucket_name = "crop-mask-earthengine"
preds_bucket_name = "crop-mask-preds"

In [116]:
def get_files_dict(bucket_name, prefix=prefix):
    blobs = client.list_blobs(bucket_name, prefix=prefix)
    files_dict = defaultdict(lambda: [])
    amount = 0
    for blob in tqdm(blobs, desc=f"From {bucket_name}"):
        p = Path(blob.name)
        files_dict[str(p.parent)].append(p.stem.replace("pred_", ""))
        amount += 1
    return files_dict, amount

## 2. Identifying tif files with missing predictions

In [135]:
prefix = "Kenya/Kenya_2019_7/"

In [136]:
# This may take some time, if the cell fails try running it again
tif_files, tif_amount = get_files_dict(tifs_bucket_name, prefix=prefix)
pred_files, pred_amount  = get_files_dict(preds_bucket_name, prefix=prefix)

missing = {}
for full_k in tqdm(tif_files.keys(), desc="Missing files"):
    if full_k not in pred_files:
        diffs = tif_files[full_k]
    else:
        diffs = list(set(tif_files[full_k]) - set(pred_files[full_k]))
    if len(diffs) > 0:
        missing[full_k] = diffs

batches_with_issues = len(missing.keys())
print("------------------------------------------------------------------------------")
print(prefix) 
print("------------------------------------------------------------------------------")
if batches_with_issues > 0:
    print(f"\u2716 {batches_with_issues}/{len(tif_files.keys())} batches have a total {tif_amount - pred_amount} missing predictions:")
    for batch, files in missing.items():
        print("\t--------------------------------------------------")
        print(f"\t{Path(batch).stem}: {len(files)}")
        print("\t--------------------------------------------------")
        [print(f"\t{f}") for f in files]
else:
    print(f"\u2714 all files in each batch match")

From crop-mask-earthengine: 0it [00:00, ?it/s]

From crop-mask-preds: 0it [00:00, ?it/s]

Missing files: 0it [00:00, ?it/s]

------------------------------------------------------------------------------
Kenya/Kenya_2019_7/
------------------------------------------------------------------------------
✔ all files in each batch match


## 3. Renaming the missing files to retrigger crop-mask inference
Only execute this cell if you are sure that crop-mask inference has completed and missed these files.

In [134]:
bucket = client.bucket(tifs_bucket_name)
for batch, files in tqdm(missing.items(), desc="Going through batches"):
    for file in tqdm(files, desc="Renaming files", leave=False):
        blob_name = f"{batch}/{file}.tif"
        blob = bucket.blob(blob_name)
        if blob.exists():
            new_blob_name = f"{batch}/{file}-retry1.tif"
            bucket.rename_blob(blob, new_blob_name)
        else:
            print(f"Could not find: {file_name}")        

Going through batches:   0%|          | 0/10 [00:00<?, ?it/s]

Renaming files:   0%|          | 0/10 [00:00<?, ?it/s]

Renaming files:   0%|          | 0/3 [00:00<?, ?it/s]

Renaming files:   0%|          | 0/30 [00:00<?, ?it/s]

Renaming files:   0%|          | 0/10 [00:00<?, ?it/s]

Renaming files:   0%|          | 0/10 [00:00<?, ?it/s]

Renaming files:   0%|          | 0/10 [00:00<?, ?it/s]

Renaming files:   0%|          | 0/10 [00:00<?, ?it/s]

Renaming files:   0%|          | 0/10 [00:00<?, ?it/s]

Renaming files:   0%|          | 0/10 [00:00<?, ?it/s]

Renaming files:   0%|          | 0/10 [00:00<?, ?it/s]