<a href="https://colab.research.google.com/github/isabellasoldner/isabellasoldner.github.io/blob/master/power_plants/image_annotations/annotation_results/compare_gen_and_orig_cooling_annotations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Using this colab requires a github token. A guide to obtaining one is here: 

https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line

In [1]:
from os.path import join
from google.colab import drive


ROOT = "/content/drive"
drive.mount(ROOT)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Mount your google drive in order to clone the repo

This either reads the git hub creditials file if you have used the code before or creates a new one in your own personal drive.

In [2]:
fn = F"{ROOT}/My Drive/creds.txt"
try:
    with open(fn, 'r') as f:
       GIT_ORGANISATION = f.readline().rstrip()
       GIT_TOKEN = f.readline().rstrip()
       GIT_REPOSITORY= f.readline().rstrip()
        
except FileNotFoundError:
    with open(fn, 'w') as f:
      GIT_ORGANISATION = input('Input git organisation (eg. WattTime).')
      f.write(GIT_ORGANISATION)
      f.write("\n")
      GIT_TOKEN = input('Input git token.')
      f.write(GIT_TOKEN)
      f.write("\n")
      GIT_REPOSITORY = input('Input git repository (eg. watttime-google-ai).')
      f.write(GIT_REPOSITORY)
      f.write("\n")
      f.close()

Input git organisation (eg. WattTime).WattTime
Input git token.5322dad3c6077fde18c83a8b87207f026c029e4a
Input git repository (eg. watttime-google-ai).watttime-google-ai


Clone the repo in the filepath of your choice. This will not work if PROJECT_PATH is already filled. Update using git pull instead or make a new directory.

In [3]:
PROJECT_PATH = "drive/My Drive/repo"
!mkdir "$PROJECT_PATH"
git_url = F"https://{GIT_TOKEN}@github.com/{GIT_ORGANISATION}/{GIT_REPOSITORY}.git"
!git clone "$git_url" "$PROJECT_PATH"

mkdir: cannot create directory ‘drive/My Drive/repo’: File exists
fatal: destination path 'drive/My Drive/repo' already exists and is not an empty directory.


Move into repo directory

In [18]:
cd "$PROJECT_PATH"

/content/drive/My Drive/repo


If you want to update instead of clone

In [4]:
!git pull --quiet

fatal: not a git repository (or any of the parent directories): .git


Install the requirement packages from git

In [0]:
!pip install -r 'requirements.txt' --quiet

Collecting absl-py==0.8.1
[?25l  Downloading https://files.pythonhosted.org/packages/3b/72/e6e483e2db953c11efa44ee21c5fdb6505c4dffa447b4263ca8af6676b62/absl-py-0.8.1.tar.gz (103kB)
[K     |███▏                            | 10kB 18.2MB/s eta 0:00:01[K     |██████▍                         | 20kB 2.2MB/s eta 0:00:01[K     |█████████▌                      | 30kB 2.5MB/s eta 0:00:01[K     |████████████▊                   | 40kB 2.1MB/s eta 0:00:01[K     |███████████████▉                | 51kB 2.3MB/s eta 0:00:01[K     |███████████████████             | 61kB 2.7MB/s eta 0:00:01[K     |██████████████████████▎         | 71kB 3.0MB/s eta 0:00:01[K     |█████████████████████████▍      | 81kB 2.7MB/s eta 0:00:01[K     |████████████████████████████▋   | 92kB 3.0MB/s eta 0:00:01[K     |███████████████████████████████▊| 102kB 3.3MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 3.3MB/s 
[?25hCollecting appnope==0.1.0
  Downloading https://files.pythonhosted.org/

In [21]:
ls

[0m[01;34madvisor[0m/         [01;34mdrive[0m/                      lint.sh        requirements.txt
[01;34mcolab[0m/           [01;34mexperimental[0m/               [01;34mmodeling[0m/      run_tests.sh
[01;34mcommon[0m/          gcp_upgrade_to_python37.sh  [01;34mpower_plants[0m/  [01;34msensing[0m/
[01;34mdatabase_utils[0m/  [01;34mground_truth[0m/               README.md      [01;34mtest_data[0m/


Load modules from the git repo using this syntax.

In [6]:
from importlib.machinery import SourceFileLoader
common = SourceFileLoader('common', join(PROJECT_PATH, 'common/common.py')).load_module()
gcs_utils = SourceFileLoader('gcs_utils', join(PROJECT_PATH, 'common/gcs_utils.py')).load_module()
normalize = SourceFileLoader('normalize', join(PROJECT_PATH, 'power_plants/normalize.py')).load_module()


ModuleNotFoundError: ignored

A script to read in the output jsonl from two models run using run_automl_model.py, one using generalised
cooling labels from Platts and one using the original Platts cooling labels. This script gives statistics
on the classification scores for each of the model predictions, checks if the predictions between models
match, plots out histograms, outputs files of the images to the bucket for reading into automl to check
the labels, and allows spot checking of the images. 

In [0]:
import pandas as pd
import logging
import google.cloud.storage as gcs
import plotly.graph_objects as go
from PIL import Image
from PIL import ImageFont
from PIL import ImageDraw
import matplotlib.pyplot as plt
import numpy as np
from random import randint


In [0]:
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS']= "drive/My Drive/Colab Notebooks/emissions-monitoring-user.json"


In [0]:
pd.set_option('display.expand_frame_repr', False)


def get_pd_from_jsonl_in_bucket(BUCKET_NAME, file, outfile):
    """ downloads jsonl model results files from bucket and formats into a pandas dataframe"""
    download_blob(BUCKET_NAME, file, outfile)
    df = pd.read_json(outfile, lines=True)
    df['annotations'] = df['annotations'].apply(lambda d: d if not len(d) == 0 else
                                                [{'annotation_spec_id': 'UNKNOWN',
                                                  'classification': {'score': 0}, 'display_name': 'UNKNOWN'}])
    df['annotation_spec_id'] = [x[0]['annotation_spec_id'] for x in df['annotations']]
    df['classification'] = [x[0]['classification'] for x in df['annotations']]
    df['display_name'] = [x[0]['display_name'] for x in df['annotations']]
    df['classification_score'] = df['classification'].apply(lambda x: x.get('score'))
    df.drop(columns=['annotations', 'classification'], inplace=True)
    return df


def append_jsonl(list_of_files):
    """Append all the different model results files in the bucket together."""
    data_annotation_df = pd.DataFrame(columns=['ID', 'annotation_spec_id', 'display_name', 'classification_score'])
    for f in list_of_files:
        tmp_df = get_pd_from_jsonl_in_bucket(BUCKET_NAME, f, outfile)
        data_annotation_df = data_annotation_df.append(tmp_df, ignore_index=True)
    return data_annotation_df


def histogram_two_overlayed(xin, yin, xname, yname, title):
    """Plot an overlayed histogram of two values"""
    fig = go.Figure()
    fig.add_trace(go.Histogram(x=xin, name=xname))
    fig.add_trace(go.Histogram(x=yin, name=yname))

    # Overlay both histograms
    fig.update_layout(title_text=title, barmode='overlay')
    # Reduce opacity to see both histograms
    fig.update_traces(opacity=0.75)
    fig.show()
    return


def numpy_image_from_bucket(BUCKET_NAME, file, outfile, caption):
    """get and image from model results in numpy form and caption it with label results"""
    download_blob(BUCKET_NAME, file, outfile)
    img = Image.open(outfile)
    draw = ImageDraw.Draw(img)
    font = ImageFont.truetype("Arial Bold.ttf", 70)
    width, height = img.size
    draw.text((width / 15 + 25, height - 70),
              caption, (255, 255, 0), font=font, align="center")
    pix = np.array(img)
    return pix


def download_blob(BUCKET_NAME, file, outfile):
    """download blob to file"""
    bucket = gcs_client.get_bucket(BUCKET_NAME)
    blob = bucket.blob(file)
    blob.download_to_filename(outfile)
    return


def get_random_pp_images(matching, BUCKET_NAME, dst_prefix):
    """ Grab 4 random power plant images from the results and plot them with locationid captions"""
    tmp_image = 'tmp.png'
    nrow = 2
    ncol = 2
    fig, axs = plt.subplots(nrow, ncol, figsize=(15, 15))
    offset = randint(0, len(matching) - 4)
    images = []
    for i in range(nrow):
        for j in range(ncol):
            count = offset + j + ncol*(i)
            gen_str = matching["display_name_gen"].astype(str).iloc[count]
            orig_str = matching["display_name_orig"].astype(str).iloc[count]
            map_str = matching["display_name_mapped"].astype(str).iloc[count]
            title_str = matching["ID_gen"].astype(str).iloc[count].replace(
                F"gs://{BUCKET_NAME}/{dst_prefix}", '').replace('.', '_').split('_')[3]
            caption = F"Gen: {gen_str}, Orig Mapped: {orig_str}, Orig: {map_str}"
            img = numpy_image_from_bucket(BUCKET_NAME, matching["ID_orig"].iloc[count].
                                          replace(F"gs://{BUCKET_NAME}/", ""), tmp_image, caption)
            images.append(axs[i, j].imshow(img))
            axs[i, j].label_outer()
            axs[i, j].title.set_text(title_str)
            plt.axis('off')
    fig.tight_layout()
    plt.show(block=True)
    return

In [0]:

if __name__ == "__main__":
    # ======== Setup ========
    BUCKET_NAME = 'satellite_project_auto_ml'
    DST_PREFIX_AUTOML = "processed_data/google_maps/processed_annotations/all_countries/"
    DST_PREFIX_ORIG = F"{DST_PREFIX_AUTOML}orig/prediction-global_multilabel_" \
        F"20200427112146-2020-04-29T05:13:29.652Z/image_classification_"
    DST_PREFIX_GEN = F"{DST_PREFIX_AUTOML}prediction-global_multilabel_" \
        F"20200427112033-2020-04-29T04:20:42.051Z/image_classification_"
    DST_PREFIX_IM = "processed_data/google_maps/images_png/zoom_17/dist_2-0_km/vis_class/all/"
    OUTPUT_NOMATCHING_IMAGES = F"gs://{BUCKET_NAME}/{DST_PREFIX_AUTOML}non-matching_images.csv"
    OUTPUT_MATCHING_IMAGES = F"gs://{BUCKET_NAME}/{DST_PREFIX_AUTOML}matching_images.csv"

    common.setup_logging()
    common.check_google_credentials()
    gcs_client = gcs.Client(gcs_utils.CLOUD_PROJECT)

    # ====== Download model results =======
    outfile = 'test.jsonl'
    complete_orig = gcs_utils.get_file_prefixes(gcs_client, DST_PREFIX_ORIG, '.jsonl', bucket=BUCKET_NAME)
    complete_gen = gcs_utils.get_file_prefixes(gcs_client, DST_PREFIX_GEN, '.jsonl', bucket=BUCKET_NAME)

    df_orig = append_jsonl(complete_orig).add_suffix('_orig')
    df_gen = append_jsonl(complete_gen).add_suffix('_gen')

    # ====== Map the original labels from original models to general labels =======
    df_orig["display_name_mapped"] = df_orig["display_name_orig"].map(normalize.GENERAL_COOLING_TYPES)

    # ====== Check models are same length and merge =======

    assert len(df_orig.index) == len(df_gen.index)

    result = df_orig.merge(df_gen, left_on="ID_orig", right_on="ID_gen")

    assert len(df_orig.index) == len(result.index)

    # ====== Calculate and output statistics on the models =======

    df_labelled_gen = result[result['display_name_gen'] != 'UNKNOWN']

    df_labelled_orig = result[result['display_name_orig'] != 'UNKNOWN']

    percentage_labelled_gen = len(df_labelled_gen)/len(result)*100.0
    percentage_labelled_orig = len(df_labelled_orig)/len(result)*100.0

    logging.info("The percentage of labelled images for the general model is (default cutoff : 0.5): \n")
    logging.info(F"{percentage_labelled_gen}")
    logging.info("and for the original: \n")
    logging.info(F"{percentage_labelled_orig}")

    df_stats_gen = df_labelled_gen["classification_score_gen"].describe()
    df_stats_orig = df_labelled_orig["classification_score_orig"].describe()

    logging.info(F"The stats on the general label model (excluding not-labelled) {df_stats_gen}")
    logging.info(F"The stats on the original label model (excluding not-labelled) {df_stats_orig}")

    """
    ====== Find which labels match between the models, split into dfs, =======
    ====== plot scores and find non-matching pair frequency, and output to csvs =======
    """
    matching = result[result['display_name_gen'] == result['display_name_mapped']]
    nonmatching = result[result['display_name_gen'] != result['display_name_mapped']]
    logging.info(
        "Two models were made, one with general labels and one with the original Platts labels.")
    logging.info("See GENERAL_COOLING_TYPES for mapping.")
    logging.info("Plotted a histogram of the classification score for both models:")
    logging.info("1. All images:")
    histogram_two_overlayed(result["classification_score_gen"],
                            result["classification_score_orig"], 'General', 'Original', 'Classification Scores')
    logging.info("2. Where the labels between the models match:")
    histogram_two_overlayed(matching["classification_score_gen"],
                            matching["classification_score_orig"], 'General', 'Original',
                            'Classification Scores Matching between Platts and general labels')
    logging.info("3. Where both where the labels between the models don't match:")
    histogram_two_overlayed(nonmatching["classification_score_gen"],
                            nonmatching["classification_score_orig"], 'General', 'Original',
                            'Classification Non-matching between Platts and General labels.')

    logging.info(
        F"There are {len(nonmatching)} non-matching pairs between the general model"
        F"and Platts original model predictions.")

    unique_df = nonmatching.groupby(['display_name_gen', 'display_name_mapped']
                                    ).size().reset_index().rename(columns={0: "Frequency"})
    logging.info("The non-matching pairs for every affected power plant:")
    logging.info("a) between general labels and original Platts labels (remapped to general) are:")
    logging.info(unique_df)

    unique_df_orig = nonmatching.groupby(['display_name_gen',
                                          'display_name_orig']).size().reset_index().rename(columns={0: "Frequency"})
    logging.info("b) between general labels and original Platts labels are:")
    logging.info(unique_df_orig)

    col_order = ['set', 'ID_gen', 'display_name_gen', 'display_name_orig']
    matching_for_input = matching.reindex(columns=[col_order])
    nonmatching['display_name_mapped'] = nonmatching['display_name_mapped'].astype(str) + '_map'
    nonmatching_for_input = nonmatching.reindex(columns=[col_order + ['display_name_mapped']])

    # the purpose of these is to uploaded as an automl data set so we can easily visually inspect the labels + images
    matching_for_input.to_csv(OUTPUT_MATCHING_IMAGES, index=False, header=False)
    nonmatching_for_input.to_csv(OUTPUT_NOMATCHING_IMAGES,
                                 index=False, header=False)

    # ===== spot check images and their labels ======
    userAnswer = True
    while(userAnswer):
        get_random_pp_images(matching, BUCKET_NAME, DST_PREFIX_IM)
        ans = int(input('Do want another set of images? 0=>no, 1=>yes: '))
        if ans == 0:
            userAnswer = False
        elif ans == 1:
            userAnswer = True
