### CEO Labeling Meta-Statistics
**Author:** Benjamin Yeh (by253@cornell.edu / byeh1@umd.edu) <br>
**Description:** This notebook contains:
1. Code to generate dataframe containing meta information from labeler sets 
2. Code to generate statistics from meta dataframe

In [1]:
import numpy as np
import pandas as pd

#### 1. Generate Meta Dataframe 

The steps for generating the meta dataframe are outlined below:
* User defines parameters of project:
    * 1.1 `completed_date` - Date when all plots are labeled for *both* sets 1 and 2.
    * 1.2 `final_date` - Date when all labels *should* be in agreement between sets 1 and 2.
    * 1.3 `IS_AREA_CHANGE` - Indicates whether labeling project is area change (multi-year) or cropmap (single-year).
    * 1.4 `YEAR` - Indicates year(s) of labeling project observations. 
* Meta dataframe is generated by the following process:
    * 2.1 A dataframe of the labels at the completed date for sets 1 and 2 is made, and disagreeing points are found by comparing the difference between the two sets.
    * 2.2 A dataframe of the labels at the final date for sets 1 and 2 is made, and the final labels *at* the disagreeing points found in the above step are extracted.
    * 2.3 A dataframe is made from the disagreeing points, their initial labels from set 1 and 2, and the final labels.

In [2]:
# Dates
completed_date = "01-10"
final_date = "01-17"

# Indicate below whether labeling project is area change (multi-year) or cropmap (single-year)
IS_AREA_CHANGE = True

# If area change project, indicate each year of observations
if IS_AREA_CHANGE:
    YEAR_1 = "2020"
    YEAR_2 = "2021"
# If cropmap project, indicate single year of observations
else:
    YEAR = ""

# Helper function for reading path location of label CSVs 
#   -> This will need to be modified to resemble user's directory
path = lambda s, d: f"data/ceo-Tigray-2020-2021-Change-({s})-sample-data-2022-{d}.csv"

In [3]:
# Function for loading individual labeling CSVs
def load_dataframes(completed_date : str, final_date : str):
    # Load dataframe for set 1 and 2 @ date where labels are both "completed"
    complete_dataframe_set_1 = pd.read_csv(path("set-1", completed_date))
    complete_dataframe_set_2 = pd.read_csv(path("set-2", completed_date))

    # Load dataframe for set 1 and 2 @ date where set 1 and 2 *should* be in "agreement"
    final_dataframe_set_1 = pd.read_csv(path("set-1", final_date))
    final_dataframe_set_2 = pd.read_csv(path("set-2", final_date))

    return complete_dataframe_set_1, complete_dataframe_set_2, final_dataframe_set_1, final_dataframe_set_2

# Function for computing area change 
def compute_area_change(label_1 : str, label_2 : str) -> str:
    switch = {
        ("Planted", "Planted") : "Stable P",
        ("Not planted", "Not planted") : "Stable NP",
        ("Planted", "Not planted") : "P loss",
        ("Not planted", "Planted") : "P gain",
    }

    return switch[label_1, label_2]
        
# Function for computing disagreements
def compute_disagreements(df1 : pd.DataFrame, df2 : pd.DataFrame):
    if IS_AREA_CHANGE:
        disagreements = (df1["area_change"] != df2["area_change"])
    else:
        disagreements = (df1["crop_noncrop"] != df2["crop_noncrop"])
        
    return disagreements

# Function for computing confused points
#   -> Where, labelers initially agreed @ completed date; however differ in final
#      agreement
def compute_confusions(completed_agreements : pd.Series, fdf : pd.DataFrame):
    raise NotImplementedError

# Aux function for creating meta dataframe
def create_meta_dataframe_aux(
        cdf1 : pd.DataFrame, 
        cdf2 : pd.DataFrame, 
        fdf : pd.DataFrame, 
        disagreements : pd.Series
        ):
    
    # Extract longitude and latitude from final dataframe
    #   -> There may be *slight* variation in `lon` and `lat` across the three dataframes;
    #      but otherwise plot/sampleid/lon/lat refer to same locations
    lon, lat = fdf.loc[disagreements, "lon"].values, fdf.loc[disagreements, "lat"].values
    
    # Extract columns to subset and define helper funcs
    columns = ["plotid", "sampleid", "email", "analysis_duration"] 
    if IS_AREA_CHANGE:
        columns.append("area_change")
        # Helper function for renaming columns by set
        rename_fn = lambda s : {
            "area_change" : f"{s}_label",
            "email" : f"{s}_email",
            "analysis_duration" : f"{s}_analysis_duration"
        }
    else:
        columns.append("crop_noncrop")
        rename_fn = lambda s : {
            "crop_noncrop" : f"{s}_label",
            "email" : f"{s}_email",
            "analysis_duration" : f"{s}_analysis_duration"
        }

    # Subset and rename by set
    cdf1 = cdf1.loc[disagreements, columns].rename(columns = rename_fn("set_1"))
    cdf2 = cdf2.loc[disagreements, columns].rename(columns = rename_fn("set_2"))
    fdf = fdf.loc[disagreements, columns].rename(columns = rename_fn("final")).drop(columns = ['final_email', 'final_analysis_duration'])

    # Assemble dataframe
    meta_dataframe = cdf1.merge(
        cdf2, left_on = ["plotid","sampleid"], right_on = ["plotid","sampleid"]
        ).merge(
        fdf, left_on = ["plotid","sampleid"], right_on = ["plotid","sampleid"]
        )
    
    # Insert lon and lat
    meta_dataframe["lon"], meta_dataframe["lat"] = lon, lat

    # Create "meta-feature" columns 
    #   -> (1) Label overridden
    #   -> (2) LabelER overridden
    #   -> (3) Correct/incorrect analysis duration

    # Convert analysis duration to float
    meta_dataframe[["set_1_analysis_duration", "set_2_analysis_duration"]] = meta_dataframe[["set_1_analysis_duration", "set_2_analysis_duration"]].applymap(
        lambda string : float(string.split(" ")[0])
        )

    # (1) 
    compute_incorrect_label = lambda l1, l2, f : l2 if l1 == f else l1 if l2 == f else "Both"
    meta_dataframe["overridden_label"] = meta_dataframe.apply(
        lambda df : compute_incorrect_label(df["set_1_label"], df["set_2_label"], df["final_label"]),
        axis = 1
        )
    
    # (2)
    compute_incorrect_email = lambda e1, e2, l1, l2, f : e2 if l1 == f else e1 if l2 == f else "Both" 
    meta_dataframe["overridden_email"] = meta_dataframe.apply(
        lambda df : compute_incorrect_email(df["set_1_email"], df["set_2_email"], df["set_1_label"], df["set_2_label"], df["final_label"]),
        axis = 1
        )
    
    # (3)
    compute_incorrect_analysis = lambda t1, t2, l1, l2, f: t2 if l1 == f else t1 if l2 == f else 'Both'
    compute_correct_analysis = lambda t1, t2, l1, l2, f: t1 if l1 == f else t2 if l2 == f else 'None'
    meta_dataframe["overridden_analysis"] = meta_dataframe.apply(
        lambda df : compute_incorrect_analysis(df["set_1_analysis_duration"], df["set_2_analysis_duration"], df["set_1_label"], df["set_2_label"], df["final_label"]),
        axis = 1
    )
    meta_dataframe["nonoverridden_analysis"] = meta_dataframe.apply(
        lambda df : compute_correct_analysis(df["set_1_analysis_duration"], df["set_2_analysis_duration"], df["set_1_label"], df["set_2_label"], df["final_label"]),
        axis = 1
    )

    # Rearrange columns
    rcolumns = [
        "plotid", "sampleid", "lon", "lat", "set_1_email", "set_2_email", "overridden_email", 
        "set_1_analysis_duration", "set_2_analysis_duration", "overridden_analysis", "nonoverridden_analysis", 
        "set_1_label", "set_2_label", "final_label", "overridden_label"
    ]
    meta_dataframe = meta_dataframe[rcolumns]

    return meta_dataframe

# Function for creating meta dataframe
def create_meta_dataframe(completed_date : str, final_date : str):

    # (1) Load labeling CSVs to dataframes
    cdf1, cdf2, fdf1, fdf2 = load_dataframes(completed_date, final_date)

    # (2) If labeling project is area change, compute area change
    if IS_AREA_CHANGE:
        for df in [cdf1, cdf2, fdf1, fdf2]:
            df["area_change"] = df.apply(
                lambda df : compute_area_change(df[f"Was this a planted crop in {YEAR_1}?"], df[f"Was this a planted crop in {YEAR_2}?"]),
                axis = 1
                )
    # (2.5) If cropmap, just rename crop column
    else:
        for df in [cdf1, cdf2, fdf1, fdf2]:
            # TODO: Find what the "native" column name is for cropmap project
            raise NotImplementedError("Native column name for cropmap is unknown.")
    
    # (3) Compute disagreements for "completed" and "final" dataframes
    cdisagreements = compute_disagreements(cdf1, cdf2)
    fdisagreements = compute_disagreements(fdf1, fdf2)
    # Disagreements between set 1 and 2 @ completed date
    print(f"Disagreements Between Set 1 and 2 (Completed): {cdisagreements.sum()}")
    # Disagreements between set 1 and 2 @ final date
    #   -> Sanity check - should be none!
    print(f"Disagreements Between Set 1 and 2 (Final): {fdisagreements.sum()}")
    assert (fdisagreements.sum() == 0), "There should be no disagreements by final labeling date between sets 1 and 2."

    # (4) Create dataframe from *just* disagreement points:
    #     -> plotid/sampleid/lon/lat
    #     -> List both email of labeler 1, labeler 2, and labeler overridden
    #     -> List both set 1, set 2, overridden, and nonoverridden analysis time duration
    #     -> List both set 1, set 2, final, and overridden label

    meta_dataframe = create_meta_dataframe_aux(cdf1, cdf2, fdf1, cdisagreements)
    
    return meta_dataframe

In [4]:
# Generate and load dataframe 
meta_dataframe = create_meta_dataframe(completed_date, final_date)
meta_dataframe.head()

Disagreements Between Set 1 and 2 (Completed): 49
Disagreements Between Set 1 and 2 (Final): 0


Unnamed: 0,plotid,sampleid,lon,lat,set_1_email,set_2_email,overridden_email,set_1_analysis_duration,set_2_analysis_duration,overridden_analysis,nonoverridden_analysis,set_1_label,set_2_label,final_label,overridden_label
0,163,163,37.120252,13.520786,jwagner@unistra.fr,bbarker1@umd.edu,Both,124.0,105.2,Both,,Stable P,P gain,Stable NP,Both
1,252,252,39.154225,14.230454,hkerner@umd.edu,ckuei@terpmail.umd.edu,Both,43.7,949.7,Both,,P gain,Stable P,Stable NP,Both
2,296,296,38.953575,14.07516,hkerner@umd.edu,engineer.arnoldmuhairwe@gmail.com,hkerner@umd.edu,172.2,187.8,172.2,187.8,Stable P,Stable NP,Stable NP,Stable P
3,299,299,39.335162,13.653124,hkerner@umd.edu,engineer.arnoldmuhairwe@gmail.com,hkerner@umd.edu,108.4,601.7,108.4,601.7,P gain,Stable NP,Stable NP,P gain
4,300,300,36.72535,13.779008,hkerner@umd.edu,engineer.arnoldmuhairwe@gmail.com,engineer.arnoldmuhairwe@gmail.com,49.6,584.5,584.5,49.6,Stable P,Stable NP,Stable P,Stable NP


#### 2. Meta Analysis
**Questions:**
* 1 Distribution of overridden points
    * 1.1 What is the distribution of incorrect labels?
    * 1.2 What is the distribution of mistaken labels?
    * 1.3 What is the exact distribution of label-label changes? 
* 2 Distribution of labelers overridden
    * 2.1 What is the frequency of labelers overridden?
* 3 Analysis duration 
    * 3.1 What is the difference in analysis duration for labels overridden?
    * 3.2 Which overridden labels have the highest analysis duration? 

**2.1.1** What is the distribution of incorrect labels?

In [5]:
# (1a) Distribution of overridden labels

def label_overrides(df : pd.DataFrame):
    # Subset 
    sdf = df[df["overridden_label"] != "Both"]

    # Counts of each label overridden
    counts = sdf["overridden_label"].value_counts().sort_index()

    # Increment with instances of both
    #   -> TODO: Add robustness if none; 
    bdf = df[df["overridden_label"] == "Both"]
    if bdf.shape[0] != 0:
        for label_1, label_2 in zip(bdf["set_1_label"], bdf["set_2_label"]):
            counts[label_1] += 1
            counts[label_2] += 1

    # Print 
    print("{:^25}\n{}".format("Incorrect Labels", "-"*25))
    for label, count in zip(counts.index, counts.values):
        print("{:^17}: {:>2}".format(label, count))

In [6]:
# Read table as: "Number of times inital {label} incorrect"
label_overrides(meta_dataframe)

    Incorrect Labels     
-------------------------
     P gain      :  9
     P loss      :  5
    Stable NP    : 11
    Stable P     : 30


**2.1.2** What is the distribution of mistaken labels?

In [7]:
# (1b) Distribution of mistaken labels

def label_mistakes(df : pd.DataFrame):
    # Counts of mistaken label
    counts = df["final_label"].value_counts().sort_index()
    
    # Print
    print("{:^25}\n{}".format("Mistaken Labels", "-"*25))
    for label, count in zip(counts.index, counts.values):
        print("{:^17}: {:>2}".format(label, count))

In [8]:
# Read table as: "Number of times final {label} mistaken for something else"
label_mistakes(meta_dataframe)

     Mistaken Labels     
-------------------------
     P gain      :  4
     P loss      :  4
    Stable NP    : 33
    Stable P     :  8


**2.1.3** What is the exact distribution of label-label changes? 

In [9]:
# (1b) Distribution of exact label-label changes

def label_transitions(df : pd.DataFrame):
    # Subset
    sdf = df[df["overridden_label"] != "Both"]

    # Counts of each label-label transition
    transitions = pd.Series(list(zip(sdf["overridden_label"], sdf["final_label"]))).value_counts().sort_index()

    # Increment transitions with instances from both incidents
    #   -> TODO: Add robustness if none; 
    bdf = df[df["overridden_label"] == "Both"]
    if bdf.shape[0] != 0:
        for set_label in ["set_1_label", "set_2_label"]:
            temp_transitions = pd.Series(list(zip(bdf[set_label], bdf["final_label"]))).value_counts().sort_index()
            transitions = transitions.add(temp_transitions, fill_value = 0)
        transitions = transitions.astype(int)

    # Print 
    print("{:^43}\n{}".format("Label-Label Transitions", "-"*42))
    for (initial, final), count in zip(transitions.index, transitions.values):
        print("{:^15} -> {:^15} : {:^3}".format(initial, final, count))

In [10]:
# Read table as: "Number of times initially labeled as {left label} by one or both sets, and final agreement was {right label}"
label_transitions(meta_dataframe)

          Label-Label Transitions          
------------------------------------------
    P gain      ->    Stable NP    :  7 
    P gain      ->    Stable P     :  2 
    P loss      ->    Stable NP    :  4 
    P loss      ->    Stable P     :  1 
   Stable NP    ->     P gain      :  4 
   Stable NP    ->     P loss      :  2 
   Stable NP    ->    Stable P     :  5 
   Stable P     ->     P gain      :  3 
   Stable P     ->     P loss      :  3 
   Stable P     ->    Stable NP    : 24 


**2.2.1** What is the frequency of labelers overridden?

In [11]:
# (2a) Number of times labeler overridden

def labeler_overrides(df : pd.DataFrame):
    # Counts of each labeler overridden
    counts = df["overridden_email"].value_counts().sort_values(ascending = False)

    # Print
    print("{:^43}\n{}".format("Frequency of Labeler Overridden", "-"*42))
    for labeler, count in zip(counts.index, counts.values):
        print(" {:<34} : {:>3}".format(labeler, count))

In [12]:
labeler_overrides(meta_dataframe)

      Frequency of Labeler Overridden      
------------------------------------------
 logdaye@gmail.com                  :  19
 engineer.arnoldmuhairwe@gmail.com  :   9
 Both                               :   6
 ckuei@terpmail.umd.edu             :   5
 hkerner@umd.edu                    :   4
 jwagner@unistra.fr                 :   3
 cnakalem@umd.edu                   :   2
 taryndev@umd.edu                   :   1


**2.3.1** What is the difference in analysis duration for labels overridden?

In [13]:
# (3a) What is the difference in analysis duration for labels overridden?

def median_duration(df : pd.DataFrame):
    # Subset 
    sdf = df[df["overridden_label"] != "Both"]

    # Subset overridden and nonoverridden analysis times
    overridden = sdf["overridden_analysis"].astype(np.float64)
    nonoverridden = sdf["nonoverridden_analysis"].astype(np.float64)

    # Append overridden analysis time with durations from both incidents
    #   -> TODO: Add robustness if none; 
    bdf = df[df["overridden_label"] == "Both"]
    if bdf.shape[0] != 0:
        overridden = pd.concat([
            overridden,
            pd.Series(bdf[["set_1_analysis_duration", "set_2_analysis_duration"]].astype(np.float64).values.flatten())
        ])

    # Print median duration times
    print("{:^37}\n{}".format("Median Analysis Duration", "-"*35))
    print(
        "Overridden Points     : {:.2f} secs \nNon-Overridden Points : {:.2f} secs"
        .format(overridden.median(), nonoverridden.median())
    )

In [14]:
# Read table as: "Median time analysis among disagreed points"
median_duration(meta_dataframe)

      Median Analysis Duration       
-----------------------------------
Overridden Points     : 131.30 secs 
Non-Overridden Points : 159.10 secs


**2.3.2** Which overridden labels have the highest analysis duration?

* Overridden points with short analysis time are most likely obvious mistakes; whereas points overridden with logner analysis duration are more likely indicative of an ambigious point

* Identifying ambigious points may be important for:
    * (1) Downstream analysis involving alternate area change estimation
    * (2) Deriving a systematic disagreement resolvment involving difficult points that are *currently* being skipped in model training pipeline

In [15]:
def highest_duration(df : pd.DataFrame, q : float):
    # (2) Combine durations across both sets
    durations = df[["set_1_analysis_duration", "set_2_analysis_duration"]].values.flatten()
    
    # (3) Find qth quantile of analysis durations
    quantile = np.quantile(durations, q) 

    # (4) Subset df where analysis durations higher than q 
    #       -> In either set 1 or set 2
    sdf = df[(df["set_1_analysis_duration"] >= quantile) | (df["set_2_analysis_duration"] >= quantile)]
    
    # (5) Print number of points with analysis duration higher than quantile
    print("{:^53}\n{}".format("Highest Analysis Durations", "-"*52))
    print(
        "{:.2f} Quantile of Analysis Durations : {:.2f} secs \nAnalysis Time Greater than {:.2f} Quantile : {} points"
        .format(q, quantile, q, sdf.shape[0])
    )
    
    # (6) Label-label transitions from points with analysis duration higher than quantile
    tdf = sdf[sdf["overridden_label"] != "Both"]
    transitions = pd.Series(list(zip(tdf["overridden_label"], tdf["final_label"]))).value_counts().sort_index()

    # (6) Increment transitions count with instances from both incidents
    #   -> TODO: Add robustness if none; 
    bdf = sdf[sdf["overridden_label"] == "Both"]
    if bdf.shape[0] != 0:
        for set_label in ["set_1_label", "set_2_label"]:
            temp_transitions = pd.Series(list(zip(bdf[set_label], bdf["final_label"]))).value_counts().sort_index()
            transitions = transitions.add(temp_transitions, fill_value = 0)
        transitions = transitions.astype(int)

    # Print label-label transitions
    print("\n{:^53}\n{}".format("Label-Label Transitions", "-"*52))
    for (initial, final), count in zip(transitions.index, transitions.values):
        print("{:^25} -> {:^15} : {:^3}".format(initial, final, count))

In [16]:
# Read table as: "Among q-th quantile of analysis times for disagreed points"
# Note: transition tabel follows same logic as above, where 'count' denotes occurence of 
#       {left label} by either one or both sets. hence, total count may exceed no. points!
highest_duration(meta_dataframe, 0.85)

             Highest Analysis Durations              
----------------------------------------------------
0.85 Quantile of Analysis Durations : 592.24 secs 
Analysis Time Greater than 0.85 Quantile : 15 points

               Label-Label Transitions               
----------------------------------------------------
         P gain           ->    Stable NP    :  4 
         P gain           ->    Stable P     :  1 
        Stable NP         ->     P gain      :  1 
        Stable NP         ->    Stable P     :  2 
        Stable P          ->     P gain      :  1 
        Stable P          ->     P loss      :  2 
        Stable P          ->    Stable NP    :  6 
