### CEO Meta-Analysis - Crop Land Area Estimation
**Author:** Benjamin Yeh (by253@cornell.edu / byeh1@umd.edu) <br>
**Description:** This notebook contains:
1. Code to generate dataframe containing meta information from labeler sets 
2. Code to generate statistics from meta dataframe

In [1]:
import numpy as np
import pandas as pd
from meta_utils import create_meta_dataframe

#### 1. Generate Meta Dataframe 

In [2]:
# Modify the below helper function here for loading label csv file
def path_fn(set_id : str, date : str) -> str:
    """ Returns string path to csv label file.

    Gives the path + file name to the csv label file by labeler set `set_id` at the timestamp `date`. For CEO
    labeling projects, the files are named identically except for labeler set and timestamp date. 
    
    Example : how to generalize the file name
    -> File for set 1 :
        ceo-Tigray-2020-2021-Change-(set-1)-sample-data-2022-01-10.csv
    -> File for set 2 : 
        ceo-Tigray-2020-2021-Change-(set-2)-sample-data-2022-01-17.csv
    -> Generalized file name:
        ceo-Tigray-2020-2021-Change-({set_id})-sample-data-2020-{date}.csv

    Args
        set_id : 
          String indicating the label set as it appears on the labeling csv file - e.g., 'set-1', or 'set-2'.
        date : str
          String indicating the date as it appears on the labeling csv file.
    Returns
        path : 
          String indicating path to csv label file for `set_id` at `date`. 
    
    """
    
    # TODO: Block-begin 
    path = f"data/ceo-Tigray-2020-2021-Change-({set_id})-sample-data-2022-{date}.csv"
    # TODO: Block-end
    return path

# Indicate here the dates 
cdate = "01-10"
fdate = "01-17"

# Indicate here whether labeling project is area change
area_change = True

In [3]:
# Create meta dataframe
if area_change:
    y1, y2 = input("Year 1 of observations : "), input("Year 2 of observations : ")
    meta_dataframe = create_meta_dataframe(path_fn, cdate, fdate, area_change, y1, y2)
else:
    meta_dataframe = create_meta_dataframe(path_fn, cdate, fdate)

meta_dataframe.head()

               Loading dataframes from file...               
-----------------------------------------------------------
Native dataframe shapes : (600, 14) , (600, 14) , (600, 14)
Loading and checking dataframes complete!

                 Computing disagreements...                  
-----------------------------------------------------------
Disagreements between labeler sets 1 and 2 : 49

                 Creating meta dataframe...                  


Unnamed: 0,plotid,sampleid,lon,lat,set_1_email,set_2_email,overridden_email,set_1_analysis_duration,set_2_analysis_duration,overridden_analysis,nonoverridden_analysis,set_1_label,set_2_label,final_label,overridden_label
0,163,163,37.120252,13.520786,jwagner@unistra.fr,bbarker1@umd.edu,Both,124.0,105.2,Both,,Stable P,P gain,Stable NP,Both
1,252,252,39.154225,14.230454,hkerner@umd.edu,ckuei@terpmail.umd.edu,Both,43.7,949.7,Both,,P gain,Stable P,Stable NP,Both
2,296,296,38.953575,14.07516,hkerner@umd.edu,engineer.arnoldmuhairwe@gmail.com,hkerner@umd.edu,172.2,187.8,172.2,187.8,Stable P,Stable NP,Stable NP,Stable P
3,299,299,39.335162,13.653124,hkerner@umd.edu,engineer.arnoldmuhairwe@gmail.com,hkerner@umd.edu,108.4,601.7,108.4,601.7,P gain,Stable NP,Stable NP,P gain
4,300,300,36.72535,13.779008,hkerner@umd.edu,engineer.arnoldmuhairwe@gmail.com,engineer.arnoldmuhairwe@gmail.com,49.6,584.5,584.5,49.6,Stable P,Stable NP,Stable P,Stable NP


#### 2. Meta Analysis

**Questions:**
* 1 Distribution of disagreement points
    * 1.1 What is the distribution of overridden labels?
    * 1.2 What is the distribution of consensus labels?
    * 1.3 What is the distribution of disagreements?
    * 1.4 What is the distribution of label changes? 
* 2 Distribution of labelers overridden
    * 2.1 What is the frequency of labelers overridden?
* 3 Analysis duration 
    * 3.1 What is the difference in analysis duration for labels overridden?
    * 3.2 Which overridden labels have the highest analysis duration? 

In [4]:
from meta_utils import (
    label_overrides, label_mistakes, label_disagreements, label_transitions, 
    labeler_overrides, median_duration, highest_duration
)

**2.1.1** What is the distribution of incorrect labels?

In [5]:
# Read table as: "Number of times label overridden"
label_overrides(meta_dataframe)

    Incorrect Labels     
-------------------------
     P gain      :  9
     P loss      :  5
    Stable NP    : 11
    Stable P     : 30


**2.1.2** What is the distribution of mistaken labels?

In [6]:
# Read table as: "Number of times consensus label 'mistaken' for a different label"
label_mistakes(meta_dataframe)

     Mistaken Labels     
-------------------------
     P gain      :  4
     P loss      :  4
    Stable NP    : 33
    Stable P     :  8


**2.1.3** What is the distribution of disagreements?

In [7]:
# Read table as: "Number of disagreements between {label 1} and {label 2}"
# Note: This is a count of *distinct* label pair disagreements

label_disagreements(meta_dataframe)

       Distribution of Disagreements       
------------------------------------------
    P gain      x    Stable NP    :  6 
    P gain      x    Stable P     :  4 
    P loss      x    Stable NP    :  5 
    P loss      x    Stable P     :  3 
   Stable NP    x    Stable P     : 31 


**2.1.3** What is the distribution of label $\rightarrow$ label changes? 

In [8]:
# Read table as: "Number of times initially labeled as {left hand side} by one or both sets, and final agreement was {right hand side}"
# Question: Is there more disagreement among crop or non-crop points?

label_transitions(meta_dataframe)

          Label-Label Transitions          
------------------------------------------
    P gain      ->    Stable NP    :  7 
    P gain      ->    Stable P     :  2 
    P loss      ->    Stable NP    :  4 
    P loss      ->    Stable P     :  1 
   Stable NP    ->     P gain      :  4 
   Stable NP    ->     P loss      :  2 
   Stable NP    ->    Stable P     :  5 
   Stable P     ->     P gain      :  3 
   Stable P     ->     P loss      :  3 
   Stable P     ->    Stable NP    : 24 


**2.2.1** What is the frequency of labelers overridden?

In [9]:
labeler_overrides(meta_dataframe)

      Frequency of Labeler Overridden      
------------------------------------------
 logdaye@gmail.com                  :  19
 engineer.arnoldmuhairwe@gmail.com  :   9
 Both                               :   6
 ckuei@terpmail.umd.edu             :   5
 hkerner@umd.edu                    :   4
 jwagner@unistra.fr                 :   3
 cnakalem@umd.edu                   :   2
 taryndev@umd.edu                   :   1


**2.3.1** What is the difference in analysis duration for labels overridden?

In [10]:
# Read table as: "Median time analysis among disagreed points"
median_duration(meta_dataframe)

      Median Analysis Duration       
-----------------------------------
Overridden Points     : 131.30 secs 
Non-Overridden Points : 159.10 secs


**2.3.2** Which overridden labels have the highest analysis duration?

Overridden points with short analysis time are most likely obvious mistakes; whereas points overridden with logner analysis duration are more likely indicative of an ambigious point

In [11]:
# Read table as: "Among q-th quantile of analysis times for disagreed points"
highest_duration(meta_dataframe, 0.85)

             Highest Analysis Durations              
----------------------------------------------------
0.85 Quantile of Analysis Durations : 592.24 secs 
Analysis Time Greater than 0.85 Quantile : 15 points

               Label-Label Transitions               
----------------------------------------------------
         P gain           ->    Stable NP    :  4 
         P gain           ->    Stable P     :  1 
        Stable NP         ->     P gain      :  1 
        Stable NP         ->    Stable P     :  2 
        Stable P          ->     P gain      :  1 
        Stable P          ->     P loss      :  2 
        Stable P          ->    Stable NP    :  6 


In [12]:
# Note: transition tabel follows same logic as above, where 'count' denotes occurence of 
#       {left label} by either one or both sets. hence, total count may exceed no. points!

# TODO: For highest analysis duration points, display the same statistics earlier in notebook
#       -> Label distribution, disagreement distributions, etc. 