# Prepare East Sudan Points

**Author**: Ivan Zvonkov

**Date Modified**: June 19, 2024

**Description**: Processes points re-checked by SatLabel Squad into updated points for East Sudan.

In [1]:
import pandas as pd

## 1. Load all relevant files

In [2]:
df2022_set1_rechecked = pd.read_csv("points_raw/ceo-Sudan-Feb-2022---Feb-2023-(Set-1)-sample-data-2024-04-29_rechecked2.csv")
df2022_set2 = pd.read_csv("points_raw/ceo-Sudan-Feb-2022---Feb-2023-(Set-2)-sample-data-2024-05-22.csv")

df2023_set1_rechecked = pd.read_csv("points_raw/ceo-Sudan-Feb-2023---Feb-2024-(Set-1)-sample-data-2024-05-06_rechecked2.csv")
df2023_set2 = pd.read_csv("points_raw/ceo-Sudan-Feb-2023---Feb-2024-(Set-2)-sample-data-2024-05-22.csv")

## Check Lengths of Each CEO set

In [3]:
len(df2022_set1_rechecked)

1197

In [4]:
len(df2022_set2)

1207

Not that same!

Checking for duplicate plot ids in both

In [5]:
df2022_set1_rechecked["plotid"].value_counts()

453     2
0       1
794     1
801     1
800     1
       ..
398     1
397     1
396     1
395     1
1195    1
Name: plotid, Length: 1196, dtype: int64

In [6]:
df2022_set2["plotid"].value_counts() 

176     2
358     2
59      2
454     2
456     2
       ..
401     1
400     1
399     1
398     1
1195    1
Name: plotid, Length: 1196, dtype: int64

Looks like CEO allows duplicate plot ids, presumambly when two people end up labeling a single point in a single set.

Without the duplicates though both sets in 2022 have 1196 points. 

Will deal with this later in the notebook.

In [7]:
len(df2023_set1_rechecked)

1196

In [8]:
len(df2023_set2)

1196

No duplicate points in 2023 good.

During rechecking we used a new column "Type of point".

In [9]:
df2022_set1_rechecked["Type of point"].value_counts()

Obvious non-crop     117
Obvious crop          95
Probably crop         75
Probably non-crop     52
Name: Type of point, dtype: int64

In [10]:
df2023_set1_rechecked["Type of point"].value_counts()

Obvious crop         161
Obvious non-crop     143
Probably non-crop     19
Probably crop         16
Name: Type of point, dtype: int64

## 2. Process Points 2022

For each point, is it crop or not?

Pseudocode:

1. RECHECK OVERRIDE: If "Type of Point" is available, use it and ignore other set.
2. If not available, use agreement between sets.

In [11]:
is_crop_col = "Does this pixel contain active cropland?"

In [12]:
points_2022 = []

for i in range(1196):
    point_2022_set1_rechecked = df2022_set1_rechecked[df2022_set1_rechecked["plotid"] == i]
    
    # Not a duplicate plot id
    if len(point_2022_set1_rechecked) > 1:
        print(f"Duplicate plotid: {i}")
        
    point_2022_set1_rechecked = point_2022_set1_rechecked.iloc[0]
    
    # Keep only East Sudan points (east of 32° lon)
    if point_2022_set1_rechecked["lon"] < 32:
        continue

    # RECHECK OVERRIDE
    type_of_point = point_2022_set1_rechecked["Type of point"]
    if type(type_of_point) == str:
        if "non-crop" in type_of_point:
            label = 0.0
        else:
            label = 1.0

    # No recheck use agreement
    else:
        point_2022_set2 = df2022_set2[df2022_set2["plotid"] == i].iloc[0]
        
        # Check agreement
        if point_2022_set1_rechecked[is_crop_col] == point_2022_set2[is_crop_col]:
            if point_2022_set2[is_crop_col] == "Crop":
                label = 1.0
            else:
                label = 0.0
        
        
#         # Labelers I trust a bit more set 1:
#         elif point_2022_set1_rechecked["email"] in ["gmuhawen@asu.edu"]:
#             if point_2022_set1_rechecked[is_crop_col] == "Crop":
#                 label = 1.0
#             else:
#                 label = 0.0
                
#         # Labelers I trust a bit more set 2
#         elif point_2022_set2["email"] in ["izvonkov@umd.edu", "hkerner@umd.edu", "mpurohi3@asu.edu", "sbaber@umd.edu"]:
#             if point_2022_set2[is_crop_col] == "Crop":
#                 label = 1.0
#             else:
#                 label = 0.0
                
        # Disagreement without Gedeon
        else:
            print(f"Disagreement for plotid: {i} skipping.")
            continue
        
    lat = point_2022_set1_rechecked["lat"]
    lon = point_2022_set1_rechecked["lon"]
    points_2022.append({"longitude": lon, "latitude": lat, "label": label})

Duplicate plotid: 453
Disagreement for plotid: 517 skipping.
Disagreement for plotid: 526 skipping.
Disagreement for plotid: 527 skipping.
Disagreement for plotid: 537 skipping.
Disagreement for plotid: 548 skipping.
Disagreement for plotid: 551 skipping.
Disagreement for plotid: 555 skipping.
Disagreement for plotid: 559 skipping.
Disagreement for plotid: 560 skipping.
Disagreement for plotid: 563 skipping.
Disagreement for plotid: 568 skipping.
Disagreement for plotid: 569 skipping.
Disagreement for plotid: 574 skipping.
Disagreement for plotid: 575 skipping.
Disagreement for plotid: 601 skipping.
Disagreement for plotid: 602 skipping.
Disagreement for plotid: 611 skipping.
Disagreement for plotid: 634 skipping.
Disagreement for plotid: 650 skipping.
Disagreement for plotid: 672 skipping.
Disagreement for plotid: 683 skipping.
Disagreement for plotid: 685 skipping.
Disagreement for plotid: 690 skipping.
Disagreement for plotid: 804 skipping.
Disagreement for plotid: 807 skipping.
Dis

In [13]:
df2022 = pd.DataFrame(points_2022)

In [14]:
df2022["label"].value_counts()

1.0    311
0.0    235
Name: label, dtype: int64

## 4. Process Points 2023

In [15]:
points_2023 = []

for i in range(1196):
    point_2023_set1_rechecked = df2023_set1_rechecked[df2023_set1_rechecked["plotid"] == i]
    
    # Not a duplicate plot id
    if len(point_2023_set1_rechecked) > 1:
        print(f"Duplicate plotid: {i}")
        
    point_2023_set1_rechecked = point_2023_set1_rechecked.iloc[0]
    
    # Keep only East Sudan points (east of 32° lon)
    if point_2023_set1_rechecked["lon"] < 32:
        continue

    # RECHECK OVERRIDE
    type_of_point = point_2023_set1_rechecked["Type of point"]
    if type(type_of_point) == str:
        if "non-crop" in type_of_point:
            label = 0.0
        else:
            label = 1.0

    # No recheck use agreement
    else:
        point_2023_set2 = df2023_set2[df2023_set2["plotid"] == i].iloc[0]
        
        # Set 2 not labeled, default to set 1
        if type(point_2023_set2[is_crop_col]) != str:
            if point_2023_set1_rechecked[is_crop_col] == "Crop":
                label = 1.0
            else:
                label = 0.0
            
        
        # Check agreement
        elif point_2023_set1_rechecked[is_crop_col] == point_2023_set2[is_crop_col]:
            if point_2023_set2[is_crop_col] == "Crop":
                label = 1.0
            else:
                label = 0.0
        
        
        # If disagreement and Gedeon is a labeler give him override:
        elif point_2023_set1_rechecked["email"] == "gmuhawen@asu.edu":
            if point_2023_set1_rechecked[is_crop_col] == "Crop":
                label = 1.0
            else:
                label = 0.0
                
        # If disagreement and Ivan or Hannah is labeler give him override
        elif point_2023_set2["email"] in ["izvonkov@umd.edu", "hkerner@umd.edu"]:
            if point_2023_set2[is_crop_col] == "Crop":
                label = 1.0
            else:
                label = 0.0
                
        # Disagreement without Gedeon
        else:
            print(f"Disagreement for plotid: {i} skipping.")
            continue
        
    lat = point_2023_set1_rechecked["lat"]
    lon = point_2023_set1_rechecked["lon"]
    points_2023.append({"longitude": lon, "latitude": lat, "label": label})

In [16]:
df2023 = pd.DataFrame(points_2023)

In [17]:
df2023["label"].value_counts()

1.0    305
0.0    274
Name: label, dtype: int64

## 5. Split for Training and Area Estimation 

Going to try 50/50 split and hope for the best.

In [18]:
import numpy as np

In [19]:
random_float = np.random.rand(len(df2022.index))

In [20]:
subset_col = pd.Series(index=df2022.index, data="testing")
subset_col[0.5 <= random_float] = "training"

In [21]:
subset_col.value_counts()

training    289
testing     257
dtype: int64

In [22]:
df2022["subset"] = subset_col

In [23]:
df2023["subset"] = subset_col

In [24]:
df2022

Unnamed: 0,longitude,latitude,label,subset
0,34.412608,13.322394,1.0,training
1,33.042904,13.296256,1.0,training
2,33.967062,12.616816,1.0,testing
3,35.666443,12.672629,1.0,training
4,34.499577,12.942798,0.0,training
...,...,...,...,...
541,34.191256,14.892369,1.0,training
542,32.990788,13.772671,1.0,training
543,33.035949,14.585912,0.0,testing
544,33.454790,13.852573,1.0,training


In [25]:
df2023

Unnamed: 0,longitude,latitude,label,subset
0,34.412608,13.322394,1.0,training
1,33.042904,13.296256,1.0,training
2,34.215501,13.411822,0.0,testing
3,33.967062,12.616816,1.0,training
4,35.666443,12.672629,0.0,training
...,...,...,...,...
574,34.191256,14.892369,0.0,
575,32.990788,13.772671,0.0,
576,33.035949,14.585912,1.0,
577,33.454790,13.852573,1.0,


In [27]:
df2022.to_csv("points_processed/points_2022_EastSudan_50subset_v3.csv", index=False)
df2023.to_csv("points_processed/points_2023_EastSudan_50subset_v3.csv", index=False)