# Intercomparison of cross validation points

We will join each dataset's 'waterflag' values together where the lat, lon, and month values align.

There are two sensible strategies for joining the datasets. The 'inner' join method will return only those samples where _all_ the analysts reported a waterflag value (either 0,1,2, or 3). This is quite a strict strategy and very few samples are returned (but we have more confidence in them). 

The 'outer' strategy will combine all sample values regardless of whether every analyst actualy completed that sample.  This second strategy is less onerous and thus we return more samples, but some rows will contain only two observations, with the remainder filled with NaNs (so we have less confidence in many of the samples).  

Once the dataframes are joined, we calculate the majority classification (mode of all analyst's classifications).  The percentage agreement between the individaul analyst's classifications and the majority classification are computed for each sample. The mean and standard deviation of all agreement percentages is reported at the end of the notebook.

In [1]:
import os
import pandas as pd
import numpy as np

In [2]:
data_folder = 'data/clean/'

## Open first dataframe

In [3]:
#first grab the list of csvs
dfs=[]
for f in os.listdir(data_folder):
    name, ext = os.path.splitext(f)
    if ext == '.csv':
        dfs.append(f)

In [4]:
#open the dataframes, and ignore some of the columns
df = pd.read_csv(data_folder+dfs[0]).set_index('Unnamed: 0').drop(['class', 'plot_id'], axis=1)
len(df)

211

## Join other dataframes where lat,lon and month match

* Use 'outer' join method to we retain all observations (lots of NaNs).

* Use 'inner' join method to retain only the observations where every analyst has labelled the sample. This will result in only 15 observations in total (out of a possible: 80 samples * 12 months = 960) 

In [5]:
join_method = 'outer'

for f in dfs[1:]:
    other=pd.read_csv(data_folder+f).set_index('Unnamed: 0').drop(['class', 'plot_id'], axis=1)
    df = df.merge(other, on=['lat', 'lon', 'month'], suffixes=[None, f[0:3]], how=join_method)

## Find unique mode of waterflag for each lat,lon,month

If `dropnans` is set to True, then if NaN is the most common value for the sample, it will not be counted. Instead, only values that aren't NaN will be counted in the mode calculation. For example, if of the 9 analysts, 7 have NaNs and two have 'water', then the mode will be 'water'.

In [6]:
dropnans = True

In [7]:
u = df.filter(like='waterflag').mode(axis=1, dropna=dropnans)
if len(u.columns) > 1:
    u = u.iloc[:, 0].where(u.iloc[:, 1:].all(axis=1))

df['waterflag_mode'] = u

print the counts for each majority classification

In [8]:
df['waterflag_mode'].value_counts(dropna=False)

2.0    465
1.0    391
0.0    171
Name: waterflag_mode, dtype: int64

## Count agreements with majority waterflag

In [9]:
x=[]
for i in range(len(df)):
    num_agree=(df.iloc[i, 3:12] == df.iloc[i, 12]).sum()
    total_valid = np.count_nonzero(~np.isnan(df.iloc[i, 3:12]))      
    x.append((num_agree/total_valid)*100)

df['agree_percent'] = x

## Mean and std of agreement across all cross-validation samples

In [10]:
df['agree_percent'].mean()

81.89803867019097

In [11]:
df['agree_percent'].std()

17.946191522129137

### Export table

In [12]:
df.to_csv('results/cross_validation_results.csv')