# Combine ribbit scores with verified data

This notebook combines the csv file with ribbit scores with csv files with manually verified data to allow the user to assess the validity of the RIBBIT model. 

## Setup

In [5]:
# run the file setup_functions.ipynb to define setting, import packages, and define functions 
%run ../ribbit_functions/setup_functions.ipynb

### Combine csv files

In [None]:
# If you need to combine multiple manually verified data files into one 
# Data must have the following columns: "Site", "Logger", "Sample Date", "Species", "NAAMP", "File ID"
if input("Are you sure you want to run this? (type 'yes' to continue)")=='yes':
    folder_path = "../manually_verified_data/ichaway_verified_data/"
    raw_ich = combine_csvs(folder_path, new_csv_name = "ichaway_verified_data.csv")
else:
    print('aborted')


# Import and clean data
### Define file and folder paths for data import and cleaning 

In [6]:
# file path to csv file with ribbit scores 
ribbit_scores_fp = "./ribbit_scores_Dec2022/ribbit_scores_combined.csv" #*# change this to file path for ribbit scores

# file path to csv file with manually verified data 
verified_data_fp = "./ichaway_verified_data/ichaway_verified_data.csv" #*# change this to file path for verified files

# path to folder that contained the audio files WHEN THE MODEL WAS RUN
# Basically the prefix to the audio files - this is used to access the indices of the csv file containing the ribbit scores. 
audio_files_fp = '/Volumes/Expansion/Frog Call Project/Calling Data/ichaway/' #*# change this to file path for where the audio data was WHEN THE MODEL WAS RUN 



### Import and clean ribbit score data 

In [7]:
# Import ribbit scores based on ribbit_scores_fp
rs_ich = pd.read_csv(ribbit_scores_fp, index_col = 0)

rs_ich['date']=pd.to_datetime(rs_ich['date']) # convert column to date-time format


### Import and clean manually verfied data 

In [8]:
# import manually verified data
raw_ich = pd.read_csv(verified_data_fp)[["Site", "Logger", "Sample Date", "Species", "NAAMP", "File ID", "Start Date"]]

# rename columns for convenience
raw_ich = raw_ich.rename(columns = {"Site":"site", "Logger":"logger", "Sample Date":"date", "Species":"species", "File ID":"file_name", "Start Date":"folder_date"})

# create year column based on date string
raw_ich['year'] = raw_ich.date.astype(str).str[-4:]
raw_ich.astype({"year":"int"})


# create full file path from file names and logger numbers 
raw_ich['folder_date'] = pd.to_datetime(raw_ich['folder_date'], format='%m/%d/%Y').dt.strftime('%-m-%-d-%y')
raw_ich['file_path'] = audio_files_fp + 'ichaway_' + raw_ich['year'].astype('string') + '/' + raw_ich['logger'].astype('string') + 'a/' + raw_ich['folder_date'] + '/' + raw_ich['file_name'] + '.wav' #*#
# set file path as index 
raw_ich = raw_ich.set_index('file_path')

# identify which rows are Lcapito observations 
raw_ich['Lcapito'] = raw_ich['species'] == 'LICAP'
raw_ich['Lcapito'] = raw_ich['Lcapito'].astype('category')

# create "verified_ich" dataframe with one row per file with a column (Lcapito) with 1 if the file has a Lcapito and 0 if it does not
verified_ich = raw_ich.sort_values(["file_path", "Lcapito"], ascending = False).groupby('file_path').head(1) 

# these files were labeled incorrectly in the ichaway data - there are gopher frogs in them 
# logger 5a: 
#20150205_194700
#20150205_204700
#20150205_214700

# fix these mistakes
temp = audio_files_fp + 'ichaway_2015/5a/2-2-15/20150205_'
incorrect_files =  [temp + '194700.wav', temp + '204700.wav', temp + '214700.wav']
verified_ich.loc[verified_ich.index.isin(incorrect_files),'Lcapito'] = True




### Merge ribbit data and verified data  

In [9]:
# merge option 1 

# merge with ribbit scores based on file path
# this drops some files where the file path minutes don't match between the rs_ich and verified_ich
verified_ich = verified_ich.drop(columns = ["year", "date", "logger"]).merge(rs_ich, left_index = True, right_index = True)
verified_ich = verified_ich.dropna(subset=['Lcapito']) # drop any rows with "NaN" for Lcapito - if data was entered incorrectly, empty, etc. 


In [16]:
# merge option 2 - better but more likely to cause problems 

# merge with ribbit scores based on hour of file path (ignore minutes - these sometimes don't match for some reason)
# still drops some files but not as many 
# warning: potential to match incorrect files (e.g. if one file is labeled 10:01 and another 10:58)
#rs_ich["fp_shortened"] = rs_ich.index.str[:-8]
#verified_ich["fp_shortened"] = verified_ich.index.str[:-8]
#verified_ich = verified_ich.drop(columns = ["year", "date", "logger"]).merge(rs_ich, left_on = "fp_shortened", right_on = "fp_shortened")
#verified_ich = verified_ich.dropna(subset=['Lcapito']).drop(columns = ["fp_shortened"]) # drop any rows with "NaN" for Lcapito - if data was entered incorrectly, empty, etc. 

### TODO:  still losing some files after merging using this option - why?



## Using `get_top_rs()`

### Function definition

`def get_top_rs(df, n = 5, min_score = 0.0, t_unit = "Y", \
               group_col = 'no_groups', groups = ["0"], \
               score_col = "score", time_stamp_col = "time_stamp", \
               save_csv = False):`

**Purpose:** get list of audio files with top ribbit scores for certain criterion

**Input:** 
* `df` - data frame with ribbit scores 
* `n` - number of files per group (e.g. n = 5 gets top 5 ribbit scores per group)
* `min_score` - minimum ribbit score needed for file to be included 
      (e.g. if you want all files above a ribbit score of 50, you could have min_score = 50 and n = 999999999999)
* `t_unit` - unit for how often we want the top scores (options: D, W, M, Y, Q - day, week, month, year, quarter year)
* `group_col` - the name of the column with the labels grouping our files 
      (e.g. "pond" for sandhills or "site" for ichaway wetlands)
* `groups` - list of the groupings 
      (e.g. for sandhills the pond numbers [398, 399, 400, 401, 402, 403]; for ichaway would be the wetlands' names)
* `score_col` - column name where ribbit score is stored 
* `time_stamp_col` - column name where time stamp for ribbit score is stored
* `save_csv` - False if we do not want to save our output to a csv. Otherwise string of the file path where we want to save the csv file 
      (e.g. "./ribbit_scores/top_ribbit_scores_per_year.csv")

**Out:**
dataframe with top `n` files with ribbit score over `min_score` for each `groups` for every `t_unit` 
       

### Example

In [11]:
# create variable of all ichaway wetlands that had audio loggers 
ichaway_wetlands = verified_ich['site'].unique()

# Get top 3 audio files for each wetland and save it to a csv file
temp = get_top_rs(verified_ich, n = 3, group_col = 'site', groups = ichaway_wetlands, save_csv = "./example_top_rs_ichaway.csv")


