# Creating a Cleaned Dataset for Snow Spotter (Water Year 2021)

#### Summary

Zooniverse data exports  provide a massive csv file with a huge range of data. However, for data analysis purposes, we are only interested in a a few fields. This script describes the code used to create a cleaned netCDF or csv file for Niwot Ridge Water Year 2021. The final cleaned file will have columns for datetime, median value, mean value, and mean threshold value. Note that this code is specific to the 2021 Water Year. The Snow Spotter Data Cleaning repository contains jupyter notebook files for other water years. 

##### First to import the following packages

In [21]:
import pandas as pd
import json
from datetime import datetime
import xarray as xr
import numpy as np
from dateutil.relativedelta import relativedelta

##### Now import the data export as a dataframe and have pandas read the "subject_data" and "annotations" columns

In [None]:
df = pd.read_csv("D:\\JohnsWork\\NiwotRidge\\Niwot_WY2021\\NiwotWY2021_Classifications_Data.csv", usecols = ['subject_data', 'annotations'])

#### Next, extract the datetime from the subject_data column.

 * First, the cell is transformed into a dict so that we can call the 'metadata' key. 
 
 * If you label your metadata column appropriately, this will already be in the form Y_m_d_HMS. If you labeled your column in another way, you will have to adjust this value so that it only contains the date. 
 
 * We can then use strip the datetime, finishing creating a column with just datetimes which we will need for our analysis later on.

In [79]:
arr = []
row_count = len(df.index)
def extract_meta_data(index):
    subject_data = df.iloc[index,1]
    #string to dict
    subject_dict = json.loads(subject_data)
    #shed outer layer of dict
    shedded_subject_dict = dict(ele for sub in subject_dict.values() for ele in sub.items())
    #extract the metadata, see important note in markdown above
    metadata = shedded_subject_dict['Filename']
    if len(metadata) == 28:
        date = metadata[7:len(metadata)-4]
    else:
        date = metadata[7:len(metadata)-8]
    #extract datetime, see important note in markdown above
    pre_date_time = datetime.strptime(date, '%Y_%m_%d_%H%M%S')
    if pre_date_time.month == 4:
        date_time = pre_date_time + relativedelta(years=1)
    else: date_time = pre_date_time
    
    return date_time

#### The next function extracts the participant response

 * Once again, the cell must first be transformed into a dict so that we can call the 'value' key.
 
 * This value will be either "Yes", "No", or "Unsure." We need to convert this into a numerical value so that we can average and graph responses later on. We use a simple if then loop to convert "Yes" to 1.0, "No" to 0.0, and "Unsure" to NaN.

In [7]:
def extract_value(index):
    annotation = df.iloc[index,0]
    #get rid of brackets 
    str_annotation = str(annotation)
    annotation_dict_str = str_annotation[1:len(str_annotation)-1]
    #string to dict
    annotation_dict = json.loads(annotation_dict_str)
    #grab yes or no value from dict 
    annotation_value = annotation_dict['value']

    if(annotation_value == 'Yes'): 
        return 1
    elif(annotation_value == 'No'): 
        return 0
    else:
        return None

#### Now, we combine the extracted datetime and reponse into a new dataframe, and then find and combine the median and mean response for each datetime. 

The resulting dataframe will have one column for datetime, one column for median value, and one column for mean value. 

In [None]:
i = 0       
while (i < (len(df)-1)):
    i = i + 1
    data = []
    data.append(extract_meta_data(i))
    data.append(extract_value(i))
    arr.append(data)
    
    
final = pd.DataFrame(arr, columns = ['datetime','value'])

median_final = final.groupby('datetime').median().reset_index()
mean_final = final.groupby('datetime').mean().reset_index()
combined_final = median_final
combined_final['mean_value'] = mean_final.value
combined_final.columns = ['datetime', 'median_value', 'mean_value']
#create the "mean_threshold" column
combined_final['mean_threshold'] = np.where(combined_final['mean_value'] >= 0.9, 1, np.where(np.isnan(combined_final['mean_value']), np.nan, 0))
combined_final.set_index('datetime', inplace=True)
combined_final.head()

#### Save the cleaned data set

This code will save the cleaned dataset as a NetCDF file for easy management. 

In [81]:
cleaned_export = combined_final.to_xarray()
cleaned_export.to_netcdf("D:\\JohnsWork\\NiwotRidge\\Cleaned Data\\netCDF\\NiwotWY2021_Cleaned_Data.nc")

In [82]:
combined_final.to_csv("D:\\JohnsWork\\NiwotRidge\\Cleaned Data\\csv\\NiwotWY2021_Cleaned_Data.csv")  