# Creating a Cleaned Dataset for Snow Spotter 

#### Summary

When you request a data export, Zooniverse will provide a massive csv file with a huge range of data. However, for data analysis purposes, we are only interested in specific fields from two columns, so the first step is to have pandas read in only the annotations and subject_data columns. Within these columns is the metadata containing the image datetime that was provided in the initial manifest upload as well as the participant response for each image. We can use a series of transformations to extract this information. 

From here, the next step is to convert Yes/No/Unsure responses into a numerical values and then average the reponses for each individual image. This can be done by converting "Yes" responses to a value of 1.0, "No" responses to a value of 0.0, and "Unsure" responses to NaN. Then, the median and mean of these values is taken and combined into a single dataset, which will be saved as a netCDF file.

#### First, import the following packages

In [19]:
import os
import pandas as pd
import json
import matplotlib.pyplot as plt
from datetime import datetime
import xarray as xr

#### Now import the data export and create a dataframe with only the subject_data and annotations columns. 

In [None]:
#set the path to work in the file where the export is located
path = "file_location"
os.chdir(path)

df = pd.read_csv("filename.csv", usecols = ['subject_data', 'annotations'])

#### The following code extracts the datetime from the subject_data column.

 * First, the cell is transformed into a dict so that we can call the 'metadata' key. 
 
 * If you label your metadata column appropriately, this will already be in the form Y_m_d_HMS. If you labeled your column in another way, you will have to adjust this value so that it only contains the date. 
 
 * We can then use strip the datetime, finishing creating a column with just datetimes which we will need for our analysis later on.
 
**Important Note:** When uploading images to Zooniverse, the choice of medatata label will have an impact on the following code. It is best practice to label the metadata with only the image date and time, as this will limit the number of steps needed to extract a datetime from the classifications export. However, if something else is chosen, additional lines of code can extract the date from the metadata. While not preferable, as long as the date is somewhere within the metadata label, it can be extracted. For instance, if I had labeled my metadata with with both location and date, such as "sagehen_2021_12_01_095605", I would add this line of code as shown below.  

```python
    metadata = subject_dict_simple['metadata']
    date = metadata[8:len(metadata)]
```
Choice of metadata label can also affect datetime extraction, and the strip datetime code can be adjusted to fit any datetime format. The code below assumes the metadata is in the form Year_Month_Day_HourMinuteSecond (such as 2021_12_01_095605).

In [18]:
arr = []
row_count = len(df.index)
def extract_meta_data(index):
    subject_data = df.iloc[index,1]
    #string to dict
    subject_dict = json.loads(subject_data)
    #shed outer layer of dict
    shedded_subject_dict = dict(ele for sub in subject_dict.values() for ele in sub.items())
    #extract the metadata, see important note in markdown above
    metadata = shedded_subject_dict_simple['metadata']
    #extract datetime, see important note in markdown above
    date_time = datetime.strptime(date, '%Y_%m_%d_%H%M%S')
    return date_time

#### The next function extracts the participant response

 * Once again, the cell must first be transformed into a dict so that we can call the 'value' key.
 
 * This value will be either "Yes", "No", or "Unsure." We need to convert this into a numerical value so that we can average and graph responses later on. We use a simple if then loop to convert "Yes" to 1.0, "No" to 0.0, and "Unsure" or "It's dark" to NaN.

In [4]:
def extract_value(index):
    annotation = df.iloc[index,0]
    #convert to string
    str_annotation = str(annotation)
    #get rid of brackets around dict
    annotation_dict_str = str_annotation[1:len(str_annotation)-1]
    #string to dict
    annotation_dict = json.loads(annotation_dict_str)
    #grab yes or no value from dict 
    annotation_value = annotation_dict['value']

    if(annotation_value == "Yes"): 
        return 1
    elif(annotation_value == "No"): 
        return 0
    else:
        return None

#### Now, combine the extracted datetime and reponse into a new dataframe, and then find and combine the median and mean response for each datetime. 

The resulting dataframe will have one column for datetime, one column for median_value, and one column for mean_value. 

In [6]:
i = 0       
while (i < 65449):
    i = i + 1
    data = []
    data.append(extract_meta_data(i))
    data.append(extract_value(i))
    arr.append(data)
    
    
final = pd.DataFrame(arr, columns = ['datetime','value'])

#average 15 participant responses
median_final = final.groupby('datetime').median().reset_index()
mean_final = final.groupby('datetime').mean().reset_index()
combined_final = median_final
combined_final['mean_value'] = mean_final.value
combined_final.rename(columns = {'value':'median_value'})

#### Save the cleaned data set as a netCDF file. 

In [10]:
cleaned_export = median_final.to_xarray()
cleaned_export.to_netcdf('your_file_location\\your_preferred_filename.nc')