## Data Cleaning (Appendix, Draft) [mary, zero filled accipiter and junco, originaldataset/sites, delete redpolls and snow dummy stuff]
Project FeederWatch is a citizen-science-based data source supported by the Cornell Lab of Ornithology, which collects observations of bird species at backyard feeders and habitats all over the world in an annual November-April survey.

Our raw file comes from the [Project FeederWatch](https://feederwatch.org/explore/raw-dataset-requests/) 2021 New York checklist data and site description data. This file is extremely large and has sightings from about November 2020 to April 2021. After downloading the file, we used an R function provided by FeederWatch to conduct taxonomic roll-up and zero-filling, two procedures recommended by FeederWatch to limit errors. The R code used to clean is provided [here](https://engagement-center.github.io/Project-FeederWatch-Zerofilling-Taxonomic-Rollup-Public/).

1. *Taxonomic roll-up*: A process of combining observations that were recorded under different species codes but would best be treated as the same species. For example, some observers may take note of subspecies, which are then recorded under different codes than the overall species when they should logically be combined.

2. *Zero-filling*: adding counts of 0 for all species that were not recorded at an observation, essential for accounting for the fact that observation data is inherently presence-only.

We decided to focus our research on only sighting in New York State, which is still a rather large subset of the data. As of right now, we have decided to drop `latitude` and `longitude`. We also dropped irrelevant columns, such as `ENTRY_TECHNIQUE` (a variable indicating method of site localization), `PROJ_PERIOD_ID` (calendar year of end of FeederWatch season), `sub_id` and `obs_id` (indentifiers for checklist or species respectively), `effort_hrs_atleast` (survey time), and `DATA_ENTRY_METHOD` (web/mobile/paper).

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import scipy.stats as stats

In [None]:
# importing sql
%load_ext sql

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

%sql duckdb:///:memory:

In [None]:
# reading in raw provided data
csv = pd.read_csv("rolled_up_NY_df.csv")

In [None]:
# making dataframe
df = pd.DataFrame(csv)

In [None]:
new_columns = list(map(str.lower, df.columns))
df.columns = new_columns

In [None]:
# dropping irrelevant columns
df.drop(['unnamed: 0', '...1', 'latitude', 'longitude', 'entry_technique', 'proj_period_id', 'reviewed', 'sub_id', 'obs_id',
        'effort_hrs_atleast', 'data_entry_method'], axis= 1, inplace= True)

In [None]:
# dropping observations that are not valid
df = df[df['valid'] == 1]

In [None]:
df.head()

We also created new dataframes to only specify the top species for our data exploration.

`species_limited_df` is a slice of `df` holding the data only for the top (most frequent) species.

In `snow_df`, we have created bins (`snow_category`) to categorize the snowfall for all sightings where snow depth is not null.

The `species_limited_time` dataframe is a manipulated version of `species_limited_df` with a datetime object, allowing for time series calculations and comparisons.

The `df_timeofday` dataframe narrows the data down to only entries where one of the four sighting periods of the two-day observation period occur. In otherwords, only the entries such that `day1_am` + `day1_pm` + `day2_am` + `day2_pm` = 1, and also limits to the top 25 species. This allows us to plot differences in observation counts based on time of day (morning vs afternoon) in our exploratory section. 

In [None]:
# making a list of the 10 most frequently observed species
frequent_species = df['species_code'].value_counts()[:1].index

# creating new dataframe limited to just the most frequent species observations
species_limited_df = df[df['species_code'].isin(frequent_species)]

species_limited_df.to_csv("bluejay_df.csv")

By convention, bird species are stored as 6-letter codes. However, this makes readability and interpretability more difficult later on. To remedy this, we can do an inner join with a taxonomy table provided by FeederWatch so we can add a column with the species full common name.

In [None]:
# joining common names
species_translate_df = pd.DataFrame(pd.read_csv("PFW-species-translation-table.csv"))
%sql species_limited_df << SELECT loc_id, subnational1_code, month, day, year, species_limited_df.species_code, how_many, valid, day1_am, day1_pm, day2_am, day2_pm, snow_dep_atleast, american_english_name AS species_name FROM species_limited_df INNER JOIN species_translate_df ON species_limited_df.species_code = species_translate_df.species_code;

In [None]:
species_limited_df.head()

In [None]:
# dropping rows where snow depth was null
snow_df = species_limited_df.dropna(subset=['snow_dep_atleast'])

# creating new category with string corresponding to each value in snow depth (for binning in the line plots)
snow_df['snow_category'] = 'No Snow'
snow_df.loc[snow_df['snow_dep_atleast'] == 0.001, 'snow_category'] = '< 5 cm'
snow_df.loc[snow_df['snow_dep_atleast'] == 5.000, 'snow_category'] = '5 to 15 cm'
snow_df.loc[snow_df['snow_dep_atleast'] == 15.001, 'snow_category'] = '> 15 cm'

In [None]:
species_limited_time = species_limited_df

# create a column that puts month, date, year in one string
species_limited_time['datestring'] = species_limited_time['month'].astype(str) + "/" + species_limited_time['day'].astype(str) + "/" + species_limited_time['year'].astype(str)

#turn datestring into datetime and drop datestring
species_limited_time['date_time'] = pd.to_datetime(species_limited_time['datestring'], format='%m/%d/%Y')
species_limited_time.drop(columns="datestring")

#grouped by date_time (datetime object)
species_limited_time = species_limited_time.groupby(["species_name", "date_time"]).mean()[['how_many','snow_dep_atleast']]
species_limited_time.head()

In [None]:
#duplicating df [MARY DELETES THIS LATER]
df_timeofday = df

#only using entries with unique time sightings (they only recorded sightings at one of the 4 time periods as opposed
#to multiple sightings aggregated)
df_timeofday = df_timeofday[df_timeofday['day1_am'] + df_timeofday['day1_pm'] + df_timeofday['day2_am'] + df_timeofday['day2_pm'] == 1]

# creating new category with string corresponding to unique sighting day1_am, day1_pm, day2_am, day2_pm
df_timeofday['unique_time'] = 'day1_am'
df_timeofday.loc[df_timeofday['day1_pm'] == 1, 'unique_time'] = 'day1_pm'
df_timeofday.loc[df_timeofday['day2_am'] == 1, 'unique_time'] = 'day2_am'
df_timeofday.loc[df_timeofday['day2_pm'] == 1, 'unique_time'] = 'day2_pm'

#limit to only top species 
df_timeofday_specieslimited = df_timeofday[df_timeofday['species_code'].isin(frequent_species)]


We also created a joined dataframe `join_df` that combines `species_limited_df` and `sites_df` using an `INNER JOIN` on `loc_id`, which provides us with information about the environment in which the observation entry took place. By doing this, we lose about half of our `species_limited_df` data entries because their location is not described in `sites_df`. 

In [None]:
# reading in raw provided data
csv_sites = pd.read_csv("PFW_count_site_data_public_2021.csv")

#creating dataframe
sites_df = pd.DataFrame(csv_sites)

# keeping only the columns that will be involved in analysis
sites_df = sites_df[['loc_id', 'proj_period_id', 'yard_type_pavement', 'yard_type_garden', 'yard_type_landsca', 'yard_type_woods', 
'yard_type_desert','hab_dcid_woods', 'hab_evgr_woods', 'hab_mixed_woods', 'hab_orchard', 'hab_park', 'hab_water_fresh', 
'hab_water_salt', 'hab_residential','hab_industrial', 'hab_agricultural', 'hab_desert_scrub', 'hab_young_woods', 'hab_swamp', 
'hab_marsh', 'brsh_piles_atleast', 'water_srcs_atleast', 'bird_baths_atleast', 'nearby_feeders', 'squirrels', 'cats', 'dogs', 'humans',
'housing_density', 'population_atleast']]

sites_df

In [None]:
#join species_limited_df (sightings of top species) with sites_df (location details)
#%sql join_df << SELECT species_limited_df.loc_id, subnational1_code, species_code, how_many, snow_dep_atleast, species_name, date_time, proj_period_id, yard_type_pavement, yard_type_garden, yard_type_landsca, yard_type_woods, yard_type_desert,hab_dcid_woods, hab_evgr_woods, hab_mixed_woods, hab_orchard, hab_park, hab_water_fresh, hab_water_salt, hab_residential,hab_industrial, hab_agricultural, hab_desert_scrub, hab_young_woods, hab_swamp, hab_marsh, brsh_piles_atleast, water_srcs_atleast, bird_baths_atleast, nearby_feeders, squirrels, cats, dogs, humans, housing_density, population_atleast, FROM species_limited_df INNER JOIN sites_df ON species_limited_df.loc_id = sites_df.loc_id;

#changed join_df to have these columns for the hypothesis testing section
%sql join_df << SELECT month, day, year, species_limited_df.loc_id, species_code, species_name, how_many, date_time, day1_am, day1_pm, day2_am, day2_pm, Light_Snow, Heavy_Snow, No_Snow, proj_period_id, yard_type_pavement, yard_type_garden, yard_type_landsca, yard_type_woods, yard_type_desert,hab_dcid_woods, hab_evgr_woods, hab_mixed_woods, hab_orchard, hab_park, hab_water_fresh, hab_water_salt, hab_residential,hab_industrial, hab_agricultural, hab_desert_scrub, hab_young_woods, hab_swamp, hab_marsh, brsh_piles_atleast, water_srcs_atleast, bird_baths_atleast, nearby_feeders, squirrels, cats, dogs, humans, housing_density, population_atleast, FROM species_limited_df INNER JOIN sites_df ON species_limited_df.loc_id = sites_df.loc_id;

In [None]:
join_df.head()