# INFO 2950 Final Project Phase IV
Mary Kolbas (mck86)  
Tammy Zhang (tz332)  


## Introduction

_What is the context of the work? What research question are you trying to answer? What are your main findings? Include a brief summary of your results._

Our project investigates observations of avian species at birdfeeders across North America in the winter of 2020 through the spring of 2021. The datasets behind our analysis are all provided by the Cornell Lab of Ornithology's [Project FeederWatch](https://feederwatch.org/about/project-overview/), an extensive citizen-science-based data source that annually engages US-Canada residents in a November-April survey to collect data on the birds that people see at the feeders in their backyards. In this survey, Project FeederWatch asks observers to take note of the species they notice and how many are present, as well as allowing users to input information on their backyard's characteristics; this includes what kind of habitats are present and other environmental/ambient factors. Of these numerous factors, we decided to focus on local housing density (categorized as "rural", "suburban/rural", "suburban", and "urban") as a key variable of interest relative to the species and counts observed.



### Research Question
Our general research question is divided into two main parts as follows.

**How does local housing density influence the distribution of avian species commonly observed at backyard feeders, on both a broad continental scale and on a smaller New York specific scale?**
- Part A: Is North American bird species diversity at backyard feeders independent of housing density in the immediate area?
- Part B: Do taxonomic groups in New York that share biological characteristics (like lifestyle and preferred diet/habitat) also display marked preferences for feeders with different housing densities, or are they observed at constant rates across backyards with different housing densities?


### Main Findings and Result Summary

For Research Question Part A, we discovered that there is in fact some association between US-Canada avian species diversity and a feeder's local housing density. This relationship can be approximately modeled with a linear regression $ y = 0.291 - 0.013x_1 - 0.024x_2 - 0.126x_3 $ where y represents the proportion of the total species for a region being found in a housing-density-specific area and where the baseline condition is a feeder in a rural area. Each x represents a housing density level ($x_1$ = suburban/rural, $x_2$ = suburban, and $x_3$ = urban). The overall trend suggested by our findings is that species diversity for a region (i.e. a U.S. state) decreases on average with increasing housing density (more urban areas tend to have fewer species observed at feeders).

For Research Question Part B, we focused on two main taxonomic groups that are among the most commonly observed at backyard feeders - _Accipiter_ (soaring birds of prey such as hawks) and _Junco_ (small sparrows). We surprisingly discovered that there was no statistically significant preferences for either genus among habitats with particular housing densities, even after adjusting for the different levels of total observations taking place across different areas.

Overall, we discovered that while there may exist a relationship between overall avian species diversity at backyard feeders and the local housing density the feeder is found in, the breakdown between how different taxonomic groups may preferentially associate with different housing densities is less clear. We attribute this difficulty to some of the statistical problems inherent to presence-only data such as those provided by Project FeederWatch as well as the sheer scale of the number of species included in the datasets, which are discussed in greater detail later on.


## Data Description

Our data in `df` comprises of observations (rows) of bird sightings in NY state between November 2020 and April 2021. It has attributes (columns) of the following: 
- `loc_id`: Unique identifier for each survey site
- `subnational1_code`: Country abbreviation and State or Province abbreviation of each survey site. 
- `Month`: Month of 1st day of two-day sighting
- `Day`: Month of 1st day of two-day sighting
- `Year`: Year of 1st day of two-day sighting
- `species_code`: Bird species observed, stored as 6-letter species codes
- `how_many`: Maximum number of individuals seen at one time during observation period
- `valid`: Validity of each observation based on flagging system
- `day1_am`: binary indicating if observer watched during morning of count Day 1
- `day1_pm`: binary indicating if observer watched during afternoon of count Day 1
- `day2_am`: binary indicating if observer watched during morning of count Day 2
- `day2_pm`: binary indicating if observer watched during afternoon of count Day 2
- `snow_dep_atleast`: Participant estimate of minimum snow depth


In our case, we have constrained our data to only look at entries from NY state (`subnational1_code` = "US-NY") and are valid (`valid` = 1). 

Our data from `species_translate_df` contains a species translation table provided by FeederWatch. The most relevant columns are as follows:
- `species_code`: variable storing a 6 letter string representing a species code
- `american_english_name`: variable storing a string representing the full common name of the species in English

We use `species_translate_df` in an inner join during Data Cleaning to create `df`.

In [None]:
species_translate_df.head()

In [None]:
df.head()

Our data in `join_df` comprises of observations (rows) of bird sightings in NY state between November 2020 and April 2021 from `df`, with additional columns from `sites_df`, which has attributes regarding location/environment details based on a unique location id. This dataframe has new additional attributes (columns) of the following: 


- `proj_period_id`: Calendar year of end of FeederWatch season
- `yard_type_pavement`: binary variable whether location is pavement (no vegetation)
- `yard_type_garden`: binary variable whether location is a garden/courtyard 
- `yard_type_landsca`: binary variable whether location is a landscaped yard
- `yard_type_woods`: binary variable whether location is natural vegetation
- `yard_type_desert`: binary variable whether location is a natural or landscaped desert
- `hab_dcid_woods`: binary variable whether location is within 0.5mi from deciduous woods
- `hab_evgr_woods`: binary variable whether location is within 0.5mi from evergreen woods
- `hab_dcid_woods`: binary variable whether location is within 0.5mi from deciduous woods
- `hab_mixed_woods`: binary variable whether location is within 0.5mi from mixed deciduous-evergreen woods
- `hab_orchard`: binary variable whether location is within 0.5mi from an orchard
- `hab_park`: binary variable whether location is within 0.5mi from a park
- `hab_water_fresh`: binary variable whether location is within 0.5mi from fresh water
- `hab_water_salt`: binary variable whether location is within 0.5mi from salt water
- `hab_residential`: binary variable whether location is within 0.5mi from a residential area
- `hab_industrial`: binary variable whether location is within 0.5mi from an industrial or commercial area
- `hab_agricultural`: binary variable whether location is within 0.5mi from an agricultural fields
- `hab_desert_scrub`: binary variable whether location is within 0.5mi from a desert or scrub
- `hab_young_woods`: binary variable whether location is within 0.5mi from an secondary growth woods
- `hab_swamp`: binary variable whether location is within 0.5mi from a swamp (wooded)
- `hab_marsh`: binary variable whether location is within 0.5mi from a marsh
- `brsh_piles_atleast`: Minimum number of brush piles within the count area
- `water_srcs_atleast`: Minimum number of water sources within the count area
- `bird_baths_atleast`: Minimum number of bird baths within the count area
- `nearby_feeders`: binary variable whether other feeders (aside from those maintained by participant) within 90m of the count site
- `squirrels`: binary variable whether squirrels take food from the feeders at least 3 times a week
- `cats`: binary variable whether cats are active within 30m of the feeder for at least 30 min 3 times a week
- `dogs`: binary variable whether dogs are active within 30m of the feeder for at least 30 min 3 times a week
- `humans`: binary variable whether humans are active within 30m of the feeder for at least 30 min 3 times a week
- `housing_density`: Participant-defined description of the housing density of the neighborhood, where 1 = "rural", 2 = "rural/suburban", 3 = "suburban", 4 = "urban"
- `population_atleast`: categorical variable expressing participant estimated population of city or town

In [None]:
join_df.head()

This dataset was created by Project FeederWatch for researchers seeking to conduct formal analyses, but also to have this data be freely accessible to students, journalists, and the general public. The data is collected by contributors and participants. It is created and supported by Cornell Lab and Birds Canada. Since 2016, Project FeederWatch has been sponsored by Wild Bird Unlimited and the National Science Foundation.

Because this is community-curated data, there are many processes that may have influenced what data was observed/recorded and what was not, including mislogging information. This dataset involved a validity checker that flags odd sightings and has them checked by a reviewer. The data we will be using for this project drops invalid observations. FeederWatch also notes there may be mislogged locations in `subnational1_code` represented "XX" locations, which we dropped when choosing to only look at NY state data. 

Because this is community-currated data, there is also volunteer bias in observations, as certain locations or species may be more common in the dataset due to active and dedicated users. However, many users are also very knowledgeable about birds, and therefore can provide a lot of information and identification resources. 

Preprocessing was done to change the very large dataset into a smaller, more manageable one with relevant columns. We filtered the data, dropped irrelevant columns, and added columns to bin data or aid in time series analysis. Details on the Data Cleaning process can be found in the appendix. 

Our raw datasource can be found [here](https://cornell.box.com/s/wzdfg3lotvqr6wc5ik680jzit8fu3t7s) via Cornell Box, or on the [FeederWatch Website](https://feederwatch.org/explore/raw-dataset-requests/).

## Data Limitations

There are some notable limitations with our dataset - as FeederWatch notes on their website, the sheer scale of the data collected and the nature of citizen science involving a large number of participants taking unverifiable observations in varying circumstances mean that the data inherently will have imperfections. For example, some species may appear highly similar to each other, which may cause increased rates of misidentification for those species. When proceeding with further analysis, this effect can be limited somewhat by grouping together similar species into general families. FeederWatch also notes that it is likely for meaningful biological patterns to still emerge from the data despite the possibility of erroneous entries.

A significant limitation of the data is described by FeederWatch as follows: "a recorded observation is a function of both the biological event (number of species actually present) and the observation process (probability that an individual, when present, will be observed)". Without using formal estimation of detection probabilities, it cannot be said that higher numbers of observations for a species necessarily indicate that the species is actually present at greater frequencies - we can only make conclusions about observations, not definitively the state of the biological system. 

For example, we cannot say with complete confidence that a species is more frequent in a certain month - only that it is more frequently observed in that month. While a subtle nuance, this is important to consider. For example, it is possible there are species which are very frequent feeder visitors, but are rarely observed due to visiting at times people tend not to be observing, are quick and difficult to identify, etc.

Without taking these limitations into consideration, there is a risk of erroneously representing the data and misleading people about the state of avian biodiversity.

## Preregistration statement

H0 : p1 = p2  
The proportion of _Accipiter_ observations in each habitat type is equal to the proportion of that habitat type across all observations -> _Accipiter_ is observed at equal frequencies across sites with differing housing densities.

HA : p1 != p2  
The proportion of _Accipiter_ observations in each habitat type is not equal to the proportion of that habitat type across all observations -> _Accipiter_ is more frequently observed at sites with certain housing densities.

We hypothesize this may be the case because birds within the genus _Accipiter_ are birds of prey that need environments where they can find food, which may be easier in environments with small birds congregating around bird feeders or in urban areas where rodents thrive.


H0 : p1 = p2  
The proportion of _Junco_ observations in each habitat type is equal to the proportion of that habitat type across all observations -> _Junco_ is observed at equal frequencies across sites with differing housing densities.

HA : p1 != p2  
The proportion of _Junco_ observations in each habitat type is not equal to the proportion of that habitat type across all observations -> _Junco_ is more frequently observed at sites with certain housing densities.


We hypothesize this may be the case because birds within the genus _Junco_ (small North American sparrows) are a common bird that are typically not phased by humans, therefore we think they may be more likely to be observed in more populated areas.

## Data analysis
Use summary functions like mean and standard deviation along with
visual displays like scatterplots and histograms to describe data.
Provide at least one model showing patterns or relationships between
variables that addresses your research question. This could be a
regression or clustering, or something else that measures some property
of the dataset.

### Bar Graphs to compare proportion of observations in each habitat vs. proportion of that habitat type across all observations

## Evaluation of significance
Use hypothesis tests, simulation, randomization, or
any other techniques we have learned to compare the patterns you observe in the
dataset to simple randomness.

## Interpretation and conclusions
What did you find over the course of your data
analysis, and how confident are you in these conclusions? Detail your results
more so than in the introduction, now that the reader is familiar with your
methods and analysis. Interpret these results in the wider context of the real-life
application from where your data hails.

## Limitations

## Source Code

## Acknowledgments

## Questions for reviewers

1. Do you see any potentially interesting narratives in our data that we should focus on / emphasize more in further analysis?
2. Do you have concerns about our research questions being too broad or ambitious in scope?
3. Are there any logical/mathematical issues that stand out to you about how the data is currently being cleaned and manipulated?