# Motivation and Problem Statement
**Plan A:** My initial proposal is to use trail data from the WTA website—trailhead coordinates, hike length, elevation gain, rating, and number of reviews—and compare that with the distance from a point in Seattle. I’m interested in finding out if there are any correlations between how far away a hike is from Seattle and its attributes and popularity. It may be interesting to see how the popularity of a hike changes depending on its distance from Seattle. It may also be interesting to see if hikes farther away might produce better results, or if longer hikes might result in better ratings. For instance, there is a known effect where more expensive items are perceived as higher quality from individuals, even if there isn’t a tangible difference. Although, it’s also important to note that the analysis would rely on a very large assumption that most hikers are coming from the Seattle area, which isn’t entirely reflective of reality since there are also hikers coming from other cities. 

**Plan B:** In case Plan A is too ambitious or I am unable to receive proper permission for using the website data (see Data Selected for Analysis for details on why that is), I have a backup! The U.S. Forest Service (USFS) operates an annual lottery for overnight camping permits in the Enchantments. The Enchantments are a popular area for hikers in the state of Washington, but to protect the plants and animals in the area, the USFS limits entry into the area via a permit system. Each year, hopeful hikers must register for the lottery for a chance to get a permit. Hikers must select their top 3 choices for an entry day and the zone within the Enchantments they want to visit, the “Core Zone” being the most popular. In the 2019 and 2020 cycles, the raw data of applicants was also released publicly via PDFs. While the USFS posts a summary PDF on the most popular days, which gives potential applicants an idea of what days to avoid, I wonder if analysis can be done to identify the combinations that optimize the chances of a hiker being selected. Since the lottery strives to fill up each day’s allotte permits to the maximum, I wonder if single-hiker applicants also have an increased chance of entry if they are used to fill in the remainders, and if so, by how much. 

I am interested in both Plan A and Plan B because of their topic being related to the outdoors. I’m a frequent hiker, and Plan A can reveal some details on the relationship between how much effort is needed to get to a hike (and complete a hike) and its popularity of perceived awesomeness. Meanwhile, Plan B has some more direct benefits of learning what day, zone, and party size combinations might be best to score an Enchantments permit, which I have not managed to do yet. 


# Data Selected for Analysis
Plan A would require a scraping of the WTA website’s trail directory for relevant information like hike length, elevation gain, rating, number of reviews, and coordinates. All the aforementioned details are located on an individual hike’s page, but must be retrieved first. There is not a publicly available database or API that the WTA has released, as far as I’m aware. Thus, Plan A would require (1) advanced permission, which I’ve asked about but am still waiting on a reply for, (2) scraping with a Python script, and (3) cleaning the data and performing the actual analysis. The closest text to a license as posted on the WTA website is the WTA Terms of Service, which states that "You may view, copy, download, and print Content that is available on this website, subject to the following conditions:
- The Content may be used solely for internal informational purposes. No part of this website or its Content may be reproduced or transmitted in any form, by any means, electronic or mechanical, including photocopying and recording, for any other purpose.
- The Content may not be modified.
- Copyright, trademark, and other proprietary notices may not be removed.
Nothing contained on this website should be construed as granting, by implication, estoppel, or otherwise, any license or right to use this website or any Content displayed on this website, through the use of framing or otherwise, except: (a) as expressly permitted by these Terms of Use; or (b) with our prior written permission or the permission of such third party that may own the trademark or copyright of material displayed on this website."

While there are many options for trail information out there, although in decreasing amounts with the recent acquisition by onX of HikingProject (see: https://www.hikingproject.com/data), the WTA provides the best source of information for the trails within Washington, since that’s what the website is focused on, in contrast to other sources that include a more national inventory of trails. I am not aware of any ethical considerations to using the data gathered, if permission is granted. User-generated content will not be collected and the data describes places rather than people, which reduces worry about privacy. All data will also have been available to other users, as it would be pulled from a publicly available area. In the case where express permission is not grante by WTA, a Python scraper may increase the burden on the website via increased web traffic. Usage of the collected data may also not be in accordance with the Terms of Service if it is released publicly versus kept private and only releasing the results. 

For Plan B, the permit lottery data is available from [here](https://web.archive.org/web/20201020211744/https://www.fs.usda.gov/Internet/FSE_DOCUMENTS/fseprd695975.pdf) (2019) and [here](https://www.fs.usda.gov/Internet/FSE_DOCUMENTS/fseprd695975.pdf) (2020). The PDFs can be converted into spreadsheet format via Google Sheets. While the license is not explicity stated within the page or the PDF, the USFS is part of the U.S. Department of Agriculture, which has a website “Digital Rights and Copyright” section that states, “Most information presented on the USDA Web site is considered public domain information. Public domain information may be freely distributed or copied, but use of appropriate byline/photo/image credits is requested. Attribution may be cited as follows: “U.S. Department of Agriculture… Some material on the USDA Web site are protected by copyright, trademark, or patent, and/or are provided for personal use only… and USDA has made every attempt to identify and clearly label them.” The lottery data is assumed to be available to be public domain, given that the lottery is held by the USFS, the document is not labeled as being copyrighted, and the dataset does not have any personal information. The dataset is appropriate for analyzing Plan B because it is the only known dataset containing Enchantments permit lottery data, as it’s coming from the lottery operator (the USFS). There does not appear to be any immediate ethical concerns with using the dataset, aside from potentially increasing one’s chances of gaining a permit should the resulting analysis be accurate. If the days to select or methods of increasing one’s chances of receiving a permit is released publicly, an influx of applicants may try to submit information that increase their chances of obtaining a permit. 


# Unknowns and Dependencies
Plan A contains more unknowns, since it is a lengthier project that would require scraping for data. The need to scrape for data already introduces complications, such as if permission isn’t granted, or if the scraper does not work as intended. Naturally, because Plan A contains an additional data scraping step, it is also at higher risk than Plan B of running behind schedule or not being able to complete in time by the end of the quarter. As a graduating senior, I have a commitment to capstone, as well as gaining a full-time job post-graduation, and moving out of my current apartment back home which will also require a significant time investment. 

Overall, while Plan A is more interesting, Plan B seems to be more feasible since it requires an already-provided dataset, is a smaller dataset, and has a smaller set of possible questions to be answered (e.g., less time needed to choose what exactly to analyze). 



# Research Questions/Hypotheses
Permission was not granted to scrape the Washington Trails Association's (WTA) Webmaster to analyze its trail-related data. In addition, the time commitment for Plan A was already long. As a result, I will be choosing Plan B, which is about exploring the Enchantments Permit Lottery System. Questions include:

* What is the average probability of winning the permit lottery for trips starting on each day of the week?
* What is the likelihood of winning the permit lottery on a weekend day compared to a weekday day? 
* What is the likelihood of winning the permit lottery depending on group size?
* Is it more likely to win the lottery as a single-person party versus a group with 2+ people?
* Which days provide the greatest chance of winning the permit lottery? 

# Background
With the increasing population and increasing prevalence of scenic viewpoints being shared on social media, comes an increased amount of traffic to recreational areas, and greater strain on the resources within them. As a result, permit systems have been implemented by various land-managing agencies like the National Park Service, Bureau of Land Management, and U.S. Forest Service. With the creation of the Recreation.gov website in the early 2000's, it's been even easier for land-managing agencies to set up, run, and hold reservation lotteries via the integrated system. Since each permit lottery holds slight variations, it doesn't look like there's been a specific study on the overall recreational permit lottery system as a whole; meanwhile, the Enchantments lottery is at too small of a scale to have had much popularity for research studies. However, it is possible to view some related real-life phenomena that lends possible credence to the research questions being asked. For instance, single-person parties might have greater chances of winning the permit lottery because of their increased flexibility—that is, if there is only one additional space for the day, any other group that's 2 or more wouldn't be able to "win" that last spot for the day. In the realm of theme parks, we see places like Disneyland offer single rider queues where lone individuals can fill in the remaining seats for a ride. On [Mousehacking.com](https://www.mousehacking.com/blog/disney-world-single-rider-lines), it's claimed that the single rider line can save 25-50% off wait times. There's also clear evidence that weekends are tougher to win the outdoor permit lottery. People are likely to prefer venturing outdoors when they're not working, or when they can minimize time off, and for most people, that's Friday night to Sunday. Yosemite National Park runs a [similar lottery system](https://www.nps.gov/yose/planyourvisit/hdpermits.htm) for Half Dome where they saw that in the 2018 season, the "average success rate on weekdays was 47%, but only 24% on weekends." Likewise, the Mt. Whitney lottery [shows similar trends](https://www.fs.usda.gov/Internet/FSE_DOCUMENTS/fseprd617167.pdf) where Fridays and Saturdays experienced increased demand and competition. 

Thus, the name of the game seems to be flexibility, and increased flexibility allows individuals to choose and participate in the permit lottery in a way that maximizes their chance of winning. Yet, how much of an advantage applies in the case of the Enchantments lottery in terms of the day and group size? Is there even an advantage as imagined? That's what the research hopes to find out.  

# Methodology
Much of the analyses will be about summarizing the data into more digestible forms than the raw row-by-row data and then comparing the numbers to spot any differences. To verify statistical significance between different variables and scenarios, the t-test will be used. The data may be visualized in a variety of diagrams and charts, including a table of the most-likely days to win the permit lottery, a bar chart comparing the successful and unsuccessful application percentages by group size, and a time series graph to trend the number of permit applications over the course of the season. The table will be a helpful way to answer the question of which days will provide the greatest chance of winning the lottery. Meanwhile, the bar charts can help convey the likelihood of winning a permit by group size and day of the week. The time series chart can help show a more holistic view of the hiking season versus single day-by-day views or summaries. For instance, a time series can show the highest peaks (most competitive) and lowest valleys (least competitive) time periods. To conduct the analysis, the data from the U.S. Forest Service will likely be cleaned up and stored inside a Python dictionary, which will allow for the statistical work to happen via access to Python's statistical and visualization libraries. 