## RQ2 Data Collection

Wikiloc data extraction for Germany. **See scrapy_setup_info.md for setting up scrapy, including edits I made to the settings to fix 403 error messages and make the scraping more polite.**

scrapy spiders are provided by Chai-Allah et al, 2023 through their GitHub repo: [Wiki4CES](https://github.com/achaiallah-hub/Wiki4CES)

From what I understand, the spiders provided in the Wiki4CES repo do the following:
1. **extract_link.py** Extracts the URLS for all the trails. You give it a starting region (an intital URL) and it goes through each city/town in that region and extracts all the trail links (URLS) in the cities listing. This spider neeeds to be run first to get the URLS for steps 2 and 3. 
2. **wikiloc_track.py** Scrapes the trail details like track name, difficulty, distance, author, views and description. It loads the trail URLs from a file called link.csv (presumably created in step 1)
3. **wikiloc_image.py** Scrapes image data from the trail pages, including URL, track name, user name, date, and location (latitude & longitude). It reads the trail pages from a file called link.csv (presumably created in step 1)
4. **download_image.py** Downloads images from the URLS in a csv file called wikiloc_image.csv (presumably this would be created from step 3). Not needed for my work, so I have removed this file

**NOTE:** For some reason, running one spider seems to try to run all the spiders at once, and you end up getting error messages saying certain files don't exist (which makes sense as these files need to be created by certain spiders first). I tried looking for the solution for this, but for now I've just commented out the code within the other spiders. UPDATE: It seems to be okay once the errors have been resolved (it doesn't actually run the other spiders but seem to check for the correct files and the code being valid?), so I'm leaving finished scripts uncommented as I correct them.

In [1]:
# SETUP

# Import packages
import os
import pandas as pd
import glob

import geopandas as gpd
from shapely.geometry import LineString, Point
from rasterstats import zonal_stats

# Create folders for storing scrapy outputs
path_list = ["./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs"]

for path in path_list:
  if not os.path.exists(path):
    os.mkdir(path)
    print("Folder %s created!" % path)
  else:
    print("Folder %s already exists" % path)

Folder ./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs already exists


#### Step 1: Updating extract_link.py

**Step 1a: Update starting_urls (including Germany region info)** 
Edit the extract_link.py to replace the staring_urls. Originally this contained https://www.wikiloc.com/trails/france/auvergne-rhone-alpes - this URL doesn't seem to exist anymore as it just redirects to https://www.wikiloc.com/trails/outdoor. 

The URL format now needs to be https://www.wikiloc.com/trails/outdoor/ + *country_name* + *region_name* so for Germany I will try https://www.wikiloc.com/trails/outdoor/germany and then each of the regions within.

For Germany, Wikiloc has trails for the following regions:

| Count | Region                 | URL ending              | Rough Count |
| ----- | ---------------------- | ----------------------- | ----------- |
| 1     | Baden-Wurttemberg      | /baden-wurttemberg      | 77,800      |
| 2     | Bavaria                | /bavaria                | 79,400      |
| 3     | Berlin                 | /berlin                 | 12,000      |
| 4     | Brandenburg            | /brandenburg            | 8,340       |
| 5     | Bremen                 | /bremen                 | 1,500       |
| -     | DE.16,11               | (don't use)             | 5           |
| 6     | Hamburg                | /hamburg                | 7,310       |
| 7     | Hessen                 | /hessen                 | 26,300      |
| 8     | Mecklenburg-Vorpommern | /mecklenburg-vorpommern | 6,320       |
| 9     | Niedersachsen          | /niedersachsen          | 31,400      |
| 10    | Nordrhein-Westfalen    | /nordrhein-westfalen    | 86,800      |
| 11    | Rheinland-Pfalz        | /rheinland-pfalz        | 55,400      |
| 12    | Saarland               | /saarland               | 5,960       |
| 13    | Sachsen                | /sachsen                | 21,000      |
| 14    | Saxony-Anhalt          | /saxony-anhalt          | 7,260       |
| 15    | Schleswig-Holstein     | /schleswig-holstein     | 9,650       |
| 16    | Thüringen              | /thuringen              | 5,650       |

*NOTE* The number of trails being added seem to be increasing steadily (for example, within a one week period, the total count for Germany went up by ~1000). So the numbers shown here may not be up to date (taken 25 April 2025) and are just to give a sense of the size/time it will take to scrape.

**Also check to make sure no new regions are added!**

DE.16,11 appears to be a few trails in Berlin - I don't think I need to bother with this as there is so few and in an urban area (and all the routes don't really look like anything to do with forests)

**Step 1b: Update xpath expressions & other edits**
After overcoming initial 403 error messages (see scrapy_setup_info.md), the spider seemed to correctly generate the urls for all the cities within the region, but still didn't return the trail URLs. I started looking into the xpath expressions in the extract_link.py as I wondered if the path structure has changed a bit over time (like the URLs).

I found this video useful for understanding xpath https://www.youtube.com/watch?v=4EvxqTSzUkI 
I then went to https://www.wikiloc.com/trails/outdoor/germany/bremen and did rick click > Inspect to see the html (I selected Bremen as the testing region as it has the fewest trails). After a search for the components on the main Bremen page and then for one city (for example: https://www.wikiloc.com/trails/outdoor/germany/bremen/alte-neustadt) I made a couple changed to the xpaths in extract_link.py (see comments in script). I also made some changes to the pagination handling (see comments in script). ALSO, since the URLs saved initially were just the back half of the URL, without the beginning (eg. /cycling-trails/bremen-achim-18077390) I adjusted the code to add the beginning part as well. This ended up being required for the other spiders to work properly. 


#### Step 2: Complete workflow for running extract_link.py

*For now, this is just for the Bremen region.*

**Step 2a**
In extract_link.py:
1. Update start_urls: 'https://www.wikiloc.com/trails/outdoor/germany/bremen' and save.

**Step 2b**
In Anaconda Prompt:
1. conda activate C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\thesis_env_conda
2. cd C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\wikiloc_scrapy\wikiloc_scrapy\spiders
3. scrapy crawl wiki -o crawling_outputs\link-bremen.csv

**Step 2c**
Remove duplicates: I am not sure why duplicates are occuring, but the code below simply removes any duplicates.

*NOTE:* For Bremen at the time of scraping (31 MARCH 2025), the website shows 1460 trails, however I get 1532 trails (after the duplicates are removed) - this means there are an extra 72 trails. I'm not sure why this is but I wonder if it has something to do with trails which cross borders (and therefore are in more than 1 region of Germany). It could be that these trails can be searched for in both regions but are only included in the count of 1 to avoid double-counting? **I should check for duplicates across regions to make sure all trails are unique.**

In [7]:
# STEP 2C: Remove CSV duplicates 

# Create a list of the csv paths with all the scraped trail links
link_csv_paths = glob.glob('./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/link-*.csv')

# Load csvs from list, remove duplicates and then write results to same file (overwrite)
for csv_path in link_csv_paths:
    loaded_csv = pd.read_csv(csv_path, sep="\t")
    loaded_csv.drop_duplicates(inplace=True)
    loaded_csv.to_csv(csv_path, index=False)

# Check
#link_bremen = pd.read_csv('./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/link-bremen.csv', sep="\t")
#link_bremen
#link_nieder = pd.read_csv('./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/link-niedersachsen.csv', sep="\t")
#link_nieder

#### Step 3: updating wikiloc_track.py

I edited wikiloc_track.py to update the xpaths (as with the extract_link.py - see comments in script). I also added code so that I could also scrape additional information: 
- date recorded
- photo/waypoint captions (title and body)
- comments
- **photo/waypoint latitudes and longitudes**
- **start point latitude and longitude** (unfortunately the end point is not stored in the html)

Because I handled all the coordinate extraction in this script, I did not use or update the wikiloc_image.py script (and I since deleted it from my repo). I extracted all latitude values in one column, and all longitude values in another column. I can then extract the minimum and maximum values from each column in order to create a bounding box.

Additionally, I removed the author extraction completely so that no personal information is collected. 

Although I updated the xapths for the following features, I commented them out as I don't think I'll need them for my analysis:
- trail difficulty
- view counts
- download counts
- trail length/distance

#### Step 4: Complete workflow for running wikiloc_track.py

*For now, this is just for the Bremen region.*

**Step 4a**
In wikiloc_track.py:
1. Change CSV name in start_urls to: crawling_outputs\link-bremen.csv

**Step 4b**
In Anaconda Prompt:
1. conda activate C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\thesis_env_conda
2. cd C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\wikiloc_scrapy\wikiloc_scrapy\spiders
3. scrapy crawl wiki_track -o crawling_outputs\track-bremen.json

**Needs to be output as json** otherwise (as csv) the utf-8 encoding doesn't seem to work properly and the German special characters are not handled well. 

For Bremen (1532 trails), with download delays and autothrottle on, this stage takes about **65 minutes**.

#### Step 5: Filter for 2018 & distance

Some initial filtering can be applied to reduce the amount of data that goes through the generating geometries step.

This function filters on two fields:
1. Filter for 2018 only. Data from the year 2018 is needed so that the social media data matches the forest definition data (which is for 2018).
2. Filter out very long trails (for now, >175km). These tend correspond to motorised transport (which, when covering large distances may be difficult to pin down to CES for forests) or unexpected use of the website/errors. For example, this trail https://www.wikiloc.com/hiking-trails/xabia-teulada-126272868 is recorded as a ~5 hour hike, but it goes from Germany to Spain. This trail already gets removed with the 2018 filter, but there may be others like it. 


In [2]:
# STEP 5: FILTER 2018 & DISTANCE

# Create a list of the json paths with all scraped data
track_json_paths = glob.glob('./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/track-*.json')

# Load jsons from list, select only 2018 data & trails less than certain distance, return new json
# This outputs to the PROCESSING folder!
def dist_year_filter(json_paths):
    for json_path in json_paths:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(json_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output
        output_path = "./processing/" + name_wo_ext + "_2018_distfilter.json" 

        # Load json as df
        track_df = pd.read_json(json_path) 

        # Select rows where date_recorded includes "2018"
        track_2018_df = track_df[track_df["date_recorded"].str.contains("2018")]

        # Select rows where distance is less than 175 km
        track_2018short_df = track_2018_df[track_2018_df["distance_km"] < 175]
        
        # Save the gdf as a json
        track_2018short_df.to_json(output_path)

# Run the function
dist_year_filter(track_json_paths)

# Load the json and check
#bremen_2018_df = pd.read_json("./processing/track-bremen_2018_distfilter.json")
#bremen_2018_df

#### Step 6: Generating geometries (buffered lines/points)

In step 4, I extracted all latitude values in one column, and all longitude values in another column. Since the coordinates are in order (first the start coordinate, then the waypoints in order) a line segment can be created for where multiple coordinates exist. Where only 1 coordinate pair exists (the start coordinates) a point can be created - then both points and lines can be buffered in order to generate polygons geometries for all trails.

**IMPORTANT** For now the buffer is set to 15m (on each side of line, or the radius for points) as I found this was the approximate minimum sight distance used in Xiang, 1996 "for hikers’ unobstructed forward and rear view of the surroundings".

The code/function below generates a buffered line/point geometry for each trail within each json and outputs a shapefile.


Technical notes for pairing up lat and long coordinates:
- apply with axis= 1 applies the function to each row
- lambda row sets up an anonymous function to do something for each row
- zip pairs the items in each list by index (so the lats and longs get paired up according to their order in the list)
- the list part converts the output from zip into a list

In [3]:
### STEP 6 ALT: buffered point/line geometries

# Create a list of the json paths with all the filtered trails
track2018_json_paths = glob.glob('./processing/track-*_2018_distfilter.json')

# Function which creates point geom if only one coordinate pair available, otherwise line geom
def geom_from_coords(coords):
    if len(coords) == 1:
        return Point(coords[0])
    else:
        return LineString(coords)

# Load jsons from list, generate buffer geometries and saves to processing folder as shp
def buffergeom_generator(json_path_list):
    for json_path in json_path_list:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(json_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output 
        output_path = "./processing/" + name_wo_ext + "_buffer.shp"

        # Load json as df
        track_df = pd.read_json(json_path)

        # Pair-up the lats and longs into coordinates
        track_df["coordinates"] = track_df.apply(lambda row: list(zip(row["longitudes"], row["latitudes"])), axis=1)

        # Run the geom_from_coords function (defined above) on all rows
        track_df['geometry'] = track_df['coordinates'].apply(geom_from_coords)

        # Convert to geodataframe
        track_gdf = gpd.GeoDataFrame(track_df, geometry="geometry")

        # Define projection and reproject
        track_gdf.crs= "EPSG:4326"
        track_gdf = track_gdf.to_crs("EPSG:3035")

        # Buffer all rows so that all geometries are now polygons
        buffer_geoms = track_gdf.buffer(15)

        # Replace geometries in gdf with buffered geometries
        track_gdf = track_gdf.set_geometry(buffer_geoms)

        # Write to shapefile
        track_gdf.to_file(output_path, driver="ESRI Shapefile")

# Run the buffer geometry generator
buffergeom_generator(track2018_json_paths)
        

  track_gdf.to_file(output_path, driver="ESRI Shapefile")
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  track_gdf.to_file(output_path, driver="ESRI Shapefile")
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(


**Note: exporting the data as a shp truncates the long text fields (e.g. "description" "photo_caption", "comments", etc).** I also tried writing the file as a geojson, which works fine, but then when loading back in as a geodataframe, there are problems with the list structures in the "photo_caption" and "comments" fields. 

For now, I think the best work-around is to use the shapefile to do the spatial intersection steps, and then to **join the results back to the main json** (i.e. STEP 5 OUTPUTS, filtered for year and distance) for any text analsyis steps. I think the track URLs can be used for the join field as these are unique. 

#### Step 7: Additional Filtering

Required steps:
1. Filter to only include trails which intersect with Natura 2000 areas. 
2. Filter for consensus forest and non-consensus forest - here I can actually calculate the zonal statistics using the buffered trail geometries (filtered to only those which intersect with Natura sites) with the consenus map. I can drop the results for the extra classes and just focus on class 3 and 6. I can drop trails which have no class 3 or 6 pixels. For the remaining trails, I can calculate the area/percentage of class 3 and 6 to get a sense of which forest class is most relevant to each trail. 

More filtering to consider:
1. Filter certain activity types?
2. Remove trails without any associated text (this depends if I end up using text or just trail counts)

In [4]:
# STEP 7: NATURA INTERSECT

# Create a list of the shp paths with all the buffer trail geometries
track2018_buffer_paths = glob.glob('./processing/track-*_2018_distfilter_buffer.shp')

# Load Germany Natura sites
natura = gpd.read_file("./outputs/natura2000_3035_DE.shp", 
                       columns=["SITECODE", "SITENAME", "MS", "SITETYPE"])

# Load shps from list, check for intersections, remove duplicates & save as shp
def natura_intersects(shp_path_list):
    for shp_path in shp_path_list:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(shp_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output 
        output_path = "./processing/" + name_wo_ext + "_natura.shp"

        # Load shp as gdf
        track_buffer_gdf = gpd.read_file(shp_path)

        # Inner join means any non-intersecting geometries will be dropped
        # Duplication: trail geoms are duplicated for every different Natura site they overlap with
        inter_trails = track_buffer_gdf.sjoin(natura, how="inner", predicate="intersects")

        # Get rid of duplicate trail geometries
        inter_trails = inter_trails.drop_duplicates(subset=["url_track"])

        # Also drop the Natura info (to avoid confusion as site listed may not be the only site)
        inter_trails = inter_trails.drop(columns=["index_right", "SITECODE", 
                                                  "SITENAME", "MS", "SITETYPE"])

        # Write to shapefile
        inter_trails.to_file(output_path, driver="ESRI Shapefile")

# Run intersecting function to remove trails which do not intersect 
natura_intersects(track2018_buffer_paths)


In [None]:
# STEP 7: CONSENUS MAP ZONAL STATISTICS (FOREST INTERSECTS)

# Path to forest consensus map
consensus_map = "./outputs/forest_consensus_3035_DE_5m_2018.tif"

# Test
bremen_trails = gpd.read_file("./processing/track-bremen_2018_distfilter_buffer_natura.shp")

# Calculate the consensus map zonal stats (count only) per class for each Natura 2000 geometry
consensus_trail_test_stats = zonal_stats(bremen_trails, consensus_map,
                                     categorical=True, geojson_output=True)

# Convert list of dictionaries to dataframe
zonalstats_test_df = pd.DataFrame(consensus_trail_test_stats)

# Rename the columns for clarity
zonalstats_test_df.columns=["0_count", "1_count", "2_count", "3_count",
                            "4_count", "5_count", "6_count"]

# Replace the NaN values with 0
zonalstats_test_df.fillna(0, inplace=True)

# Join stats with trails (join based on index)
trails_stats = bremen_trails.join(zonalstats_test_df)

# Drop unneeded counts (keep only class 3 and 6)
trails_stats = trails_stats.drop(columns=["0_count", "1_count", "2_count", 
                                          "4_count", "5_count"])

# Remove any trails which don't have either class 3 or 6 coverage
ind_to_drop = trails_stats[(trails_stats['3_count'] == 0.0) & (trails_stats['6_count'] == 0.0)].index
trails_stats_filter = trails_stats.drop(ind_to_drop)

# Convert the counts per class to area (hectares)
trails_stats_filter["3_area_ha"] = trails_stats_filter["3_count"] * 0.002
trails_stats_filter["6_area_ha"] = trails_stats_filter["6_count"] * 0.002

# Calculate the total area of each geometry
trails_stats_filter["total_area_ha"] = trails_stats_filter.area * 0.0001

# Calculate the percentage coverage of class 3 and 6
trails_stats_filter["3_percent"] = (trails_stats_filter["3_area_ha"] / \
                                           trails_stats_filter["total_area_ha"]) * 100
trails_stats_filter["6_percent"] = (trails_stats_filter["6_area_ha"] / \
                                           trails_stats_filter["total_area_ha"]) * 100

# Add a column for the class with the highest percentage
trails_stats_filter["max_class"] = trails_stats_filter[["3_percent", "6_percent"]].idxmax(axis=1) 

# Replace the row values created in the last step to remove the word "percent"
trails_stats_filter.replace("3_percent","3", inplace=True)
trails_stats_filter.replace("6_percent","6", inplace=True)


trails_stats_filter


Unnamed: 0,track_name,url_track,track_type,date_publi,descriptio,distance_k,date_recor,photo_capt,comments,latitudes,...,coordinate,geometry,3_count,6_count,3_area_ha,6_area_ha,total_area_ha,3_percent,6_percent,max_class
1,Findorff-Blockland-St. Jürgen-Ritterhude-Findorff,https://www.wikiloc.com/cycling-trails/findorf...,Road Bike,2018-04-07T17:44+0200,Finndorff-Blockland-St. Jürgen-Ritterhude-Find...,44.19,April 2018,"['Pause am Wümme Deich', 'Am Wümme Deich', 'We...","['Nice trail, nice view.']","[53.094692, 53.135328, 53.13783, 53.147445, 53...",...,"[(8.808772, 53.094692), (8.872359, 53.135328),...","POLYGON ((4243755.279 3338326.772, 4243754.704...",902.0,12.0,1.804,0.024,70.808673,2.547711,0.033894,3
2,Flughafenrunde,https://www.wikiloc.com/cycling-trails/flughaf...,Road Bike,2018-06-02T21:07+0200,Flughafenrunde,30.03,April 2018,"['Foto', 'Foto', 'Foto', 'Abzweig zum Park lin...",['None'],"[53.092769, 53.069321, 53.069129, 53.063334, 5...",...,"[(8.803195, 53.092769), (8.761088, 53.069321),...","POLYGON ((4237965.53 3329685.548, 4237964.15 3...",325.0,0.0,0.65,0.0,62.014026,1.04815,0.0,3
3,HBFindorff - OHZ Mühlencafé - OHZ Bahnhof,https://www.wikiloc.com/cycling-trails/hbfindo...,Road Bike,2018-04-04T13:59+0200,,29.72,April 2018,"['OHZ', 'Café', 'Foto']",['None'],"[53.094486, 53.22507, 53.229502]",...,"[(8.801539, 53.094486), (8.813878, 53.22507), ...","POLYGON ((4241763.697 3346957.133, 4240261.637...",175.0,0.0,0.35,0.0,48.532067,0.721173,0.0,3
5,"WSV Hasenbühren - Bremen, Stephaniebrücke über...",https://www.wikiloc.com/cycling-trails/wsv-has...,Road Bike,2018-06-02T20:59+0200,"WSV Hasenbühren - Bremen, Stephaniebrücke über...",16.91,May 2018,"['Knotenpunkt Abzweig zur Flughafenrunde', 'Un...",['None'],"[53.121748, 53.063357, 53.063338, 53.079259]",...,"[(8.665211, 53.121748), (8.747375, 53.063357),...","POLYGON ((4237005.192 3329049.327, 4231626.285...",776.0,0.0,1.552,0.0,35.629803,4.355904,0.0,3
6,Rund um Bremen,https://www.wikiloc.com/cycling-trails/rund-um...,Road Bike,2018-05-11T16:45+0200,Rund um Bremen,50.28,May 2018,"['Start', 'Silbersee mit Bade- / Picknick Mögl...",['None'],"[53.083553, 53.083553, 53.008837, 52.940169, 5...",...,"[(8.787574, 53.083553), (8.787574, 53.083553),...","POLYGON ((4237015.913 3322983.308, 4241897.959...",33.0,322.0,0.066,0.644,53.979273,0.122269,1.193051,6


In [None]:
# STEP 7: CONSENUS MAP ZONAL STATISTICS (FOREST INTERSECTS)

# Create a list of the shp paths with all the buffered trails which intersect Natura sites
track2018_natura_paths = glob.glob('./processing/track-*_2018_distfilter_buffer_natura.shp')

# Path to forest consensus map
consensus_map = "./outputs/forest_consensus_3035_DE_5m_2018.tif"

# Load shps from list, calculate zonal stats for consensus map, remove trails which don't have
# class 3 or 6 coverage, find the dominant coverage class, save output as shp 
def forest_coverage(shp_path_list):
    for shp_path in shp_path_list:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(shp_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output 
        output_path = "./processing/" + name_wo_ext + "_natura_forest.shp"

        # Load shp as gdf
        track_gdf = gpd.read_file(shp_path)

        # Calculate the consensus map zonal stats (count only) per class for each trail
        zonalstats = zonal_stats(track_gdf, consensus_map, categorical=True, geojson_output=True)
        
        # Convert list of dictionaries to dataframe
        zonalstats_df = pd.DataFrame(zonalstats)

        # Rename the columns for clarity
        zonalstats_df.columns=["0_count", "1_count", "2_count", "3_count",
                               "4_count", "5_count", "6_count"]
        
        # Replace the NaN values with 0
        zonalstats_df.fillna(0, inplace=True)

        # Join stats with trails (join based on index)
        trails_stats = track_gdf.join(zonalstats_df)

        # Drop unneeded counts (keep only class 3 and 6)
        trails_stats = trails_stats.drop(columns=["0_count", "1_count", "2_count", 
                                                  "4_count", "5_count"])

        # Remove any trails which don't have either class 3 or 6 coverage
        ind_to_drop = trails_stats[(trails_stats['3_count'] == 0.0) & (trails_stats['6_count'] == 0.0)].index
        trails_stats_filter = trails_stats.drop(ind_to_drop)

        # Convert the counts per class to area (hectares)
        trails_stats_filter["3_area_ha"] = trails_stats_filter["3_count"] * 0.002
        trails_stats_filter["6_area_ha"] = trails_stats_filter["6_count"] * 0.002

        # Calculate the total area of each geometry
        trails_stats_filter["total_area_ha"] = trails_stats_filter.area * 0.0001

        # Calculate the percentage coverage of class 3 and 6
        trails_stats_filter["3_percent"] = (trails_stats_filter["3_area_ha"] / \
                                                trails_stats_filter["total_area_ha"]) * 100
        trails_stats_filter["6_percent"] = (trails_stats_filter["6_area_ha"] / \
                                                trails_stats_filter["total_area_ha"]) * 100

        # Add a column for the class with the highest percentage
        trails_stats_filter["max_class"] = trails_stats_filter[["3_percent", "6_percent"]].idxmax(axis=1) 

        # Replace the row values created in the last step to remove the word "percent"
        trails_stats_filter.replace("3_percent","3", inplace=True)
        trails_stats_filter.replace("6_percent","6", inplace=True)

        # Write to shapefile
        trails_stats_filter.to_file(output_path, driver="ESRI Shapefile")

# Run forest coverage function to remove trails which don't overlap with class 3 or 6 
# and to find out which class (3 or 6) has the most coverage
forest_coverage(track2018_natura_paths)

