## RQ2 Data Collection

Wikiloc data extraction for Germany. **See scrapy_setup_info.md for setting up scrapy, including edits I made to the settings to fix 403 error messages and make the scraping more polite.**

scrapy spiders are provided by Chai-Allah et al, 2023 through their GitHub repo: [Wiki4CES](https://github.com/achaiallah-hub/Wiki4CES)

From what I understand, the spiders provided in the Wiki4CES repo do the following:
1. **extract_link.py** Extracts the URLS for all the trails. You give it a starting region (an intital URL) and it goes through each city/town in that region and extracts all the trail links (URLS) in the cities listing. This spider neeeds to be run first to get the URLS for steps 2 and 3. 
2. **wikiloc_track.py** Scrapes the trail details like track name, difficulty, distance, author, views and description. It loads the trail URLs from a file called link.csv (presumably created in step 1)
3. **wikiloc_image.py** Scrapes image data from the trail pages, including URL, track name, user name, date, and location (latitude & longitude). It reads the trail pages from a file called link.csv (presumably created in step 1)
4. **download_image.py** Downloads images from the URLS in a csv file called wikiloc_image.csv (presumably this would be created from step 3). Not needed for my work, so I have removed this file

**NOTE:** For some reason, running one spider seems to try to run all the spiders at once, and you end up getting error messages saying certain files don't exist (which makes sense as these files need to be created by certain spiders first). I tried looking for the solution for this, but for now I've just commented out the code within the other spiders. UPDATE: It seems to be okay once the errors have been resolved (it doesn't actually run the other spiders but seem to check for the correct files and the code being valid?), so I'm leaving finished scripts uncommented as I correct them.

In [1]:
# SETUP

# Import packages
import os
import pandas as pd
import glob

import geopandas as gpd
from shapely.geometry import LineString, Point
from rasterstats import zonal_stats

# Create folders for storing scrapy outputs
path_list = ["./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs"]

for path in path_list:
  if not os.path.exists(path):
    os.mkdir(path)
    print("Folder %s created!" % path)
  else:
    print("Folder %s already exists" % path)

Folder ./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs already exists


#### Step 1: Updating extract_link.py

**Step 1a: Update starting_urls (including Germany region info)** 
Edit the extract_link.py to replace the staring_urls. Originally this contained https://www.wikiloc.com/trails/france/auvergne-rhone-alpes - this URL doesn't seem to exist anymore as it just redirects to https://www.wikiloc.com/trails/outdoor. 

The URL format now needs to be https://www.wikiloc.com/trails/outdoor/ + *country_name* + *region_name* so for Germany I will try https://www.wikiloc.com/trails/outdoor/germany and then each of the regions within.

For Germany, Wikiloc has trails for the following regions:

| Count | Region                 | URL ending              | Rough Count |
| ----- | ---------------------- | ----------------------- | ----------- |
| 1     | Baden-Wurttemberg      | /baden-wurttemberg      | 77,800      |
| 2     | Bavaria                | /bavaria                | 79,400      |
| 3     | Berlin                 | /berlin                 | 12,000      |
| 4     | Brandenburg            | /brandenburg            | 8,340       |
| 5     | Bremen                 | /bremen                 | 1,500       |
| -     | DE.16,11               | (don't use)             | 5           |
| 6     | Hamburg                | /hamburg                | 7,310       |
| 7     | Hessen                 | /hessen                 | 26,300      |
| 8     | Mecklenburg-Vorpommern | /mecklenburg-vorpommern | 6,320       |
| 9     | Niedersachsen          | /niedersachsen          | 31,400      |
| 10    | Nordrhein-Westfalen    | /nordrhein-westfalen    | 86,800      |
| 11    | Rheinland-Pfalz        | /rheinland-pfalz        | 55,400      |
| 12    | Saarland               | /saarland               | 5,960       |
| 13    | Sachsen                | /sachsen                | 21,000      |
| 14    | Saxony-Anhalt          | /saxony-anhalt          | 7,260       |
| 15    | Schleswig-Holstein     | /schleswig-holstein     | 9,650       |
| 16    | Thüringen              | /thuringen              | 5,650       |

*NOTE* The number of trails being added seem to be increasing steadily (for example, within a one week period, the total count for Germany went up by ~1000). So the numbers shown here may not be up to date (taken 25 April 2025) and are just to give a sense of the size/time it will take to scrape.

**Also check to make sure no new regions are added!**

DE.16,11 appears to be a few trails in Berlin - I don't think I need to bother with this as there is so few and in an urban area (and all the routes don't really look like anything to do with forests)

**Step 1b: Update xpath expressions & other edits**
After overcoming initial 403 error messages (see scrapy_setup_info.md), the spider seemed to correctly generate the urls for all the cities within the region, but still didn't return the trail URLs. I started looking into the xpath expressions in the extract_link.py as I wondered if the path structure has changed a bit over time (like the URLs).

I found this video useful for understanding xpath https://www.youtube.com/watch?v=4EvxqTSzUkI 
I then went to https://www.wikiloc.com/trails/outdoor/germany/bremen and did rick click > Inspect to see the html (I selected Bremen as the testing region as it has the fewest trails). After a search for the components on the main Bremen page and then for one city (for example: https://www.wikiloc.com/trails/outdoor/germany/bremen/alte-neustadt) I made a couple changed to the xpaths in extract_link.py (see comments in script). I also made some changes to the pagination handling (see comments in script). ALSO, since the URLs saved initially were just the back half of the URL, without the beginning (eg. /cycling-trails/bremen-achim-18077390) I adjusted the code to add the beginning part as well. This ended up being required for the other spiders to work properly. 


#### Step 1 ADDITION: extract_link_large.py

For large/popular states like Bavaria and Nordrhein-Westfalen, I ran into problems where many trails links were missing after the link extraction. These were quite large amounts compared to small amounts missing for other regions - for example, for Bavaria I was expecting around 80,000 trails but only got ~10,000. I found out this is because Wikiloc uses a pagination cap of 1000 pages - so for especially popular activites, even if there were more than 1000 pages worth of trails, the URLS like https://www.wikiloc.com/trails/hiking/germany/bavaria?page=1001 simply don't work. Given that there are 10 trails per page, this means a maximum of 10,000 trails can be accessed through the 1000 pages. For most regions/activity combinations this is not a problem, but for hiking trails in Bavaria (for example), there are about 32,000 trails, meaning 22,000 trails are missed with this approach.

To resolve this issue I created another scraper for handling large regions specifically. Instead of supplying a URL to the region and filtering all trails by activity, this approach uses a directory of places within a region, accessing each in turn. For each place within a region, the scraper checks whether further filtering is possible by activity. If available it filters the trails for the place by activity, and if not it extracts the trails directly. 

**NOTE** It seems like the best way is to run both the extract_link.py and the extract_link_large.py - this seems to cover all bases and get the most links. This could be because not all trails are linked to place beyond simply the region, and so the second script might not get everything either.Just make sure to remove duplicates AFTER merging the lists.

#### Step 2: Workflow for running extract_link.py or extract_link_large.py

*For now, this is just for the Bremen region.*

**Step 2a**
In extract_link.py:
1. Update start_urls: 'https://www.wikiloc.com/trails/outdoor/germany/bremen' and save. If using extract_link_large the start URL looks more like this 'https://www.wikiloc.com/directory/bjnYAg' (this is the URL which shows a list of all the places within a region/state)

**Step 2b**
In Anaconda Prompt:
1. conda activate C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\thesis_env_conda
2. cd C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\wikiloc_scrapy\wikiloc_scrapy\spiders
3. scrapy crawl wiki -o crawling_outputs\link-bremen.csv

**Step 2c**
Remove duplicates: I am not sure why duplicates are occuring, but the code below simply removes any duplicates.

*NOTE:* For Bremen at the time of scraping (31 MARCH 2025), the website shows 1460 trails, however I get 1532 trails (after the duplicates are removed) - this means there are an extra 72 trails. I'm not sure why this is but I wonder if it has something to do with trails which cross borders (and therefore are in more than 1 region of Germany). It could be that these trails can be searched for in both regions but are only included in the count of 1 to avoid double-counting? **I should check for duplicates across regions to make sure all trails are unique.**

In [3]:
# STEP 2C: Remove CSV duplicates 

# Create a list of the csv paths with all the scraped trail links
link_csv_paths = glob.glob('./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/link-*.csv')

# Load csvs from list, remove duplicates and then write results to same file (overwrite)
for csv_path in link_csv_paths:
    loaded_csv = pd.read_csv(csv_path, sep="\t")
    loaded_csv.drop_duplicates(inplace=True)
    loaded_csv.to_csv(csv_path, index=False)

# Check
#link_bremen = pd.read_csv('./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/link-bremen.csv', sep="\t")
#link_bremen
#link_nieder = pd.read_csv('./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/link-niedersachsen.csv', sep="\t")
#link_nieder

#### Step 3: updating wikiloc_track.py

I edited wikiloc_track.py to update the xpaths (as with the extract_link.py - see comments in script). I also added code so that I could also scrape additional information: 
- date recorded
- photo/waypoint captions (title and body)
- comments
- **photo/waypoint latitudes and longitudes**
- **start point latitude and longitude** (unfortunately the end point is not stored in the html)

Because I handled all the coordinate extraction in this script, I did not use or update the wikiloc_image.py script (and I since deleted it from my repo). I extracted all latitude values in one column, and all longitude values in another column. I can then extract the minimum and maximum values from each column in order to create a bounding box.

Additionally, I removed the author extraction completely so that no personal information is collected. 

Although I updated the xapths for the following features, I commented them out as I don't think I'll need them for my analysis:
- trail difficulty
- view counts
- download counts
- trail length/distance

#### Step 4: Workflow for running wikiloc_track.py

*For now, this is just for the Bremen region.*

**Step 4a**
In wikiloc_track.py:
1. Change CSV name in start_urls to: crawling_outputs\link-bremen.csv

**Step 4b**
In Anaconda Prompt:
1. conda activate C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\thesis_env_conda
2. cd C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\wikiloc_scrapy\wikiloc_scrapy\spiders
3. scrapy crawl wiki_track -o crawling_outputs\track-bremen.json

**Needs to be output as json** otherwise (as csv) the utf-8 encoding doesn't seem to work properly and the German special characters are not handled well. 

For Bremen (1532 trails), with download delays and autothrottle on, this stage takes about **65 minutes**.

#### Step 5: Filter for 2018 & distance

Some initial filtering can be applied to reduce the amount of data that goes through the generating geometries step.

This function filters on two fields:
1. Filter for 2018 only. Data from the year 2018 is needed so that the social media data matches the forest definition data (which is for 2018).
2. Filter out very long trails (for now, >175km). These tend correspond to motorised transport (which, when covering large distances may be difficult to pin down to CES for forests) or unexpected use of the website/errors. For example, this trail https://www.wikiloc.com/hiking-trails/xabia-teulada-126272868 is recorded as a ~5 hour hike, but it goes from Germany to Spain. This trail already gets removed with the 2018 filter, but there may be others like it. 


In [None]:
# STEP 5: FILTER 2018 & DISTANCE (ATTRIBUTE)

# Create a list of the json paths with all scraped data
track_json_paths = glob.glob('./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/track-*.json')

# Load jsons from list, select only 2018 data & trails less than certain distance, return new json
# This outputs to the PROCESSING folder!
def dist_year_filter(json_paths):
    for json_path in json_paths:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(json_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output
        output_path = "./processing/" + name_wo_ext + "_2018_distfilter.json" 

        # Load json as df
        track_df = pd.read_json(json_path) 

        # Select rows where date_recorded includes "2018"
        track_2018_df = track_df[track_df["date_recorded"].str.contains("2018")]

        # Select rows where distance is less than 175 km
        track_2018short_df = track_2018_df[track_2018_df["distance_km"] < 175]
        
        # Save the gdf as a json
        track_2018short_df.to_json(output_path)

# Run the function
dist_year_filter(track_json_paths)

# Load the json and check
#bw_2018_df = pd.read_json("./processing/track-badenwurttemberg_2018_distfilter.json")
#bw_2018_df

#### Step 6: Generating geometries (buffered lines/points)

In step 4, I extracted all latitude values in one column, and all longitude values in another column. Since the coordinates are in order (first the start coordinate, then the waypoints in order) a line segment can be created for where multiple coordinates exist. Where only 1 coordinate pair exists (the start coordinates) a point can be created - then both points and lines can be buffered in order to generate polygons geometries for all trails.

**IMPORTANT** The buffer is set to 30m (on each side of line, or the radius for points). This is roughly the average between a minimum sight distance of 15m used in Xiang, 1996 "for hikers’ unobstructed forward and rear view of the surroundings" and a buffer of 50m used in Torkko et al 2023 which was found to most accurately capture perceived greenery in urban areas. The 30m buffer is likely a small overestimation of sight in dense forest, but a large underestimation for high, open viewpoints where people can see much further. I opted for this more conservative approach with small buffers in an effort to increase the chance that the textual content is about forests (see forest masking step later).

The code/function below generates a buffered line/point geometry for each trail within each json and outputs a shapefile.

Technical notes for pairing up lat and long coordinates:
- apply with axis= 1 applies the function to each row
- lambda row sets up an anonymous function to do something for each row
- zip pairs the items in each list by index (so the lats and longs get paired up according to their order in the list)
- the list part converts the output from zip into a list

In [None]:
### STEP 6: buffered point/line geometries

# Create a list of the json paths with all the filtered trails
track2018_json_paths = glob.glob('./processing/track-*_2018_distfilter.json')

# Function which creates point geom if only one coordinate pair available, otherwise line geom
def geom_from_coords(coords):
    if len(coords) == 1:
        return Point(coords[0])
    else:
        return LineString(coords)

# Load jsons from list, generate buffer geometries and saves to processing folder as shp
def buffergeom_generator(json_path_list):
    for json_path in json_path_list:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(json_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output 
        output_path = "./processing/" + name_wo_ext + "_buffer.shp"

        # Load json as df
        track_df = pd.read_json(json_path)

        # Pair-up the lats and longs into coordinates
        track_df["coordinates"] = track_df.apply(lambda row: list(zip(row["longitudes"], row["latitudes"])), axis=1)

        # Run the geom_from_coords function (defined above) on all rows
        track_df['geometry'] = track_df['coordinates'].apply(geom_from_coords)

        # Convert to geodataframe
        track_gdf = gpd.GeoDataFrame(track_df, geometry="geometry")

        # Define projection and reproject
        track_gdf.crs= "EPSG:4326"
        track_gdf = track_gdf.to_crs("EPSG:3035")

        # Buffer all rows so that all geometries are now polygons
        # SET BUFFER DISTANCE HERE (M)
        buffer_geoms = track_gdf.buffer(30)

        # Replace geometries in gdf with buffered geometries
        track_gdf = track_gdf.set_geometry(buffer_geoms)

        # Write to shapefile
        track_gdf.to_file(output_path, driver="ESRI Shapefile")

# Run the buffer geometry generator
buffergeom_generator(track2018_json_paths)
        
        

**Note: exporting the data as a shp truncates the long text fields (e.g. "description" "photo_caption", "comments", etc).** I also tried writing the file as a geojson, which works fine, but then when loading back in as a geodataframe, there are problems with the list structures in the "photo_caption" and "comments" fields. 

For now, I think the best work-around is to use the shapefile to do the spatial intersection steps, and then to **join the results back to the main json** (i.e. STEP 5 OUTPUTS, filtered for year and distance) for any text analsyis steps. I think the track URLs can be used for the join field as these are unique. 

**ADDITIONAL AREA FILTER** Previously I ran a filter to remove very long trails based on the distance attribute entered by the user on Wikiloc. Unfortunately this does not catch every case as sometimes the distance is entered incorrectly. To accomodate for this, I added another filter based on the area of the buffered geometries. This is only meant to catch really extreme cases (such as flights and long distances outside Germany). I decided to remove everything over 65 km2 based on a manual assessment of the extreme cases in QGIS. 

In [34]:
# STEP 6: ADDITIONAL AREA FILTER 

# Create a list of the shp paths with all the buffer trail geometries
track2018_buffer_paths = glob.glob('./processing/track-*_2018_distfilter_buffer.shp')

# Load jsons from list, select only trails under 65km2, return new shp
# This outputs to the PROCESSING folder!
def area_filter(shp_paths):
    for shp_path in shp_paths:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(shp_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output
        output_path = "./processing/" + name_wo_ext + "_areafilt.shp" 

        # Load json as gdf
        track_gdf = gpd.read_file(shp_path)

        # Calculate the area
        track_gdf["area"] = track_gdf.geometry.area

        # Select rows where area is less than 65km2 (65000000m2)
        track_areafilt_gdf = track_gdf[track_gdf["area"] < 65000000]
        
        # Write to shapefile
        track_areafilt_gdf.to_file(output_path, driver="ESRI Shapefile")

# Run the function
area_filter(track2018_buffer_paths)

#### Step 7: Natura & Forest Consensus Map Filtering

Required steps:
1. Filter to only include trails which intersect with Natura 2000 areas. 
2. Filter for forests - here I can actually calculate the zonal statistics using the buffered trail geometries (filtered to only those which intersect with Natura sites) with the consenus map. I can the calculate the max class for each trail geometry and only include the trails which have a max class of 3, 4, 5, or 6 (ie. classes where at least half of the forest definitions agree on forest presence). In other words, if the max class of a trail is a non-forest class (0,1,2), then I remove it from the output.

UPDATE:
For RQ3 I found that, with all the filters applied in this script, there are very few remaining trails for the comparison between class 3 and 6. I decided to try to still demonstrate the process (of comparing class 3 and 6), but with the adjustment to consider all 2018 trails in forest areas - i.e. not just the ones intersecting with Natura sites. SO: I created an alternative step to include a column in the shapefile indicating whether the trail intersects with Natura or not - this can be used as a filter later for RQ2 (which only uses Natura trails) but means that I have the trail text ready for non-Natura areas as well (for RQ3).

In [None]:
# STEP 7: NATURA INTERSECT (ORIGINAL - NATURA ONLY)

# Create a list of the shp paths with all the buffered and filtered trail geometries
track2018_bufferfilt_paths = glob.glob('./processing/track-*_2018_distfilter_buffer_areafilt.shp')

# Load Germany Natura sites
natura = gpd.read_file("./outputs/natura2000_3035_DE.shp", 
                       columns=["SITECODE", "SITENAME", "MS", "SITETYPE"])

# Load shps from list, check for intersections, remove duplicates & save as shp
def natura_intersects(shp_path_list):
    for shp_path in shp_path_list:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(shp_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output 
        output_path = "./processing/" + name_wo_ext + "_natura.shp"

        # Load shp as gdf
        track_buffer_gdf = gpd.read_file(shp_path)

        # Inner join means any non-intersecting geometries will be dropped
        # Duplication: trail geoms are duplicated for every different Natura site they overlap with
        inter_trails = track_buffer_gdf.sjoin(natura, how="inner", predicate="intersects")

        # Get rid of duplicate trail geometries
        inter_trails = inter_trails.drop_duplicates(subset=["url_track"])

        # Also drop the Natura info (to avoid confusion as site listed may not be the only site)
        inter_trails = inter_trails.drop(columns=["index_right", "SITECODE", 
                                                  "SITENAME", "MS", "SITETYPE"])

        # Write to shapefile
        inter_trails.to_file(output_path, driver="ESRI Shapefile")

# Run intersecting function to remove trails which do not intersect 
natura_intersects(track2018_bufferfilt_paths)

In [None]:
# STEP 7: NATURA INTERSECT (ALTERNATIVE - ALL WITH NATURA FLAG COLUMN)

# Create a list of the shp paths with all the buffered and filtered trail geometries
track2018_bufferfilt_paths = glob.glob('./processing/track-*_2018_distfilter_buffer_areafilt.shp')

# Load Germany Natura sites
natura = gpd.read_file("./outputs/natura2000_3035_DE.shp", 
                       columns=["SITECODE", "SITENAME", "MS", "SITETYPE"])

# Load shps from list, check for intersections, remove duplicates & save as shp
def natura_intersects(shp_path_list):
    for shp_path in shp_path_list:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(shp_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output 
        output_path = "./processing/" + name_wo_ext + "_naturaflag.shp"

        # Load shp as gdf
        track_buffer_gdf = gpd.read_file(shp_path)

        # Left join to ensure all trails are kept (but where an intersection occurs natura info is added to the trail/row)
        # Duplication: trail geoms are duplicated for every different Natura site they overlap with
        join = track_buffer_gdf.sjoin(natura, how="left", predicate="intersects")

        # Add column indicating natura/non-natura based on whether the trail has a joined Natura info (sitecode) or not
        join["natura"] = join["SITECODE"].notnull().map({True: "natura", False: "non-natura"})

        # Get rid of duplicate trail geometries (which result from when a trail intersects with more than 1 Natura site)
        join = join.drop_duplicates(subset=["url_track"])

        # Also drop the Natura info (to avoid confusion as site listed may not be the only site)
        join = join.drop(columns=["index_right", "SITECODE", 
                                  "SITENAME", "MS", "SITETYPE"])

        # Write to shapefile
        join.to_file(output_path, driver="ESRI Shapefile")

# Run intersecting function to remove trails which do not intersect 
natura_intersects(track2018_bufferfilt_paths)


In [None]:
# STEP 7: CONSENUS MAP ZONAL STATISTICS (FOREST INTERSECTS)
# TAKES ABOUT 23 MIN (EACH SET - NATURA ONLY & ALL W NATURA FLAG)

# Create a list of the shp paths with all the buffered trails (either natura only or natura flag)
track2018_natura_paths = glob.glob('./processing/track-*_2018_distfilter_buffer_areafilt_natura.shp')
track2018_naturaflag_paths = glob.glob('./processing/track-*_2018_distfilter_buffer_areafilt_naturaflag.shp')

# Path to forest consensus map
consensus_map = "./outputs/forest_consensus_3035_DE_5m_2018.tif"

# Load shps from list, calculate zonal stats for consensus map, remove trails which don't have
# class 3, 4, 5, 6 coverage, find the dominant coverage class, save output as shp 
def forest_coverage(shp_path_list):
    for shp_path in shp_path_list:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(shp_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output 
        output_path = "./processing/" + name_wo_ext + "_forest.shp"

        # Load shp as gdf
        track_gdf = gpd.read_file(shp_path)

        # Calculate the consensus map zonal stats (count only) per class for each trail
        zonalstats = zonal_stats(track_gdf, consensus_map, categorical=True, geojson_output=True)
        
        # Convert list of dictionaries to dataframe
        zonalstats_df = pd.DataFrame(zonalstats)

        # Force the columns in order (this is not guaranteed!)
        zonalstats_df = zonalstats_df[[0, 1, 2, 3, 4, 5, 6]]

        # Rename the columns for clarity
        zonalstats_df.columns=["0_count", "1_count", "2_count", "3_count",
                               "4_count", "5_count", "6_count"]
        
        # Replace the NaN values with 0
        zonalstats_df.fillna(0, inplace=True)

        # Join stats with trails (join based on index)
        trails_stats = track_gdf.join(zonalstats_df)

        # Convert the counts per class to area (based on 25m2 pixel size converted to hectares)
        trails_stats["0_area_ha"] = trails_stats["0_count"] * 0.0025
        trails_stats["1_area_ha"] = trails_stats["1_count"] * 0.0025
        trails_stats["2_area_ha"] = trails_stats["2_count"] * 0.0025
        trails_stats["3_area_ha"] = trails_stats["3_count"] * 0.0025
        trails_stats["4_area_ha"] = trails_stats["4_count"] * 0.0025
        trails_stats["5_area_ha"] = trails_stats["5_count"] * 0.0025
        trails_stats["6_area_ha"] = trails_stats["6_count"] * 0.0025

        # Calculate the total area of each geometry (convert m2 to ha)
        trails_stats["total_area_ha"] = trails_stats.area * 0.0001

        # Calculate the percentage coverage
        trails_stats["0_percent"] = (trails_stats["0_area_ha"] / trails_stats["total_area_ha"]) * 100
        trails_stats["1_percent"] = (trails_stats["1_area_ha"] / trails_stats["total_area_ha"]) * 100
        trails_stats["2_percent"] = (trails_stats["2_area_ha"] / trails_stats["total_area_ha"]) * 100
        trails_stats["3_percent"] = (trails_stats["3_area_ha"] / trails_stats["total_area_ha"]) * 100
        trails_stats["4_percent"] = (trails_stats["4_area_ha"] / trails_stats["total_area_ha"]) * 100
        trails_stats["5_percent"] = (trails_stats["5_area_ha"] / trails_stats["total_area_ha"]) * 100
        trails_stats["6_percent"] = (trails_stats["6_area_ha"] / trails_stats["total_area_ha"]) * 100
        
        # Add a column for the class with the highest percentage
        trails_stats["max_class"] = trails_stats[["0_percent", "1_percent", "2_percent", "3_percent", "4_percent", "5_percent", "6_percent"]].idxmax(axis=1) 

        # Remove any trails which have max_class of 0,1 or 2
        ind_to_drop = trails_stats[(trails_stats['max_class'] == "0_percent") | (trails_stats['max_class'] == "1_percent") | (trails_stats['max_class'] == "2_percent")].index
        trails_stats_filter = trails_stats.drop(ind_to_drop)

        # Replace the row values created in the last step to remove the word "percent"
        trails_stats_filter.replace("3_percent","3", inplace=True)
        trails_stats_filter.replace("4_percent","4", inplace=True)
        trails_stats_filter.replace("5_percent","5", inplace=True)
        trails_stats_filter.replace("6_percent","6", inplace=True)

        # Write to shapefile
        trails_stats_filter.to_file(output_path, driver="ESRI Shapefile")

# Run forest coverage function to find out which class has the most coverage for each trail (max class)
# and remove trails which don't have a max class of 3, 4, 5, or 6 (i.e. remove the trails which have a non-forest class as their max class)
forest_coverage(track2018_natura_paths)
forest_coverage(track2018_naturaflag_paths)





The code below is only for RQ2 (to generate geometries to associate with the clusters) and so it only runs with the outputs from # STEP 7: NATURA INTERSECT **(ORIGINAL - NATURA ONLY)**


In [None]:
# STEP 7: EXPORT ALL GEOMS (ONLY NATURA TRAILS)

# Create a list of the shp paths with all the final filtered trails per region (natura trails only)
filtered_paths = glob.glob('./processing/track-*_2018_distfilter_buffer_areafilt_natura_forest.shp')

# Create a list for storing the gdfs (for reading in all the shp paths)
gdf_list = []

# Read in each shp, read as gdf and add to list
for shp_path in filtered_paths:
    filtered_track_gdf = gpd.read_file(shp_path)
    gdf_list.append(filtered_track_gdf)

# Combine all the gdfs to create one shp
combined_shp = gpd.GeoDataFrame(pd.concat(gdf_list, ignore_index=True))

# Write to shapefile
combined_shp.to_file("./processing/master_geoms_natura.shp", driver="ESRI Shapefile")

#### Step 8: Finalise Text for Analysis

At this stage I have the original data with the full text and the filtered data for the areas of interest, but without the full text (it was truncated during conversion to shp). At this stage I now need to do the following things to finalise the text that can be used for analysis:

1. Link the full text back to the filtered trails. For ease I also drop the geometries at this stage so that outputs can be saved as a json. 
2. Combine the data from the different regions together.
3. Remove any duplicates (trails which cross borders between regions may be listed in both)

In [None]:
# STEP 8: REINCORPORATE FULL TEXT (ALL W NATURA FLAG)

# Create a list of the shp paths with all the final filtered trails per region
filtered_paths = glob.glob('./processing/track-*_2018_distfilter_buffer_areafilt_naturaflag_forest.shp')

# Load shps from list, drop geometries (for ease), load original json with full text and merge
def full_text_merge(shp_path_list):
    for shp_path in shp_path_list:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(shp_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # Extract corresponsing file name for original json
        original_name = name_wo_ext[:-45]
        # For output file naming: assemble the new file path for the output 
        output_path = "./processing/" + original_name + "_filtered_full_text.csv"

        # Load shp as gdf
        filtered_track_gdf = gpd.read_file(shp_path)
        # Drop geometries
        filtered_track_df = filtered_track_gdf.drop(columns='geometry')


        # Load the original json with full text
        original_track_df = pd.read_json("./processing/" + original_name + "_distfilter.json")

        # Left join so that only filtered trails remain
        filt_fulltxt = pd.merge(filtered_track_df, original_track_df, how="left", 
                                    on=["url_track", "url_track"], suffixes=["_filter", "_original"])

        # Clean up the columns
        filt_fulltxt = filt_fulltxt.drop(columns=["date_publi", "descriptio", "distance_k",  "date_recor", 
                                                "photo_capt","photo_capt", "comments_filter", 
                                                "latitudes_filter", "longitudes_filter", 
                                                "track_name_original", "latitudes_original", 
                                                "longitudes_original", "track_type_original", "coordinate"
                                                ])
        filt_fulltxt.rename(columns={"track_name_filter":"track_name", 
                                    "track_type_filter":"track_type",
                                    "comments_original":"comments"}, inplace=True)
        
        # Add a column for the region name (useful later)
        track_region = name_wo_ext[:-50]
        filt_fulltxt["region"] = track_region[6:]

        # Save as csv (utf-8-sig encoding seems to work for the special characters)
        # Using csv for now as this will be easier some manual checking
        filt_fulltxt.to_csv(output_path, index = False, encoding="utf-8-sig")

# Use the function to create csvs of full text for the filtered trails of each region
full_text_merge(filtered_paths)



In [None]:
# STEP 8: COMBINE REGIONAL TRAILS (ALL W NATURA FLAG)

# Load in each region manually
baden_wurt = pd.read_csv("./processing/track-badenwurttemberg_2018_natflag_filtered_full_text.csv")
bavaria = pd.read_csv("./processing/track-bavaria_2018_natflag_filtered_full_text.csv")
berlin = pd.read_csv("./processing/track-berlin_2018_natflag_filtered_full_text.csv")
branden = pd.read_csv("./processing/track-brandenburg_2018_natflag_filtered_full_text.csv")
bremen = pd.read_csv("./processing/track-bremen_2018_natflag_filtered_full_text.csv")

hamburg = pd.read_csv("./processing/track-hamburg_2018_natflag_filtered_full_text.csv")
hessen = pd.read_csv("./processing/track-hessen_2018_natflag_filtered_full_text.csv")
meck_vor = pd.read_csv("./processing/track-mecklenburgvorpommern_2018_natflag_filtered_full_text.csv")
nieder = pd.read_csv("./processing/track-niedersachsen_2018_natflag_filtered_full_text.csv")
nord_west = pd.read_csv("./processing/track-nordrheinwestfalen_2018_natflag_filtered_full_text.csv")

rhein_pfalz = pd.read_csv("./processing/track-rheinlandpfalz_2018_natflag_filtered_full_text.csv")
saar = pd.read_csv("./processing/track-saarland_2018_natflag_filtered_full_text.csv")
sachs = pd.read_csv("./processing/track-sachsen_2018_natflag_filtered_full_text.csv")
sax_anh = pd.read_csv("./processing/track-saxonyanhalt_2018_natflag_filtered_full_text.csv")
schles_hol = pd.read_csv("./processing/track-schleswigholstein_2018_natflag_filtered_full_text.csv")
thuri = pd.read_csv("./processing/track-thuringen_2018_natflag_filtered_full_text.csv")

# Combine regions by adding rows to master
master = pd.concat([baden_wurt, bavaria, berlin, branden, bremen,  
                    hamburg, hessen, meck_vor, nieder, nord_west,
                    rhein_pfalz, saar, sachs, sax_anh, schles_hol, thuri], axis=0)

# Remove duplicates (could be that trails which cross regional borders are duplicated?)
master.drop_duplicates(inplace=True)

# Save master csv
master.to_csv("./processing/master_natflag_version.csv", index = False, encoding="utf-8-sig")