## RQ2 Data Collection

Wikiloc data extraction for Germany. **See scrapy_setup_info.md for setting up scrapy, including edits I made to the settings to fix 403 error messages and make the scraping more polite.**

scrapy spiders are provided by Chai-Allah et al, 2023 through their GitHub repo: [Wiki4CES](https://github.com/achaiallah-hub/Wiki4CES)

From what I understand, the spiders provided in the Wiki4CES repo do the following:
1. **extract_link.py** Extracts the URLS for all the trails. You give it a starting region (an intital URL) and it goes through each city/town in that region and extracts all the trail links (URLS) in the cities listing. This spider neeeds to be run first to get the URLS for steps 2 and 3. 
2. **wikiloc_track.py** Scrapes the trail details like track name, difficulty, distance, author, views and description. It loads the trail URLs from a file called link.csv (presumably created in step 1)
3. **wikiloc_image.py** Scrapes image data from the trail pages, including URL, track name, user name, date, and location (latitude & longitude). It reads the trail pages from a file called link.csv (presumably created in step 1)
4. **download_image.py** Downloads images from the URLS in a csv file called wikiloc_image.csv (presumably this would be created from step 3). Not needed for my work, so I have removed this file

**NOTE:** For some reason, running one spider seems to try to run all the spiders at once, and you end up getting error messages saying certain files don't exist (which makes sense as these files need to be created by certain spiders first). I tried looking for the solution for this, but for now I've just commented out the code within the other spiders. UPDATE: It seems to be okay once the errors have been resolved (it doesn't actually run the other spiders but seem to check for the correct files and the code being valid?), so I'm leaving finished scripts uncommented as I correct them.

In [2]:
# SETUP

# Import packages
import os
import pandas as pd
import glob

import shapely
from shapely.geometry import box
import geopandas as gpd

# Create folders for storing scrapy outputs
path_list = ["./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs"]

for path in path_list:
  if not os.path.exists(path):
    os.mkdir(path)
    print("Folder %s created!" % path)
  else:
    print("Folder %s already exists" % path)

Folder ./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs already exists


#### Step 1: Updating extract_link.py

**Step 1a: Update starting_urls (including Germany region info)** 
Edit the extract_link.py to replace the staring_urls. Originally this contained https://www.wikiloc.com/trails/france/auvergne-rhone-alpes - this URL doesn't seem to exist anymore as it just redirects to https://www.wikiloc.com/trails/outdoor. 

The URL format now needs to be https://www.wikiloc.com/trails/outdoor/ + *country_name* + *region_name* so for Germany I will try https://www.wikiloc.com/trails/outdoor/germany and then each of the regions within.

For Germany, Wikiloc has trails for the following regions:

| Count | Region                 | URL ending              |
| ----- | ---------------------- | ----------------------- | 
| 1     | Baden-Wurttemberg      | /baden-wurttemberg      | 
| 2     | Bavaria                | /bavaria                |
| 3     | Berlin                 | /berlin                 |
| 4     | Brandenburg            | /brandenburg            |
| 5     | Bremen                 | /bremen                 |
| -     | DE.16,11               | (don't use)             |
| 6     | Hamburg                | /hamburg                |
| 7     | Hessen                 | /hessen                 |
| 8     | Mecklenburg-Vorpommern | /mecklenburg-vorpommern |
| 9     | Niedersachsen          | /niedersachsen          |
| 10    | Nordrhein-Westfalen    | /nordrhein-westfalen    |
| 11    | Rheinland-Pfalz        | /rheinland-pfalz        |
| 12    | Saarland               | /saarland               |
| 13    | Sachsen                | /sachsen                |
| 14    | Saxony-Anhalt          | /saxony-anhalt          |
| 15    | Schleswig-Holstein     | /schleswig-holstein     |
| 16    | Thüringen              | /thuringen              |

*NOTE* The number of trails being added seem to be increasing steadily (for example, within a one week period, the total count for Germany went up by ~1000). I'll need to keep track of the number of expected trails on the day of download. Also check to make sure no new regions are added!

DE.16,11 appears to be a few trails in Berlin - I don't think I need to bother with this as there is so few and in an urban area (and all the routes don't really look like anything to do with forests)

**Step 1b: Update xpath expressions & other edits**
After overcoming initial 403 error messages (see scrapy_setup_info.md), the spider seemed to correctly generate the urls for all the cities within the region, but still didn't return the trail URLs. I started looking into the xpath expressions in the extract_link.py as I wondered if the path structure has changed a bit over time (like the URLs).

I found this video useful for understanding xpath https://www.youtube.com/watch?v=4EvxqTSzUkI 
I then went to https://www.wikiloc.com/trails/outdoor/germany/bremen and did rick click > Inspect to see the html (I selected Bremen as the testing region as it has the fewest trails). After a search for the components on the main Bremen page and then for one city (for example: https://www.wikiloc.com/trails/outdoor/germany/bremen/alte-neustadt) I made a couple changed to the xpaths in extract_link.py (see comments in script). I also made some changes to the pagination handling (see comments in script). ALSO, since the URLs saved initially were just the back half of the URL, without the beginning (eg. /cycling-trails/bremen-achim-18077390) I adjusted the code to add the beginning part as well. This ended up being required for the other spiders to work properly. 


#### Step 2: Complete workflow for running extract_link.py

*For now, this is just for the Bremen region.*

**Step 2a**
In extract_link.py:
1. Update start_urls: 'https://www.wikiloc.com/trails/outdoor/germany/bremen' and save.

**Step 2b**
In Anaconda Prompt:
1. conda activate C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\thesis_env_conda
2. cd C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\wikiloc_scrapy\wikiloc_scrapy\spiders
3. scrapy crawl wiki -o crawling_outputs\link-bremen.csv

**Step 2c**
Remove duplicates: I am not sure why duplicates are occuring, but the code below simply removes any duplicates.

*NOTE:* For Bremen at the time of scraping (31 MARCH 2025), the website shows 1460 trails, however I get 1532 trails (after the duplicates are removed) - this means there are an extra 72 trails. I'm not sure why this is but I wonder if it has something to do with trails which cross borders (and therefore are in more than 1 region of Germany). It could be that these trails can be searched for in both regions but are only included in the count of 1 to avoid double-counting? **I should check for duplicates across regions to make sure all trails are unique.**

In [99]:
# STEP 2C: Remove CSV duplicates 

# Create a list of the csv paths with all the scraped trail links
link_csv_paths = glob.glob('./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/link-*.csv')

# Load csvs from list, remove duplicates and then write results to same file (overwrite)
for csv_path in link_csv_paths:
    loaded_csv = pd.read_csv(csv_path, sep="\t")
    loaded_csv.drop_duplicates(inplace=True)
    loaded_csv.to_csv(csv_path, index=False)

# Check
#link_bremen = pd.read_csv('./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/link-bremen.csv', sep="\t")
#link_bremen
link_nieder = pd.read_csv('./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/link-niedersachsen.csv', sep="\t")
link_nieder

Unnamed: 0,Link
0,https://www.wikiloc.com/hiking-trails/rundwand...
1,https://www.wikiloc.com/offroading-trails/30-m...
2,https://www.wikiloc.com/hiking-trails/rund-alt...
3,https://www.wikiloc.com/hiking-trails/alt-burl...
4,https://www.wikiloc.com/running-trails/alt-wol...
...,...
30387,https://www.wikiloc.com/hiking-trails/altenau-...
30388,https://www.wikiloc.com/hiking-trails/altenau-...
30389,https://www.wikiloc.com/running-trails/altenau...
30390,https://www.wikiloc.com/hiking-trails/oderteic...


#### Step 3: updating wikiloc_track.py

I edited wikiloc_track.py to update the xpaths (as with the extract_link.py - see comments in script). I also added code so that I could also scrape additional information: 
- date recorded
- photo/waypoint captions (title and body)
- comments
- **photo/waypoint latitudes and longitudes**
- **start point latitude and longitude** (unfortunately the end point is not stored in the html)

Because I handled all the coordinate extraction in this script, I did not use or update the wikiloc_image.py script (and I since deleted it from my repo). I extracted all latitude values in one column, and all longitude values in another column. I can then extract the minimum and maximum values from each column in order to create a bounding box.

Additionally, I removed the author extraction completely so that no personal information is collected. 

Although I updated the xapths for the following features, I commented them out as I don't think I'll need them for my analysis:
- trail difficulty
- view counts
- download counts
- trail length/distance

#### Step 4: Complete workflow for running wikiloc_track.py

*For now, this is just for the Bremen region.*

**Step 4a**
In wikiloc_track.py:
1. Change CSV name in start_urls to: crawling_outputs\link-bremen.csv

**Step 4b**
In Anaconda Prompt:
1. conda activate C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\thesis_env_conda
2. cd C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\wikiloc_scrapy\wikiloc_scrapy\spiders
3. scrapy crawl wiki_track -o crawling_outputs\track-bremen.json

**Needs to be output as json** otherwise (as csv) the utf-8 encoding doesn't seem to work properly and the German special characters are not handled well. 

For Bremen (1532 trails), with download delays and autothrottle on, this stage takes about **65 minutes**.

#### Step 5: Filter for 2018 & distance

Some initial filtering can be applied to reduce the amount of data that goes through the generating geometries step.

This function filters on two fields:
1. Filter for 2018 only. Data from the year 2018 is needed so that the social media data matches the forest definition data (which is for 2018).
2. Filter out very long trails (for now, >175km). These tend correspond to motorised transport (which, when covering large distances may be difficult to pin down to CES for forests) or unexpected use of the website/errors. For example, this trail https://www.wikiloc.com/hiking-trails/xabia-teulada-126272868 is recorded as a ~5 hour hike, but it goes from Germany to Spain. This trail already gets removed with the 2018 filter, but there may be others like it. 


In [3]:
# STEP 5: FILTER 2018 & DISTANCE

# Create a list of the json paths with all scraped data
track_json_paths = glob.glob('./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/track-*.json')

# Load jsons from list, select only 2018 data & trails less than certain distance, return new json
# This outputs to the PROCESSING folder!
def dist_year_filter(json_paths):
    for json_path in json_paths:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(json_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output
        output_path = "./processing/" + name_wo_ext + "_2018_distfilter.json" 

        # Load json as df
        track_df = pd.read_json(json_path) 

        # Select rows where date_recorded includes "2018"
        track_2018_df = track_df[track_df["date_recorded"].str.contains("2018")]

        # Select rows where distance is less than 175 km
        track_2018short_df = track_2018_df[track_2018_df["distance_km"] < 175]
        
        # Save the gdf as a json
        track_2018short_df.to_json(output_path)

# Run the function
dist_year_filter(track_json_paths)

# Load the json and check
#bremen_2018_df = pd.read_json("./processing/track-bremen_2018_distfilter.json")
#bremen_2018_df

#### Step 6: Generating geometries (bbox)

In step 4, I extracted all latitude values in one column, and all longitude values in another column. I can then extract the minimum and maximum values from each column in order to create a bounding box. This bounding box contains all available coordinates (without downloading the actual gpx or kml file - which is likely more invasive to scrape from the website) - this includes the trail start coordinates and any available photo/waypoint coordinates (unfortunately the trail end coordinates were not available in the htmls).

The code/function below generates a bounding box geometry for each trail within each json and outputs a shapefile.

In [None]:
### STEP 6: bbox geometries

# Create a list of the json paths with all the filtered trails
track2018_json_paths = glob.glob('./processing/track-*_2018_distfilter.json')

# Load jsons from list, generate bbox geometries and save to processing folder as shp
def bbox_generator(json_path_list):
    for json_path in json_path_list:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(json_path)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output 
        output_path = "./processing/" + name_wo_ext + "_bbox.shp"

        # Load json as df
        track_df = pd.read_json(json_path)
        
        # Create column for xmin & xmax (lowest & highest longitude)
        track_df["xmin"] = [min(x) for x in track_df.longitudes]
        track_df["xmax"] = [max(x) for x in track_df.longitudes]

        # Create column for ymin & ymax (lowest & highest latitude)
        track_df["ymin"] = [min(x) for x in track_df.latitudes]
        track_df["ymax"] = [max(x) for x in track_df.latitudes]

        # Run shapely box function using new columns
        # .apply(lambda row: ..., axis=1) runs the code after the : for each row in the df
        # lambda simply indicates a function without a name is being used
        track_df["geometry"] = track_df.apply(lambda row: box(row["xmin"], row["ymin"], 
                                                              row["xmax"], row["ymax"]), axis=1)
        
        # Convert the df to a gdf
        track_gdf = gpd.GeoDataFrame(track_df, geometry='geometry')

        # Set the CRS and reproject to match other data
        track_gdf.crs= "EPSG:4326"
        track_gdf = track_gdf.to_crs("EPSG:3035")

        # Save the gdf as a shp
        track_gdf.to_file(output_path, driver="ESRI Shapefile")

# Run the function
#bbox_generator(track2018_json_paths)


**Note: exporting the data as a shp truncates the long text fields (e.g. "description" "photo_caption", "comments", etc).** I also tried writing the file as a geojson, which works fine, but then when loading back in as a geodataframe, there are problems with the list structures in the "photo_caption" and "comments" fields. 

For now, I think the best work-around is to use the shapefile to do the spatial intersection steps, and then to **join the results back to the main json** (i.e. STEP 5 OUTPUTS, filtered for year and distance) for any text analsyis steps. I think the track URLs can be used for the join field as these are unique. 

- apply with axis= 1 applies the function to each row
- lambda row sets up an anonymous function to do something for each row
- zip pairs the items in each list by index (so the lats and longs get paired up according to their order in the list)
- the list part converts the output from zip into a list

In [None]:
### STEP 6 ALT: buffered line geometries

from shapely.geometry import LineString, Point

# Load tester json
bremen_2018_df = pd.read_json("./processing/track-bremen_2018_distfilter.json")

# Pair-up the lats and longs into coordinates
bremen_2018_df["coordinates"] = bremen_2018_df.apply(lambda row: list(zip(row["longitudes"], row["latitudes"])), axis=1)

# Function which creates point geom if only one coordinate pair available, otherwise line geom
def make_geometry(coords):
    if len(coords) == 1:
        return Point(coords[0])
    else:
        return LineString(coords)

# Run the make geom function on all rows
bremen_2018_df['geometry'] = bremen_2018_df['coordinates'].apply(make_geometry)

# Convert to geodataframe
bremen_2018_gdf = gpd.GeoDataFrame(bremen_2018_df, geometry="geometry")

# Define projection and reproject
bremen_2018_gdf.crs= "EPSG:4326"
bremen_2018_gdf = bremen_2018_gdf.to_crs("EPSG:3035")


# Buffer all rows so that all geometries are now polygons
# PLACE HOLDER BUFFER VALUE FOR NOW
buffer_geoms = bremen_2018_gdf.buffer(100)

# Replace geometries in gdf with buffered geometries
bremen_2018_gdf = bremen_2018_gdf.set_geometry(buffer_geoms)

# Write as shapefile for visualising
bremen_2018_gdf.to_file("./processing/test_startupdate.shp", driver="ESRI Shapefile")



  bremen_2018_gdf.to_file("./processing/test.shp", driver="ESRI Shapefile")
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(


#### Step 7: Additional Filtering

1. Decide how to handle trails with only one set of coordinates (start location) - buffer or delete? For now I have decided to buffer.
2. Filter to only include areas which intersect with Natura 2000 areas. 
3. Filter for consensus forest and non-consensus forest


More filtering to consider:
1. Filter certain activity types?
2. Remove trails without any associated text (this depends if I end up using text or just trail counts)
3. Remove trails which have a bbox area below a certain value? (loop trails without photos/waypoints = to hard to tell where exactly the trail is)

In [5]:
# STEP 7: BUFFER SINGLE COORDINATE GEOMETRIES

# Create a list of the json paths with all the filtered trails
track2018_bbox_paths = glob.glob("./processing/track-*_2018_distfilter_bbox.shp")

def single_coord_trails(shp_path_list):
    for bboxshp in shp_path_list:
        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(bboxshp)[1] 
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: assemble the new file path for the output 
        output_path = "./processing/" + name_wo_ext + "+buffer.shp"

        # Load shp as df and calculate the area of all geoms
        allgeoms = gpd.read_file(bboxshp)
        allgeoms["area"] = allgeoms.area

        # Extract the rows where the area = 0 and where area > 0
        single_coord = allgeoms[allgeoms["area"] == 0.0]
        bbox = allgeoms[allgeoms["area"] > 0.0]
        
        # Generate new geometries for the single coordinate trails based on trail distance
        single_coord_geoms = single_coord.buffer(single_coord["distance_k"])

        # Replace existing geometry with new geometry for single coordinate trails
        single_coord = single_coord.set_geometry(single_coord_geoms)

        # Combine the new buffered points with the bbox areas
        allgeoms_new = pd.concat([single_coord, bbox])

        # Save the gdf as a shp
        allgeoms_new.to_file(output_path, driver="ESRI Shapefile")

# Run the function
single_coord_trails(track2018_bbox_paths)

  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(


In [None]:
# STEP 7: NATURA INTERSECT

