## RQ2 Data Collection

Wikiloc data extraction for Germany. **See scrapy_setup_info.md for setting up scrapy, including edits I made to the settings to fix 403 error messages and make the scraping more polite.**

scrapy spiders are provided by Chai-Allah et al, 2023 through their GitHub repo: [Wiki4CES](https://github.com/achaiallah-hub/Wiki4CES)

From what I understand, the spiders provided in the Wiki4CES repo do the following:
1. **extract_link.py** Extracts the URLS for all the trails. You give it a starting region (an intital URL) and it goes through each city/town in that region and extracts all the trail links (URLS) in the cities listing. This spider neeeds to be run first to get the URLS for steps 2 and 3. 
2. **wikiloc_track.py** Scrapes the trail details like track name, difficulty, distance, author, views and description. It loads the trail URLs from a file called link.csv (presumably created in step 1)
3. **wikiloc_image.py** Scrapes image data from the trail pages, including URL, track name, user name, date, and location (latitude & longitude). It reads the trail pages from a file called link.csv (presumably created in step 1)
4. **download_image.py** Downloads images from the URLS in a csv file called wikiloc_image.csv (presumably this would be created from step 3). Not needed for my work, so I have removed this file

**NOTE:** For some reason, running one spider seems to try to run all the spiders at once, and you end up getting error messages saying certain files don't exist (which makes sense as these files need to be created by certain spiders first). I tried looking for the solution for this, but for now I've just commented out the code within the other spiders. UPDATE: It seems to be okay once the errors have been resolved (it doesn't actually run the other spiders but seem to check for the correct files and the code being valid?), so I'm leaving finished scripts uncommented as I correct them.

In [1]:
# SETUP

# Import packages
import os
import pandas as pd

from shapely.geometry import box
import geopandas as gpd

# Create folders for storing scrapy outputs
path_list = ["./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs"]

for path in path_list:
  if not os.path.exists(path):
    os.mkdir(path)
    print("Folder %s created!" % path)
  else:
    print("Folder %s already exists" % path)

Folder ./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs already exists


#### Step 1: Updating extract_link.py

**Step 1a: Update starting_urls (including Germany region info)** 
Edit the extract_link.py to replace the staring_urls. Originally this contained https://www.wikiloc.com/trails/france/auvergne-rhone-alpes - this URL doesn't seem to exist anymore as it just redirects to https://www.wikiloc.com/trails/outdoor. 

The URL format now needs to be https://www.wikiloc.com/trails/outdoor/ + *country_name* + *region_name* so for Germany I will try https://www.wikiloc.com/trails/outdoor/germany and then each of the regions within.

For Germany, Wikiloc has trails for the following regions:

| Count | Region                 | URL ending              |
| ----- | ---------------------- | ----------------------- | 
| 1     | Baden-Wurttemberg      | /baden-wurttemberg      | 
| 2     | Bavaria                | /bavaria                |
| 3     | Berlin                 | /berlin                 |
| 4     | Brandenburg            | /brandenburg            |
| 5     | Bremen                 | /bremen                 |
| -     | DE.16,11               | (don't use)             |
| 6     | Hamburg                | /hamburg                |
| 7     | Hessen                 | /hessen                 |
| 8     | Mecklenburg-Vorpommern | /mecklenburg-vorpommern |
| 9     | Niedersachsen          | /niedersachsen          |
| 10    | Nordrhein-Westfalen    | /nordrhein-westfalen    |
| 11    | Rheinland-Pfalz        | /rheinland-pfalz        |
| 12    | Saarland               | /saarland               |
| 13    | Sachsen                | /sachsen                |
| 14    | Saxony-Anhalt          | /saxony-anhalt          |
| 15    | Schleswig-Holstein     | /schleswig-holstein     |
| 16    | Thüringen              | /thuringen              |

*NOTE* The number of trails being added seem to be increasing steadily (for example, within a one week period, the total count for Germany went up by ~1000). I'll need to keep track of the number of expected trails on the day of download. Also check to make sure no new regions are added!

DE.16,11 appears to be a few trails in Berlin - I don't think I need to bother with this as there is so few and in an urban area (and all the routes don't really look like anything to do with forests)

**Step 1b: Update xpath expressions & other edits**
After overcoming initial 403 error messages (see scrapy_setup_info.md), the spider seemed to correctly generate the urls for all the cities within the region, but still didn't return the trail URLs. I started looking into the xpath expressions in the extract_link.py as I wondered if the path structure has changed a bit over time (like the URLs).

I found this video useful for understanding xpath https://www.youtube.com/watch?v=4EvxqTSzUkI 
I then went to https://www.wikiloc.com/trails/outdoor/germany/bremen and did rick click > Inspect to see the html (I selected Bremen as the testing region as it has the fewest trails). After a search for the components on the main Bremen page and then for one city (for example: https://www.wikiloc.com/trails/outdoor/germany/bremen/alte-neustadt) I made a couple changed to the xpaths in extract_link.py (see comments in script). I also made some changes to the pagination handling (see comments in script). ALSO, since the URLs saved initially were just the back half of the URL, without the beginning (eg. /cycling-trails/bremen-achim-18077390) I adjusted the code to add the beginning part as well. This ended up being required for the other spiders to work properly. 


#### Step 2: Complete workflow for running extract_link.py

*For now, this is just for the Bremen region.*

**Step 2a**
In extract_link.py:
1. Update start_urls: 'https://www.wikiloc.com/trails/outdoor/germany/bremen' and save.

**Step 2b**
In Anaconda Prompt:
1. conda activate C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\thesis_env_conda
2. cd C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\wikiloc_scrapy\wikiloc_scrapy\spiders
3. scrapy crawl wiki -o crawling_outputs\link-bremen.csv

**Step 2c**
Remove duplicates: I am not sure why duplicates are occuring, but the code below simply removes any duplicates.

*NOTE:* For Bremen at the time of scraping (31 MARCH 2025), the website shows 1460 trails, however I get 1532 trails (after the duplicates are removed) - this means there are an extra 72 trails. I'm not sure why this is but I wonder if it has something to do with trails which cross borders (and therefore are in more than 1 region of Germany). It could be that these trails can be searched for in both regions but are only included in the count of 1 to avoid double-counting? **I should check for duplicates across regions to make sure all trails are unique.**

In [None]:
# STEP 2C: Remove CSV duplicates 

# Store scrapy spider path (where outputs are stored)
scrapy_output = "./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/"

# Create a list of the link CSVs
link_csv_list = ["link-bremen.csv", "link-niedersachsen.csv"]

# Load csvs from list, remove duplicates and then write results to same file (overwrite)
for csv in link_csv_list:
    loaded_csv = pd.read_csv(scrapy_output + csv, sep="\t")
    loaded_csv.drop_duplicates(inplace=True)
    loaded_csv.to_csv(scrapy_output + csv, index=False)

# Check
#link_bremen = pd.read_csv(scrapy_output + "link-bremen.csv", sep="\t")
#link_bremen
link_nieder = pd.read_csv(scrapy_output + "link-niedersachsen.csv", sep="\t")
link_nieder

Unnamed: 0,Link
0,https://www.wikiloc.com/hiking-trails/rundwand...
1,https://www.wikiloc.com/offroading-trails/30-m...
2,https://www.wikiloc.com/hiking-trails/rund-alt...
3,https://www.wikiloc.com/hiking-trails/alt-burl...
4,https://www.wikiloc.com/running-trails/alt-wol...
...,...
30387,https://www.wikiloc.com/hiking-trails/altenau-...
30388,https://www.wikiloc.com/hiking-trails/altenau-...
30389,https://www.wikiloc.com/running-trails/altenau...
30390,https://www.wikiloc.com/hiking-trails/oderteic...


#### Step 3: updating wikiloc_track.py

I edited wikiloc_track.py to update the xpaths (as with the extract_link.py - see comments in script). I also added code so that I could also scrape additional information: 
- date recorded
- photo/waypoint captions (title and body)
- comments
- **photo/waypoint latitudes and longitudes**
- **start point latitude and longitude** (unfortunately the end point is not stored in the html)

Because I handled all the coordinate extraction in this script, I did not use or update the wikiloc_image.py script (and I since deleted it from my repo). I extracted all latitude values in one column, and all longitude values in another column. I can then extract the minimum and maximum values from each column in order to create a bounding box.

Additionally, I removed the author extraction completely so that no personal information is collected. 

Although I updated the xapths for the following features, I commented them out as I don't think I'll need them for my analysis:
- trail difficulty
- view counts
- download counts
- trail length/distance

#### Step 4: Complete workflow for running wikiloc_track.py

*For now, this is just for the Bremen region.*

**Step 4a**
In wikiloc_track.py:
1. Change CSV name in start_urls to: crawling_outputs\link-bremen.csv

**Step 4b**
In Anaconda Prompt:
1. conda activate C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\thesis_env_conda
2. cd C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\wikiloc_scrapy\wikiloc_scrapy\spiders
3. scrapy crawl wiki_track -o crawling_outputs\track-bremen.json

**Needs to be output as json** otherwise (as csv) the utf-8 encoding doesn't seem to work properly and the German special characters are not handled well. 

For Bremen (1532 trails), with download delays and autothrottle on, this stage takes about **65 minutes**.

#### Step 5: Filter for 2018

Although most of the filtering will happen later, an initial filter to only retain data from 2018 can be applied to reduce the amount of data that goes through the next step (generating the geometries). Data from the year 2018 is needed so that the social media data matches the forest definition data (which is for 2018).


In [10]:
# STEP 5: FILTER 2018 ONLY

# Create a list of the jsons with all scraped data
track_json_list = ["track-bremen.json"] # , "track-niedersachsen.json"

# Load jsons from list, select only 2018 data and return new json
def filter_2018(json_list):
    for json in json_list:
        # Store scrapy spider path (where outputs are stored)
        scrapy_output = "./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/"
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(json)[0]
        # For output file naming: assemble the new file path for the output
        output_path = scrapy_output + name_wo_ext + "_2018.json" 

        # Load json as df
        track_df = pd.read_json(scrapy_output + json) 

        # Select rows where date_recorded includes "2018"
        track_2018_df = track_df[track_df["date_recorded"].str.contains("2018")]
        
        # Save the gdf as a shp
        track_2018_df.to_json(output_path)

# Run the function
filter_2018(track_json_list)

# Load the shapefile and check
bremen_2018_df = pd.read_json("./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/track-bremen_2018.json")
bremen_2018_df

Unnamed: 0,track_name,url_track,track_type,date_published,description text,date_recorded,photo_captions,comments,latitudes,longitudes
6,Dammsiel bis Grambke,https://www.wikiloc.com/cycling-trails/dammsie...,Road Bike,2018-06-30T15:22+0200,Dieser Weg ist der zweite Teil zum Weg von Gra...,June 2018,[None],[Gehört zum Trail 'Grüner Weg Waller Feldmark'],[53.158181],[8.777526]
14,Dag 7 Peene : Bremen (D) - Steenwijk (Nl),https://www.wikiloc.com/recreational-vehicle-t...,Motorhome,2018-11-17T18:42+0100,Dag 7 Peene : Bremen (D) - Steenwijk (Nl),November 2018,[None],[None],[53.073133],[8.803362]
37,Werdersee,https://www.wikiloc.com/stand-up-paddle-sup-tr...,Stand up Paddle,2018-05-25T18:03+0200,Werdersee,May 2018,[None],[None],[53.069035],[8.805651]
74,Weser-Romantische Straße (D9),https://www.wikiloc.com/outdoor-trails/weser-r...,Unspecified,2018-09-24T20:23+0200,Weser-Romantische Straße (D9),September 2018,[None],[None],[53.57506],[8.561429]
110,Findorff-Blockland-St. Jürgen-Ritterhude-Findorff,https://www.wikiloc.com/cycling-trails/findorf...,Road Bike,2018-04-07T17:44+0200,Finndorff-Blockland-St. Jürgen-Ritterhude-Find...,April 2018,"[Pause am Wümme Deich, Am Wümme Deich, Weitere...","[Nice trail, nice view.]","[53.135328, 53.13783, 53.147445, 53.147445, 53...","[8.872359, 8.862605, 8.846398, 8.846398, 8.845..."
...,...,...,...,...,...,...,...,...,...,...
1486,Aschenbeck - Tarmstedt,https://www.wikiloc.com/cycling-trails/aschenb...,Road Bike,2018-08-20T09:48+0200,Aschenbeck - Tarmstedt,August 2018,[None],[None],[52.932978],[8.404686]
1491,Niederende - Ovelgönne,https://www.wikiloc.com/mountain-biking-trails...,Mountain Bike,2018-08-21T17:23+0200,Niederende - Ovelgönne,August 2018,[None],[None],[53.185632],[8.779968]
1500,Bremen - Hamburg,https://www.wikiloc.com/car-trails/bremen-hamb...,Car,2018-06-06T22:30+0200,Bremen - Hamburg,June 2018,[None],[None],[53.071848],[8.80532]
1518,Nienburg-Bremen,https://www.wikiloc.com/bicycle-touring-trails...,Bicycle Touring,2018-07-02T21:50+0200,,"July 2, 2018",[None],[None],[52.644734],[9.216162]


#### Step 6: Generating geometries (bbox)

In step 4, I extracted all latitude values in one column, and all longitude values in another column. I can then extract the minimum and maximum values from each column in order to create a bounding box. This bounding box contains all available coordinates (without downloading the actual gpx or kml file - which is likely more invasive to scrape from the website) - this includes the trail start coordinates and any available photo/waypoint coordinates (unfortunately the trail end coordinates were not available in the htmls).

The code/function below generates a bounding box geometry for each trail within each json and outputs a shapefile.

In [13]:
### STEP 6: bbox geometries

# Create a list of the jsons with all scraped data
track_json_list = ["track-bremen_2018.json"] # , "track-niedersachsen.json"

# Load jsons from list, generate bbox geometries and save to processing folder as shp
def bbox_generator(json_list):
    for json in json_list:
        # For output file naming: remove extension from input file name
        name_wo_ext = os.path.splitext(json)[0]
        # For output file naming: assemble the new file path for the output
        output_path = "./processing/" + name_wo_ext + ".shp" 

        # Store scrapy spider path (where outputs are stored)
        scrapy_output = "./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/"
        # Load json as df
        track_df = pd.read_json(scrapy_output + json)
        
        # Create column for xmin & xmax (lowest & highest longitude)
        track_df["xmin"] = [min(x) for x in track_df.longitudes]
        track_df["xmax"] = [max(x) for x in track_df.longitudes]

        # Create column for ymin & ymax (lowest & highest latitude)
        track_df["ymin"] = [min(x) for x in track_df.latitudes]
        track_df["ymax"] = [max(x) for x in track_df.latitudes]

        # Run shapely box function using new columns
        # .apply(lambda row: ..., axis=1) runs the code after the : for each row in the df
        # lambda simply indicates a function without a name is being used
        track_df["geometry"] = track_df.apply(lambda row: box(row["xmin"], row["ymin"], 
                                                              row["xmax"], row["ymax"]), axis=1)
        
        # Convert the df to a gdf
        track_gdf = gpd.GeoDataFrame(track_df, geometry='geometry')

        # Set the CRS
        track_gdf.crs= "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"

        # Save the gdf as a shp
        # NOTE this truncates the long text descriptions, etc!!
        track_gdf.to_file(output_path, driver='ESRI Shapefile')

# Run the function
bbox_generator(track_json_list)

# Load the shapefile and check
bremen_gdf = gpd.read_file("./processing/track-bremen_2018.shp")
bremen_gdf


  track_gdf.to_file(output_path, driver='ESRI Shapefile')
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(
  ogr_write(


Unnamed: 0,track_name,url_track,track_type,date_publi,descriptio,date_recor,photo_capt,comments,latitudes,longitudes,xmin,xmax,ymin,ymax,geometry
0,Dammsiel bis Grambke,https://www.wikiloc.com/cycling-trails/dammsie...,Road Bike,2018-06-30T15:22+0200,Dieser Weg ist der zweite Teil zum Weg von Gra...,June 2018,['None'],"[""Gehört zum Trail 'Grüner Weg Waller Feldmark'""]",[53.158181],[8.777526],8.777526,8.777526,53.158181,53.158181,"POLYGON ((8.77753 53.15818, 8.77753 53.15818, ..."
1,Dag 7 Peene : Bremen (D) - Steenwijk (Nl),https://www.wikiloc.com/recreational-vehicle-t...,Motorhome,2018-11-17T18:42+0100,Dag 7 Peene : Bremen (D) - Steenwijk (Nl),November 2018,['None'],['None'],[53.073133],[8.803362],8.803362,8.803362,53.073133,53.073133,"POLYGON ((8.80336 53.07313, 8.80336 53.07313, ..."
2,Werdersee,https://www.wikiloc.com/stand-up-paddle-sup-tr...,Stand up Paddle,2018-05-25T18:03+0200,Werdersee,May 2018,['None'],['None'],[53.069035],[8.805651],8.805651,8.805651,53.069035,53.069035,"POLYGON ((8.80565 53.06904, 8.80565 53.06904, ..."
3,Weser-Romantische Straße (D9),https://www.wikiloc.com/outdoor-trails/weser-r...,Unspecified,2018-09-24T20:23+0200,Weser-Romantische Straße (D9),September 2018,['None'],['None'],[53.57506],[8.561429],8.561429,8.561429,53.575060,53.575060,"POLYGON ((8.56143 53.57506, 8.56143 53.57506, ..."
4,Findorff-Blockland-St. Jürgen-Ritterhude-Findorff,https://www.wikiloc.com/cycling-trails/findorf...,Road Bike,2018-04-07T17:44+0200,Finndorff-Blockland-St. Jürgen-Ritterhude-Find...,April 2018,"['Pause am Wümme Deich', 'Am Wümme Deich', 'We...","['Nice trail, nice view.']","[53.135328, 53.13783, 53.147445, 53.147445, 53...","[8.872359, 8.862605, 8.846398, 8.846398, 8.845...",8.751890,8.872359,53.094692,53.181573,"POLYGON ((8.87236 53.09469, 8.75189 53.09469, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,Aschenbeck - Tarmstedt,https://www.wikiloc.com/cycling-trails/aschenb...,Road Bike,2018-08-20T09:48+0200,Aschenbeck - Tarmstedt,August 2018,['None'],['None'],[52.932978],[8.404686],8.404686,8.404686,52.932978,52.932978,"POLYGON ((8.40469 52.93298, 8.40469 52.93298, ..."
83,Niederende - Ovelgönne,https://www.wikiloc.com/mountain-biking-trails...,Mountain Bike,2018-08-21T17:23+0200,Niederende - Ovelgönne,August 2018,['None'],['None'],[53.185632],[8.779968],8.779968,8.779968,53.185632,53.185632,"POLYGON ((8.77997 53.18563, 8.77997 53.18563, ..."
84,Bremen - Hamburg,https://www.wikiloc.com/car-trails/bremen-hamb...,Car,2018-06-06T22:30+0200,Bremen - Hamburg,June 2018,['None'],['None'],[53.071848],[8.80532],8.805320,8.805320,53.071848,53.071848,"POLYGON ((8.80532 53.07185, 8.80532 53.07185, ..."
85,Nienburg-Bremen,https://www.wikiloc.com/bicycle-touring-trails...,Bicycle Touring,2018-07-02T21:50+0200,,"July 2, 2018",['None'],['None'],[52.644734],[9.216162],9.216162,9.216162,52.644734,52.644734,"POLYGON ((9.21616 52.64473, 9.21616 52.64473, ..."


#### Step 7: Additional Filtering


Must do:
- filter out crazy trails (with massive bbox)
- spatial filter for Natura 2000 areas
- figure out how to handle trails with only one set of coordinates (start location) - buffer or delete?
- spatial filter for consensus forest and non-consensus forest


To consider:
- filter certain activity types?
- remove trails without any associated text (this depends if I end up using text or just trail counts)
- remove trails which have a bbox area below a certain value? (loop trails without photos/waypoints = to hard to tell where exactly the trail is)