## RQ2 Data Collection


[Add description]

Wikiloc data extraction for Germany

Python code provided by Chai-Allah et al, 2023 through their GitHub repo: https://github.com/achaiallah-hub/Wiki4CES

In [19]:
# SETUP

# Import packages
import os
import pandas as pd

# Create folders for storing scrapy outputs
path_list = ["./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs"]

for path in path_list:
  if not os.path.exists(path):
    os.mkdir(path)
    print("Folder %s created!" % path)
  else:
    print("Folder %s already exists" % path)

Folder ./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs created!


#### Scrapy Set-up

To run scrapy you need to set up a scrapy "project". In Anaconda Prompt:

1. cd C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis
2. conda activate C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\thesis_env_conda 
3. scrapy startproject wikiloc_scrapy

Helpful scrapy information: https://docs.scrapy.org/en/latest/intro/tutorial.html

Then you need to add the scrapy "spiders" (the processes which crawl and scrape information from the web) as .py scripts to the directory wikiloc_scrapy/wikiloc_scrapy/spiders. Basically, the .py scripts from the achaiallah-hub repo need to be saved in this directory. In Anaconda Prompt:

1. cd wikiloc_scrapy\wikiloc_scrapy\spiders
2. git init
3. git remote add origin git@github.com:achaiallah-hub/Wiki4CES.git
4. git pull origin main

5. Afterwards, delete the git files which are created in wikiloc_scrapy\wikiloc_scrapy\spiders (otherwise you end up working on the achiallah-hub Wiki4CES repo rather than your own repo!)

**NOTE** Instead of Steps 2-4 I first just tried the simple command git clone https://github.com/achaiallah-hub/Wiki4CES.git This works, but it stores the .py scripts inside a repo directory folder called Wiki4CES, and I think this causes problems when trying to use the spiders later. Steps 2-4 are a work-around: this way the python scripts are directly within the "spiders" directory without being inside another directory. In order to do steps 2-4 an **ssh-key** needs to be set up. 

From what I understand, the spiders provided in the Wiki4CES repo do the following:
1. **extract_link.py** Extracts the URLS for all the trails. You give it a starting point (an intital URL) and it goes through each city/town and extracts all the trail links (URLS) in the cities listing. It stores each link as {"Link": link}. Needs to be run first to get the URLS for steps 2 and 3. 
2. **wikiloc_track.py** Scrapes the trail details like track name, difficulty, distance, author, views and description. It loads the trail URLs from a file called link.csv (presumably created in step 1)
3. **wikiloc_image.py** Scrapes image data from the trail pages, including URL, track name, user name, date, and location (latitude & longitude). It reads the trail pages from a file called link.csv (presumably created in step 1)
4. **download_image.py** Downloads images from the URLS in a csv file called wikiloc_image.csv (presumably this would be created from step 3)

#### Step 1: extract_link.py

Edit the extract_link.py to replace the staring_urls. Originally this contained https://www.wikiloc.com/trails/france/auvergne-rhone-alpes - this URL doesn't seem to exist anymore as it just redirects to https://www.wikiloc.com/trails/outdoor. 

My guess is that the URL format now needs to be https://www.wikiloc.com/trails/outdoor/ + *country_name* + *region_name* so for Germany I will try https://www.wikiloc.com/trails/outdoor/germany and then each of the regions within (I think I probably need to do each separately as that's the way the code was written for the original project).

For Germany, there are the following regions:

| Count | Region                 | URL ending              |
| ----- | ---------------------- | ----------------------- | 
| 1     | Baden-Wurttemberg      | /baden-wurttemberg      | 
| 2     | Bavaria                | /bavaria                |
| 3     | Berlin                 | /berlin                 |
| 4     | Brandenburg            | /brandenburg            |
| 5     | Bremen                 | /bremen                 |
| -     | DE.16,11               | (don't use)             |
| 6     | Hamburg                | /hamburg                |
| 7     | Hessen                 | /hessen                 |
| 8     | Mecklenburg-Vorpommern | /mecklenburg-vorpommern |
| 9     | Niedersachsen          | /niedersachsen          |
| 10    | Nordrhein-Westfalen    | /nordrhein-westfalen    |
| 11    | Rheinland-Pfalz        | /rheinland-pfalz        |
| 12    | Saarland               | /saarland               |
| 13    | Sachsen                | /sachsen                |
| 14    | Saxony-Anhalt          | /saxony-anhalt          |
| 15    | Schleswig-Holstein     | /schleswig-holstein     |
| 16    | Thüringen              | /thuringen              |

**NOTE** The number of trails being added seem to be increasing steadily (within one week, the total count for Germany went up by ~1000). I'll need to keep track of the number of expected trails on the day of download. Also check to make sure no new regions are added!

DE.16,11 appears to be a few trails in Berlin - I don't think I need to bother with this as there is so few and in an urban area (and all the routes don't really look like anything to do with forests)


In Anaconda Prompt (with conda environment activated and from the scrapy project's top level directory):
1. (if needed) conda activate C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\thesis_env_conda 
2. (if needed) cd C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\wikiloc_scrapy\wikiloc_scrapy\spiders
3. scrapy crawl wiki -o link.csv

**NOTE:** For some reason, this seems to try to run all the spiders at once, and you end up getting error messages saying certain files don't exist (which makes sense as these files need to be created by certain spiders first). I tried looking for the solution for this, but for now I've just commented out the code within the other spiders. UPDATE: It seems to be okay once the errors have been resolved, so I'm leaving finished scripts uncommented as I correct them.

**PROBLEM** HTTP Status Code 403 - Forbidden / Access Denied - could this be anti-scraping measure?

**Solution Attempt #1**

1. pip install scrapy_cloudflare_middleware
2. edit downloader middleware in settings.py according to: https://github.com/clemfromspace/scrapy-cloudflare-middleware?tab=readme-ov-file

Error message, trying to install an older version of requests/urllib3 as per https://stackoverflow.com/questions/76414514/cannot-import-name-default-ciphers-from-urllib3-util-ssl-on-aws-lambda-us
3. conda install requests==2.28.2
4. Now try scrapy crawl command

I got this to run, but it ended up back with the 403 error messages again. The solution maybe is too old? This source seems to suggest that the cloudflare middleware solution no longer works: https://www.zenrows.com/blog/scrapy-cloudflare#conclusion

I reverted back to original set up by:
1. Commenting out the downloader middlewware in settings.py
2. ~~conda update requests~~ (just left requests as is for now)

**Solution Attemp #2**

1. Add default request headers according to https://www.zenrows.com/blog/scrapy-headers#most-important-ones
2. Now try scrapy crawl command

Now the spider seems to correctly generate the urls for all the cities within the region, but still doesn't return the trail URLs. I started looking into the xpath expressions in the extract_link.py as I wondered if the path structure has changed a bit over time (like the URLs).

I found this video useful for understanding xpath https://www.youtube.com/watch?v=4EvxqTSzUkI 
I then went to https://www.wikiloc.com/trails/outdoor/germany/bremen and did rick click > Inspect to see the html. After a search for the components on the main Bremen page and then for one city (for example: https://www.wikiloc.com/trails/outdoor/germany/bremen/alte-neustadt) I made a couple changed to the xpaths.

Now the spider generates URLs (or at least parts of URLs) for Bremen (which I'm using as my testing region)! 

**NOTE:** Since the URLs saved so far are just the back half of the URL, without the beginning (eg. /cycling-trails/bremen-achim-18077390) I adjusted the code to add the beginning part as well. I'm not 100% sure if this is needed or now (I think it might be), but I might need to remove this later. 

The next issue was that only 861 URLs were being saved for Bremen (website says there is 1450). This was an issue with the pagination handling, so I made some modifications to the extract_link.py for the next and next_page sections. Now I get 2061 URLs, but a check in excel showed there were many duplicates. After removing these I get 1458 which I think is correct (I think there is some rounding going on on the website as all trail counts at the city level are divisible by 10).

Below I will record the workflow per region to make this clear now that things seem to be working!

**Extract Links Workflow Example: Bremen**

In extract_link.py:
1. Update start_urls: 'https://www.wikiloc.com/trails/outdoor/germany/bremen' and save.

In Anaconda Prompt:
1. conda activate C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\thesis_env_conda
2. cd C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\wikiloc_scrapy\wikiloc_scrapy\spiders
3. scrapy crawl wiki -o crawling_outputs\link-bremen.csv


Remove duplicates:
I am not sure why duplicates are occuring, but the step below simply removes any duplicates.
(see code below)

*NOTE:* For Bremen at the time of scraping (31 MARCH 2025), the website shows 1460 trails, however I get 1532 trails (after the duplicates are removed) - this means there are an extra 72 trails. I'm not sure why this is but I wonder if it has something to do with trails which cross borders (and therefore are in more than 1 region of Germany). It could be that these trails can be searched for in both regions but are only included in the count of 1 to avoid double-counting? **I should check for duplicates across regions to make sure all trails are unique.**

In [21]:
# Remove CSV duplicates 

# Store scrapy spider path (where outputs are stored)
scrapy_output = "./wikiloc_scrapy/wikiloc_scrapy/spiders/crawling_outputs/"

# Load the CSV as a df
link_bremen = pd.read_csv(scrapy_output + "link-bremen.csv", sep="\t")

#Remove duplicates
link_bremen.drop_duplicates(inplace=True)

# Write the results to same file (overwrite)
link_bremen.to_csv(scrapy_output + "link-bremen.csv", index=False)

# Check
link_bremen

Unnamed: 0,Link
0,https://www.wikiloc.com/bicycle-touring-trails...
1,https://www.wikiloc.com/hiking-trails/weser-ra...
2,https://www.wikiloc.com/mountain-biking-trails...
3,https://www.wikiloc.com/hiking-trails/weser-ra...
4,https://www.wikiloc.com/hiking-trails/weser-ra...
...,...
2021,https://www.wikiloc.com/hiking-trails/bremen-4...
2023,https://www.wikiloc.com/outdoor-trails/06bremh...
2027,https://www.wikiloc.com/outdoor-trails/05lohnb...
2037,https://www.wikiloc.com/bicycle-touring-trails...


PLAN
- get everything working for Bremen
- run "formally" by keeping track of date of download, number of trails listed on website, number actually downloaded, etc

#### Step 2: wikiloc_track.py

I edited the wikiloc_track to update the xpaths and add extraction of: 
- date recorded
- photo captions (title and body)
- comments
I removed the author extract so no personal information is collected. Although I updated the xapths for the following features, I commented them out as I don't think I'll need them for my analysis:
- trail difficulty
- view counts
- download counts
- trail length/distance

To run the script:

In wikiloc_track.py:
1. Change CSV name in start_urls to crawling_outputs\link-bremen.csv

In Anaconda Prompt:
1. conda activate C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\thesis_env_conda
2. cd C:\Users\ninam\Documents\UZH\04_Thesis\code\nm_forest_thesis\wikiloc_scrapy\wikiloc_scrapy\spiders
3. scrapy crawl wiki_track -o crawling_outputs\track-bremen.json

**Needs to be output as json** otherwise the utf-8 encoding doesn't seem to work properly and the German special characters are not handled well. 