# Setup

### Run Previous Script

Enable when debugging; elsewhere, just run it after running the master script.

In [1]:
#%run ./1_Master_Script.ipynb
#%run ./2_Cleaning_layers.ipynb

Master Script has been run successfully!
Cleaning Layers Script has been run successfully!
Chosen Study Area: Stolen Lands


### Pathfinder Wiki Cities

NOTE: this database collects data on ALL cities from the Wiki. For that reason, we're using the source file for cities, as we're interested only in the wiki links it provides.

In [2]:
path = source_loc

cities = gpd.read_file(path + 'cities.geojson')

cities

Unnamed: 0,Name,link,capital,size,text,articleLength,geometry
0,Aaminiut,https://pathfinderwiki.com/wiki/Aaminiut,False,2,<p><b>Aaminiut</b> is the largest town in the ...,700,POINT (-43.3315 67.15341)
1,Aaramor,https://pathfinderwiki.com/wiki/Aaramor,False,2,<p>The city-fortress of <b>Aaramor</b> is loca...,1400,POINT (-5.35496 51.01807)
2,Abberton,https://pathfinderwiki.com/wiki/Abberton,False,3,"<p><b>Abberton</b> is a small, declining town ...",600,POINT (-0.55697 32.20194)
3,Abken,https://pathfinderwiki.com/wiki/Abken,False,2,"<p>One of the newest settlements in <a href=""h...",1600,POINT (-20.54576 44.95886)
4,Absalom,https://pathfinderwiki.com/wiki/Absalom,True,0,"<p>For more than 4,000 years, <b>Absalom</b> (...",14500,POINT (-0.23431 30.88863)
...,...,...,...,...,...,...,...
846,Zimar,https://pathfinderwiki.com/wiki/Zimar,False,1,"<p>One of the main defensive <a href=""https://...",2300,POINT (6.7078 31.58416)
847,Ziplatna,https://pathfinderwiki.com/wiki/Ziplatna,False,1,<p><b>Ziplatna</b> is the northernmost of the ...,300,POINT (-99.19609 14.74814)
848,Zlatomesto,https://pathfinderwiki.com/wiki/Zlatomesto,False,2,<p><b>Zlatomesto</b> is a small town in <a hre...,900,POINT (-24.7329 51.73437)
849,Zom Kullan,https://pathfinderwiki.com/wiki/Zom_Kullan,True,0,"<p><b>Zom Kullan</b>, the capital city of <a h...",400,POINT (153.268 2.37258)


### Check Age

Here, we check when we last ran this script. More specifically, we look for a copy of the file this script creates, and check its timestamp. If enough time has passed, or the file doesn't exist, then we run the code.

This is important for two reasons. First, this script takes some--though, not a lot of--time to run. Second, and more importantly, web-scraping is disruptive to wiki servers, and should be as light in load as possible. This process minimizes how much web scraping is done.

In [3]:
max_age = 7 # Days

In [4]:
path = clean_loc

try:

    with open(path + 'city_info.txt') as file:
      output_old = file.read()
    
    
    output_old = json.loads(output_old)

    # Now check how much time has elapsed
    last_generated = output_old['timestamp']
    last_generated = datetime.datetime.fromisoformat(last_generated)

    time_elapsed = datetime.datetime.today() - last_generated
    days_elapsed = time_elapsed.days

    print(f"Days since last generation: {days_elapsed}")

    # Assign flag
    if days_elapsed >= max_age:
        
        regenerate_city_info = True

    else:

        regenerate_city_info = False

        # Rename the object, so it can be called in the future

        cities_info = output_old

except Exception as error:

    print("City info not found. Regenerating new copy. Error printed below:")
    print(f"ERROR MESSAGE: {error}")
    
    regenerate_city_info = True

print(f"Regenerate Cities: {regenerate_city_info}")

City info not found. Regenerating new copy.
ERROR MESSAGE: [Errno 2] No such file or directory: 'D:/0Coding Projects/GitHub/My Repositories/Pathfinder Mapping/Data/Cleaned_Data/City Info/city_info.txt'
Regenerate Cities: True


### Establish API

While we *could* collect all the data we need rather easily by web-scraping with BeautifulSoup, that would be very rude. It would be a drastic spike in traffic on the wiki, potentially harming the experience of other users, and risking an IP ban from moderators. So instead, we'll use the MediaWiki API to gather data in a non-strenuous way.

In [5]:
# Establish session
session = req.Session()

# Wikimedia API url
api_url = 'https://pathfinderwiki.com/w/api.php'

# Web Scraping

Typically, web scraping is very straightforward in Python: using BeautifulSoup, you parse a bunch of webistes using their URLs. But that has two issues here:

1) It's inefficient. This script runs through *hundreds* of Pathfinder Wiki pages, so the usual method takes multiple minutes to run.

2) It's rude. The Pathfinder Wiki folks do not like web scrapers, as it strains their servers.

So instead, for the benefit of this code and to respect the wishes of the Pathfinder Wiki operators, we will instead use the MediaWiki API for parsing. BeautifulSoup will still be used for sorting through the HTML code, of course.

Useful links on MediaWiki API:

1) Use the MediaWiki API: https://www.mediawiki.org/wiki/API:Main_page

2) Follow their etiquitte rules: https://www.mediawiki.org/wiki/API:Etiquette

3) Use a header: https://foundation.wikimedia.org/wiki/Policy:User-Agent_policy

### Create Header

This is effectively the script's ID, so the Pathfinder Wiki operators may know who is causing traffic. This is useful so that, in case this code isn't working right and making their lives harder, they know to contact the code's operator rather than outright banning them.

In [6]:
headers = {
    'User-Agent': "Boost's Webscraper Script (Discord: _boost, Pathfinder Wiki Server)"
}

### API Request

So we'll still use BeautifulSoup for sorting through HTML code, but we'll use the API beforehand to pull and parse the data.

In [7]:
def api_request(page_name):

    global api_url
    global session
    global headers

    # First, parse through to find every section in the page
    PARAMS = {
        'action': 'parse',
        'page': page_name,
        'format': 'json'
    }
    
    page = session.get(url=api_url, params=PARAMS, headers = headers)
    page = page.json()

    page_text = page['parse']['text']['*']

    return page_text

### Web Scrape

In [8]:
def scrape_data(page_name):
    
    page_text = api_request(page_name)
    
    soup = BeautifulSoup(page_text, 'html.parser')

    # Lists
    data_dict = dict()

    # Keys make up the name column of the dataframe
    keys = soup.find_all("div",{'class': 'key'})

    for key in keys:

        # NOTE: some keys have data which is actually multi-column, but that can be cleaned in Pandas

        elements = key.find_next()

        # Append
        data_dict[key.text] = elements.text 

    return data_dict

In [None]:
# Relational dictionary containing the data

if regenerate_city_info == True:

    city_demos = dict()
    
    # Construct list of cities that have pages
    na_link = 'https://pathfinderwiki.com/wiki/PathfinderWiki:Map_Locations_Without_Articles'
    city_list = cities.copy()
    city_list = city_list.loc[city_list['link'] != na_link]
    city_list = city_list['Name'].to_list()
    
    for city in city_list:
    
        try:
    
            city_demos[city] = scrape_data(city)
    
        except:
    
            city_demos[city] = 'ERROR' # Some errors are not really solvable, and are not worth looking into

### Timestamp

This creates a new layer to the dictionary, so a timestamp can be added. This timestamp will be useful for assessing whether or not to rerun the script. The script takes some time to run, so we'll instead save the data and refresh it if the file is old enough.

In [None]:
if regenerate_city_info == True:

    today = datetime.datetime.now()
    
    today = str(today)
    
    today

In [None]:
if regenerate_city_info == True:

    cities_info = {
        'data': city_demos,
        'timestamp': today
    }

# Export

In [None]:
if regenerate_city_info == True:

    path = clean_loc
    
    with open(path + 'city_info.txt', 'w') as f:
      json.dump(cities_info, f)

# Run Message

This is to show key info when this script is run in another script.

In [None]:
print("City Scraping Script has been run successfully!")