# Terraria wiki article recommender
### Hubert Nowakowski 160302
### Mukhammad Sattorov 159351

<hr>

# Introduction
This project recommends similar articles from the terraria wiki based on text contents of said articles. Our dataset is comprised of a total of 2000 different pages scraped off of `terraria.wiki.gg`. We have chosen this wiki in particular for 3 main reasons:
1. The person writing this likes the game and knows it very well :^)
2. The amount of articles on this wiki is a little under 6 thousand, so we could get decent coverage with a sane amount of articles downloaded.
3. A lot of mechanics, NPC's, items, events, bosses, etc. in this game affect and depend each other, so we were hoping to see some interesting relations between articles.

<hr>

# Scraping

### Approach and libraries used
We have taken the approach of starting on the main page of the wiki, collecting all the hyperlinks leading to other pages into a queue, and performing bfs. On every page we visited from that point onwards, we took all of the text from the main article div, as well as added all the non-visted hyperlinks into the queue for further search.

In an ideal world we'd have been able to just blast requests as fast as possible and be done with downloading the dataset within a few minutes at most. The scraping guidelines of the wiki, however, have specified a 1 second cooldown between requests. This still wouldn't have been too bad, but even a 1 second cooldown was short enough to trigger cloudflare's captchas and eventually put the scraper's ip on a blacklist for a few hours.

This forced us to ditch the regular requests library in favour of cloudscraper, which works in a nigh-identical way, but employs some protections and workarounds for cloudflare's protections specifically. After a bit of trial and error with request cooldowns and headers, we have arrived at a working solution visible below. This implementation allowed us to scrape aforementioned 2000 articles within a *servicable* timeframe of a little over 2 hours.

Aside from some hurdles with overcoming anti-bot protection, the remainder of scraping the data was very straightforward - a very standardised layout of the wiki's pages allowed us to just inspect a given div on each article for both text content and other links.

The downloaded dataset has been put into a `raw_text.csv` file for further processing explained in the next section.
### The code:

In [None]:
import bs4
import requests
import re
from time import sleep
from random import random
from random import uniform
import cloudscraper
import pandas as pd

# scraping guidelines
# https://terraria.wiki.gg/robots.txt

URL = "https://terraria.wiki.gg"

def get_url(url, scraper):
    res = scraper.get(url)
    if res.status_code != 200:
        print(f"WARNING! Get request returned code other than 200: {res.status_code}")
    if res.status_code == 429:
        print("Returned 429, retrying in 15 seconds")
        sleep(15)
        res = scraper.get(url)
        if res.status_code == 429:
            raise ConnectionRefusedError("try in a few hours or on a phone hotspot lol")
    if res.status_code == 200:
        print(f"Url fetched successfully: {url}")
    return res.text

def bfs(initial_links, scraper, df, iterations):
    queue = initial_links
    visited = set(initial_links[:])
    while iterations and queue:
        sleep(uniform(2, 4) + 1) # robots.txt says 1 sec cooldown is fine but i still sometimes got blocked by cloudflare
        curr_link = queue.pop(0)
        curr = get_url(URL + curr_link, scraper)
        soup = bs4.BeautifulSoup(curr, "html.parser")
        title = soup.find("h1", {"id": "firstHeading"}).text
        body = soup.find("div", {"class": "mw-content-ltr mw-parser-output"})
        new_links = [a.get("href") for a in body.find_all("a", attrs={'href': re.compile(r'^/wiki')}) if not a.find("img")]
        df = pd.concat((df,
                         pd.DataFrame([[title, curr_link, body.text.strip().replace('\n', ' ').replace('  ', ' ')]],
                                       columns=["title", "link", "body"])), ignore_index=True)
        for link in new_links:
            if link not in visited:
                visited.add(link)
                queue.append(link)
        iterations -= 1
    return df



def scrape():
    df = pd.DataFrame(columns=["title", "link", "body"])
    scraper = cloudscraper.create_scraper(delay=10, browser={'custom': 'ScraperBot/1.0',})
    main_site = get_url(URL, scraper)
    # id="main-section" <-- extract all wiki.gg hrefs from here and start bfs
    # h1 id="firstHeading" (page title) div class="mw-content-ltr mw-parser-output" (body) <--- for other pages
    soup = bs4.BeautifulSoup(main_site, "html.parser")
    body = soup.find("div", {"id": "main-section"})
    links = [a.get("href") for a in body.find_all("a", attrs={'href': re.compile(r'^/wiki')}) if not a.find("img")]
    df = bfs(links, scraper, df, 2000)
    #print(df)
    df.to_csv("raw_text.csv")

<hr>

# Stemming and lemmatization