# Terraria wiki article recommender
### Hubert Nowakowski 160302
### Mukhammad Sattorov 159351

<hr>

# Introduction
This project recommends similar articles from the terraria wiki based on text contents of said articles. Our dataset is comprised of a total of 1000 different pages scraped off of `terraria.wiki.gg`. We have chosen this wiki in particular for 3 main reasons:
1. The person writing this likes the game and knows it very well :^)
2. The amount of articles on this wiki is a little under 6 thousand, so we could get decent coverage with a sane amount of articles downloaded.
3. A lot of mechanics, NPC's, items, events, bosses, etc. in this game affect and depend each other, so we were hoping to see some interesting relations between articles.

<hr>

# Scraping

### Approach and libraries used
We have taken the approach of starting on the main page of the wiki, collecting all the hyperlinks leading to other pages into a queue, and performing bfs. On every page we visited from that point onwards, we took all of the text from the main article div, as well as added all the non-visted hyperlinks into the queue for further search.

In an ideal world we'd have been able to just blast requests as fast as possible and be done with downloading the dataset within a few minutes at most. The scraping guidelines of the wiki, however, have specified a 1 second cooldown between requests. This still wouldn't have been too bad, but even a 1 second cooldown was short enough to trigger cloudflare's captchas and eventually put the scraper's ip on a blacklist for a few hours.

This forced us to ditch the regular requests library in favour of cloudscraper, which works in a nigh-identical way, but employs some protections and workarounds for cloudflare's protections specifically. After a bit of trial and error with request cooldowns and headers, we have arrived at a working solution visible below. This implementation allowed us to scrape aforementioned 1000 articles within a *servicable* timeframe of a little over 40 minutes.

Aside from some hurdles with overcoming anti-bot protection, the remainder of scraping the data was very straightforward - a very standardised layout of the wiki's pages allowed us to just inspect a given div on each article for both text content and other links.

The downloaded dataset has been put into a `raw_text.csv` file for further processing explained in the next section.
### The code:

In [94]:
import bs4
import re
from time import sleep
from random import uniform
import cloudscraper
import pandas as pd

# scraping guidelines
# https://terraria.wiki.gg/robots.txt

URL = "https://terraria.wiki.gg"

# making sure we don't add junk/duplicate pages to the dataset
# like for example links to a certain section of a page #
def clean_link(link):
    link = link.split('#')[0]
    link = link.rstrip('/')
    return re.sub(r'\?.*$', '', link)


def get_url(url, scraper):
    res = scraper.get(url)
    if res.status_code != 200:
        print(f"WARNING! Get request returned code other than 200: {res.status_code}")
    if res.status_code == 429:
        print("Returned 429, retrying in 15 seconds")
        sleep(15)
        res = scraper.get(url)
        if res.status_code == 429:
            raise ConnectionRefusedError("try in a few hours or on a phone hotspot lol")
    if res.status_code == 200:
        print(f"Url fetched successfully: {url}")
    return res.text

def bfs(initial_links, scraper, df, iterations):
    queue = initial_links
    visited = set(initial_links[:])
    while iterations and queue:
        sleep(uniform(1, 2) + 1) # robots.txt says 1 sec cooldown is fine but i still sometimes got blocked by cloudflare
        curr_link = clean_link(queue.pop(0))
        curr = get_url(URL + curr_link, scraper)
        soup = bs4.BeautifulSoup(curr, "html.parser")
        title = soup.find("h1", {"id": "firstHeading"}).text
        body = soup.find("div", {"class": "mw-content-ltr mw-parser-output"})
        new_links = [a.get("href") for a in body.find_all("a", attrs={'href': re.compile(r'^/wiki')}) if not a.find("img")]
        df = pd.concat((df,
                         pd.DataFrame([[title, curr_link, body.text.strip().replace('\n', ' ').replace('  ', ' ')]],
                                       columns=["title", "link", "body"])), ignore_index=True)
        for link in new_links:
            if link not in visited:
                visited.add(link)
                queue.append(link)
        iterations -= 1
        print(f"Remaining: {iterations}")
    return df.drop_duplicates(subset='link', keep='first')


# entrypoint for this entire section
def scrape():
    df = pd.DataFrame(columns=["title", "link", "body"])
    scraper = cloudscraper.create_scraper(delay=10, browser={'custom': 'ScraperBot/1.0',})
    main_site = get_url(URL, scraper)
    # id="main-section" <-- extract all wiki.gg hrefs from here and start bfs
    # h1 id="firstHeading" (page title) div class="mw-content-ltr mw-parser-output" (body) <--- for other pages
    soup = bs4.BeautifulSoup(main_site, "html.parser")
    body = soup.find("div", {"id": "main-section"})
    links = [a.get("href") for a in body.find_all("a", attrs={'href': re.compile(r'^/wiki')}) if not a.find("img")]
    df = bfs(links, scraper, df, 1000)
    #print(df)
    df.to_csv("raw_text.csv")

# dont run this
#scrape()

<hr>

# Data processing

### Approach
We have decided to go with simple lemmatization for data processing. We considered further cleanup of data by eg. removing extremely common stop words, but have deemed it unnecessary due to our approach to encoding and querrying in the next section.

Ultimately every article's text contents are turned into a long string in all lower-case with words brought down to their base dictionary form.

This result is then saved into `processed_data.csv` for later use so that this cell does not need to be re-ran (takes around 20 seconds). 

A short example of such processed text is visible below.
### Code:

In [95]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

# nltk.download('punkt')
# nltk.download('punkt_tab')
# nltk.download('wordnet')
# nltk.download('stopwords')

def tokenize(df):
    df['tokens'] = df['body'].map(word_tokenize)
    return df

def lemmatize(df):
    lemmatizer = WordNetLemmatizer()
    df['lemma'] = df['tokens'].map(lambda tokens: [lemmatizer.lemmatize(token.lower(), pos='v') for token in tokens])
    df['lemma'] = df['lemma'].map(lambda x: ' '.join(x))
    return df

# entrypoint for this entire section
def process_text():
    df = pd.read_csv('raw_text.csv', index_col=0).fillna(' ')
    df = lemmatize(tokenize(df))
    df = df.drop(columns=['tokens', 'body'])
    df.to_csv("processed_text.csv")
    return df

df = process_text()
print(f"Example processed text: {df['lemma'][0][:100]}")

Example processed text: terraria author re-logicdr studiosengine software not to be confuse with the latin plural of terrari


<hr>

# Similarities and querrying
### Approach
For this section have decided to go with Term Frequency-Inverse Document Frequency (TF-IDF). As mentioned earlier, with this approach having common stop words in the dataset does not have a large impact on the results, as aforementioned stop words have an insanely low significance in scoring. On the contrary, the existence of very specific terms boosts similarity between articles immensly.

For the similarity calculations we have opted for the cosine similarity, which does much better than euclidean distance in larger dimensions since it is based on angles rather than distance (which would be insane in a 976x32303 matrix).

We take the title of an article from the database as a query, and after comparing the similarities between its vectorized form and various rows in `df_vectorized`, take the top 10 highest similarity matches.

In the case of several articles (a history of visited articles) we simply concatenate all of their texts into a single string and run it through the same pipeline as above.

### Vectorizing code + matrix:

In [96]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(use_idf=True, smooth_idf=False)
# the previous section needs to have been executed
df = pd.read_csv('processed_text.csv', index_col=0).fillna(' ')
df_vectorized = pd.DataFrame(vectorizer.fit_transform(df['lemma']).toarray(), index=df.index, columns=vectorizer.get_feature_names_out())
df_vectorized = pd.concat((df[['title', 'link']], df_vectorized), axis=1)
df_vectorized

Unnamed: 0,title,link,00,000,0000000000,000005162020,00012,00016,0002,00021,...,„ÇÄ„Çâ„Åï„Åç„ÅÆ„ÅÑ„Å®,„ÇÜ„ÅÜ„Åó„ÇÉ„ÅÆË°£Ë£Ö,„Éé„Éº„Éà,ÂãáÊ∞ó,Ê≤≥Á´•,ÁßªÂä®‰∏≠ÂõΩÁâà,ÁÆÄ‰Ωì‰∏≠Êñá,ùêºùëöùëùùëíùëõùëëùëñùëõùëî,ùëéùëùùëùùëüùëúùëéùëê‚Ñéùëíùë†,ùëëùëúùëúùëö
0,Terraria,/wiki/Terraria,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ores,/wiki/Ore,0.029576,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Recipes,/wiki/Recipes,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Weapons,/wiki/Weapons,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Magic weapons,/wiki/Magic,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,Frozen Slime Block,/wiki/Frozen_Slime_Block,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
996,Vortex Fragment Block,/wiki/Vortex_Fragment_Block,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
997,Nebula Fragment Block,/wiki/Nebula_Fragment_Block,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
998,Stardust Fragment Block,/wiki/Stardust_Fragment_Block,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Finding similarity code:

In [97]:
from sklearn.metrics.pairwise import cosine_similarity

def page_hist_lemma(history):
    lemma = df[df['title'].isin(history)]['lemma'].to_list()
    return [" ".join(lemma)]

def get_recommendations(df, df_vectorized, history, vectorizer):
    query_vec = vectorizer.transform(page_hist_lemma(history))
    #print(query_vec.shape)
    # cosine similarity sometimes can't find proper indices in the df
    df = df.reset_index(drop=True)
    df_vectorized = df_vectorized.reset_index(drop=True)
    similarities = cosine_similarity(df_vectorized.iloc[:,2:], query_vec).ravel()
    #print(len(df), len(df_vectorized), len(similarities))

    # prevents getting recommended the same articles again
    similarities[df['title'].isin(history)] = 0
    top_indicies = similarities.argsort()[::-1][:10]
    res = df.loc[top_indicies, ['title', 'link']].copy()
    res['similarity'] = similarities[top_indicies]
    return res.sort_values(by='similarity', ascending=False)

<hr>

# Examples:

In [98]:
# names of some early game bosses
history = [
    'Brain of Cthulhu',
    'Eater of Worlds',
    'Skeletron',
    'Queen Bee'
]

# we get articles about other early/mid game bosses
get_recommendations(df, df_vectorized, history, vectorizer)

Unnamed: 0,title,link,similarity
135,Wall of Flesh,/wiki/Wall_of_Flesh,0.893719
138,The Destroyer,/wiki/The_Destroyer,0.888379
155,Frost Legion,/wiki/Frost_Legion,0.878192
136,Queen Slime,/wiki/Queen_Slime,0.873574
129,Eye of Cthulhu,/wiki/Eye_of_Cthulhu,0.87354
133,Deerclops,/wiki/Deerclops,0.873384
111,Old Man,/wiki/Old_Man,0.871002
139,Skeletron Prime,/wiki/Skeletron_Prime,0.860523
269,Mechanical bosses,/wiki/Mechanical_bosses,0.858665
320,Mechanical bosses,/wiki/Mechanical_boss,0.858665


In [99]:
# with fishing and achievements
history = [
    'Fishing',
    'Achievements'
]

# the top result is the fishing-based NPC who is needed for most fishing-related achievements
get_recommendations(df, df_vectorized, history, vectorizer)

Unnamed: 0,title,link,similarity
99,Angler,/wiki/Angler,0.750786
107,Guide,/wiki/Guide,0.715211
20,Hardmode,/wiki/Hardmode,0.712933
19,Events,/wiki/Events,0.705487
814,Events,/wiki/Event,0.705487
135,Wall of Flesh,/wiki/Wall_of_Flesh,0.698027
319,Biomes,/wiki/Abandoned_Minecart_Tracks,0.694872
51,Biomes,/wiki/Biomes,0.694872
199,NPCs,/wiki/NPC,0.686466
15,NPCs,/wiki/NPCs,0.686466


In [100]:
# some context around summoning the Wall of Flesh boss
history = [
    'Guide',
    'The Underworld',
    'Pre-Hardmode',
    'Boss'
]

# ...and we get recommended the article about the boss itself and hardmode which defeating it unlocks!
get_recommendations(df, df_vectorized, history, vectorizer)

Unnamed: 0,title,link,similarity
240,Drunk world,/wiki/Drunk_world,0.772963
20,Hardmode,/wiki/Hardmode,0.768756
135,Wall of Flesh,/wiki/Wall_of_Flesh,0.763772
199,NPCs,/wiki/NPC,0.76296
15,NPCs,/wiki/NPCs,0.76296
51,Biomes,/wiki/Biomes,0.758257
319,Biomes,/wiki/Abandoned_Minecart_Tracks,0.758257
16,Guide:Getting started,/wiki/Guide:Getting_started,0.75441
310,Achievements,/wiki/Achievements,0.748292
116,Tavernkeep,/wiki/Tavernkeep,0.745377


In [101]:
# searching about evil biomes and bosses
history = [
    'Evil biomes',
    'Boss'
]

get_recommendations(df, df_vectorized, history, vectorizer)
# we get:
# articles about the underground and surface crimson and corruption which are the said evil biomes
# the general biomes page
# Eater of Worlds and Brain of Cthulhu which are their respective evil biome bosses
# the Dryad who is an NPC revolving around curing the world of evil biomes

Unnamed: 0,title,link,similarity
67,Underground Corruption,/wiki/Underground_Corruption,0.784167
69,Underground Crimson,/wiki/Underground_Crimson,0.744557
66,The Corruption,/wiki/The_Corruption,0.732141
68,The Crimson,/wiki/The_Crimson,0.725262
51,Biomes,/wiki/Biomes,0.705481
319,Biomes,/wiki/Abandoned_Minecart_Tracks,0.705481
130,Eater of Worlds,/wiki/Eater_of_Worlds,0.685462
135,Wall of Flesh,/wiki/Wall_of_Flesh,0.685438
103,Dryad,/wiki/Dryad,0.676847
131,Brain of Cthulhu,/wiki/Brain_of_Cthulhu,0.663096
