# NLP Project - Web Scraping and Text Analysis of Game Reviews on Rock Paper Shotgun
## Part I. Build Dataset

### Step 1 - Webpage Exploration & Code Test

### Scrape single webpage

In [80]:
import requests
from bs4 import BeautifulSoup
import json
import re

In [85]:
# attempt to scrape single webpage using bs4

# original code obtained from the lecture notebook - Week 5 Web data and generative text - Webscraping Wikipedia
ROCKPAPERSHOTGUN = "https://www.rockpapershotgun.com/reviews?page=1"
response = requests.get(url=ROCKPAPERSHOTGUN)
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

# tag located using Simplescraper - Chrome Extension, https://simplescraper.io/
# soup.select = search all with the provided tag
titles = soup.select("p.title a.link.link--expand")

# get the link & title of the first review article
# get herf links, reference code found in: https://stackoverflow.com/questions/3075550/how-can-i-get-href-links-from-html-using-python
# .text vs. .get_text(): https://stackoverflow.com/questions/35496332/differences-between-text-and-get-text
first_title_text = titles[0].text
first_title_url = titles[0].get("href")
print(first_title_url)
print(first_title_text)

https://www.rockpapershotgun.com/avatar-frontiers-of-pandora-review

Avatar: Frontiers Of Pandora review: these frontiers will bore ya                    


### Scapre webpage with different formats

A total of 11 (2 types, 3+8) different HTML structures were summarised from RPS and the code was adjusted accordingly to extract data 

#### FORMAT TYPE I
- all information available to scrape (with slightly different html code of brief/developer info/review)

##### webpage format 1 - brief variation 1
1. review format 1 - soup.select("div.article_body_content p")
2. brief format 1 - soup.select(".article_body_content > :nth-child(1)")
3. developer format 1 - soup.select("div.article_body_content li")

In [97]:
link = "https://www.rockpapershotgun.com/persona-5-tactica-review"

response = requests.get(url=link)
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

In [99]:
info = soup.select("div.article_body_content li")
len(info)

7

In [100]:
# extract game title - same for all format
link_title = " ".join(link.replace("https://www.rockpapershotgun.com/", "").split("-")).title()

# remove everying after 'Review' to obtain a clean link title using regex
# remove characters/symbols using regex, reference code found in: https://stackoverflow.com/questions/875968/how-to-remove-symbols-from-a-string-with-python
# regex for 'anything after Review' is obtained using ChatGPT, and tested on https://regex101.com/
cleaned_link_title = re.sub(r'Review.*', 'Review', link_title)
cleaned_link_title

# replace unwanted text to obtain the game title - it won't be too accurate tho as some titles were not just a game title but a full sentence describing the article
game_title = cleaned_link_title.replace(" Review", "").replace("Wot I Think ", "")
game_title

'Persona 5 Tactica'

In [101]:
# extract topics provided at the end of review - same for all format
topic = soup.select("#content_above > .page_content span.tagged_with_item.tagged_with_item--secondary a")
topics = []
# extract text from each paragraph
for item in topic:
    topics.append(item.text)
print(topics)

['Atlus', 'PC', 'RPG', 'Sega', 'SEGA of America', 'Simulation', 'Strategy', 'Wot I Think', 'Xbox One', 'Xbox Series X/S']


In [102]:
# extract main body text
body = (soup.select("div.article_body_content p"))
body_text = ""
# extract text from each paragraph
for item in body:
    body_text += item.text
updated_body_text = body_text.replace("\nThis review is based on a copy of the game provided by the publisher.","")

print(updated_body_text)


The Persona 5 kids are trapped in high school, their semesters stretched into infinity by the inescapable march of spin-offs that have been retroactively stuffed into every available blank spot. Since the credits rolled on Persona 5 (and also before, and sometimes concurrently), the Phantom Thieves have spent their time in a rubbish Dynasty Warriors knock-off, got lost in a dream about a really cheap rhythm-action game and ended up trapped in a cross-generational dungeon crawler. Now they find themselves embroiled in - of all things - an X-COM style turn-based strategy game. 

Your energy for Persona 5 Tactica is going to come down to how much tolerance you have for the Persona 5 brand’s Baz Luhrmann-style maximalism. If you're emotionally invested in the gang, it'll definitely help. But even if you're a Persona veteran or a total newcomer, you'll find the story a bit loose, the chats a bit tiring, and the combat a bit simplistic. Then again, if you're a Persona fan, it's still more t

In [103]:
info = soup.select("div.article_body_content li")
developer = info[0].text.replace("Developer: ", "").replace("Developer:", "").replace("\n","")
developer

'Atlus'

In [105]:
# extract brief
brief = str(soup.select(".article_body_content > :nth-child(1)"))

# use regex to get text between <strong> and </strong> and <ul> 
# select text between tags, reference code found in: https://stackoverflow.com/questions/7167279/regex-select-all-text-between-tags
# ChatGPT is used here for debugging and explaining the code
# re.DOTALL allows the . to match any character, including newline characters
pattern = re.compile(r'<strong>.+?</strong>(.+?)<ul>', re.DOTALL)
match = re.search(pattern, brief)
# match.group(1): Retrieves the content captured by regex
# .strip(): Removes ' ' at the begining and end of the str, reference code found in: https://www.w3schools.com/python/ref_string_strip.asp
text = match.group(1).strip().replace("\r", "").replace("\n", "")

# remove everything in <> and <> itself, if any
# regex is obtained using ChatGPT, and tested on https://regex101.com/
brief_text = re.sub('<.*?>', '', text)
print(brief_text)

A turn-based strategy 'em up with the Persona 5 gang that might be a bit too simplistic for some, and anime for others, but still makes for a nice excuse to hang with pals.


##### webpage format 2 - brief variation 2
1. review format 1 - soup.select("div.article_body_content p")
2. brief format 2 - soup.select(".article_body_content > :nth-child(2)")
3. developer format 1 - soup.select("div.article_body_content li")

In [106]:
link = "https://www.rockpapershotgun.com/wo-long-fallen-dynasty-review"
response = requests.get(url=link)
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

In [107]:
info = soup.select("div.article_body_content li")
len(info)

6

In [108]:
# extract main body text
body = (soup.select("div.article_body_content p"))
body_text = ""
# extract text from each paragraph
for item in body:
    body_text += item.text
updated_body_text = body_text.replace("\nThis review is based on a copy of the game provided by the publisher.","")

print(updated_body_text)

It's easy to label Wo Long: Fallen Dynasty as a Soulslike that's basically Nioh and Sekiro mashed together. To some extent, that's true. But I think Wo Long is Team Ninja having sanded off some - not all - of the edges from the overwhelmingly complex Nioh, to craft an action RPG with even crisper combat than its predecessor. And crucially, it's more approachable as a result. Clever ranks and meters reward, whether risk-taking is your jam or you're more of a careful type. It may still be a tricky venture filled with demonic, titanic cows that'll want to pound your skull into the dirt, but this is Team Ninja's most encouraging effort yet.Nioh and its sequel are set in a feudal Japan where important historic figures fight over crystals, as these crystals are sparkly and have powers. However! They also have the power to corrupt, turning your shoguns and wizards into bloated demons and gangly harpies. The story is, err... certainly present. And Wo Long is no different, in fact, it's largely

In [109]:
info = soup.select("div.article_body_content li")
developer = info[0].text.replace("Developer: ", "").replace("Developer:", "").replace("\n","")
developer

'Team Ninja, Koei Tecmo'

In [30]:
# extract brief, with different html tag, located by Simplescraper 
brief = str(soup.select(".article_body_content > :nth-child(2)"))
pattern = re.compile(r'<strong>.+?</strong>(.+?)<ul>', re.DOTALL)
match = re.search(pattern, brief)
brief_text = match.group(1).strip().replace("<br/>", "").replace("\r", "").replace("\n", "")
print(brief_text)

 Team Ninja has streamlined Nioh with dashes of Sekiro, but it stands on its own as a Soulslike with, arguably, the crispest combat out there.


##### webpage format 3 - brief variation 3
1. review format 1 - soup.select("div.article_body_content p")
2. brief format 1 - soup.select(".article_body_content > :nth-child(3)")
3. developer format 1 - soup.select("div.article_body_content li")

In [110]:
link = "https://www.rockpapershotgun.com/trepang2-review-its-an-indie-fear"

response = requests.get(url=link)
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

In [111]:
info = soup.select("div.article_body_content li")
len(info)

6

In [112]:
# extract main body text
body = (soup.select("div.article_body_content p"))
body_text = ""
# extract text from each paragraph
for item in body:
    body_text += item.text
updated_body_text = body_text.replace("\nThis review is based on a copy of the game provided by the publisher.","")

print(updated_body_text)

Here's a move I pull in most gunfights in Trepang2: slidekick into an enemy, grab them out of mid-air, briefly hold them in front of me as a human shield, only to pull the pin on their vest's grenade and hurl them into a group of their pals, who do try to scatter before this meaty bomb bursts but sadly forget that they also need to avoid me and my shotgun. Often this is all in slow-motion. Trepang2 is unashamedly aiming to be a new F.E.A.R. and does a pretty great job of it for a game made by a core team of only four people (plus external artists and such). Give me a shotgun, a slidekick, and slo-mo, and I'm happy.It's the near future and you are a supersoldier fighting for a secret organisation. Corps and cults are engaged in questionable science, creating fleshy bioweapons and poking at incomprehensible otherworldly entities, so here you come in your black helicopter to invade their offices and secret bases. Many parts feel familiar—SCP, Resident Evil, creepypasta horror stories, may

In [113]:
info = soup.select("div.article_body_content li")
developer = info[0].text.replace("Developer: ", "").replace("Developer:", "").replace("\n","")
developer

'Trepang Studios'

In [35]:
# extract brief, using different brief html tag, located by Simplescraper
brief = str(soup.select(".article_body_content > :nth-child(3)"))
pattern = re.compile(r'<strong>.+?</strong>(.+?)<ul>', re.DOTALL)
match = re.search(pattern, brief)
brief_text = match.group(1).strip().replace("<br/>", "").replace("\r", "").replace("\n", "")
print(brief_text)

Trepang2 is a short but joyously violent first-person horror shooter and the closest we've had to F.E.A.R. in years


##### webpage format 4 - brief variation 4
1. review format 1 - soup.select("div.article_body_content p")
2. brief format 1 - soup.select(":nth-child(1) > em")[0]
3. developer format 1 - soup.select("div.article_body_content li")

In [114]:
link = "https://www.rockpapershotgun.com/stoneshard-early-access-review"

response = requests.get(url=link)
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

In [115]:
info = soup.select("div.article_body_content li")
len(info)

6

In [116]:
# extract main body text
body = (soup.select("div.article_body_content p"))
body_text = ""
# extract text from each paragraph
for item in body:
    body_text += item.text
updated_body_text = body_text.replace("\nThis review is based on a copy of the game provided by the publisher.","")

print(updated_body_text)

Like any decent professional videogames journalist, I have spent the last ten years devaluing the term “roguelike” through overuse. Anything where a goblin politely waits for you to take your turn? Roguelike. Anything where a big slime can permanently murder you? You better believe that’s a roguelike. We would walk down the street with wheelbarrows full of roguelike. We burned piles of it just to keep warm. We introduced the word “roguelite” as a form of quantitative easing, but it failed to curb spiralling rogueflation. So now, faced with Stoneshard – a roguelike that’s really, very, incredibly, actually like Rogue – I’m left reaching for useful adjectives.Stoneshard is a top-down, tile-based dungeon crawler in which time only moves when you do. It’s as if Superhot and chess stole Diablo’s car to go on a road trip to Durham. It’s like NetHack and Dungeons of Dredmor held hands and stared lovingly into one another’s eyes as they walked into the ocean. You are an adventurer in a richly 

In [39]:
info = soup.select("div.article_body_content li")
developer = info[0].text.replace("Developer: ", "").replace("Developer:", "").replace("\n","")
developer

'Ink Stains Games'

In [117]:
# extract brief, using different brief html tag, located by Simplescraper
brief = soup.select(":nth-child(1) > em")[0].text
print(brief)

Premature Evaluation is the weekly column in which Steve Hogarty explores the wilds of early access. This week, he's being pulverised by trolls and torn to shreds by wolves in challenging roguelike Stoneshard.


##### webpage format 5 - body variation 1 + brief variation 5 (body and brief are in the same code format)
1. review format 1 - (soup.select("div.article_body_content p"))[1:]
2. brief format 1 - soup.select(".article_body_content aside p")[0]
3. developer format 1 - soup.select("div.article_body_content li")

In [118]:
link = "https://www.rockpapershotgun.com/gears-tactics-review"

response = requests.get(url=link)
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

In [119]:
info = soup.select("div.article_body_content li")
len(info)

6

In [120]:
# extract main body text
body = (soup.select("div.article_body_content p"))[1:]
body_text = ""
# extract text from each paragraph
for item in body:
    body_text += item.text
updated_body_text = body_text.replace("\nThis review is based on a copy of the game provided by the publisher.","")

print(updated_body_text)

I've been playing Gears Tactics with every spare moment of the last week. The second I'm done writing review, I'm going back for more. It's superb. In fact, it's good to the extent where, as risky as it is to say such a thing, I'd argue it sets the new gold standard for turn-based tactics.I say that as a lifelong XCOM (and X-COM) fan, too, and without disrespect to those games. After all, the formula set by 1994's X-COM is the foundation for pretty much the entire genre, and Gears Tactics sits squarely atop it. It's a game about assembling squads from a pool of soldiers, and sending them on missions. There, they have a set number of actions each turn, to spend on moving around the map, shooting baddies, or using special abilities. After your actions run out, the baddies have a turn. In that much at least, little has been reinvented. But just as happened when XCOM arrived in 2010, Gears Tactics has redefined just how much fun can be had with that simple recipe for squad-based xenocide.I

In [121]:
info = soup.select("div.article_body_content li")
developer = info[0].text.replace("Developer: ", "").replace("Developer:", "").replace("\n","")
developer

'Splash Damage, The Coalition'

In [45]:
# using different brief html tag, located by Simplescraper
brief = soup.select(".article_body_content aside p")[0].text
brief

'This is 100% a Gears Of War game, that also happens to be a top flight strategy effort.'

##### webpage format 6 - all info available but len(li)=0; brief variation 1
1. review format 1 - soup.select("div.article_body_content p")[2:]
2. brief format 1 - soup.select(".article_body_content > :nth-child(2)") + regex
3. developer format 1 - soup.find('strong', string='Developer:')

In [122]:
link = "https://www.rockpapershotgun.com/unto-the-end-review"
response = requests.get(url=link)
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

In [123]:
info = soup.select("div.article_body_content li")
len(info)

0

In [126]:
# using different brief html tag, located by Simplescraper
# regex
brief_text = str(soup.select(".article_body_content > :nth-child(2)"))
pattern = re.compile(r'<strong>.*?</strong><br/>\s*(.*?)\s*<', re.DOTALL)
# find all matches, reference code found in: https://stackoverflow.com/questions/4697882/how-can-i-find-all-matches-to-a-regular-expression-in-python
matches = re.findall(pattern, brief_text)
matches[0]

'A unique, demanding 2D sword fighting game that frustrates as much as it impresses.'

In [127]:
# extract main body text
body = soup.select("div.article_body_content p")[2:]
body_text = ""
# extract text from each paragraph
for item in body:
    body_text += item.text
updated_body_text = body_text.replace("\nThis review is based on a copy of the game provided by the publisher.","")
print(updated_body_text)

Unto The End is a 2D sword fighting game about a little beardy man trying to find his way back home after getting lost hunting a deer. You'll guide him through gloomy, cramped caves, climbing and exploring and gathering scraps of leather, bone, and healing herbs. And you'll fight. Lord, you'll fight.It's very much a game about sword fights, but rather than a hack and slash where you cut down monsters by the dozen, 2 Ton Studios emphasise that Unto The End is about fewer, harder fights. Fights that are deliberate and careful, and reward observation and a cool head rather than sheer aggression or strength. It's a strong concept, gorgeously realised, with some great ideas and a lot of personality.




In [50]:
developer_tag = soup.find('strong', string='Developer:')
# some pages have some typo? in html code, having an extra <br> 
# iterate through element to find text, reference code found in: https://stackoverflow.com/questions/48780530/python-beautifulsoup-find-next-sibling

# use try.. except to deal with Error, reference code found in: https://stackoverflow.com/questions/1835756/using-try-vs-if-in-python
# faster than if.. else..?
try:
    developer_info = developer_tag.next_sibling.strip()
    print(developer_info)
except TypeError:
    developer_info = developer_tag.next_sibling.next_sibling.strip()
    print(developer_info)

2 Ton Studios


##### webpage format 7 - all info available but len(li)=0; brief variation 2
1. review format 1 - soup.select("div.article_body_content p")[2:]
2. brief format 1 - soup.select(".article_body_content > :nth-child(1)")[0] + regex
3. developer format 1 - soup.find('strong', string='Developer:')

In [128]:
link = "https://www.rockpapershotgun.com/nimbatus-review-early-access"

response = requests.get(url=link)
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

In [129]:
info = soup.select("div.article_body_content li")
len(info)

0

In [130]:
# extract main body text
body = soup.select("div.article_body_content p")[2:]

body_text = ""
# extract text from each paragraph
for item in body:
    body_text += item.text
updated_body_text = body_text.replace("\nThis review is based on a copy of the game provided by the publisher.","").replace("Manage cookie settings", "").replace("To see this content please enable targeting cookies.","")
print(updated_body_text)

I've been doing a lot of tinkering recently. Endless iterations of death-dealing drones, flying factories, huge robotic worms and automated weapon platforms - I've lost count of all my experiments in destruction. Nimbatus’ drone editor is a complex, liberating tool where patient engineers can craft a bewildering array of exotic, inventive machines. Then a giant snake blows them up. When my massive artillery drone - the better part of an hour of work - was smashed to bits in less than a minute, I was less than chuffed. It was bristling with missile launchers and surrounded by shields, but it got stuck on a barely-visible piece of scenery and then blown up by the aforementioned scaley menace, who, it turns out, is invincible. I’ve become well acquainted with these setbacks, watching my glorious drones suffer inglorious ends, but I keep finding myself back in the workshop, slapping on new components and trying to figure out how the heck logic gates work.  
 
Above is the full extent of th

In [131]:
developer_tag = soup.find('strong', string='Developer:')
try:
    developer_info = developer_tag.next_sibling.strip()
    print(developer_info)
except TypeError:
    developer_info = developer_tag.next_sibling.next_sibling.strip()
    print(developer_info)

Stray Fawn Studio


In [132]:
brief_str = soup.select(".article_body_content > :nth-child(1)")[0].text

if brief_str == "" or brief_str is None: # probably is the img link, no brief, or simly is an empty thing
    brief_text = "n/a"
    print("this is not a brief, should be TYPE II format 1")

else:
    # regex capture any text before "Developer:"
    # tested using https://regex101.com/
    pattern = re.compile(r'(.+?)\nDeveloper:', re.DOTALL)
    match = re.search(pattern, brief_str)
    if match:
        brief_text = match.group(1).strip()
        print("this is a brief, should be TYPE I Format 8")
    else:
        if "Developer" in brief_str:
            print("this is not a brief, should be TYPE II format 1")
            print(brief_text)
        else:
            brief_text = brief_str
            print("this is a brief, should be TYPE I Format 8")

print(brief_text)

this is a brief, should be TYPE I Format 8
Premature Evaluation is the weekly column in which we explore the wilds of early access. This week, Fraser’s put on his (prescription) welding goggles to fix up drones in Nimbatus, a space drone construction sim.


##### webpage format 8 - all info available but len(li)=0; brief variation 3
1. review format 1 - soup.select("div.article_body_content p")[1:]
2. brief format 1 - soup.select("em")
3. developer format 1 - soup.find('strong', string='Developer:')

In [133]:
link = "https://www.rockpapershotgun.com/early-access-review-hades"
response = requests.get(url=link)
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

In [134]:
info = soup.select("div.article_body_content li")
len(info)

0

In [135]:
# extract main body text
# older posts have a less refined structure ..
# in case developer infomation or empty string is included at the begining, iterate till find the body text
if "Release" not in soup.select("div.article_body_content p")[0].text and soup.select("div.article_body_content p")[0].text !="":
    body = soup.select("div.article_body_content p")[0:]
elif "Release" not in soup.select("div.article_body_content p")[1].text and soup.select("div.article_body_content p")[0].text !="":
    body = soup.select("div.article_body_content p")[1:] 
elif "Release" not in soup.select("div.article_body_content p")[2].text and soup.select("div.article_body_content p")[0].text !="":
    body = soup.select("div.article_body_content p")[2:] 
elif "Release" not in soup.select("div.article_body_content p")[3].text and soup.select("div.article_body_content p")[0].text !="":
    body = soup.select("div.article_body_content p")[3:] 
else:
    body = soup.select("div.article_body_content p")[4:] 

body_text = ""

# extract text from each paragraph
for item in body:
    body_text += item.text
updated_body_text = body_text.replace("\nThis review is based on a copy of the game provided by the publisher.","").replace("Manage cookie settings", "").replace("To see this content please enable targeting cookies.","")
print(updated_body_text)

Let’s take another look at Hades, the rogue-ish-action-hack-n-slash-n-chat-em-up by Supergiant Games, developers of Bastion and Transistor, in which you play Zagreus, the immortal son of the lord of the underworld on a quest to repeatedly run away from home. Home, in this case, is a giant-ass castle in helltown where the tortured souls of the deceased languish while they await processing, like an infernal waiting room or a less depressing version of Digbeth Coach Station. Yeah, that’s right Birmingham, your dumb coach station is whatever the building equivalent of abject misery is. The toilets cost 30p and they don’t give change. I once saw a rat eating a pigeon there.When I last pointed my Premature Evaluation telescope at Hades one year ago, it was already extremely good, cursing me with a misplaced confidence in the quality of early access games that was then cruelly eroded over the course of 2019 by the confusing parade of broken shite that followed. There was that janky pirate MMO

In [136]:
developer_tag = soup.find('strong', string='Developer:')
try:
    developer_info = developer_tag.next_sibling.strip()
    print(developer_info)
except TypeError:
    developer_info = developer_tag.next_sibling.next_sibling.strip()
    print(developer_info)

Supergiant Games


In [137]:
# extract brief
brief = soup.select("em")[0].text
print(brief)

Premature Evaluation is the weekly column in which Steve Hogarty explores the wilds of early access. This week, he returns to the realm of the damned to see what's been goin' down in helltown.


#### FORMAT TYPE II
- some information is missing (either missing brief or brief & developer info)

##### webpage format 1 - no brief, part of info (release date, etc.) is embedded to body text
1. review format 1 - soup.select("div.article_body_content p")[1:]
2. brief format 1 - no brief, or the first p of body text
3. developer format 1 - soup.find('strong', string='Developer:')

In [138]:
link = "https://www.rockpapershotgun.com/wot-i-think-journey-to-the-savage-planet"

response = requests.get(url=link)
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

In [139]:
info = soup.select("div.article_body_content li")
len(info)

0

In [140]:
# extract main body text
# iterate till find the body text
if "Release" not in soup.select("div.article_body_content p")[0].text and soup.select("div.article_body_content p")[0].text !="":
    body = soup.select("div.article_body_content p")[0:]
elif "Release" not in soup.select("div.article_body_content p")[1].text and soup.select("div.article_body_content p")[0].text !="":
    body = soup.select("div.article_body_content p")[1:] 
elif "Release" not in soup.select("div.article_body_content p")[2].text and soup.select("div.article_body_content p")[0].text !="":
    body = soup.select("div.article_body_content p")[2:] 
elif "Release" not in soup.select("div.article_body_content p")[3].text and soup.select("div.article_body_content p")[0].text !="":
    body = soup.select("div.article_body_content p")[3:] 
else:
    body = soup.select("div.article_body_content p")[4:] 

body_text = ""

# extract text from each paragraph
for item in body:
    body_text += item.text
updated_body_text = body_text.replace("\nThis review is based on a copy of the game provided by the publisher.","").replace("Manage cookie settings", "").replace("To see this content please enable targeting cookies.","")
print(updated_body_text)


My clone wakes up in the hell pod for the dozenth time, and I am immediately desperate to leave. Martin Tweed, proud president of the universe's 4th best space exploration company, yells at me from inside a cheesy company FMV memo. The ship's computer lectures me about our dwindling supplies of resurrection goop. The Grob machine will not stop bleeting. Fortunately, there's a planet out there to be explored, exploited and catalogued. All of it is colourful, most of it is gooey, and some of it is hostile. Far too much of it is fart jokes.Journey To The Savage Planet is a romp, and a worthwhile one. It hasn't quite left me feeling full, but I am a picky eater. If you're hankering for a good explore 'em up, the Savage Planet ticks the boxes. You will turn corners and see giant mushroom forests (with a friend, if you like). You will grapple up barnacle cliffs, and hop about lava caves. You'll shoot at glowing lizards, blast through walls with harvested "bombegranates", and be snuck up on b

In [141]:
developer_tag = soup.find('strong', string='Developer:')
try:
    developer_info = developer_tag.next_sibling.strip()
    print(developer_info)
except TypeError:
    developer_info = developer_tag.next_sibling.next_sibling.strip()
    print(developer_info)

Typhoon Studios


In [65]:
# no brief

##### webpage format 2 - no brief; len(li)!=0
1. review format 1 - soup.select("div.article_body_content p")
2. brief format 1 - no brief
3. developer format 1 - soup.select("div.article_body_content li")

In [142]:
link = "https://www.rockpapershotgun.com/iron-harvest-review"

response = requests.get(url=link)
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

In [143]:
info = soup.select("div.article_body_content li")
len(info)

6

In [144]:
# extract main body text
body = (soup.select("div.article_body_content p"))
body_text = ""
# extract text from each paragraph
for item in body:
    body_text += item.text
updated_body_text = body_text.replace("\nThis review is based on a copy of the game provided by the publisher.","")

print(updated_body_text)

Iron Harvest is an alternate history RTS in which the great empires of the early 20th century invented giant stompy robots with machine guns for heads and spinning blades for arms. Inspired by the work of Polish artist Jakub Rozalski, who paints smokey pastoral scenes loomed over by grey machines lumbering across hazy horizons, the game depicts a world in which the mechanization of warfare didn’t stop with tanks and railguns, but instead progressed to iron mech suits and industrial-era megazords.These are very good robots. Not the pristine, precision-engineered war machines of modern science-fiction, with all of their ball bearings and cupholders and heated seats, but noisy, oily, juddering wrecks that stink of WD40 and tremble like nervous metal greyhounds. These are janky, diesel-powered mechs that look like they’re about to rattle themselves to pieces sooner than stroll into battle. Iron Harvest conveys the look and feel of Rozalski’s fantasy universe, tapping into the sense of awe 

In [145]:
# different tag to locate developer info - Simplescraper
info = soup.select("div.article_body_content li")
developer = info[0].text.replace("Developer: ", "").replace("Developer:", "").replace("\n","")
developer

'King Art'

In [70]:
# no brief info

##### webpage format 3 - no info & brief, only body text; or brief is embdded to body text
1. review format 1 - soup.select("div.article_body_content p")
2. brief format 1 - no brief, or the first p of body text
3. developer format 1 - no info

In [146]:
link = "https://www.rockpapershotgun.com/wot-i-think-star-wars-chess"
# link = "https://www.rockpapershotgun.com/witch-hunt-review-early-access" # premature evaluation

response = requests.get(url=link)
html_code = response.text
soup = BeautifulSoup(html_code, "html.parser")

In [147]:
info = soup.select("div.article_body_content li")
len(info)

0

In [149]:
# extract main body text
body = (soup.select("div.article_body_content p"))
body_text = ""
# extract text from each paragraph
# there is one case that it is a Premature Evaluation - in that case the first sentence of the body text is the brief
if "Premature Evaluation" in str(body):
    brief_text = body[0].text.replace("\n","")
    for item in body[1:]:
        body_text += item.text
else:
    brief_text = "n/a"
    for item in body:
        body_text += item.text
updated_body_text = body_text.replace("\nThis review is based on a copy of the game provided by the publisher.","")

print(brief_text)
print(updated_body_text)

n/a
Oh man, I can't imagine loving anything more than Star Wars. Spaceships and robots and laser swords and that masked evil dude who sounds like he needs a throat sweet - those cool 80s films are the best thing ever. And I really like chess too: it's like a 3D videogame, only you don't need to wear silly glasses! So Star Wars Chess is a dream come true, maybe even the game I've been waiting for my whole life. The only way the universe could possibly get any better would be if they made some new Star Wars films. That would be so awesome.


 
Star Wars Chess is chess with characters from Star Wars. Apart from Han, who only appears as a frozen carbonite block in the background. Maybe they thought everyone would be too confused because he looks like the guy in Indiana Jones and the Last Crusade? It's got Luke, and R2-D2, and my favorite, Chewie!  They're all here! The white side are guys from the Rebel Alliance - R2D2 as pawns, Chewie as knight (let the wookiee win!), C-3P0 as the bishop 

In [150]:
developer_tag = soup.find('strong', string='Developer:')
print(developer_tag)
# try:
#     developer_info = developer_tag.next_sibling.strip()
#     print(developer_info)
# except TypeError:
#     developer_info = developer_tag.next_sibling.next_sibling.strip()
#     print(developer_info)

# no developer info

None


In [151]:
# no brief

### References

Tools used in this notebook:

- Simplescraper, a Chrome Extension to obtain HTML elements for the select content on any website. Available at: https://simplescraper.io

- regex101, regular expression tester. Available at: https://regex101.com/
  
Code References:

- How can I get href links from HTML using Python? [online] Stack Overflow. Available at: https://stackoverflow.com/questions/3075550/how-can-i-get-href-links-from-html-using-python

- Differences between .text and .get_text() [online] Stack Overflow. Available at: https://stackoverflow.com/questions/35496332/differences-between-text-and-get-text

- How to remove symbols from a string with Python [online] Stack Overflow. Available at: https://stackoverflow.com/questions/875968/how-to-remove-symbols-from-a-string-with-python

- Python String strip() Method [online] W3Schools. Available at:https://www.w3schools.com/python/ref_string_strip.asp

- Regex select all text between tages [online] Stack Overflow. Available at: https://stackoverflow.com/questions/7167279/regex-select-all-text-between-tags


- How can I find all matches to a regular expression in Python [online] Stack Overflow. Available at:https://stackoverflow.com/questions/4697882/how-can-i-find-all-matches-to-a-regular-expression-in-python

- Python BeautifulSoup find next sibling [online] Stack Overflow. Available at:https://stackoverflow.com/questions/48780530/python-beautifulsoup-find-next-sibling

- Using try vs. if in Python [online] Stack Overflow. Available at:https://stackoverflow.com/questions/1835756/using-try-vs-if-in-python
