# NLP Project - Web Scraping and Text Analysis of Game Reviews on Rock Paper Shotgun
## Part I. Build Dataset
### Step 3 - Scrape webpages using bs4 & loops to get content of all the game-related reviews

#### 1. Extract links & titles from `rps_links_output_file.json`

In [1]:
import requests
from bs4 import BeautifulSoup
import json
import re
import json
from datetime import datetime

In [2]:
# Open the JSON file
with open('rps_links_output_file.json', 'r') as file:
    # Load the JSON content into a variable
    data = json.load(file)

In [3]:
print(f"{len(data)} review links have been scrapped from rps - until 1/12/2023")

2388 review links have been scrapped from rps - until 1/12/2023


#### 2. Data Pre-Processing - obtain game reviews from the past 3.5 years:
1. remove all reviews published before 1/6/2020; retain only those from the past three and half years to ensure the dataset reflects the latest information. Many older posts prior mid 2020 lack details about developers and briefs, and the varied (inconsistent) formatting substantially increases the time required for accurate data extraction and cleaning.
2. remove all the non game related reviews, e.g., hardware, CPU

In [5]:
reviews_in_recent_years = []

for dict in data:
    response = requests.get(url=dict['URL'])
    html_code = response.text
    soup = BeautifulSoup(html_code, "html.parser")

    # ChatGpt used here asking about how to compare dates using python
    # convert str to datetime object 
    # reference code also found in: https://stackoverflow.com/questions/72702859/use-python-datetime-strptime
    date = soup.find("time").get('datetime')[:10]
    datetime_to_compare = datetime.strptime('2020-06-01', '%Y-%m-%d')

    # extract topics provided at the end of review - same for all webpages
    topic = soup.select("#content_above > .page_content span.tagged_with_item.tagged_with_item--secondary a")
    label = [item.text for item in topic]

    if datetime.strptime(date, '%Y-%m-%d') > datetime_to_compare:
        if "Hardware" not in label and "CPU" not in label:
            reviews_in_recent_years.append(dict)
            print(date)
            print(label)
    else:
        break

2023-11-27
['Action Adventure', 'Historical', 'Indie', 'PC', 'Strategy: Real-Time Strategy', 'Wot I Think']
2023-11-27
['Management', 'PC', 'Simulation', 'Strategy', 'The Station', 'Thunderful', 'Thunderful Publishing', 'Wot I Think']
2023-11-23
['Action Adventure', 'Indie', 'RPG', 'Sports', 'Strategy: Turn-Based Strategy', 'Wot I Think']
2023-11-21
['Action Adventure', 'Microids', 'PC', 'Platformer', 'Wot I Think']
2023-11-17
['Action Adventure', 'Darya Noghani', 'Indie', 'RPG', 'Wot I Think']
2023-11-17
['PC', 'PS4', 'PS5', 'Shooter', 'Wot I Think', 'Xbox One', 'Xbox Series X/S']
2023-11-16
['Action Adventure', 'Exploration', 'Indie', 'Moon Lagoon', 'PC', 'Science Fiction', 'Secret Mode', 'Story Rich', 'Wot I Think']
2023-11-15
['Artefacts Studio', 'Dear Villagers', 'PC', 'Simulation', 'Wot I Think']
2023-11-15
['Action Adventure', 'Indie', 'PC', 'Platformer', 'Puzzle', 'Raw Fury', 'Reviews', 'Wot I Think']
2023-11-14
['Atlus', 'PC', 'RPG', 'Sega', 'SEGA of America', 'Simulation', 'S

In [6]:
print(f"There are {len(reviews_in_recent_years)} review articles published on RockPaperShotgun.com in the recent three and half years (1/6/2020 - 1/12/2023)")

There are 540 review articles published on RockPaperShotgun.com in the recent three and half years (1/6/2020 - 1/12/2023)


#### 3. Extract data (title, developer, label, brief, review, date) from each webpage
- note that different webpages have different html formats for developer, brief, and body text (review), requiring different code to handle each variation
- please refer to '1a_web_scrape_test_code.ipynb' for different formats (I've summarised 11 different formats in total; due to limited timeframe, other format or errors (if occurred) will be manually examined/fixed/removed after extrating all data)

In [60]:
# code used to extract each format is first tested in '1a_web_scrape_test_code.ipynb' 
# combine all the code together to extract everything at once

game_titles = []
developers = []
labels = []
briefs = []
reviews = []
dates = []

# iterate using enumerate - get both index and element
for i, element in enumerate(reviews_in_recent_years):
    response = requests.get(url=reviews_in_recent_years[i]['URL'])
    html_code = response.text
    soup = BeautifulSoup(html_code, "html.parser")

    # extract date - same for all webpages
    date = soup.find("time").get('datetime')[:10]
    dates.append(date)

    # extract game titles from url link - same for all webpages
    link_title = " ".join(element['URL'].replace("https://www.rockpapershotgun.com/", "").split("-")).title()
    cleaned_link_title = re.sub(r'Review.*', 'Review', link_title) # Use regex to remove everything after 'review'
    game_title = cleaned_link_title.replace(" Review", "").replace("Wot I Think ", "")
    game_titles.append(game_title)

    # extract topics provided at the end of review - same for all webpages
    topic = soup.select("#content_above > .page_content span.tagged_with_item.tagged_with_item--secondary a")
    label = []
    for item in topic:
        label.append(item.text)
    labels.append(label)

    info = soup.select("div.article_body_content li")

    if len(info) != 0: # info/brief is directly available for scraping (mostly)
        # extract developer info
        developer = info[0].text.replace("Developer:", "").replace("Developer:", "").replace("\n","")
        developers.append(developer)

        # try.. except.. reference code found in: https://www.programiz.com/python-programming/exception-handling
        try: 
            brief = str(soup.select(".article_body_content > :nth-child(1)"))
            pattern = re.compile(r'<strong>.+?</strong>(.+?)<ul>', re.DOTALL)
            text = re.search(pattern, brief).group(1).strip().replace("\r", "").replace("\n", "")
            # remove everything in <> and <> itself, if any
            brief_text = re.sub(r'<.*?>', '', text)
            if brief_text == "":
                brief_text = "n/a"

            print(f"index{i} is Type I, format 1 - {element['Title']}: {element['URL']}")

            # extract review
            body = soup.select("div.article_body_content p")
            body_text = ""
            for item in body:
                body_text += item.text

        except AttributeError:
            try: 
                brief = str(soup.select(".article_body_content > :nth-child(2)"))
                pattern = re.compile(r'<strong>.+?</strong>(.+?)<ul>', re.DOTALL)
                text = re.search(pattern, brief).group(1).strip().replace("\r", "").replace("\n", "")
                # remove everything in <> and <> itself, if any
                brief_text = re.sub(r'<.*?>', '', text)

                print(f"index{i} is Type I, format 2 - {element['Title']}: {element['URL']}")

                # extract review
                body = soup.select("div.article_body_content p")
                body_text = ""
                for item in body:
                    body_text += item.text

            except AttributeError:
                try:
                    brief = str(soup.select(".article_body_content > :nth-child(3)"))
                    pattern = re.compile(r'<strong>.+?</strong>(.+?)<ul>', re.DOTALL)
                    text = re.search(pattern, brief).group(1).strip().replace("\r", "").replace("\n", "")
                    # remove everything in <> and <> itself, if any
                    brief_text = re.sub(r'<.*?>', '', text)

                    print(f"index{i} is Type I, format 3 - {element['Title']}: {element['URL']}")

                    # extract review
                    body = soup.select("div.article_body_content p")
                    body_text = ""
                    for item in body:
                        body_text += item.text

                except AttributeError:
                    try: 
                        brief_text = soup.select(":nth-child(1) > em")[0].text

                        print(f"index{i} is Type I, format 4 - {element['Title']}: {element['URL']}")

                        # extract review
                        body = soup.select("div.article_body_content p")
                        body_text = ""
                        for item in body:
                            body_text += item.text

                    except IndexError:
                        try:
                            brief_text = soup.select(".article_body_content aside p")[0].text

                            print(f"index{i} is Type I, format 5 - {element['Title']}: {element['URL']}")

                            # extract review
                            body = (soup.select("div.article_body_content p"))[1:]
                            body_text = ""
                            for item in body:
                                body_text += item.text
                        
                        except IndexError:
                            brief_text = "n/a"

                            print(f"index{i} is Type II, format 2 - {element['Title']}: {element['URL']}")

                            # extract review
                            body = (soup.select("div.article_body_content p"))
                            body_text = ""
                            for item in body:
                                body_text += item.text             

    else: # info is not directly available for scraping or there is no info/brief at all 
        developer_tag = soup.find('strong', string='Developer:')
        if developer_tag is not None: # if developer_tag can be found - some or all info can be scrapped
            try:
                brief_str = str(soup.select(".article_body_content > :nth-child(2)"))
                pattern = re.compile(r'<strong>.*?</strong><br/>\s*(.*?)\s*<', re.DOTALL)
                brief_text = re.findall(pattern, brief_str)[0]      

                print(f"index{i} is Type I, format 6 - {element['Title']}: {element['URL']}")

                # extract review
                body = soup.select("div.article_body_content p")[2:]
                body_text = ""
                for item in body:
                    body_text += item.text

            except IndexError:
                brief_str = soup.select(".article_body_content > :nth-child(1)")[0].text
                
                if brief_str is None or brief_str == "": # probably is the img link with no brief, or simly extract an empty text
                    try:
                        brief = soup.select("em")[0].text
                        brief_text = brief

                        print(f"index{i} is Type I, format 8 - {element['Title']}: {element['URL']}")

                    except IndexError:
                        brief_text = "n/a"

                        print(f"index{i} is Type II, format 1 - {element['Title']}: {element['URL']}")

                    # extract review - iterate until find the body text
                    if "Release" not in soup.select("div.article_body_content p")[0].text and soup.select("div.article_body_content p")[0].text !="":
                        body = soup.select("div.article_body_content p")[0:]
                    elif "Release" not in soup.select("div.article_body_content p")[1].text and soup.select("div.article_body_content p")[0].text !="":
                        body = soup.select("div.article_body_content p")[1:] 
                    elif "Release" not in soup.select("div.article_body_content p")[2].text and soup.select("div.article_body_content p")[0].text !="":
                        body = soup.select("div.article_body_content p")[2:] 
                    elif "Release" not in soup.select("div.article_body_content p")[3].text and soup.select("div.article_body_content p")[0].text !="":
                        body = soup.select("div.article_body_content p")[3:] 
                    else:
                        body = soup.select("div.article_body_content p")[4:] 

                    body_text = ""
                    for item in body:
                        body_text += item.text

                else:
                    # regex capture any text before "Developer:" and after "review"
                    pattern = re.compile(r'review(.+?)\nDeveloper:', re.DOTALL)
                    match = re.search(pattern, brief_str)
                    if match:
                        text = match.group(1).strip()
                        # remove everything in <> and <> itself, if any
                        brief_text = re.sub(r'<.*?>', '', text)

                        print(f"index{i} is Type I, format 7 - {element['Title']}: {element['URL']}")

                        # extract review
                        body = soup.select("div.article_body_content p")[2:]
                        body_text = ""
                        for item in body:
                            body_text += item.text

                    else:
                        if "Developer" in brief_str: # the text starts with developer info, no brief can be extracted
                            try:
                                brief = soup.select("em")[0].text
                                brief_text = brief

                                print(f"index{i} is Type I, format 8 - {element['Title']}: {element['URL']}")

                            except IndexError:
                                brief_text = "n/a"

                                print(f"index{i} is Type II, format 1 - {element['Title']}: {element['URL']}")
            
                            # extract review
                            if "Release" not in soup.select("div.article_body_content p")[0].text and soup.select("div.article_body_content p")[0].text !="":
                                body = soup.select("div.article_body_content p")[0:]
                            elif "Release" not in soup.select("div.article_body_content p")[1].text and soup.select("div.article_body_content p")[0].text !="":
                                body = soup.select("div.article_body_content p")[1:] 
                            elif "Release" not in soup.select("div.article_body_content p")[2].text and soup.select("div.article_body_content p")[0].text !="":
                                body = soup.select("div.article_body_content p")[2:] 
                            elif "Release" not in soup.select("div.article_body_content p")[3].text and soup.select("div.article_body_content p")[0].text !="":
                                body = soup.select("div.article_body_content p")[3:] 
                            else:
                                body = soup.select("div.article_body_content p")[4:] 

                            body_text = ""
                            for item in body:
                                body_text += item.text
        
                        else:
                            brief_text = brief_str

                            print(f"index{i} is Type I, format 7 - {element['Title']}: {element['URL']}")

                            # extract review
                            body = soup.select("div.article_body_content p")[2:]
                            body_text = ""
                            for item in body:
                                body_text += item.text

            try:
                developer_info = developer_tag.next_sibling.strip()
                developers.append(developer_info)
            except TypeError:
                developer_info = developer_tag.next_sibling.next_sibling.strip()
                developers.append(developer_info)
                        
        else:  # no info can be scrapped at all, only the body text
            print(f"index{i} is Type II, format 3 - {element['Title']}: {element['URL']}")
            developers.append("n/a")

            # but Premature Evaluation is different, its first paragraph of body text can be the brief
            # extract review
            body = (soup.select("div.article_body_content p"))
            body_text = ""
            if "Premature Evaluation" in str(body):
                brief_text = body[0].text.replace("\n","")
                for item in body[1:]:
                    body_text += item.text
            else:
                brief_text = "n/a"
                for item in body:
                    body_text += item.text
   
    briefs.append(brief_text)
    updated_body_text = body_text.replace(
        "\nThis review is based on a copy of the game provided by the publisher.",""
        ).replace(
            "Manage cookie settings", ""
            ).replace(
                "To see this content please enable targeting cookies.", "")
    review = {}
    review['review'] = updated_body_text
    reviews.append(review)

index0 is Type I, format 1 - Last Train Home review: freezing to death in the Russian Civil War shouldn't be this entertaining: https://www.rockpapershotgun.com/last-train-home-review
index1 is Type I, format 1 - SteamWorld Build review: joyful citybuilding and delightful dungeon digging: https://www.rockpapershotgun.com/steamworld-build-review
index2 is Type I, format 1 - Knuckle Sandwich review: a turn-based RPG that's a little too random: https://www.rockpapershotgun.com/knuckle-sandwich-review
index3 is Type I, format 1 - Flashback 2 review: a broken travesty of a retro revival: https://www.rockpapershotgun.com/flashback-2-review
index4 is Type I, format 1 - Small Saga review: a short, story-heavy RPG that's light on challenge: https://www.rockpapershotgun.com/small-saga-review
index5 is Type I, format 1 - Call Of Duty: Modern Warfare 3 multiplayer review: a tiring nostalgia trip: https://www.rockpapershotgun.com/call-of-duty-modern-warfare-3-multiplayer-review
index6 is Type I, fo

* note that the above code is designed to run all dataset (2386), while 10 formats are still not able to cover all the variation for all 2386 dataset and minor adjustments to code might be needed to extract all the information correctly (due to limited timeframe, other formats need to be handled with extra time), it is more than enough to extract all recent three years reviews & related information correctly.

#### 4. Clean & Store the extracted dataset to local JSON file
The result is store in json file `game_reviews_rps_in_recent_years.json`

In [61]:
# uisng regex to check if the first or the last character of the title is ':' or ' ' or '<br>' or '<br/>'
# regex tested using regex101
# ChatGpt used here to debug and fix regex 
def remove_weird_character(text):r'^[:\s<br>]+|[:\s<br/>]+$'
    pattern = re.compile(r'^[:\s<br>]+|[:\s<br/>]+$')
    if pattern.search(text):
        text = pattern.sub('', text)
    return text

In [62]:
merged_dict = {}
rps_data = []

for i in range(0, len(reviews_in_recent_years)):
    merged_dict = {
        'Title': reviews_in_recent_years[i]['Title'],
        'URL': reviews_in_recent_years[i]['URL'],
        'Game Title': game_titles[i],
        'Developer': remove_weird_character(developers[i].replace("\n", "")),
        'Labels': ", ".join(labels[i]), # convert list to str for csv
        'Brief': remove_weird_character(briefs[i]),
        'Review': reviews[i]['review'].replace("\n", "").replace("\r", ""), # remove all newlines for further processing
        'Publish Date': dates[i]
    }
    rps_data.append(merged_dict)

In [63]:
with open('game_reviews_rps_in_recent_years.json', 'w') as output:
    json.dump(rps_data, output)

### References

Tools used in this notebook:

- Simplescraper, a Chrome Extension to obtain HTML elements for the select content on any website. Available at: https://simplescraper.io

- regex101, regular expression tester. Available at: https://regex101.com/
  
Code References:

- Using Python datetime.strptime [online] Stack Overflow. Available at: https://stackoverflow.com/questions/72702859/use-python-datetime-strptime

- Python Exception Handling [online] Programiz. Available at: https://www.programiz.com/python-programming/exception-handling