# Games Research


## Table of contents

1. [Introduction](#Introduction)
2. [Imports](#Imports)
3. [Data Acquisition](#Data_aqu)

# <a class="anchor" id="Introduction"></a>1. Introduction



# <a class="anchor" id="Imports"></a>2. Imports

In [1]:
!pip install selenium
!pip install pydot
!pip install pydotplus
from bs4 import BeautifulSoup
from bs4.element import Tag as HtmlTag
import requests
import os
import re
from random import randint
import time
from time import sleep
from tqdm import tqdm
from abc import ABC, abstractmethod
from enum import Enum
from functools import partial
from typing import Callable, List
from numbers import Number
import pandas as pd
import numpy as np
import concurrent.futures
from IPython.display import display
import seaborn as sns
import matplotlib.pyplot as plt
from multiprocessing import Pool
from sklearn.preprocessing import LabelEncoder
from IPython.display import Image, display  
import pydotplus 
from concurrent.futures import ThreadPoolExecutor, as_completed
from scipy import misc
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC



# <a class="anchor" id="Data_aqu"></a>3. Data Acquisition

In order to create a predication model we first needed to gather relevant data.

Considering our options of data acquisition sources, we decided to look for the biggest video game digital distribution service and storefront, and scrape data which we thought will be helpful and save it as a dataframe.

## The Platform that we choose to get the data:  [Steam](https://store.steampowered.com "Steam")
####  
<div>
<img src=https://www.turn-on.de/media/cache/article_images/media/cms/2017/07/steam-logo.jpg?890000 width="300">
</div>

We have decided on Steam due to the scale of the community and amount of features provided with each game.
We started looking for a way to scrape data from Steam and we immediately faced the issue of having to scroll down to load more games, for that issue we used selenium.

## Getting Games Links

## Read Game links and scrape data from it

In [None]:
# Set up the web driver
driver = webdriver.Firefox()
driver.get("https://store.steampowered.com/search?category1=998&supportedlang=english&ndl=1")

# Specify the number of links to scrape
num_links_to_scrape = 50

# Calculate the number of times to scroll down based on the number of links loaded per scroll
num_links_per_scroll = 25
num_scrolls = int(num_links_to_scrape / num_links_per_scroll)

# Scroll down to load more search results
for i in range(num_scrolls):
    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
    except Exception as e:
        print(e)
        break

# Extract game links and prices from search results
soup = BeautifulSoup(driver.page_source, "html.parser")

# Extract price
prices = []
for game in soup.select(".search_result_row"):
    price = game.select_one(".search_price")
    if price and price.select_one("strike"):
        price = price.select_one("strike").text.strip()
    elif price:
        price = price.text.strip()
    else:
        price = "N/A"
    if price and "₪" in price:
        price = price.replace("₪", "")
    prices.append(price)

for price in prices:
    if not price:
        price = "N/A"

# Extract game links from search results
game_tags = soup.find_all("a", {"class": "search_result_row"})
game_links = [tag.get("href") for tag in game_tags][:num_links_to_scrape]

# Save game links and prices to dataframe
df = pd.DataFrame({"link_to_game_page": game_links, "price": prices[:num_links_to_scrape]})

# Quit the web driver
driver.quit()

# Print the number of links and the dataframe
print(f"Total links scraped: {len(game_links)}")
print(df)
df.to_csv('game_links_and_prices.csv', index=False)


In [None]:
data = pd.read_csv("game_links_and_prices.csv")
print (len(data))

In [2]:
# Read the CSV file containing the URLs
data = pd.read_csv('game_links_and_prices.csv')

# Create an empty list to store the scraped data
game_data_list = []

# Initialize an empty dataframe to store the scraped data
game_data_df = pd.DataFrame(columns=['game_id',
                                     'game_name',
                                     'game_price',
                                     'release_date',
                                     'publisher',
                                     'developer',
                                     'all_review_score',
                                     'all_review_count',
                                     'genre',
                                     'features'])

# Create a session object
session = requests.Session()

# make all game feature list
all_games_features = []

# Loop through each URL in the CSV file
for index, row in tqdm(data.iterrows(), total=len(data)):
    
    if 'sub' in row['link_to_game_page']:
        continue
    
    # Extract game price
    game_price = row['price']

    # Send a GET request to the URL
    for i in range(3):
        try:
            response = session.get(row['link_to_game_page'])
            if response.status_code == 200:
                break
        except Exception as e:
            print(f"An error occurred while requesting {row['link_to_game_page']}: {e}")
        time.sleep(1)

    # If the request failed, skip to the next URL
    if response.status_code != 200:
        print(f"Failed to get data from {row['link_to_game_page']}")
        continue


    # Parse the HTML of the response with Beautiful Soup
    soup = BeautifulSoup(response.content, "html.parser")
    # Check if the age verification prompt is present on the page
    age_gate = soup.find('div', {'class': 'agegate_text_container'})

    # If age verification is required
    if age_gate is not None:
        inputs = soup.find_all('input')
        data = {}
        for input in inputs:
            name = input.get('name')
            value = input.get('value')
            if name is not None:
                data[name] = value

        # Set the age verification value to '1' (indicating over 18)
        data['ageDay'] = '1'
        data['ageMonth'] = '1'
        data['ageYear'] = '1980'

        # Send a POST request to the age verification form URL with the input field values
        response = requests.post(url, data=data)
        soup = BeautifulSoup(response.content, 'html.parser')

        
    # Extract the game ID from the page
    game_id = soup.find('meta', {'property': 'og:url'})['content'].split('/')[4]    
    
    # Extract the game name from the page
    game_name = soup.find('b', text='Title:').next_sibling.strip().replace("™", "").replace("®", "").replace("’", "")
    
    # Extract the release date from the page
    release_date = soup.find('div', {'class': 'date'})
    if release_date is not None:
        release_date = release_date.text.strip()
    
    # Find all the div elements with class "dev_row"
    dev_rows = soup.find_all("div", {"class": "dev_row"})

    # Loop through the dev rows to find the one containing the "Publisher:" text
    for dev_row in dev_rows:
        if "Publisher:" in dev_row.text:
            # Extract the publisher name from the next sibling div element with class "summary"
            publisher = dev_row.find_all("a", href=True)[0].text.strip()
            break

    # Find all the div elements with class "dev_row"
    dev_rows = soup.find_all("div", {"class": "dev_row"})

    # Loop through the dev rows to find the one containing the "Developer:" text
    for dev_row in dev_rows:
        if "Developer:" in dev_row.text:
            # Extract the first developer name from the next sibling div element with class "summary"
            developer = dev_row.find_all("a", href=True)[0].text.strip()
            break

    
    # Extract the all review score and number of all reviews from the page
    all_reviews = soup.find_all('span', {'class': 'game_review_summary'})
    if all_reviews and len(all_reviews) > 1:
        all_review_score = all_reviews[1].text.strip()
    else:
        all_review_score = '0'
    if soup.find('meta', {'itemprop': 'reviewCount'}):
        all_review_num = soup.find('meta', {'itemprop': 'reviewCount'})['content']
    else:
        all_review_num = '0'


    # Extract the game genres from the page
    genres_tag = soup.find('span', {'data-panel': '{"flow-children":"row"}'})
    if genres_tag is not None:
        genres = genres_tag.find('a').text.strip()
    else:
        genres = ''

    # find the div containing the game features
    features_div = soup.find('div', {'class': 'game_area_features_list_ctn'})

    # find all the game features and store them in a list
    features = []
    for feature in features_div.find_all('div', {'class': 'label'}):
        features.append(feature.text.strip())
    
    # Make all features list
    all_games_features.append(features)

    # Create a new dataframe with the scraped data
    new_game_data_df = pd.DataFrame({
        'game_id':[game_id],
        'game_name': [game_name],
        'game_price': [game_price],
        'release_date': [release_date],
        'publisher': [publisher],
        'developer': [developer],
        'all_review_score': [all_review_score],
        'all_review_count': [all_review_num],
        'genre': [genres],
        'features': [features]
    })
   # Concatenate the new dataframe with the existing dataframe
    if game_data_df.empty:
        game_data_df = new_game_data_df
    else:
        game_data_df = pd.concat([game_data_df, new_game_data_df], ignore_index=True)
    
# Save the new dataframe to a CSV file
game_data_df.to_csv('game_data_df.csv', index=False)

# Display the resulting dataframe with the scraped data
print(len(game_data_df))



 27%|██▋       | 2726/10000 [32:52<1:35:01,  1.28it/s]

An error occurred while requesting https://store.steampowered.com/app/221260/Little_Inferno/?snr=1_7_7_230_150_55: ('Connection aborted.', ConnectionAbortedError(10053, 'An established connection was aborted by the software in your host machine', None, 10053, None))
An error occurred while requesting https://store.steampowered.com/app/221260/Little_Inferno/?snr=1_7_7_230_150_55: HTTPSConnectionPool(host='store.steampowered.com', port=443): Max retries exceeded with url: /app/221260/Little_Inferno/?snr=1_7_7_230_150_55 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001ED8D4D7C10>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
An error occurred while requesting https://store.steampowered.com/app/221260/Little_Inferno/?snr=1_7_7_230_150_55: HTTPSConnectionPool(host='store.steampowered.com', port=443): Max retries exceeded with url: /app/221260/Little_Inferno/?snr=1_7_7_230_150_55 (Caused by NewConnectionError('<urllib3.connect

 27%|██▋       | 2727/10000 [33:07<10:14:24,  5.07s/it]

An error occurred while requesting https://store.steampowered.com/app/702670/Donut_County/?snr=1_7_7_230_150_55: HTTPSConnectionPool(host='store.steampowered.com', port=443): Max retries exceeded with url: /app/702670/Donut_County/?snr=1_7_7_230_150_55 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001ED8C3CB940>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))


100%|██████████| 10000/10000 [2:11:26<00:00,  1.27it/s] 


9940
