## **Purpose**
 In this notebook we will scrape video game data from the MetaCritic website (https://www.metacritic.com/)
* We will collect the following information:
    * Game name
    * Developer name
    * Critic score
    * Number of critic that rated the game
    * User score
    * Number of users that rated the game
    * The games ESRB rating
    * Whether the game in multiplayer of not
    * Genre of the game



In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import csv
import os

Load in csv file we created in the notebook ("100-Collecting_VGChartz.ipynb")

In [3]:
vg_df= pd.read_csv("../data/prep/100-Collecting_VGChartz.csv", low_memory = False)

In [4]:
meta_full_url_list=vg_df.meta_url.tolist()

To prevent scraping data that we already collected I will read in the csv file. I will check it the metaCritic url is already in the csv file. If that url is already in the csv file then it means that we have already collected the data associated with that url and there is no need to collect it again. If the url is not in the csv file then it will be passed into the scraper and we will collect the data for that url.

Check if the csv file already exists

In [5]:
if not os.path.exists("../data/raw/Original_MetaCritic.csv"):
    print("Missing dataset file")
    exisit=False
else:
    print("Success!")
    exists=True

Success!


If the file doesnt already exist then create it.

In [25]:
if exists==False:
    with open('../data/raw/Original_MetaCritic.csv', 'w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(['','meta_game_name','meta_developer','meta_critic_score','meta_critic_count','meta_user_score','meta_user_count','meta_esrb','meta_genre','meta_multiplayer','meta_full_url'])

This function checks if a value (url) is in a dataframe. I will convert the csv file to a dataframe and this function will check if a certain url is in that dataframe hence, checking if the url is in the csv file

In [6]:
df= pd.read_csv("../data/raw/Original_MetaCritic.csv", low_memory = False)
obtained_meta_full_url_list=df.meta_full_url.tolist()
print('We have already collected: ' + str(len(obtained_meta_full_url_list)))

We have already collected: 8115


In [7]:
get_meta_full_url_list= [x for x in meta_full_url_list if x not in obtained_meta_full_url_list]

In [8]:
len(get_meta_full_url_list)

10217

In [19]:
get_meta_full_url_list[:1]

['https://www.metacritic.com/game/playstation-3/grand-theft-auto-v']

I will now scrape the necessary data from MetaCritic.

In [25]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
meta_game_name=[]
meta_developer=[]
meta_user_score=[]
meta_user_count=[]
meta_critic_score=[]
meta_critic_count=[]
meta_esrb=[]
meta_full_url=[]
meta_genre=[]
meta_error_url=[]
meta_multiplayer=[]
data = []
count=0
fails=0
for url in get_meta_full_url_list:
    print(url)
    r = requests.get(url, headers=headers)
    r = r.text
    soup = BeautifulSoup(r,'html.parser')
    multiplayer_data='no'
    print(fails)
    try:
        title=soup.find('div',class_='product_title')
        title_a_tag=title.find('a')
        title_data=title_a_tag.text.strip()
      
        user_score_div_tag=soup.find('div',class_="userscore_wrap feature_userscore")
        user_score=user_score_div_tag.find('a')
        user_score_data=user_score.text.strip()
    
        
        user_count_div_tag=soup.find('div',class_="userscore_wrap feature_userscore")
        user_count_d_tag= user_count_div_tag.find('div',class_='summary')
        user_count=user_count_d_tag.find('a')
        user_count_data=user_count.text.replace('Ratings','')
        user_count_data= user_count_data.strip()
        
        try:
            developer_li=soup.find("li",class_="summary_detail developer")
            developer=developer_li.find('span',class_='data')
            developer_data=developer.text.strip()
            
        except:
            developer_data=np.nan
            
        try:
            rating_li=soup.find("li",class_="summary_detail product_rating")
            rating=rating_li.find('span',class_='data')
            rating_data=rating.text.strip()
            
        except:
            rating_data=np.nan
                   
        try:
            genre_li=soup.find('li',class_="summary_detail product_genre")
            genre=genre_li.find('span',class_='data')
            genre_data=genre.text.strip()
           
        except:
            genre_data=np.nan
       
        try:
            player_li=soup.find('li',class_="summary_detail product_players")
            number=player_li.find('span',class_='data')        
            if ('1 Player' not in number)&('No Online Multiplayer' not in number):
                multiplayer_data='yes'
        except:
            multiplayer_data=np.nan
            
        try:
            critic_score_a_tag = soup.find('a',class_="metascore_anchor")
            critic_score=critic_score_a_tag.find('span')
            critic_score_data=critic_score.text.strip()
        except:
            crtic_score_data=np.nan
            
        try:
            critic_count_a_tag = soup.find('a',class_="metascore_anchor")
            critic_count_div_tag=soup.find('div',class_="summary")
            critic_count=critic_count_div_tag.find('a')
            critic_count_data=critic_count.text.replace('Critic Reviews','')
            critic_count_data= critic_count_data.strip()
        except:
            critic_count_data=np.nan
        
            
        
        meta_game_name.append(title_data)
        meta_developer.append(developer_data)
        meta_critic_score.append(critic_score_data)
        meta_critic_count.append(critic_count_data)
        meta_user_score.append(user_score_data)
        meta_user_count.append(user_count_data)
        meta_esrb.append(rating_data)
        meta_genre.append(genre_data)
        meta_full_url.append(url)
        meta_multiplayer.append(multiplayer_data)
        get_meta_full_url_list.remove(url)
        count+=1
    except:
        get_meta_full_url_list.remove(url)
        fails+=1

print('metacritic count = '+str(count))
print('metacricic fails = '+str(fails))

meta_columns ={'meta_game_name':meta_game_name,
          'meta_developer':meta_developer,
          'meta_critic_score':meta_critic_score,
          'meta_critic_count':meta_critic_count,
          'meta_user_score':meta_user_score,
          'meta_user_count':meta_user_count,
          'meta_esrb':meta_esrb,
          'meta_genre':meta_genre,
          'meta_multiplayer':meta_multiplayer,
          'meta_full_url':meta_full_url}


meta_df = pd.DataFrame(meta_columns)
meta_df = meta_df[['meta_game_name','meta_developer','meta_critic_score','meta_critic_count','meta_user_score','meta_user_count','meta_esrb','meta_genre','meta_multiplayer','meta_full_url']]
del meta_df.index.name
meta_df.to_csv("../data/raw/Original_MetaCritic.csv",sep=",",encoding='utf-8', mode='a', header=False)

https://www.metacritic.com/game/xbox-one/middle-earth-shadow-of-mordor
0
https://www.metacritic.com/game/playstation-2/yu-gi-oh-the-duelists-of-the-roses
0
https://www.metacritic.com/game/gamecube/lego-star-wars-the-video-game
1
https://www.metacritic.com/game/psp/lego-star-wars-ii-the-original-trilogy
2
https://www.metacritic.com/game/xbox/spider-man-the-movie
2
https://www.metacritic.com/game/wii/toy-story-3-the-video-game
2
https://www.metacritic.com/game/wii/hannah-montana-spotlight-world-tour
3
https://www.metacritic.com/game/ds/bakugan-battle-brawlers
3
https://www.metacritic.com/game/xbox/true-crime-streets-of-la
3
https://www.metacritic.com/game/playstation-3/dead-rising-2
3
https://www.metacritic.com/game/wii/metroid-other-m
3
https://www.metacritic.com/game/xbox-one/watch-dogs-2
3
https://www.metacritic.com/game/xbox-360/devil-may-cry-4
3
https://www.metacritic.com/game/playstation-3/saints-row-iv
3
https://www.metacritic.com/game/xbox-360/alan-wake
3
https://www.metacritic.c