## **Purpose**
 In this notebook we will add a is_series column to our VGchartz/MetaCritic dataset that says whether a video game is part of a series or not. We will also create the dataset that we will use to answer research question 'Do video game series get worse'.

## **Datasets**
300.csv, 400.csv


In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import os
from fuzzywuzzy import fuzz 



In [2]:
if not os.path.exists("../data/prep/400.csv"):
    print("Missing dataset file")
else:
    print("Success!")

Success!


In [24]:
df = pd.read_csv("../data/prep/400.csv")
df1 = pd.read_csv("../data/prep/300.csv")

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4571 entries, 0 to 4570
Data columns (total 9 columns):
name                 4571 non-null object
global_sales         4571 non-null float64
na_sales             4571 non-null float64
eu_sales             4571 non-null float64
jp_sales             4571 non-null float64
other_sales          4571 non-null float64
meta_critic_score    4571 non-null float64
meta_user_score      4571 non-null float64
release_date         4571 non-null object
dtypes: float64(7), object(2)
memory usage: 321.5+ KB


## **Adding is_series Column**

In order to determine if a game is part of a series I can use a number of different methods. The following example is done using the fuzzywuzzy library.

In [5]:
from fuzzywuzzy import fuzz 
Str1 = "Grand Theft Auto V"
Str2 = "Grand Theft Auto III"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)

In [6]:
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)

89
94
84


As you can see both games matched pretty well which means they are part of the same series.

Although fuzzy matching worked good, I am going to use the get_close_matches() function as its compile time is much better for this task. I am going to use the dataset we created in the previous notebook (400.csv) as I dont want, for example grand theft auto V for ps3 matching with grand theft auto V for ps4. I will join my resulting dataframe to the 300.csv.

In [7]:
import difflib
def find_it(w):
    return difflib.get_close_matches(w, df.name, n=100000, cutoff=.72)

I iteratively run the get_close_matches() function for every game name in my dataframe

In [8]:
name_list=df.name.tolist()
l=[]
for i in name_list:
    l.append( find_it(i))

In [9]:
tem_df = pd.DataFrame()
column_values = pd.Series(name_list)
tem_df.insert(loc=0, column='name', value=column_values)

In [10]:
column_values = pd.Series(l)
tem_df.insert(loc=0, column='Matched_Name', value=column_values)

In [11]:
tem_df.insert(loc=0, column='is_series', value='yes')

In [12]:
mention_list=[]
for index, row in tem_df.iterrows():
    str1 = " ".join(str(x) for x in row['Matched_Name'])
    mention_list.append(str1)

In [22]:
tem_df['Matched_Name'] = mention_list
#temp_df.set_index('Name')
tem_df.head()

Unnamed: 0,is_series,Matched_Name,name
0,yes,.hack//G.U. Last Recode,.hack//G.U. Last Recode
1,yes,.hack//Infection Part 1,.hack//Infection Part 1
2,yes,007 Legends GT Legends,007 Legends
3,yes,007 Racing,007 Racing
4,yes,007: Quantum of Solace,007: Quantum of Solace


I remove the games that only matched with themselves as they are not part of a series. I now have a dataframe contain only the game sthat a part of a series.

In [14]:
newdf = pd.DataFrame() 
newdf=tem_df[tem_df['Matched_Name'] != tem_df['name']]

I then join it back to the original dataframe

In [15]:
final_df=df1.merge(newdf, on=['name'], how='left')

Finally I change the is_series value of the games that arent part of a series from nan to no.

In [16]:
final_df['is_series'] = final_df['is_series'].replace(np.nan,'no')
final_df=final_df.set_index('name')

In [17]:
final_df.sample()

Unnamed: 0_level_0,developer,rank,platform,release_date,publisher,na_sales,eu_sales,jp_sales,other_sales,global_sales,...,meta_critic_count,meta_user_score,meta_user_count,meta_esrb,meta_genre,meta_multiplayer,meta_full_url,release_year,is_series,Matched_Name
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
BioShock 2,2K Marin,228,PS3,2010-02-09,2K Games,850000,460000,20000,220000,1550000.0,...,62.0,8.2,585,M,Action,yes,https://www.metacritic.com/game/playstation-3/...,2010,yes,BioShock 2 BioShock


In [18]:
final_df['Matched_Name']=final_df['Matched_Name'].replace(np.nan,'none')

In [19]:
print(final_df.shape)

(7374, 25)


In [20]:
final_df.to_csv("../data/prep/500.csv",sep=",",encoding='utf-8')

## **Creating dataset for research question 3**

I am also interested at comparing different series and seeing if video games series get worse as the series goes on. To do this, this time I am going to use fuzzy matching.

I have created a list of video game series. I have picked thses games at random in order to avoid bias and i fell they represent a large number of genres, developers etc.

In [76]:
series_list=['grand theft auto', 'assassin creed', 'fifa','red dead redemption','metal gear solid','resident evil','star wars','angry birds',
             'tomb raider', 'spider-man','hitman', 'dishonored','far cry','just cause','saints row','harry potter','tom clancys',
             'wwe 2k','ape escape','arcana heart','army men','asphalt','sims','mafia','pac-man','dragon ball z','madden nfl','nba 2k','darksiders'
             ,'dead or alive','dead space','disgaea','fallout','way of the samurai','transformers','tenchu','teenage mutant ninja turtles',
             'ninja gaiden','atv offroad fury','all star baseball','armored core','battlefield','bioshock','breath of fire','burnout','call of duty',
             'crazy taxi','dance revolution','dragon age','donkey kong','f1','farming simulator','fight night','final fantasy','Football Manager',
             'forza','god eater','grandia','guilt gear','guitar hero','halo','motogp','nhl 2k','naruto','need for speed','pro evolution soccer',]

I will iteratively create a dataframe for each series containing all the games that are in that series.

In [77]:
 d = {}
for name in series_list:
    d[name] = df[df.apply(lambda row: fuzz.partial_ratio(row['name'].lower(),name.lower()), axis=1) > 90]
    d[name]['series_name']=name

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [78]:
d['grand theft auto']

Unnamed: 0,name,global_sales,na_sales,eu_sales,jp_sales,other_sales,meta_critic_score,meta_user_score,release_date,series_name
1567,Grand Theft Auto,240000.0,170000.0,60000.0,0.0,0.0,68.0,7.1,2004-10-26,grand theft auto
1568,Grand Theft Auto 2,3420000.0,1130000.0,2070000.0,0.0,220000.0,70.0,7.9,1999-10-25,grand theft auto
1569,Grand Theft Auto III,13110000.0,6990000.0,4520000.0,300000.0,1300000.0,95.0,8.4,2001-10-23,grand theft auto
1570,Grand Theft Auto IV,22530000.0,11600000.0,7640000.0,580000.0,2720000.0,95.3,7.4,2008-04-29,grand theft auto
1571,Grand Theft Auto Online,10000.0,0.0,10000.0,0.0,0.0,83.0,5.9,2013-10-01,grand theft auto
1572,Grand Theft Auto V,64290000.0,26190000.0,28140000.0,1660000.0,8320000.0,97.0,8.2,2013-09-17,grand theft auto
1573,Grand Theft Auto: Chinatown Wars,2410000.0,860000.0,1060000.0,80000.0,410000.0,91.5,7.8,2009-03-17,grand theft auto
1574,Grand Theft Auto: Episodes from Liberty City,90000.0,0.0,80000.0,0.0,10000.0,63.0,8.2,2010-04-13,grand theft auto
1575,Grand Theft Auto: Liberty City Stories,11260000.0,4460000.0,4230000.0,310000.0,2240000.0,83.0,7.8,2005-10-25,grand theft auto
1576,Grand Theft Auto: San Andreas,2920000.0,1260000.0,1540000.0,0.0,130000.0,93.0,8.8,2005-06-07,grand theft auto


That seemed to work well

I now have a dataframe for each series. To simplify things I will join these dataframes together.

In [79]:
series_df=pd.DataFrame()
for name in series_list:
    series_df=series_df.append(d[name])

In [80]:
print('Number of different games:',len(series_df))
print('Number of different series:',len(series_df['series_name'].unique()))

Number of different games: 709
Number of different series: 64


This dataframe contains over 700 games from over 60 series. This dataset should be sufficent to answer research question 3 'Does video game series get worse over time'.

In [81]:
series_df['release_date']= pd.to_datetime(series_df['release_date'])
series_df['release_year'] = series_df['release_date'].dt.year

In [82]:
len_df= series_df.groupby('series_name').max()['release_year']- series_df.groupby('series_name').min()['release_year']
len_df= pd.DataFrame(len_df)
len_df.columns = ['length']
len_df=len_df.reset_index()
series_df=series_df.merge(len_df, left_on='series_name', right_on='series_name')
series_df.head()

Unnamed: 0,name,global_sales,na_sales,eu_sales,jp_sales,other_sales,meta_critic_score,meta_user_score,release_date,series_name,release_year,length
0,Grand Theft Auto,240000.0,170000.0,60000.0,0.0,0.0,68.0,7.1,2004-10-26,grand theft auto,2004,14
1,Grand Theft Auto 2,3420000.0,1130000.0,2070000.0,0.0,220000.0,70.0,7.9,1999-10-25,grand theft auto,1999,14
2,Grand Theft Auto III,13110000.0,6990000.0,4520000.0,300000.0,1300000.0,95.0,8.4,2001-10-23,grand theft auto,2001,14
3,Grand Theft Auto IV,22530000.0,11600000.0,7640000.0,580000.0,2720000.0,95.3,7.4,2008-04-29,grand theft auto,2008,14
4,Grand Theft Auto Online,10000.0,0.0,10000.0,0.0,0.0,83.0,5.9,2013-10-01,grand theft auto,2013,14


In [83]:
series_df=series_df.set_index('series_name')
series_df.to_csv('../data/prep/500-game_series.csv')