<a href="https://colab.research.google.com/github/kallenhager/metacritic_scraper/blob/main/analysis/analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis

I scraped this data intending to answer the following questions for PC games:

* **Do metacritic scores correlate with sales\*?**
    * Should a videogame publisher/developer pay attention to critic reviews?
<BR>
* **Which critic's scores correlate best with sales\*?**
    *  Specifically, which critics should a videogame publisher/developer pay attention to?


*Note: I don't actually have sales data. Videogame sales figures are not pubically available. The total amount of reviews a game has on Steam is generally accepted as the best publically available proxy for PC game sales (https://vginsights.com/insights/article/how-to-estimate-steam-video-game-sales).

### Correlation Matrix (Python)

Here I produced a Spearman correlation matrix using Pandas's .corr. The .corr method(?) ignores null values which is great for my data set.

In [None]:
import pandas as pd
import plotnine
from scipy.stats import spearmanr

df = pd.read_csv('Metacritic_Steam_2021.csv', na_values=['tbd']) # user_score has some 'tbd's in it
df_corr = df.corr(method='spearman')
df_corr = df_corr.sort_values(by=['total_reviews'], ascending=False) # sort by correlation to total_reviews
display(df_corr.head(n=7))
display(df_corr.tail(n=15))

Got a correlation matrix, but this isn't really the form I need my data in since I'm going to take the correlations into R and put them into the cocor package to compare for a signicant difference. The cocor(zou2007) function needs each correlation  and observation count (n). Also, I want p-values to see if these correlations are even worth comparing. We can use sci.py's `spearmanr` function(?). 

In [None]:
spearmanr_dict = {'variable':[], 'correlation':[], 'pvalue':[], 'n':[]} # create dictionary, which later we'll convert to Dataframe. 
for each_column in df.select_dtypes(exclude='object'):                  # for each column in df, excluding column with type 'object' (columnns conataining strings)
    
    series = df[each_column]                                            # df[] returns a series which contains a list (AKA array or matrix), matrix name (string), and a type it accepts
    each_column = series.tolist()                                       # each column as a list to index down
    total_reviews = df['total_reviews'].tolist()                        # total_reviews column as a list to index down
    
    if total_reviews == each_column:                                    # skip total_reviews * total_reviews correlation
        continue

    trimmed_total_reviews = []                                          # total_reviews has 276 values
    trimmed_each_column = []                                            # but each critic only reviewed a handful of games, spearmanr needs lists of same length
    for index in range(len(each_column)):                               # goes down each_column with index
        if pd.notna(each_column[index]):                                # if each_column has a review (not NaN's)
            trimmed_total_reviews.append(total_reviews[index])          # grab that games value in total_reviews
            trimmed_each_column.append(each_column[index])              # grab that games value in each_column
    
    correlation, pvalue = spearmanr(trimmed_total_reviews, trimmed_each_column)
    if pd.isna(correlation):                                            # don't add correlations that can't be computed (which creates NaN) due to small n
        continue

    spearmanr_dict['variable'].append(series.name)
    spearmanr_dict['correlation'].append(correlation)
    spearmanr_dict['pvalue'].append(pvalue)
    spearmanr_dict['n'].append(len(trimmed_total_reviews))

df_spearmanr = pd.DataFrame(spearmanr_dict)
df_spearmanr



Unnamed: 0,variable,correlation,pvalue,n
0,appid,-0.161779,7.076097e-03,276
1,pos_reviews,0.994981,5.778750e-276,276
2,neg_reviews,0.913155,8.504820e-109,276
3,rank,-0.386549,2.869959e-11,276
4,meta_score,0.386612,2.847250e-11,276
...,...,...,...,...
152,WellPlayed,0.416596,1.973749e-02,31
153,Windows Central,0.252563,2.037356e-01,27
154,Worth Playing,0.424994,3.238901e-03,46
155,XGN,0.492805,3.206326e-01,6


### Google Colab (R)

Need to use the `library(cocor)` from R, moving to Google Colab which lets one easily run R and Python in the same notebook

##### Import, Clean, Preliminary Analysis

In [6]:
from google.colab import drive
drive.mount('/content/drive')
# load this once to run R, then put '%%R' by itself at the beginning of each R block
%load_ext rpy2.ipython 

Mounted at /content/drive


In [None]:
%%R
install.packages('cocor')
library(tidyverse)
library(cocor)
library(knitr)

In [39]:
%%R
df <- read.csv(file = 'drive/MyDrive/Colab Notebooks/metacritic_scraper/df_spearmanr.csv')
df <- filter(df, pvalue < .05) %>%                                              # only significant values, sort by correlation
  arrange(-correlation)

df <- df[-c(2, 3, 45, 46),]                                                     # removed irrelevent columns pos_reviews, neg_reviews, and rank, appid
rownames(df) <- NULL                                                            # reset index(row names)
df

            variable correlation       pvalue   n
1                EGM   1.0000000 0.000000e+00   3
2    Washington Post   0.8196886 4.583714e-02   6
3         BaziCenter   0.6585395 3.840358e-02  10
4          Game Rant   0.6442759 2.907041e-03  19
5           Player 2   0.6085100 7.576676e-04  27
6      Hooked Gamers   0.5408697 1.154574e-03  33
7        Meristation   0.5337783 9.428026e-06  61
8        Finger Guns   0.5105695 2.143100e-02  20
9     Hardcore Gamer   0.5059294 5.951102e-05  57
10          Gamersky   0.4977392 2.553432e-02  20
11        IGN Italia   0.4953671 2.349815e-05  66
12     Jeuxvideo.com   0.4678987 4.846886e-08 123
13       GameCritics   0.4570950 8.911441e-05  68
14         Millenium   0.4566809 1.103523e-03  48
15       Everyeye.it   0.4443979 9.652410e-04  52
16       Riot Pixels   0.4425910 1.288956e-03  50
17         CD-Action   0.4309273 2.608244e-06 110
18     Worth Playing   0.4249944 3.238901e-03  46
19        WellPlayed   0.4165956 1.973749e-02  31


Findings so far: 
* the correlation between **user_score** and total_reviews was not signficant(p < .05)
* **Electronic Gaming Montly (EGM)** reviews have perfect correlation of 1, but that throws and error in cocor, and the n was only 3, so it is excluded from the final analysis, but still noteworthy
* **Game Revolution** reviews have a significant negative correlation with total_reviews. As a publisher, do that opposite of what they suggest.

In [42]:
%%R
df <- filter(df, correlation < 1)                                               # only significant values, sort by correlation
rownames(df) <- NULL                                                            # reset index(row names)
df

            variable correlation       pvalue   n
1    Washington Post   0.8196886 4.583714e-02   6
2         BaziCenter   0.6585395 3.840358e-02  10
3          Game Rant   0.6442759 2.907041e-03  19
4           Player 2   0.6085100 7.576676e-04  27
5      Hooked Gamers   0.5408697 1.154574e-03  33
6        Meristation   0.5337783 9.428026e-06  61
7        Finger Guns   0.5105695 2.143100e-02  20
8     Hardcore Gamer   0.5059294 5.951102e-05  57
9           Gamersky   0.4977392 2.553432e-02  20
10        IGN Italia   0.4953671 2.349815e-05  66
11     Jeuxvideo.com   0.4678987 4.846886e-08 123
12       GameCritics   0.4570950 8.911441e-05  68
13         Millenium   0.4566809 1.103523e-03  48
14       Everyeye.it   0.4443979 9.652410e-04  52
15       Riot Pixels   0.4425910 1.288956e-03  50
16         CD-Action   0.4309273 2.608244e-06 110
17     Worth Playing   0.4249944 3.238901e-03  46
18        WellPlayed   0.4165956 1.973749e-02  31
19      Impulsegamer   0.4137837 1.211908e-02  36


#### cocor

Now use `library(cocor)` to compare each correlation with every other correlation to determine if they are signicantly different. Need to store the signficant differences in a way that is conducive to later visualization (*idk if it's even possible to usefully visualize this statically*; https://www.researchgate.net/post/How_to_elegantly_show_multiple_significant_differences_between_groups_on_a_bar_graph); *seems like I'm gonna have to learn Shiny or Dash next*)
<br>
<br>
For `cocor` I'm using:
  * independent groups; each critic only reviewed a handful of games, little overlap
  * zou2007; relatively recent, robust method for comparing correlations
  * confidence level = .90;  cause CI=.95 wasn't giving me enough significant differences, and a false positive isn't much of a concern with the comparisons I'm making

In [31]:
%%R
df$correlation[]

 [1] 1.0000000 0.8196886 0.6585395 0.6442759 0.6085100 0.5408697 0.5337783
 [8] 0.5105695 0.5059294 0.4977392 0.4953671 0.4678987 0.4570950 0.4566809
[15] 0.4443979 0.4425910 0.4309273 0.4249944 0.4165956 0.4137837 0.4131182
[22] 0.4050172 0.4022146 0.3896962 0.3866119 0.3864614 0.3660718 0.3476699
[29] 0.3470870 0.3448234 0.3423294 0.3348343 0.3219316 0.3123230 0.3072509
[36] 0.2918734 0.2758500 0.2688698 0.2625260 0.2569171 0.2476513 0.2475964


In [None]:
%%R
#this one is working best

i  <- 1
for (variable1 in df$variable){  
  ii <- 1
  for (variable2 in df$variable){
    result <- cocor.indep.groups(r1.jk=df$correlation[i], r2.hm=df$correlation[ii], n1=df$n[i], n2=df$n[ii], test='zou2007', alpha=0.05, conf.level=0.90, null.value=0)
    print(paste(df$variable[i], 'vs.', df$variable[ii]))
    #print(paste(i, 'vs.', ii))
    #print(result)
    result <- get.cocor.results(result)
    #print(result$zou2007$conf.int)

    if (((result$zou2007$conf.int[1]>0) & (result$zou2007$conf.int[2]>0)) | ((result$zou2007$conf.int[1]<0)) & ((result$zou2007$conf.int[2]<0))) {
    print('*')
    }
  ii <- ii + 1
  #print(ii)
  }
  i <- i + 1
  #print(i)
}

In [None]:
%%R

sig_diff <- c()
i <- 1
for (variable in df$variable){
    ii <- 1
    for (v in df$variable){
      result <- cocor.indep.groups(r1.jk=df$correlation[i], r2.hm=df$correlation[i+ii], n1=df$n[i], n2=df$n[i+ii], test='zou2007', alpha=0.05, conf.level=0.90, null.value=0)
      #print(result)
      result <- get.cocor.results(result)
      #print(result$zou2007$conf.int)
      ii <- ii+1
      
      if (((result$zou2007$conf.int[1]>0) & (result$zou2007$conf.int[2]>0)) | ((result$zou2007$conf.int[1]<0)) & ((result$zou2007$conf.int[2]<0))) {
      sig <- c()
      sig <- append(sig, v)
      sig_string <-paste(sig, collapse = ", ")
      sig_diff <- append(sig_diff, sig_string)
      }
    i <- i+1
    }
}
print(sig_diff)

In [None]:
%%R

i  <- 4
ii <- 20
result <- cocor.indep.groups(r1.jk=df$correlation[i], r2.hm=df$correlation[i+ii], n1=df$n[i], n2=df$n[i+ii], test='zou2007', alpha=0.05, conf.level=0.90, null.value=0)

print(result)
result <- get.cocor.results(result)
print(result$zou2007$conf.int)

if (((result$zou2007$conf.int[1]>0) & (result$zou2007$conf.int[2]>0)) | ((result$zou2007$conf.int[1]<0)) & ((result$zou2007$conf.int[2]<0))) {
print('sigicant difference')
}


  Results of a comparison of two correlations based on independent groups

Comparison between r1.jk = 0.6085 and r2.hm = 0.3866
Difference: r1.jk - r2.hm = 0.2219
Group sizes: n1 = 27, n2 = 276
Null hypothesis: r1.jk is equal to r2.hm
Alternative hypothesis: r1.jk is not equal to r2.hm (two-sided)
Alpha: 0.05

zou2007: Zou's (2007) confidence interval
  90% confidence interval for r1.jk - r2.hm: -0.0446 0.4135
  Null hypothesis retained (Interval includes 0)

[1] -0.04461771  0.41346895


In [None]:
%%R

sig_diff <- c()
for (variable in df$variable){
    
    sig <- append(sig, variable)
    sig_string <-paste(sig, collapse = ", ")
    sig_diff <- append(sig_diff, sig_string)
}
print(sig_diff)

 [1] "Washington Post, BaziCenter, Game Rant, Player 2, Hooked Gamers, Meristation, Finger Guns, Hardcore Gamer, Gamersky, IGN Italia, Jeuxvideo.com, GameCritics, Millenium, Everyeye.it, Riot Pixels, CD-Action, Worth Playing, WellPlayed, Impulsegamer, IGN, Ragequit.gr, Wccftech, Gamer Escape, meta_score, COGconnected, Softpedia, New Game Network, SpazioGames, TheGamer, PC Games, Multiplayer.it, GamingTrend, GameStar, PC Invasion, Hey Poor Player, Shacknews, TheSixthAxis, The Games Machine, Noisy Pixel, Eurogamer Italy, Checkpoint Gaming, Game Revolution, Washington Post, BaziCenter, Game Rant, Player 2, Hooked Gamers, Meristation, Finger Guns, Hardcore Gamer, Gamersky, IGN Italia, Jeuxvideo.com, GameCritics, Millenium, Everyeye.it, Riot Pixels, CD-Action, Worth Playing, WellPlayed, Impulsegamer, IGN, Ragequit.gr, Wccftech, Gamer Escape, meta_score, COGconnected, Softpedia, New Game Network, SpazioGames, TheGamer, PC Games, Multiplayer.it, GamingTrend, GameStar, PC Invasion, Hey Poor Pl

In [None]:
%%R
i <- 1

variable_v <- c()
for (variable in df$variable){
    variable_v <- append(variable_v, variable)
}
dfv <- data.frame(variable_v) %>%
print()

          variable_v
1    Washington Post
2         BaziCenter
3          Game Rant
4           Player 2
5      Hooked Gamers
6        Meristation
7        Finger Guns
8     Hardcore Gamer
9           Gamersky
10        IGN Italia
11     Jeuxvideo.com
12       GameCritics
13         Millenium
14       Everyeye.it
15       Riot Pixels
16         CD-Action
17     Worth Playing
18        WellPlayed
19      Impulsegamer
20               IGN
21       Ragequit.gr
22          Wccftech
23      Gamer Escape
24        meta_score
25      COGconnected
26         Softpedia
27  New Game Network
28       SpazioGames
29          TheGamer
30          PC Games
31    Multiplayer.it
32       GamingTrend
33          GameStar
34       PC Invasion
35   Hey Poor Player
36         Shacknews
37      TheSixthAxis
38 The Games Machine
39       Noisy Pixel
40   Eurogamer Italy
41 Checkpoint Gaming
42   Game Revolution


In [None]:
%%R

sig <- c()
i <- 1

for (variable in df$variable){
    sig <- append(sig, variable)
}


df2 <- data.frame(sig) %>%
rename(!!df$variable[1]:= sig) %>%
print()

In [None]:
%%R

tibble_row(x
  .name_repair = c("check_unique", "unique", "universal", "minimal")
)

##### Debugging Tool's

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(df.dtypes)