<a href="https://colab.research.google.com/github/kallenhager/metacritic_scraper/blob/main/analysis/analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis

I scraped this data intending to answer the following questions for PC games:

* **Do metacritic scores correlate with sales\*?**
    * Should a videogame publisher/developer pay attention to critic reviews?
<BR>
* **Which critic's scores correlate best with sales\*?**
    *  Specifically, which critics should a videogame publisher/developer pay attention to?


*Note: I don't actually have sales data. Videogame sales figures are not pubically available. The total amount of reviews a game has on Steam is generally accepted as the best publically available proxy for PC game sales (https://vginsights.com/insights/article/how-to-estimate-steam-video-game-sales).

R will be needed later, so the analysis will be done in Google Colab.

### Correlation Matrix (Python)

Here I produced a Spearman correlation matrix using Pandas's .corr. The .corr method(?) ignores null values which is great for my data set.

In [128]:
import pandas as pd
from scipy.stats import spearmanr

df = pd.read_csv('drive/MyDrive/Colab Notebooks/metacritic_scraper/Metacritic_Steam_2021.csv', na_values=['tbd']) # user_score has some 'tbd's in it
df_corr = df.corr(method='spearman')
df_corr = df_corr.sort_values(by=['total_reviews'], ascending=False) # sort by correlation to total_reviews
display(df_corr.head(n=7))
display(df_corr.tail(n=15))

Unnamed: 0,appid,total_reviews,pos_reviews,neg_reviews,rank,meta_score,user_score,4Players.de,Adventure Gamers,Android Central,App Trigger,Areajugones,Atomix,Attack of the Fanboy,AusGamers,BaziCenter,Buried Treasure,But Why Tho?,CD-Action,CGMagazine,COGconnected,Carole Quintaine,Checkpoint Gaming,Comicbook.com,Critical Hit,Cubed3,Cultured Vultures,DarkStation,DarkZero,Destructoid,Dexerto,Digital Chumps,Digital Spy,Digital Trends,Digitally Downloaded,DualShockers,EGM,Easy Allies,Edge Magazine,Eurogamer Italy,...,RPGamer,Ragequit.gr,Riot Pixels,Road to VR,SECTOR.sk,Screen Rant,Shacknews,Shindig,Siliconera,Slant Magazine,Softpedia,SpazioGames,Stevivor,Telegraph,The Digital Fix,The Games Machine,The Indie Game Website,The Loadout,The Mako Reactor,The Overpowered Noobs,TheGamer,TheSixthAxis,TierraGamer,TrueGaming,Trusted Reviews,Twinfinite,UploadVR,VG247,VGC,Vandal,Vgames,Washington Post,Wccftech,We Got This Covered,WellPlayed,Windows Central,Worth Playing,XGN,ZTGD,null_404
total_reviews,-0.161779,1.0,0.994981,0.913155,-0.386549,0.386612,0.042865,0.260959,0.186555,,-0.258199,0.503875,0.341846,-0.066001,0.008691,0.658539,,-0.097317,0.430927,0.241894,0.386461,,0.247596,0.316228,0.436527,0.090182,-0.085795,0.735612,0.199169,0.308416,-0.66398,0.239709,,0.737865,-0.200189,0.49589,1.0,0.281387,0.139437,0.247651,...,-0.196044,0.405017,0.442591,0.464286,-0.148642,0.163957,0.27585,-0.222616,-0.130931,0.326994,0.366072,0.347087,0.089468,,0.617213,0.262526,0.202249,0.866025,0.5,,0.344823,0.26887,0.272345,0.258363,-0.223607,0.370796,0.447214,-0.019604,-0.061721,0.171119,-0.429617,0.819689,0.402215,-0.7,0.416596,0.252563,0.424994,0.492805,0.238598,
EGM,1.0,1.0,1.0,1.0,-0.5,0.5,-1.0,,,,,,,-1.0,,,,,1.0,,1.0,,,,,,,,,,,,,1.0,,1.0,1.0,,,,...,,,-1.0,,,,-1.0,,,,,1.0,,,,1.0,1.0,,,,-1.0,,,,,,,-1.0,,-1.0,-1.0,,,-1.0,,,,,,
pos_reviews,-0.158239,0.994981,1.0,0.87882,-0.426456,0.426678,0.089355,0.303177,0.196647,,-0.258199,0.605534,0.341846,-0.061876,0.008691,0.781631,,-0.079476,0.46379,0.307896,0.387391,,0.264273,0.316228,0.391983,0.113505,-0.079555,0.617914,0.225831,0.286265,-0.689518,0.229192,,0.737865,-0.15667,0.49589,1.0,0.312653,0.168648,0.255211,...,-0.196044,0.415673,0.508415,0.392857,-0.148642,0.201315,0.291467,-0.16595,0.130931,0.34099,0.379335,0.373411,-0.004709,,0.771517,0.292486,0.231755,0.866025,0.5,,0.350001,0.296624,0.256484,0.321635,-0.223607,0.392273,0.447214,0.093119,-0.061721,0.192586,-0.429617,0.819689,0.433821,-0.7,0.418027,0.280695,0.452558,0.492805,0.278736,
neg_reviews,-0.170504,0.913155,0.87882,1.0,-0.151694,0.152194,-0.1674,0.106636,-0.136032,,-0.258199,0.0442,0.152445,-0.100769,-0.085668,-0.036927,,-0.207611,0.263722,0.070161,0.30497,,0.061752,-0.316228,0.231626,-0.089405,-0.136492,0.647339,-0.037585,0.092153,-0.672493,0.165247,,-0.105409,-0.31334,0.555738,1.0,0.218857,-0.004758,0.135726,...,-0.429788,0.255187,0.26267,-0.036037,-0.27251,-0.030662,0.174194,-0.344042,-0.654654,0.101788,0.096473,0.146955,0.042379,,0.617213,0.074141,0.043125,0.866025,0.5,,0.200939,0.092678,0.326118,0.144472,-0.782624,0.304516,0.447214,-0.254851,-0.308607,0.070546,-0.517389,0.698253,0.206655,-0.7,0.292035,-0.064168,0.348373,0.463817,0.169472,
The Loadout,0.0,0.866025,0.866025,0.866025,-0.866025,0.866025,0.866025,,,,,,,,,,,,,,,,1.0,,,,,,,1.0,,,,,,,,,,,...,,,,,,,1.0,,,,,,,,,,,1.0,,,,,,,,,,,,,,,,,,,0.866025,,,
Washington Post,0.030359,0.819689,0.819689,0.698253,-0.333947,0.333947,0.344124,1.0,-0.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.235702,,...,,1.0,,,,,,,,,,,,,,,-0.333333,,,,1.0,,,,,,,,,,,1.0,,,,,1.0,,,
Digital Trends,-0.210819,0.737865,0.737865,-0.105409,0.210819,-0.210819,0.105409,,,,,,,-0.777778,,1.0,,,-0.5,,1.0,,0.5,,,,,,,0.866025,,,,1.0,,-1.0,1.0,,1.0,,...,,,-0.210819,,,0.866025,-0.055556,,,,,0.5,-1.0,,,0.5,,,,,-0.5,,-1.0,-0.5,,,,-0.5,,-0.866025,-1.0,,,-1.0,,,-1.0,,,


Unnamed: 0,appid,total_reviews,pos_reviews,neg_reviews,rank,meta_score,user_score,4Players.de,Adventure Gamers,Android Central,App Trigger,Areajugones,Atomix,Attack of the Fanboy,AusGamers,BaziCenter,Buried Treasure,But Why Tho?,CD-Action,CGMagazine,COGconnected,Carole Quintaine,Checkpoint Gaming,Comicbook.com,Critical Hit,Cubed3,Cultured Vultures,DarkStation,DarkZero,Destructoid,Dexerto,Digital Chumps,Digital Spy,Digital Trends,Digitally Downloaded,DualShockers,EGM,Easy Allies,Edge Magazine,Eurogamer Italy,...,RPGamer,Ragequit.gr,Riot Pixels,Road to VR,SECTOR.sk,Screen Rant,Shacknews,Shindig,Siliconera,Slant Magazine,Softpedia,SpazioGames,Stevivor,Telegraph,The Digital Fix,The Games Machine,The Indie Game Website,The Loadout,The Mako Reactor,The Overpowered Noobs,TheGamer,TheSixthAxis,TierraGamer,TrueGaming,Trusted Reviews,Twinfinite,UploadVR,VG247,VGC,Vandal,Vgames,Washington Post,Wccftech,We Got This Covered,WellPlayed,Windows Central,Worth Playing,XGN,ZTGD,null_404
Game Revolution,-0.211598,-0.606199,-0.691982,-0.491822,0.108658,-0.097354,-0.046068,-0.5,,,,-0.666886,0.316228,0.737865,0.866025,,,,0.131579,-0.866025,-0.948683,,0.917663,-1.0,,,,,-1.0,0.308021,1.0,1.0,,,,-1.0,,0.0,-1.0,0.317821,...,,-0.616849,-0.5,,,0.44239,0.648886,1.0,,0.866025,,-0.529641,0.4,,-1.0,-0.004293,,,,,-0.316228,0.866025,0.5,-0.866025,0.5,,,0.645497,-0.866025,-0.67082,-1.0,,-0.49359,1.0,,-0.866025,-0.172212,1.0,-1.0,
Dexerto,0.178764,-0.66398,-0.689518,-0.672493,-0.349015,0.349015,0.252174,,,,,1.0,-0.5,,-1.0,,,,,-0.4,-0.359092,,-0.866025,,,,,,,,1.0,1.0,,,,,,,-1.0,-0.210819,...,,-0.5,1.0,,,-0.5,0.5,,,,1.0,0.0,-1.0,,,0.0,,,,,0.5,0.0,-0.158114,,,1.0,,-1.0,,0.316228,,,0.632456,,1.0,-0.105409,,1.0,-1.0,
We Got This Covered,-0.9,-0.7,-0.7,-0.7,0.3,-0.359092,0.9,,-1.0,,,,,0.866025,,,,,-1.0,1.0,-0.5,,0.774597,,,,,,,,,,,-1.0,,1.0,-1.0,1.0,-0.866025,,...,,,1.0,,,0.0,0.894427,,,,,-0.8,1.0,,,-0.9,-1.0,,,,0.866025,0.866025,,-1.0,,-0.632456,,0.866025,-1.0,-0.3,0.0,,-0.5,1.0,-1.0,0.0,-0.105409,,,
PlaySense,-1.0,-1.0,-1.0,-1.0,-1.0,1.0,1.0,,,,,,,,1.0,,,,,1.0,-1.0,,,,,,,,,,-1.0,,,,,,,,,,...,,,,,,,,,,,,-1.0,1.0,,,-1.0,,,,,,1.0,,,,,,1.0,,1.0,,,,,-1.0,1.0,,,,
Android Central,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Buried Treasure,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Carole Quintaine,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Digital Spy,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Forbes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
IGN Adria,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Got a correlation matrix, but this isn't really the form I need my data in since I'm going to take the correlations into R and put them into the cocor package to compare for a signicant difference. The cocor(zou2007) function needs each correlation  and observation count (n). Also, I want p-values to see if these correlations are even worth comparing. We can use sci.py's `spearmanr` function(?). 

In [129]:
spearmanr_dict = {'variable':[], 'correlation':[], 'pvalue':[], 'n':[]} # create dictionary, which later we'll convert to Dataframe. 
for each_column in df.select_dtypes(exclude='object'):                  # for each column in df, excluding column with type 'object' (columnns conataining strings)
    
    series = df[each_column]                                            # df[] returns a series which contains a list (AKA array or matrix), matrix name (string), and a type it accepts
    each_column = series.tolist()                                       # each column as a list to index down
    total_reviews = df['total_reviews'].tolist()                        # total_reviews column as a list to index down
    
    if total_reviews == each_column:                                    # skip total_reviews * total_reviews correlation
        continue

    trimmed_total_reviews = []                                          # total_reviews has 276 values
    trimmed_each_column = []                                            # but each critic only reviewed a handful of games, spearmanr needs lists of same length
    for index in range(len(each_column)):                               # goes down each_column with index
        if pd.notna(each_column[index]):                                # if each_column has a review (not NaN's)
            trimmed_total_reviews.append(total_reviews[index])          # grab that games value in total_reviews
            trimmed_each_column.append(each_column[index])              # grab that games value in each_column
    
    correlation, pvalue = spearmanr(trimmed_total_reviews, trimmed_each_column)
    if pd.isna(correlation):                                            # don't add correlations that can't be computed (which creates NaN) due to small n
        continue

    spearmanr_dict['variable'].append(series.name)
    spearmanr_dict['correlation'].append(correlation)
    spearmanr_dict['pvalue'].append(pvalue)
    spearmanr_dict['n'].append(len(trimmed_total_reviews))

df_spearmanr = pd.DataFrame(spearmanr_dict)
df_spearmanr

  c /= stddev[:, None]
  c /= stddev[None, :]


Unnamed: 0,variable,correlation,pvalue,n
0,appid,-0.161779,7.076097e-03,276
1,pos_reviews,0.994981,5.778750e-276,276
2,neg_reviews,0.913155,8.504820e-109,276
3,rank,-0.386549,2.869959e-11,276
4,meta_score,0.386612,2.847250e-11,276
...,...,...,...,...
152,WellPlayed,0.416596,1.973749e-02,31
153,Windows Central,0.252563,2.037356e-01,27
154,Worth Playing,0.424994,3.238901e-03,46
155,XGN,0.492805,3.206326e-01,6


### Import, Clean, Preliminary Analysis

In [130]:
from google.colab import drive
drive.mount('/content/drive')
# load this once to run R, then put '%%R' by itself at the beginning of each R block
%load_ext rpy2.ipython 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [None]:
%%R
install.packages('cocor')
library(tidyverse)
library(cocor)
library(knitr)

In [132]:
%%R
df <- read.csv(file = 'drive/MyDrive/Colab Notebooks/metacritic_scraper/df_spearmanr.csv')
df <- filter(df, pvalue < .05) %>%                                              # only significant values, sort by correlation
  arrange(-correlation)

df <- df[-c(2, 3, 45, 46),]                                                     # removed irrelevent columns pos_reviews, neg_reviews, and rank, appid
rownames(df) <- NULL                                                            # reset index(row names)
df

            variable correlation       pvalue   n
1                EGM   1.0000000 0.000000e+00   3
2    Washington Post   0.8196886 4.583714e-02   6
3         BaziCenter   0.6585395 3.840358e-02  10
4          Game Rant   0.6442759 2.907041e-03  19
5           Player 2   0.6085100 7.576676e-04  27
6      Hooked Gamers   0.5408697 1.154574e-03  33
7        Meristation   0.5337783 9.428026e-06  61
8        Finger Guns   0.5105695 2.143100e-02  20
9     Hardcore Gamer   0.5059294 5.951102e-05  57
10          Gamersky   0.4977392 2.553432e-02  20
11        IGN Italia   0.4953671 2.349815e-05  66
12     Jeuxvideo.com   0.4678987 4.846886e-08 123
13       GameCritics   0.4570950 8.911441e-05  68
14         Millenium   0.4566809 1.103523e-03  48
15       Everyeye.it   0.4443979 9.652410e-04  52
16       Riot Pixels   0.4425910 1.288956e-03  50
17         CD-Action   0.4309273 2.608244e-06 110
18     Worth Playing   0.4249944 3.238901e-03  46
19        WellPlayed   0.4165956 1.973749e-02  31


Findings so far: 
* the correlation between **user_score** and total_reviews was not signficant(p < .05)
* **Electronic Gaming Montly (EGM)** reviews have perfect correlation of 1, but that throws and error in cocor, and the n was only 3, so it is excluded from the final analysis, but still noteworthy

In [133]:
%%R
df <- filter(df, correlation < 1)                                               # only significant values, sort by correlation
rownames(df) <- NULL                                                            # reset index(row names)
df

            variable correlation       pvalue   n
1    Washington Post   0.8196886 4.583714e-02   6
2         BaziCenter   0.6585395 3.840358e-02  10
3          Game Rant   0.6442759 2.907041e-03  19
4           Player 2   0.6085100 7.576676e-04  27
5      Hooked Gamers   0.5408697 1.154574e-03  33
6        Meristation   0.5337783 9.428026e-06  61
7        Finger Guns   0.5105695 2.143100e-02  20
8     Hardcore Gamer   0.5059294 5.951102e-05  57
9           Gamersky   0.4977392 2.553432e-02  20
10        IGN Italia   0.4953671 2.349815e-05  66
11     Jeuxvideo.com   0.4678987 4.846886e-08 123
12       GameCritics   0.4570950 8.911441e-05  68
13         Millenium   0.4566809 1.103523e-03  48
14       Everyeye.it   0.4443979 9.652410e-04  52
15       Riot Pixels   0.4425910 1.288956e-03  50
16         CD-Action   0.4309273 2.608244e-06 110
17     Worth Playing   0.4249944 3.238901e-03  46
18        WellPlayed   0.4165956 1.973749e-02  31
19      Impulsegamer   0.4137837 1.211908e-02  36


### cocor

Now use `library(cocor)` to compare each correlation with every other correlation to determine if they are signicantly different. Need to store the signficant differences in a way that is conducive to later visualization (*idk if it's even possible to usefully visualize this statically*; https://www.researchgate.net/post/How_to_elegantly_show_multiple_significant_differences_between_groups_on_a_bar_graph); *seems like I'm gonna have to learn a visualization tool next*)
<br>
<br>
For `cocor` I'm using:
  * independent groups; each critic only reviewed a handful of games, little overlap
  * zou2007; relatively recent, robust method for comparing correlations
  * confidence level = .90; false positive isn't much of a concern with the comparisons I'm making, so this could be lowered

In [135]:
%%R
variable <- c()
sig_diff <- c()
i  <- 1
for (variable1 in df$variable){  
  ii <- 1
  sig <- c()
  sig_string <- c()
  for (variable2 in df$variable){
    result <- cocor.indep.groups(r1.jk=df$correlation[i], r2.hm=df$correlation[ii], n1=df$n[i], n2=df$n[ii], test='zou2007', alpha=0.05, conf.level=0.90, null.value=0)
    #print(paste(df$variable[i], 'vs.', df$variable[ii]))
    #print(paste(i, 'vs.', ii))
    #print(result)
    result <- get.cocor.results(result)
    #print(result$zou2007$conf.int)

    if (((result$zou2007$conf.int[1]>0) & (result$zou2007$conf.int[2]>0)) | ((result$zou2007$conf.int[1]<0)) & ((result$zou2007$conf.int[2]<0))) {
      #print('*')
      sig <- append(sig, variable2)
    }
  sig_string <-paste(sig, collapse = ", ")
  ii <- ii + 1
  }
  sig_diff <- append(sig_diff, sig_string)
  variable <- append(variable, variable1)
  i <- i + 1
}
df_sig <- data.frame(sig_diff, variable)
df2 <- left_join(df, df_sig) %>%
rename(sig_diff_from = sig_diff) %>%
write_csv('drive/MyDrive/Colab Notebooks/metacritic_scraper/analyzed.csv')
knitr::kable(df2)                                                               #pretty printing library

Joining, by = "variable"


|variable          | correlation|    pvalue|   n|sig_diff_from                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|:-----------------|-----------:|---------:|---:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

#### Findings:
* **Game Revolution** reviews have a significant negative correlation with total_reviews. Every other reviewers is significantly better than them. As a publisher, do that opposite of what they suggest.
* 

#### Debugging Tool's

In [None]:
%%R

sig_diff <- c()
i <- 1
for (variable in df$variable){
    ii <- 1
    for (v in df$variable){
      result <- cocor.indep.groups(r1.jk=df$correlation[i], r2.hm=df$correlation[i+ii], n1=df$n[i], n2=df$n[i+ii], test='zou2007', alpha=0.05, conf.level=0.90, null.value=0)
      #print(result)
      result <- get.cocor.results(result)
      #print(result$zou2007$conf.int)
      ii <- ii+1
      
      if (((result$zou2007$conf.int[1]>0) & (result$zou2007$conf.int[2]>0)) | ((result$zou2007$conf.int[1]<0)) & ((result$zou2007$conf.int[2]<0))) {
      sig <- c()
      sig <- append(sig, v)
      sig_string <-paste(sig, collapse = ", ")
      sig_diff <- append(sig_diff, sig_string)
      }
    i <- i+1
    }
}
print(sig_diff)

In [None]:
%%R

i  <- 4
ii <- 20
result <- cocor.indep.groups(r1.jk=df$correlation[i], r2.hm=df$correlation[i+ii], n1=df$n[i], n2=df$n[i+ii], test='zou2007', alpha=0.05, conf.level=0.90, null.value=0)

print(result)
result <- get.cocor.results(result)
print(result$zou2007$conf.int)

if (((result$zou2007$conf.int[1]>0) & (result$zou2007$conf.int[2]>0)) | ((result$zou2007$conf.int[1]<0)) & ((result$zou2007$conf.int[2]<0))) {
print('sigicant difference')
}

In [None]:
%%R

sig_diff <- c()
for (variable in df$variable){
    
    sig <- append(sig, variable)
    sig_string <-paste(sig, collapse = ", ")
    sig_diff <- append(sig_diff, sig_string)
}
print(sig_diff)

In [None]:
%%R
i <- 1

variable_v <- c()
for (variable in df$variable){
    variable_v <- append(variable_v, variable)
}
dfv <- data.frame(variable_v) %>%
print()

In [None]:
%%R

sig <- c()
i <- 1

for (variable in df$variable){
    sig <- append(sig, variable)
}


df2 <- data.frame(sig) %>%
rename(!!df$variable[1]:= sig) %>%
print()

In [None]:
%%R

tibble_row(x
  .name_repair = c("check_unique", "unique", "universal", "minimal")
)

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(df.dtypes)