<a href="https://colab.research.google.com/github/kallenhager/metacritic_scraper/blob/main/analysis/analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis

I scraped this data intending to answer the following questions for PC games:

* **Do metacritic scores correlate with sales\*?**
    * Should a videogame publisher/developer pay attention to critic reviews?
<BR>
* **Which critic's scores correlate best with sales\*?**
    *  Specifically, which critics should a videogame publisher/developer pay attention to?


*Note: I don't actually have sales data. Videogame sales figures are not pubically available. The total amount of reviews a game has on Steam is generally accepted as the best publically available proxy for PC game sales (https://vginsights.com/insights/article/how-to-estimate-steam-video-game-sales).

In [None]:
import pandas as pd
import plotnine
from scipy.stats import spearmanr

### Correlation Matrix

Here I produced a Spearman correlation matrix using Pandas's .corr. The .corr method(?) ignores null values which is great for my data set.

In [18]:
df = pd.read_csv('Metacritic_Steam_2021.csv', na_values=['tbd']) # user_score has some 'tbd's in it
df_corr = df.corr(method='spearman')
df_corr = df_corr.sort_values(by=['total_reviews'], ascending=False) # sort by correlation to total_reviews
display(df_corr.head(n=7))
display(df_corr.tail(n=15))

FileNotFoundError: ignored

Got a correlation matrix, but this isn't really the form I need my data in since I'm going to take the correlations into R and put them into the cocor package to compare for a signicant difference. The cocor(zou2007) function needs each correlation  and observation count (n). Also, I want p-values to see if these correlations are even worth comparing. We can use sci.py's `spearmanr` function(?). 

In [None]:
spearmanr_dict = {'variable':[], 'correlation':[], 'pvalue':[], 'n':[]} # create dictionary, which later we'll convert to Dataframe. 
for each_column in df.select_dtypes(exclude='object'):                  # for each column in df, excluding column with type 'object' (columnns conataining strings)
    
    series = df[each_column]                                            # df[] returns a series which contains a list (AKA array or matrix), matrix name (string), and a type it accepts
    each_column = series.tolist()                                       # each column as a list to index down
    total_reviews = df['total_reviews'].tolist()                        # total_reviews column as a list to index down
    
    if total_reviews == each_column:                                    # skip total_reviews * total_reviews correlation
        continue

    trimmed_total_reviews = []                                          # total_reviews has 276 values
    trimmed_each_column = []                                            # but each critic only reviewed a handful of games, spearmanr needs lists of same length
    for index in range(len(each_column)):                               # goes down each_column with index
        if pd.notna(each_column[index]):                                # if each_column has a review (not NaN's)
            trimmed_total_reviews.append(total_reviews[index])          # grab that games value in total_reviews
            trimmed_each_column.append(each_column[index])              # grab that games value in each_column
    
    correlation, pvalue = spearmanr(trimmed_total_reviews, trimmed_each_column)
    if pd.isna(correlation):                                            # don't add correlations that can't be computed (which creates NaN) due to small n
        continue

    spearmanr_dict['variable'].append(series.name)
    spearmanr_dict['correlation'].append(correlation)
    spearmanr_dict['pvalue'].append(pvalue)
    spearmanr_dict['n'].append(len(trimmed_total_reviews))

df_spearmanr = pd.DataFrame(spearmanr_dict)
df_spearmanr



Unnamed: 0,variable,correlation,pvalue,n
0,appid,-0.161779,7.076097e-03,276
1,pos_reviews,0.994981,5.778750e-276,276
2,neg_reviews,0.913155,8.504820e-109,276
3,rank,-0.386549,2.869959e-11,276
4,meta_score,0.386612,2.847250e-11,276
...,...,...,...,...
152,WellPlayed,0.416596,1.973749e-02,31
153,Windows Central,0.252563,2.037356e-01,27
154,Worth Playing,0.424994,3.238901e-03,46
155,XGN,0.492805,3.206326e-01,6


### Google Colab

Need to use the `library(cocor)` from R, moving to Google Colab which lets one easily run R and Python in the same notebook

In [19]:
from google.colab import drive
drive.mount('/content/drive')
# load this once to run R, then put '%%R' by itself at the beginning of each R block
%load_ext rpy2.ipython 

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%%R
install.packages('cocor')
library(tidyverse)
library(cocor)
library(knitr)

In [24]:
%%R
df <- read.csv(file = 'drive/MyDrive/Colab Notebooks/metacritic_scraper/df_spearmanr.csv')

df <- filter(df, pvalue < .05) %>%                                              # only significant values, sort by correlation
  arrange(-correlation)

df <- df[-c(1, 2, 3, 45, 46),]                                                  # removed irrelevent columns pos_reviews, neg_reviews, and rank, appid; removed EGM as correlation of 1 causes NaN in cocor
rownames(df) <- NULL                                                            # reset index(row names)
df

            variable correlation       pvalue   n
1    Washington Post   0.8196886 4.583714e-02   6
2         BaziCenter   0.6585395 3.840358e-02  10
3          Game Rant   0.6442759 2.907041e-03  19
4           Player 2   0.6085100 7.576676e-04  27
5      Hooked Gamers   0.5408697 1.154574e-03  33
6        Meristation   0.5337783 9.428026e-06  61
7        Finger Guns   0.5105695 2.143100e-02  20
8     Hardcore Gamer   0.5059294 5.951102e-05  57
9           Gamersky   0.4977392 2.553432e-02  20
10        IGN Italia   0.4953671 2.349815e-05  66
11     Jeuxvideo.com   0.4678987 4.846886e-08 123
12       GameCritics   0.4570950 8.911441e-05  68
13         Millenium   0.4566809 1.103523e-03  48
14       Everyeye.it   0.4443979 9.652410e-04  52
15       Riot Pixels   0.4425910 1.288956e-03  50
16         CD-Action   0.4309273 2.608244e-06 110
17     Worth Playing   0.4249944 3.238901e-03  46
18        WellPlayed   0.4165956 1.973749e-02  31
19      Impulsegamer   0.4137837 1.211908e-02  36


Findings so far: 
* the correlation between user_score and total_reviews was not signficant(p < .05)
<br>
* EGM (Electronic Gaming Montly) had perfect correlation of 1, but that throws and error in cocor, and the n was only 3, so it is excluded from the final analysis, but noteworthy
<br>
<br>
<br>

Now use `library(cocor)` to compare each correlation with every other correlation to determine if they are signicantly different. Need to store the signficant differences in a way that is conducive to later visualization (*idk if it's even possible to usefully visualize this statically*; https://www.researchgate.net/post/How_to_elegantly_show_multiple_significant_differences_between_groups_on_a_bar_graph); *seems like I'm gonna have to learn Shiny or Dash next*)
<br>
<br>
For `cocor` I'm using:
  * independent groups; each critic only reviewed a handful of games, little overlap
  * zou2007; relatively recent, robust method for comparing correlations
  * confidence level = .90;  cause CI=.95 wasn't giving me enough significant differences, and a false positive isn't much of a concern with the comparisons I'm making

In [None]:
%%R
df$correlation[1]

[1] 0.60851


In [None]:
%%R

i  <- 5
ii <- 30
result <- cocor.indep.groups(r1.jk=df$correlation[i], r2.hm=df$correlation[i+ii], n1=df$n[i], n2=df$n[i+ii], test='zou2007', alpha=0.05, conf.level=0.90, null.value=0)

print(result)
result <- get.cocor.results(result)
print(result$zou2007$conf.int)

if (((result$zou2007$conf.int[1]>0) & (result$zou2007$conf.int[2]>0)) | ((result$zou2007$conf.int[1]<0)) & ((result$zou2007$conf.int[2]<0))) {
print('sigicant difference')
}


  Results of a comparison of two correlations based on independent groups

Comparison between r1.jk = 0.5409 and r2.hm = 0.2919
Difference: r1.jk - r2.hm = 0.249
Group sizes: n1 = 33, n2 = 84
Null hypothesis: r1.jk is equal to r2.hm
Alternative hypothesis: r1.jk is not equal to r2.hm (two-sided)
Alpha: 0.05

zou2007: Zou's (2007) confidence interval
  95% confidence interval for r1.jk - r2.hm: -0.1018 0.5418
  Null hypothesis retained (Interval includes 0)

[1] -0.1017585  0.5418006


In [None]:
%%R
variable <- 1
v <- variable
x <- c()

for (i in df$variable){
    x <- append(x, i)
}

enframe(x, name = v, value = df$variable[1]) %>%
print()

# A tibble: 43 × 2
     `1` EGM            
   <int> <chr>          
 1     1 EGM            
 2     2 Washington Post
 3     3 BaziCenter     
 4     4 Game Rant      
 5     5 Player 2       
 6     6 Hooked Gamers  
 7     7 Meristation    
 8     8 Finger Guns    
 9     9 Hardcore Gamer 
10    10 Gamersky       
# … with 33 more rows


In [None]:
%%R

tibble_row(x
  .name_repair = c("check_unique", "unique", "universal", "minimal")
)

RParsingError: ignored

##### Debugging Tool's

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(df.dtypes)