In [27]:
# Notebook-wide definitions.

import json
import numpy

from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go

init_notebook_mode(connected=True)

# Assessing Whether Reviews Are Equal

There are a lot of factors influencing the data recorded as a 'review' in my system.

A user who has received a product for free has either had it gifted to them by a friend or a developer. It is therefore tough to discern whether they are a reviewer, a typical consumer, or somewhere in between. We eschew the notion of personal motivation by using pre-classified information, and looking solely at the wealth of data collected about the users habits.

Furthermore, we look at the aggregated data provided by the platform - as opposed to producing a 'mapreduce' view of the information in my system, I take summary information from other APIs. This has been done to save time, however some investigation is needed to assess how they compute it before using it.

## R1

### Aim

To discern whether reviews left by users who have received the product for free are biased.

### Purpose

If there is a trend either way then this means that reviews cannot be treated as an atomically equal metric; this is problematic as there are so many of them, and may result in the need for a new score system to be created for the dataset (at least I have control of this). Simply factoring this information into graphical representations is not useful because the usefulness of its visibility relies on the personal asssumptions of the viewer; this is what I wish to make concrete.

I believe one of three things will happen:

- Free reviews will favour products more positively (as they do not have the stigma of having spent money on the experience). They may also have a personal relationship with the developer.

- Free reviews will be more spread out and perhaps more considered because the person receiving a product for free may be a professional reviewer, community influencer, or at  least somebody aware that the context of their purchase will be highlighted.

- There will be no trend due to the wide mix of people reviewing - perhaps most people leaving reviews on products that were received for free received them as gifts and therefore even though it was not their money, they still feel obliged to give an honest opinion.

### Investigation

Out of all reviews:
- 1,162,276 (~68%) positive.
- 370,912 (~32%) negative.

Out of all reviews,
- 1,496,026 (~97.58%) were submitted by users who purchased the product themselves.
- 37,162 (~2.42%) were submitted by users who received the product for free.

Out of 1,496,026 purchased reviews:
- 1,133,417 positive (~68%).
- 362,609 negative (~32%).
The lack of change is due to the comparatively small sample size of free games, which therefore requires further investigation.

Out of 37,162 free reviews:
- 28,859 (~77.66%) positive.
- 8,303 (~22.34%) negative.

This appears to suggest that free reviews are slightly more biased towards positive recommendations. In order to verify this, we need to look at their deviation from the ratio on the product for which they were left.

For every product that has **at least one free review**, we build an intermediate index in order to construct the same set of comparisons. In it, we **discount products that have no purchased reviews** (in which case they are free or anomalous).

In [28]:
review_demographic_count = json.load(open('./dumps/review-demographic-count.json'))

num_purchased_products_with_free_reviews = 0

percent_purchased_reviews_per_product = []

# This is the deviation of the free review ratio to the purchased ratio of the same product.
deviation_per_product = []
deviation_and_total_free_per_product = []

num_more_free_than_purchased = 0
num_positive_deviation = 0
num_negative_deviation = 0

for item in review_demographic_count:
    if item['is_free'] == True:
        continue
    
    num_purchased_products_with_free_reviews += 1
    
    if item['total_reviews_purchased'] < item['total_reviews_free']:
        num_more_free_than_purchased += 1
    
    # Compute the split for free and purchased.
    percent_purchased = item['total_reviews_purchased'] / item['total_reviews'] * 100
    percent_purchased_reviews_per_product.append(percent_purchased)
    
    # Compute the positive percentages for free and purchased.
    positive_percent_purchased = item['total_reviews_purchased_positive'] / item['total_reviews_purchased'] * 100
    positive_percent_free = item['total_reviews_free_positive'] / item['total_reviews_free'] * 100
    
    # Compute the difference between them.
    deviation = positive_percent_free - positive_percent_purchased
    deviation_per_product.append(deviation)
    
    if deviation > 0:
        num_positive_deviation += 1
    if deviation < 0:
        num_negative_deviation += 1
    
    deviation_and_total_free_per_product.append({
        'total_free': item['total_reviews_free'],
        'deviation': deviation
    })
    
average_percent_purchased_reviews_per_product = numpy.average(percent_purchased_reviews_per_product)
average_deviation_per_product = numpy.average(deviation_per_product)

print("Number of paid products with more free reviews than purchased reviews: " + str(num_more_free_than_purchased))
print("Number of paid products with free reviews: " + str(num_purchased_products_with_free_reviews))
print("Average percent of purchased reviews per product: " + str(average_percent_purchased_reviews_per_product))
print("Average deviation in percent of free positive reviews against purchased positive reviews: " + str(average_deviation_per_product))
print("Number of products for which this deviation is positive: " + str(num_positive_deviation))
print("Number of products for which this deviation is negative" + str(num_negative_deviation))

Number of paid products with more free reviews than purchased reviews: 6
Number of paid products with free reviews: 795
Average percent of purchased reviews per product: 91.898367361
Average deviation in percent of free positive reviews against purchased positive reviews: 5.18526901452
Number of products for which this deviation is positive: 570
Number of products for which this deviation is negative217


This is almost sound, except the deviation might be drastic if there are only a couple of free reviews per product (for example, 70% positive purchased with 1000 reviews versus 100% positive free with 1 review is unfair to compare).
We can graph this to get a better understanding of what is happening.

// TODO: A double bar graph which will inherently show the deviation.

In [29]:
deviation_and_total_free_per_product.sort(
    key=lambda i: i['total_free']
)

x = []
y = []

for item in deviation_and_total_free_per_product:
    x.append(numpy.log2(item['total_free']))
    y.append(item['deviation'])
    
layout = go.Layout(
    title='Deviation Of Free Reviews',
    xaxis=dict(
        title="Total Free Reviews Log2"
    ),
    yaxis=dict(
        title="Deviation Of Ratio From Purchased Reviews"
    ),
    showlegend=False
)   

data = [go.Scatter(
    x=x,
    y=y,
    mode='markers'
)]

fig = go.Figure(data=data, layout=layout)

iplot(fig)

It appears that products with less free reviews have wilder deviation from the purchased percentage, which makes sense from a maths percentage. I did not expect this disparity to continue past 10 reviews. This suggests that there is only a slight trend. This is particularly unexpected as there are few products for which there are more free reviews than purchased reviews.

However, there are significantly (353) more products for which the percentage of positive reviews is greater in the subset of those left for free, so although the deviation may not be drastic is it reasonable to state that free reviews are more positive.

Finally, having observed something, we can also look at for which products free reviews are left, and ergo the cases for which it is most applicable  - are they left chiefly on large, popular games, or younger games (in order to promote them)?

There are several ways to do this - the most straightforward is to look at the ratio of free to purchased reviews per genre.

In [30]:
genres = json.load(open('./dumps/genres.json'))

genres_counts = []

# Pre-fill array with genres.
for genre in genres:
    genres_counts.append({
        "genre": genre['genre'],
        "total_products": genre['count'],
        "total_reviews_free": 0,
        "total_reviews": 0
    })

for item in review_demographic_count:
    for genre in item['genres']:
        # Find the matching genre to add to.
        for genre_count in genres_counts:
            if genre_count['genre'] == genre:
                genre_count['total_reviews_free'] += item['total_reviews_free']
                genre_count['total_reviews'] += item['total_reviews']
                break
                
for item in genres_counts:
    if item['total_reviews'] == 0:
        item['percentage'] = 0
    else:
        item['percentage'] = item['total_reviews_free'] / item['total_reviews'] * 100

genres_counts.sort(key=lambda genre: genre['percentage'], reverse=True)

x = []
y = []
names = []

for item in genres_counts:
    x.append(item['genre'])
    y.append(item['percentage'])
    names.append("Total Reviews: " + str(item['total_reviews']))
    
layout = go.Layout(
    title='Free Reviews Per Genre',
    xaxis=dict(
        title='Genre'
    ),
    yaxis=dict(
        title='Percent Reviews Free'
    ),
    showlegend=False
)   

data = [go.Bar(
    x=x,
    y=y,
    hovertext=names
)]

fig = go.Figure(data=data, layout=layout)

iplot(fig)

As we can see, software and casual games receive the most free/gifted reviews proportional to purchased reviews. This is somewhat influenced by products per genre, which can be seen on hover.

### Results

- On average, ~8% of reviews per product are submitted by users who have received the product for free.
- For ~78.2% of products with free reviews, the ratio of positive free reviews to negative free reviews is greater than the ratio of positive purchased reviews to negative purchased reviews. 
- On average, reviews that were submitted for free are ~5% more likely to be positive.
- There are more free revies submitted for software products and casual games.

## R2

### Aim

To discern whether reviews left by users who have a greater usage of the product are seen as more helpful by the community.

### Purpose

Since these reviews are bubbled to the forefront of the UI, and userscore is derived from all reviews, it is worth exploring whether users who have used the product for longer receive more positive votes on their review. If so, this may also indicate similar to R1 that 'userscore' as a metric of the ratio between binary recommendation decisions is flawed, as their recommendation is therefore more impactful.

I would expect user reviews with amounts of hours logged at the higher end of the spectrum would receive more upvotes, however this partially relies on either the poster's ability to leverage that experience into creating a meaningful review, and whether the reader actually sees the amount of hours put in.

If this is the case, then I need to examine whether people who have used products for longer are more biased towards positive reviews in order to grade them properly.

### Investigation

The total time using a product per account is [time logged against a verified server](https://www.reddit.com/r/GlobalOffensive/comments/359xkj/apparently_you_can_fake_the_time_how_long_youve/cr2hv9k/). This is separate to a statistic that allows the client to determine how long has been spent playing a game. The client-based method has been ignored in my data collection, as some people fake the time they have spent using a product in order to illegitimately boost their statistics on the community platform.

Initially, I can look at the amount of 'helpfulness' per time amount of time played on a review, and see if it tends towards higher the greater amount of time the user has used the product. I can do this for one product and then assess whether the outcome holds true for all games.

Caution must be exercised however, as the genral maximum intended usage of a product is wildly variable, and not easily available. I can look instead at the times as a scale.

I will be using the product with the most reviews first - '221100', or 'Day Z'.

In [31]:
playtime_for_221100 = json.load(open('./dumps/playtime-for-221100.json'))

playtime_for_221100.sort(key=lambda i: i['author_total_playtime'])

x = []
y1 = []
y2 = []

for item in playtime_for_221100:
    x.append(item['author_total_playtime'] / 60) # Divide to get hours.
    y1.append(item['average_votes_up'])
    y2.append(item['max_votes_up'])
    
layout = go.Layout(
    title='Playtime Against Votes Up',
    xaxis=dict(
        title="Playtime (Hrs)"
    ),
    yaxis=dict(
        title="Votes Up"
    )
)   

data = [
    go.Scatter(
        x=x,
        y=y1,
        mode='markers'
    )
]

fig = go.Figure(data=data, layout=layout)

iplot(fig)

There is a twofold problem here in that the vast majority of reviews may not actually be seen, and those that are but are deemed as unhelpful (as opposed to neutral) are not ableto be tagged such. Without simply ignoring them, we can gain a better understanding of playtime distribution be bounding the playtime to a histogram, and taking the *distinct* average of `votes_up`, to reduce the disproportionate amount of low votes per review. This exacerbates the peaks and allows a clearer picture not of which timebands have indeterminate reviews, but rather definitely positive reviews. 

In order to do a histogram I need to decide on the bounds in a way that does not bias the graphical representation towards one specific outcome. One option is to split the playtime into even increments along its range; in doing so, I will achieve a clean aggregation of reviews.

In [32]:
# Define size of band in count.
increment = 500

x = []
y1 = []
y2 = []
y3 = []

for i in range(int(len(playtime_for_221100) / increment)):
    start_j = increment * i
    
    min_time = playtime_for_221100[start_j]['author_total_playtime'] / 60
    max_time = playtime_for_221100[start_j + increment]['author_total_playtime'] / 60
    
    averages = []
    distinct_averages = []
    
    max_votes_up = -1
    
    for j in range(increment):
        item = playtime_for_221100[start_j + j]
        
        # Set max votes per band.
        if(item['max_votes_up'] > max_votes_up):
            max_votes_up = item['max_votes_up']
            
        averages.append(item['average_votes_up'])
        distinct_averages.append(item['distinct_average_votes_up'])
            
    # Set bar values.
    x.append(str(round(min_time, 2)) + " - " + str(round(max_time, 2)))  
    y1.append(max_votes_up)
    y2.append(numpy.mean(averages))
    y3.append(numpy.mean(distinct_averages))

layout = go.Layout(
    title='221100: Playtime Against Votes Up',
    xaxis=dict(
        title="Playtime (Hrs)"
    ),
    yaxis=dict(
        title="Votes Up"
    ),
    showlegend=True
)

data = [
    go.Bar(
        x=x,
        y=y1,
        name="Max"
    ),
    go.Scatter(
        x=x,
        y=y2,
        name="Average"
    ),
    go.Scatter(
        x=x,
        y=y3,
        name="Distinct Average"
    )
]

fig = go.Figure(data=data, layout=layout)

iplot(fig)

It appears there are peak times played for which people are trusting reviews for this specific product, and past a certain point there are less reviews reaching higher helpfulness scores.

The distinct average and average split in the earlier time bands because there are a greater number of reviews (that are therefore not seen or unhelpful) between 0 - 100 hours than there are greater than 100 hours. 

This may also be due to the type of product - a multi-player non-story game has open-ended playtime. The median for time played in this instance is 362 hours, which is very long.

Instead, it may be better to look at a smaller in scope games that are linearly designed (and ergo designed to be completed), such as '347650', or 'Terra Incognita'. It is worth noting that I found it hard to find a product with an intentionally linear usage - 'story' and the likes are not tags that exist in the sytem, and even for games that incorporate narrative elements, the iterative nature of the platform does not lend itself well to things that are only meant to be completed once.

In [33]:
# Define size of band in count.
playtime_for_347560 = json.load(open('./dumps/playtime-for-347560.json'))

playtime_for_347560.sort(key=lambda i: i['author_total_playtime'])

increment = 10

x = []
y1 = []
y2 = []
y3 = []

for i in range(int(len(playtime_for_347560) / increment)):
    start_j = increment * i
    
    min_time = playtime_for_347560[start_j]['author_total_playtime'] / 60
    max_time = playtime_for_347560[start_j + increment]['author_total_playtime'] / 60
    
    averages = []
    distinct_averages = []
    
    max_votes_up = -1
    
    for j in range(increment):
        item = playtime_for_347560[start_j + j]
        
        # Set max votes per band.
        if(item['max_votes_up'] > max_votes_up):
            max_votes_up = item['max_votes_up']
            
        averages.append(item['average_votes_up'])
        distinct_averages.append(item['distinct_average_votes_up'])
            
    # Set bar values.
    x.append(str(round(min_time, 2)) + " - " + str(round(max_time, 2)))  
    y1.append(max_votes_up)
    y2.append(numpy.mean(averages))
    y3.append(numpy.mean(distinct_averages))

layout = go.Layout(
    title='347560: Playtime Against Votes Up',
    xaxis=dict(
        title="Playtime (Hrs)"
    ),
    yaxis=dict(
        title="Votes Up"
    ),
    showlegend=True
)

data = [
    go.Bar(
        x=x,
        y=y1,
        name="Max"
    ),
    go.Scatter(
        x=x,
        y=y2,
        name="Average"
    ),
    go.Scatter(
        x=x,
        y=y3,
        name="Distinct Average"
    )
]

fig = go.Figure(data=data, layout=layout)

iplot(fig)

// TODO: Add more subplots for more games.

Both games seem to have a distinct peak for which reviews are deemed most helpful. It is more helpful to think of these times as peaks as opposed to the explicit average votes up because the user interface and peer pressure promotes may encourage a positive feedback loop. I may be able to correlate the average use time band in which a product receives the most helpful reviews in order to recommend an intended use time to maximise helpful feedback.

Interestingly, the game has been through two major iterations, and the peaks correlate with the time the game is expected to be beaten in (~3 hours, and then ~6 hours for another playthrough). This may suggest that reviews are better when the user has just finished the game.

### Results

Although there were useful observations made about the kinds of reviews that are deemed the most helpful, since there was not a linear correlation in users who have used the product for longer receiving more positive votes on their review, this as a metric can be discarded from the issue of review equality.

An interesting piece of further investigation on this metric would be how long people actually use products for versus their expected 'completion' time (for applicable products). Furthermore, to crossreference their helpfulness against playtime could expose how long it takes to complete the game, and may avoid the need for polling as [some sites do](https://howlongtobeat.com).

## R3

### Aim

There amount of time for which people play a game before leaving the best reviews is predictable.

### Purpose

If the time played and peak reviews have a correlation across products then we may be able to posit something that helps developers either plan the length of content, or recruit testers/in depth feedback from a target demographic.

### Investigation

I calculated the peak review time on average for every product.

In [34]:
playtime_peak_votes_up = json.load(open('./dumps/playtime-peak-votes-up.json'))

x = [item['author_total_playtime'] / 60 for item in playtime_peak_votes_up]
y = [item['distinct_average_votes_up'] for item in playtime_peak_votes_up]

layout = go.Layout(
    title='Peak Votes Up By Playtime',
    xaxis=dict(
        title="Playtime (Hrs)"
    ),
    yaxis=dict(
        title="Peak Votes Up"
    ),
    showlegend=False
)

data = [
    go.Scatter(
        x=x,
        y=y1,
        mode='markers'
    )
]

fig = go.Figure(data=data, layout=layout)

iplot(fig)

The average review time is 50.15 hours, the median is 5.18. This suggests that most products receive their most useful reviews from users who have spent 6 hours using it, which is a reasonable [completion time for most games on average](https://www.theringer.com/2016/8/25/16038806/video-game-length-playtimes-f7b8e38f949f). It is worth noting that it is hard to make a concrete connection between completion time and playtime for reviews as 'completion' as a concept has many [different definitions](https://howlongtobeat.com/stats.php), and the coverage for any defintion against this sample set is far from complete.

This carries the caveat that it does not account for several factors we have explored along the way; it is apparent that the range of playtimes is very large, and this is likely because the kind of game is very different. The number of upvotes also may depend on the exposure of the game, which weakens the use of both axes.

## R4

### Aim

To discern whether the most helpful reviews are biased.

### Purpose

To decide whether reviews can be treated as equal.

Since there was no correlation between playtime and most helpful, the most helpful reviews are left by users that have a range of playtimes. Therefore, despite the assumption that people who put more time into a product are more likely to rate it highly, this is unimportant as their reviews are not necessarily rated higher.

In general I don't believe that the most helpful reviews will be biased however the are susceptible to being swayed by popular opinion, so it is worth asserting. 

### Investigation

I can simply leverage the intermediate indexes produced in R1 - R3 and and see the distribution of positive to negative. If it is not heavily biased then there is likely no trend.

In [35]:
top_reviews_voted_up = json.load(open('./dumps/top-reviews-voted-up.json'))

num_positive = 0
num_negative = 0

num_products_positive = 0
num_products_negative = 0

num_agree = 0
num_indeterminate = 0
positive_top_reviews_on_negative_products = 0
negative_top_reviews_on_positive_products = 0

for item in top_reviews_voted_up:
    if item['voted_up']:
        num_positive += 1
    else:
        num_negative += 1
    
    if item['user_score'] is 50:
        num_indeterminate += 1
    
    if item['user_score'] > 50:
        num_products_positive += 1
        
        if item['voted_up']:
            num_agree += 1
        else:
            negative_top_reviews_on_positive_products += 1
        
    if item['user_score'] < 50:
        num_products_negative += 1
        
        if not item['voted_up']:
            num_agree += 1
        else:
            positive_top_reviews_on_negative_products += 1

There are 833 products rated higher than 50 (and ergo 'positive' overall for the sake of this test), and 127 products rated lower than 50 (and ergo 'negative').

The top reviews appear to be slightly biased towards positivity in total, with 603 (~62%) of top reviews being positive, and 374 (~38%) being negative. I would expect this to more closely reflect the ratio of positive to negative products. This may be due to the sample size. To ensure nothing unusual is happening, and that each top review follows user opinions, we can see if each top review reflects the user score of the product.

Using 50 as the boundary for splitting negative and positive, the number of top reviews that align with their review score is 664, and the number that are indeterminate  is 17. This totals ~70% of top reviews that do not go 'against the grain', which correlates with the number of positive to negative products. It is slightly less than I was expecting, however it does not account for changes in the product over time - a product that shifted in recent opinion may still have a high overall score (which is the reason the platform introduced 'recent' as a metric), and therefore we can assume that almost all top reviews reflect public opinion.

For the outliers, we look at what number went against a positive overall userscore versus a negative one, and the results are interesting - 29 top reviews were positive reviews on a negatively rated product, and 267 (~820% more) were negative reviews on a positively rated product. This may be because more products end up disenfranchising users than producing more.

### Results

- Top reviews are slightly biased towards positive recommendations.
- Top reviews are largely reflective of the overall user opinion of the product.
- Outliers are much more likely to be negative than positive.

There is no strong bias and therefore reviews can be treated as equal for the context of this project.

## R5

### Aim

To discern whether reviews left at the start of a product's lifecycle are more biased towards positive.

### Purpose

There are many reasons one could posit as to why a product may have more positive reviews at its inception - stakeholders self-reviewing, optimism/leniency of early adopters etc.

If they are more positive time series investigation cannot rely on the 'start' user score as a metric.

### Investigation

We can look at the start user score and determine how many are positive, and cross-reference this with the number of present user scores per band. If they differ drastically, there is a problem.

In [36]:
user_score_start_end = json.load(open('./dumps/user-score-start-end.json'))
present_user_scores = json.load(open('./dumps/present-user-scores.json'))
true_user_score = json.load(open('./dumps/true-user-score.json'))

x = [None] * 101
y = [None] * 101

x2 = [i for i in range(101)]
y2 = [None] * 101

for item in user_score_start_end:
    x[item['start_user_score']] = item['start_user_score']
    if y[item['start_user_score']]:
        y[item['start_user_score']] += 1
    else:
        y[item['start_user_score']] = 1
        
for item in true_user_score:
    if y2[item['true_user_score']] == None:
        y2[item['true_user_score']] = 1
    else:
        y2[item['true_user_score']] += 1

layout = go.Layout(
    title='Number Of Games Per Starting User Score',
    xaxis=dict(
        title="User Score"
    ),
    yaxis=dict(
        title="Total Products"
    ),
    showlegend=True
)

data = [go.Bar(
    x=x,
    y=y,
    name='Start User Scores'
),
go.Bar(
    x=x2,
    y=y2,
    name="True User Scores"
),
go.Bar(
    x=present_user_scores['x'],
    y=present_user_scores['y'],
    name='Present User Scores Reported By Steam'
)]

fig = go.Figure(data=data, layout=layout)

iplot(fig)

In [37]:
# Find the number of positive to negative reviews of all games.
print("Percentage of all reviews positive: " + str(round(1162276 / (1162276 + 370912) * 100, 2)) + "%")

Percentage of all reviews positive: 75.81%


### Results

As you can see, there is a greater proportion of products with higher user score at the start compared to present user scores. This does not account for the things that may happen during a game's lifecycle to make this happen, however scores at the start are much higher.

The majority still end up relatively optimistic, with ~75.8% of all reviews being positive - perhaps fans are more likely to leave reviews in the first place. Also, looking at the true user scores, we can see that the scores are spread out more evenly at the end. This implies that reviews left by users who acquired the product outside of the platform are more favourable. With 'true user score' we can see lower representation of higher end scores, which have spread down into the middle, and furthermore more products with less than 50.

Due to review counts and division in the function producing the start user score, there are spikes at multiples of 5.

### R6

The user score reported by the platform does not align with the information I have collected - this suggests that it is discounting some reviews (for example, those left by users who acquired the product key outside of the platform).

I constructed my own from all reviews, and will sweep over previous investigations to see if it they differ drastically.