# Exploratory Data Analysis Overview

## Contents

- [Context](#Context)
- [Model Definition](#Model-Definition)
- [Assumptions](#Assumptions)
- [Known Factors Influencing Data](#Known-Factors-Influencing-Data)
- [Data Overview](#Data-Overview)

## Context

Simplified, the intention of the project is to improve the __['New Product Development'](https://en.wikipedia.org/wiki/New_product_development)__ (NPD) process of software; that is, to inform product owners and help them tune their release cycle to maximise their intended outcomes.

My initial hypotheses are centric to potentially actionable outcomes. Resultantly, although many theories may be concocted about relationships within the data, I am focusing specifically on those for which their assertion could prove useful in the context of industrial lifecycle management. This explains the absence of some seemingly obvious or interesting open avenues of exploration, such as pricing. This is elaborated upon elsewhere.

We use 'product' because the digital platform has expanded from just video games, and therefore there are products here that are software tools. This representation provides interesting avenues for exploration and comparison of findings; in addition to seeing if trends hold true for the entire sample set, we may posit why based on the kind of product it is.

...

## Model Definition

#### Users

A *user* represents somebody who has purchased the product, and encompasses a set of interactions within this system including:
- Using the product
- Publishing reviews about the product

#### Developers

A *developer* represents somebody with control over a product's lifecycle-management, and encompasses a set of interactions within this system including:
- Publishing updates to the product's listing
- Publishing updates to the product itself

#### Data Model

I have collected product data from four discrete sources, which represent different categorical features. This should prove sufficient, however there are more kinds of data available.

- Product listing
- Product reviews
- Product updates
- Product useage

##### Listing

A *listing* discerns it from other products. They are stored in the 'store' bucket (the discrepancy in nomenclature is explained in source. The fields are not largely subject to change. Each may have multiple 'genres', referred to also as 'categories' or 'tags'.

```json
{
  "date_released": "8 Nov, 2013",
  "genres": [
    "Indie",
    "RPG",
    "Simulation",
    "Early Access"
  ],
  "is_free": false,
  "metacritic_score": "68",
  "name": "Example Product",
  "owners": 872046,
  "players_average_forever": 1231,
  "players_forever": 837417,
  "players_median_forever": 391,
  "price": "1499",
  "product_id": "108600",
  "score_rank": 65,
  "user_score": 86
}
```

##### Review

A *review* represents a pre-classified unit of sentiment, one-per user. It also contains some flattened information about the user.

```json
{
  "author_id": "76561682049575891",
  "author_last_played": 1475862888,
  "author_num_games_owned": 90,
  "author_num_reviews": 3,
  "author_total_playtime": 623,
  "date_created": 1399060157,
  "language": "english",
  "product_id": "284850",
  "received_for_free": false,
  "recommendation_id": "16000027",
  "review_length": 122,
  "voted_up": true,
  "votes_funny": 0,
  "votes_up": 0,
  "written_during_early_access": true
}
```

##### Update

An *update* is a modification to the product at a point in time.

```json
{
  "date_created": 1387226838,
  "feed_name": "steam_community_announcements",
  "product_id": "2281001,
  "update_id": "1999102271521259752"
}
```

##### Usage

*Usage* refers to active time spent using the product. This is represented as daily concurrent users since the product's release. It has been flattened and denormalized to promote speed, and can be indexed using a function of the start time and step as a unix epoch.

```json
{
  "product_id": "593440",
  "start": "1366834400",
  "step": "86400",
  "values": "[1, 16, 31, 46, 61, 76, 91, 106 ... 13]"
}
```

## Assumptions

TODO: Revisit

Since the problem contains many moving parts, there are aspects of the available data that can be compounded or ignored for the sake of simplicity (this may reduce the accuracy and efficacy of the results). I will attempt to mitigate the negative consequences of these assumptions where possible.

- There are a set of actions that may be undertaken during the product's lifecycle, which are constrained by this system. These actions have, for now, been reduced to publishing a product update as the only action of importance. 

- The contents of a review are unimportant (although I do log the length of a review).

- It is sensible to define a *positive* interaction is one that increases user engagement with the product. This may take the form of more owners, reviews, an increase in time spent using it etc. We assume that this is desirable.

- It is reasonable to disregard the classification of an update in favour of it simply having happened. Any further qualifications of its type and content are ignored, as the incredible variance in their meta descriptions (represented chiefly through  __['changelogs'](https://en.wikipedia.org/wiki/Changelog)__) would likely require an impossible amount of manual investigation. As the systems developers have access to to publish updates are not restrictive, it is possible that an 'update' may not actually represent an update in the sense that I am seeking, and measures have been taken to minimise the risk of this: from a cursory trawl through the endpoint used, it seems that for my sample set the vast majority of developers are exclusively tagging updates as "community_announcement", which has resultantly been used as a filter during collection. 

## Known Factors Influencing Data

TODO: Revisit

There are factors that may infuence and furthermore corrupt the data. I will attempt to conduct changepoint analysis in order waylay this, and otherwise keep them in mind when investigating anomolies or unexpected results. They are listed here for posterity.

- The date influences CCU (localised seasonal effect) - it spikes at weekends, and drops during weekdays when users are otherwise at work.

- Price changes influence CCU and review sentiment - a decrease in price is correlative to a spike in users and positive review scores.

- External publicity of a product may have implications on any aspect of the data; 'review bombing' sees an increase in negative reviews when the product is attached to controversy irrespective of itself.

## Data Overview

#### Overview

The sample set after collection and culling is tailored to my investigation. Although the exact mechanical steps have not been explicitly logged, the software used for collection is open sourced, and a database snapshot (backup) can be provided upon request.

I have collected:
- 1,583 listing
- 1,549,896 reviews
- 38,721 updates
- Usage information for all listings

#### Semantics

In terms of representation, there are 24 genres (sans 'early access'). Understandably, most games are tagged as 'indie', meaning independently published. As there are a limited amount of tags per game, it is reasonable to assume that all products are 'independent' (most publishers forego 'early access' as a release strategy as their purpose«it to circumvent it) and some have simply forgotten or opted to not tag themselves as 'indie'.

In [6]:
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

import plotly.graph_objs as go

init_notebook_mode(connected=True)

product_categories = [
  {
    "genre": "Accounting",
    "count": 1
  },
  {
    "genre": "Action",
    "count": 936
  },
  {
    "genre": "Adventure",
    "count": 520
  },
  {
    "genre": "Animation & Modeling",
    "count": 19
  },
  {
    "genre": "Audio Production",
    "count": 6
  },
  {
    "genre": "Casual",
    "count": 368
  },
  {
    "genre": "Design & Illustration",
    "count": 23
  },
  {
    "genre": "Education",
    "count": 12
  },
  {
    "genre": "Free to Play",
    "count": 140
  },
  {
    "genre": "Gore",
    "count": 93
  },
  {
    "genre": "Indie",
    "count": 1262
  },
  {
    "genre": "Massively Multiplayer",
    "count": 129
  },
  {
    "genre": "Nudity",
    "count": 23
  },
  {
    "genre": "RPG",
    "count": 364
  },
  {
    "genre": "Racing",
    "count": 85
  },
  {
    "genre": "Sexual Content",
    "count": 10
  },
  {
    "genre": "Simulation",
    "count": 467
  },
  {
    "genre": "Software Training",
    "count": 5
  },
  {
    "genre": "Sports",
    "count": 111
  },
  {
    "genre": "Strategy",
    "count": 383
  },
  {
    "genre": "Utilities",
    "count": 25
  },
  {
    "genre": "Video Production",
    "count": 7
  },
  {
    "genre": "Violent",
    "count": 141
  },
  {
    "genre": "Web Publishing",
    "count": 2
  }
]

x = []
y = []

for category in product_categories:
    x.append(category['genre'])
    y.append(category['count'])
    
data = [go.Bar(
    x=x,
    y=y
)]

iplot(data, filename='jupyter/test')

#### Ranges

The 

#### Constraints