# Initial Hypotheses

## Contents

- [Context](#Context)
- [Definition Of Terms](#Definition Of Terms)
- [Assumptions](#Assumptions)
- [Hypotheses](#Hypotheses)

## Context

Testing my intuition seems like a logical way to begin exploratory data analysis.

I have collected product data from four discrete sources, which represent different categorical features. This should prove sufficient, however there are more kinds of data available.

- Product listing
- Product reviews
- Product updates
- Product use

Since the simplified and overarching intention of the project is to improve the __['New Product Development'](https://en.wikipedia.org/wiki/New_product_development)__ (NPD) process of software, my initial hypotheses are centric to potentially actionable outcomes. Resultantly, although many theories may be concocted about relationships within the data, I am focusing specifically on those for which their assertion could prove useful in the context of lifecycle management in-industry.

This should explain the absence of some seemingly *obvious* or interesting avenues of exploration, such as pricing, critical reception etc.

## Definition Of Terms

A *listing* discerns it from other products. The fields are not largely subject to change.

A *review* represents a pre-classified unit of sentiment, statically one-per user.

An *update* is a modification to the product at a point in time.

*Use* refers to active time spent using the product.

A *user* represents somebody who has purchased the product, and encompasses a set of interactions within this system including:
- Using the product
- Publishing reviews about the product

A *developer* represents somebody with control over a product's lifecycle-management, and encompasses a set of interactions within this system including:
- Publishing updates to the product's listing
- Publishing updates to the product itself

## Assumptions

Since the problem contains many moving parts, there are aspects of the available data that can be compounded or ignored for the sake of simplicity (this may reduce the accuracy and efficacy of the results).

There are a set of actions that may be undertaken during the product's lifecycle, which are constrained by this system. These actions have, for now, been reduced to publishing a product update. 

It is reasonable to classify an update using a function of its delta size and versioning information. Any further qualifications of its type and content are ignored, as the incredible variance in their meta descriptions (represented chiefly through  __['changelogs'](https://en.wikipedia.org/wiki/Changelog)__) would likely require an impossible amount of manual investigation. There are problems with this...

It is sensible to define a *positive* interaction is one that increases user engagement with the product. This may take the form of positive review scores, an increase in time spent using it etc. This is therefore to be sought and maximised.

...

Anomalies...

## Hypotheses

### Products that are updated more frequently receive more positive reviews

#### Why?

If we can see an upward trend in the ratio of positive to negative reviews correlative to the number of updates a product has received, then this may provide business justification to use continued development as a means of increasing positive user engagement.

#### Exploration



In [8]:
import pandas as pd
import numpy as np
import scipy as sp
import plotly as py
import plotly.graph_objs as go

from couchbase.n1ql import N1QLQuery

import source.shared.database

db = source.shared.database.Couchbase()
bucket = db.cluster.open_bucket('review')
q = N1QLQuery('SELECT date_created, votes_funny FROM review ORDER BY date_created')
results = bucket.n1ql_query(q)

x = []
y = []

for row in results:
    x.append(row['date_created'])
    y.append(row['votes_funny'])
    
data = [go.Bar(
    x=x,
    y=y
)]

py.offline.iplot(data, filename='jupyter/test')