<h1><center>Bayesian Data Analysis project report</center></h1>

<img src="http://icopartners.com/newblog/wp-content/uploads/2018/02/header.png">

<h1><center>Luxury Shirts Inc.</center></h1>


 <img src="shirt2.jpeg"> 

* First product (the shirt pictured above) to launch on Kickstarter (https://www.kickstarter.com/about?ref=global-footer)
* Price point for the shirt is 200€
* The goal of the campaign is to gather 200 000€ to cover production and other expenses

#### How likely is it that the project will be successfull or should there be some changes to the pricing and/or goal?

## Introduction

The goal of this project and report is to analyze whether the presented business case seems plausible based on historical Kicstarter project data.

Kickstarter is a platform where users can submit different projects online and gather funding from other Kickstarter users. The main success metric for the projects is whether it managed to pledge more money that the goal set for the project. If the project does not meet or exceed the goal, all the pledged funding will be cancleled and the project labeled as unsuccessful.

In this particular analysis problem, we will analyze the needed amount of backers (people willing to fund the project) to have a high probability of success given the scale of the project and seeing whether this project fits in that predicted requirement.

This notebook will present the following:
* Description of the data and analysis problem
* Description of the used model and comparison to other tested models
* Discussion about used priors
* Technical implementation of the model and running it
* Convergence analysis and predictive checking
* Conclusion based on the results
* Discussion about potential improvements

## Description of data and analysis problem

The data used for this analysis is historical Kickstarter project data downloaded from Kaggle (https://www.kaggle.com/kemical/kickstarter-projects).

The main used data points per project are:
* Success (successful/failed)
* Goal of the project ($)
* Amount of backers (int)
* Duration of the project

Other data were also available for the projects such as the amount of money pledged, but due to the nature of the analysis problem we only concentrate on variables that are known before the project has started.

The data set has over 370 000 data points, but we have cut it down based on some thresholds and also resampled it for more efficient computation. The success metric has also been converted to be binary instead of containing multiple values (all canceled ones removed for example).

If you want to run the model yourself, you should download the zip file from the foillowing link and extract the csv file as "projects.csv": https://www.kaggle.com/kemical/kickstarter-projects/downloads/kickstarter-projects.zip/3

In the following code we will tranform the data into useful form and display the first few rows of the data to get a grasp how it looks like. How the data is used in our model will be discussed in the next section.

In [14]:
#Import librararies that will be used throughout this notebook
import pystan
from pystan import StanModel
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import pandas as pd
from IPython.display import display, HTML
from psis import psisloo
from sklearn.model_selection import StratifiedKFold, ShuffleSplit, KFold
from sklearn.utils import resample

In [24]:
projects = pd.read_csv("projects.csv", header=None)
pm = np.array(projects) #All project data
pm = np.delete(pm, (0), axis=0) #Delete top row
pm = np.delete(pm, (0,1,2,4,8,11,12,13,14,15,16), axis=1) #Delete non needed columns


success_idx = pm[:,4] == "successful" #Tuple of all indexes that are successful
failed_idx = pm[:,4] == "failed" #Tuple of all indexes that failed

#Remove all rows where the end result is not successful or failed
remove_idx = np.where(np.logical_or(success_idx,failed_idx)==False)
pm = np.delete(pm, remove_idx, axis=0)

#Find new indexes for success and fail cases
success_idx = pm[:,4] == "successful"
failed_idx = pm[:,4] == "failed"

#Replace failed values with 0, success with 0 and convert whole matrix to numeric values
pm[success_idx,4] = 1
pm[failed_idx,4] = 0

#Remove all rows where the data points are below a threshold
remove_idx = np.where(pm[:,5].astype(int) < 10) #Remove all projects with backer count less than 10
pm = np.delete(pm, remove_idx, axis=0)
pm = resample(pm,n_samples=100,random_state=0) #Resample data used in fitting into model

evaluated_projets = pm.shape[0] #All projects evaluated

print("Data points after formatting data and resampling: ",evaluated_projets)

pd.DataFrame(pm).to_csv("sanitized.csv",header=False, index=False)
sanitized = pd.read_csv("sanitized.csv", header=None, names=["Category","Start time","Goal ($)","End time","Success (bool)","Backers (int)"])
display(HTML(sanitized.head(50).to_html(max_rows=50)))

  interactivity=interactivity, compiler=compiler, result=result)


Data points after formatting data and resampling:  100


Unnamed: 0,Category,Start time,Goal ($),End time,Success (bool),Backers (int)
0,Film & Video,2015-07-30 16:17:39,90000,2015-06-30 16:17:39,0,264
1,Music,2011-03-08 20:37:28,2000,2011-02-06 20:37:28,1,86
2,Music,2012-01-15 07:17:17,1000,2011-12-27 07:17:17,1,30
3,Technology,2015-01-23 21:24:31,30000,2014-12-24 21:24:31,0,57
4,Film & Video,2016-12-27 14:54:28,5000,2016-11-17 14:54:28,1,57
5,Music,2015-03-01 22:00:00,3000,2015-02-02 14:09:20,1,91
6,Film & Video,2013-09-03 03:23:29,2000,2013-07-25 03:23:29,1,26
7,Theater,2013-03-31 03:14:23,3800,2013-03-01 03:14:23,1,62
8,Art,2016-01-04 16:33:41,50000,2015-11-30 16:33:41,0,158
9,Music,2014-08-31 21:00:00,7500,2014-07-23 02:58:58,1,184


## Description of the model

Given the binary nature of the data, a binomial model is the most appropriate for this particular problem. The data points $y_i$ are binomially distributed and the model is of the form: <h3>$$y_i | θ_i ∼ Bin( n_i , θ_i ),$$</h3> 

where $θ_i$ is the probability of success for each project. The form and different variations of $θ_i$ tested during this project are discussed below. One example of models tested will be <h3>$$logit(θ_i) = a + bx + cy,$$</h3>

where x is the amount of backers of the project and y is the goal of the project. To make sure that the probability $θ_i$ lies between 0 and 1, we use the logit transformation as shown above. The interesting insights we will draw from this equation is that we set a desired probability of success (80% in our case), and use the parameter values drawn from the posterior distrubution to evaluate the distribution of the desired variables (such as backer count).

## Discussion about priors
...

## Stan implementation and running
...

## Convergence analysis
...

## Model comparison
...

## Results/Conclusion
...

## Problems and improvement ideas
...