## Crowdfunding Analysis - Final Project
### Goal: Predict wether a crowdfunding campaign will be successful or fail


#### Dataset: https://www.kaggle.com/yashkantharia/kickstarter-campaigns-dataset-20


### Importing dependecies and data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # plotting

df = pd.read_csv('data.csv')

In [None]:
# Describing the Data
# df.describe()
df.info()

### Data Cleaning

* Duplicates
* Removing useless columns
* Converting variables
* Normalizing values

In [None]:
# creator_id, index, creator_id are unique / serializing values
ul_cols = ['id', 'creator_id', 'index']
df = df.drop(columns = ul_cols, axis=1)
df.info()

In [None]:
# Removing Duplicates
print("# of Duplicates: ", df.duplicated().sum())
df.drop_duplicates(inplace=True, ignore_index=True)

# Length after duplicate removal
print("Length of dataset after removal of duplicates is ", len(df))

In [None]:
# Converting fields to binary
df['status'].describe()

In [None]:
# Status to binary 0-failed 1-successful
status = df['status']
new_status = []
for s in status:
    if s == "successful":
        new_status.append(1)
    else:
        new_status.append(0)
df['status'] = new_status
df['status'].describe()

In [None]:
df.info()

### Which features could affect the outcome?
* Title length (h0 - the shorter the better)
* Launch season (h0 - specific seasons may affect success rate)
* Launch year (h0 - crowdfunding grew in populatiry as the years progress, google trends show decrease since 2013 however traffic to crowdfunding websites grew
  * https://trends.google.com/trends/explore?date=today%205-y&q=indigogo
  * https://trends.google.com/trends/explore?date=today%205-y&q=kickstarter
  * https://www.similarweb.com/website/kickstarter.com/#overview
  * https://www.similarweb.com/website/indiegogo.com/#overview
* Length (Deadline (-) Launched_at = campaign length OR duration, is there an optimal number?)
* Categories (h0 - some are more successful then others)
* Goal (May be addressed as interval scale)
* City (Some cities might be more successful then others, if so, why?)

#### Title Length

In [None]:
# Title length
titles = df['name']
# Put lengths to list
title_length = []
for title in titles:
    title_length.append(len(title))

# Push back into the dataFrame
df['title_length'] = title_length
df.info()

#### Launch year + season
* Seasons:
  * 1 - Winter
  * 2 - Fall
  * 3 - Spring
  * 4 - Summer

In [None]:
# Addressing years launched and seasons 
# Convert series to dt object using pandas
launch = pd.to_datetime(df['launched_at'])
# add the year launched to the df
df['year_launched'] = launch.dt.year

In [None]:
# Converting months to seasons
months = launch.dt.month
# Season by month
seasons = [1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 1]
# to Dictonary
month_to_season = dict(zip(range(1,13), seasons))
# Mapping
seasons = months.map(month_to_season)
df['season_launched'] = seasons
df.info()

### Importing models for the first iteration
* Logistic Regrssion
* Random Forest


In [None]:
!pip install scikit-learn

In [None]:
# General dependencies + Scikit learn RF classifier
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Splitting data to train an test datasets
part = np.random.rand(len(df)) < 0.8
train = df[part]
test = df[~part]

In [None]:
df.info()

In [None]:
# Prep train data
# preparing training data
cols = ['backers_count','blurb_length','title_length', 'season_launched','year_launched']
x_train = train[cols]
y = train['status']
x_test = test[cols]

In [None]:
x_train.info()

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestRegressor
m = RandomForestRegressor(n_estimators=20, random_state=0)

In [None]:
# Naive Bayes
from sklearn.naive_bayes import GaussianNB  
m = GaussianNB(priors=None, var_smoothing=1e-09)

In [None]:
scores = cross_val_score(m, x_train, y, cv = 10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))