## Project 2: Machine Learning
### **Kieran Walsh**, **Dillan Gajarawala**
#### Dataset: Kickstarter Projects
#### Link: https://www.kaggle.com/kemical/kickstarter-projects#ks-projects-201801.csv

### Importing Required Packages

In [6]:
import pandas as pd
import sklearn as sk
import numpy as np
import warnings
warnings.filterwarnings('ignore')

data_path = 'ks-projects-201801.csv'

### Importing the Data

In [7]:
cols = ['ID', 'name', 'category', 'main_category', 'currency', 'deadline', 'goal', 'launched',
       'pledged', 'state', 'backers', 'country', 'usd pledged', 'usd_pledged_real', 'usd_goal_real']

In [8]:
projects = pd.read_csv(data_path, usecols=cols, index_col='ID')

In [9]:
projects.head()

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


### Data Exploration

In [10]:
len(projects)

378661

In [11]:
projects.state.value_counts()

failed        197719
successful    133956
canceled       38779
undefined       3562
live            2799
suspended       1846
Name: state, dtype: int64

In [12]:
projects.main_category.value_counts()

Film & Video    63585
Music           51918
Publishing      39874
Games           35231
Technology      32569
Design          30070
Art             28153
Food            24602
Fashion         22816
Theater         10913
Comics          10819
Photography     10779
Crafts           8809
Journalism       4755
Dance            3768
Name: main_category, dtype: int64

In [13]:
projects.category.value_counts()

Product Design     22314
Documentary        16139
Music              15727
Tabletop Games     14180
Shorts             12357
                   ...  
Residencies           69
Letterpress           49
Chiptune              35
Literary Spaces       27
Taxidermy             13
Name: category, Length: 159, dtype: int64

In [14]:
projects.currency.value_counts()

USD    295365
GBP     34132
EUR     17405
CAD     14962
AUD      7950
SEK      1788
MXN      1752
NZD      1475
DKK      1129
CHF       768
NOK       722
HKD       618
SGD       555
JPY        40
Name: currency, dtype: int64

In [15]:
projects.country.value_counts()

US      292627
GB       33672
CA       14756
AU        7839
DE        4171
N,0"      3797
FR        2939
IT        2878
NL        2868
ES        2276
SE        1757
MX        1752
NZ        1447
DK        1113
IE         811
CH         761
NO         708
HK         618
BE         617
AT         597
SG         555
LU          62
JP          40
Name: country, dtype: int64

In [16]:
projects.isna().sum()

name                   4
category               0
main_category          0
currency               0
deadline               0
goal                   0
launched               0
pledged                0
state                  0
backers                0
country                0
usd pledged         3797
usd_pledged_real       0
usd_goal_real          0
dtype: int64

### Data Cleaning

In [17]:
# remove rows with states we do not want to predict
misc_states = ['undefined', 'live', 'suspended', 'canceled']
projects = projects[~projects.state.isin(misc_states)]
projects.state.value_counts()

failed        197719
successful    133956
Name: state, dtype: int64

In [18]:
len(projects)

331675

In [19]:
# remove columns we do not need
projects = projects.drop(['usd pledged', 'name', 'goal', 'pledged', 'country', 'category'], axis=1)
projects.head()

Unnamed: 0_level_0,main_category,currency,deadline,launched,state,backers,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1000002330,Publishing,GBP,2015-10-09,2015-08-11 12:12:28,failed,0,0.0,1533.95
1000003930,Film & Video,USD,2017-11-01,2017-09-02 04:43:57,failed,15,2421.0,30000.0
1000004038,Film & Video,USD,2013-02-26,2013-01-12 00:20:50,failed,3,220.0,45000.0
1000007540,Music,USD,2012-04-16,2012-03-17 03:24:11,failed,1,1.0,5000.0
1000014025,Food,USD,2016-04-01,2016-02-26 13:38:27,successful,224,52375.0,50000.0


In [20]:
projects.dtypes

main_category        object
currency             object
deadline             object
launched             object
state                object
backers               int64
usd_pledged_real    float64
usd_goal_real       float64
dtype: object

In [21]:
# convert date strings to date objects for calculation
import datetime as dt
projects['deadline'] = projects['deadline'].map(lambda x: dt.datetime.strptime(x, "%Y-%m-%d").date())
projects['launched'] = projects['launched'].map(lambda x: dt.datetime.strptime(x, "%Y-%m-%d %H:%M:%S").date())
projects.head()

Unnamed: 0_level_0,main_category,currency,deadline,launched,state,backers,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1000002330,Publishing,GBP,2015-10-09,2015-08-11,failed,0,0.0,1533.95
1000003930,Film & Video,USD,2017-11-01,2017-09-02,failed,15,2421.0,30000.0
1000004038,Film & Video,USD,2013-02-26,2013-01-12,failed,3,220.0,45000.0
1000007540,Music,USD,2012-04-16,2012-03-17,failed,1,1.0,5000.0
1000014025,Food,USD,2016-04-01,2016-02-26,successful,224,52375.0,50000.0


In [22]:
# create a new column which is the number of days the kickstarter was open
projects['days_open'] = projects.apply(lambda row: (row.deadline-row.launched).days, axis=1)

In [23]:
projects.head()

Unnamed: 0_level_0,main_category,currency,deadline,launched,state,backers,usd_pledged_real,usd_goal_real,days_open
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1000002330,Publishing,GBP,2015-10-09,2015-08-11,failed,0,0.0,1533.95,59
1000003930,Film & Video,USD,2017-11-01,2017-09-02,failed,15,2421.0,30000.0,60
1000004038,Film & Video,USD,2013-02-26,2013-01-12,failed,3,220.0,45000.0,45
1000007540,Music,USD,2012-04-16,2012-03-17,failed,1,1.0,5000.0,30
1000014025,Food,USD,2016-04-01,2016-02-26,successful,224,52375.0,50000.0,35


### Data Preprocessing

In [26]:
# only get the columns we want for preprocessing
#cols = ['main_category', 'currency', 'backers', 'usd_pledged_real', 'usd_goal_real', 'days_open']

#Including state into the cols
cols = ['main_category', 'currency', 'state', 'backers', 'usd_pledged_real', 'usd_goal_real', 'days_open']
proj2 = projects[cols]

In [27]:
proj2.head()

Unnamed: 0_level_0,main_category,currency,state,backers,usd_pledged_real,usd_goal_real,days_open
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1000002330,Publishing,GBP,failed,0,0.0,1533.95,59
1000003930,Film & Video,USD,failed,15,2421.0,30000.0,60
1000004038,Film & Video,USD,failed,3,220.0,45000.0,45
1000007540,Music,USD,failed,1,1.0,5000.0,30
1000014025,Food,USD,successful,224,52375.0,50000.0,35


In [28]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import make_column_transformer

In [29]:
train, test = train_test_split(proj2)

In [30]:
train.state.value_counts(normalize=True)

failed        0.595881
successful    0.404119
Name: state, dtype: float64

In [31]:
mms = MinMaxScaler()
ohe = OneHotEncoder(sparse=False)

# main column transformer- to be combined with model in pipeline
ct = make_column_transformer (
    (ohe, ['main_category', 'currency']),
    (mms, ['backers', 'usd_goal_real', 'days_open']),
    remainder= 'passthrough')

#Choosing which columns to use as features
cols_redux = ['main_category', 'currency', 'backers', 'usd_goal_real', 'days_open']
train_redux = train[cols_redux]

train_redux.head()

Unnamed: 0_level_0,main_category,currency,backers,usd_goal_real,days_open
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1790788508,Film & Video,DKK,0,758.73,25
1689692032,Art,USD,4,3000.0,30
1889427607,Theater,GBP,145,92444.74,34
859450349,Film & Video,USD,147,25000.0,47
1892979629,Fashion,USD,143,15000.0,30


### Modeling & Training

In [34]:
#Importing needed modules
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline

In [35]:
#model = GaussianNB()
model = LogisticRegression()
#model = KNeighborsClassifier()

p1 = make_pipeline(ct, model)
X = train_redux
y = train.state
cross_val_score(p1, X, y, cv=10, scoring='accuracy', n_jobs=-1).mean()

0.6613911126625247

KNN Classifier might be most appropriate - using neighbors for predictions vs. relying on probability

#### Model & Feature Selection

##### Helper functions

In [36]:
def get_perms(cols_redux):
    '''
    Takes in a list and returns a list of lists of all
    permutations of the items of the given list
    '''
    perms = set()
    for i in range(0, len(cols_redux)):
        for j in range(i+1, len(cols_redux) + 1):
            first, others = cols_redux[i], cols_redux[j:]
            if j < len(cols_redux):
                perms.add((first, others[0]))
            perms.add((first, *others))
    perms = sorted([list(x) for x in set(perms)])
    return perms

In [37]:
def score_perms(perms, model_s):
    '''
    Takes in a list of permutations and a model to test with
    and tests all permutations on that model to find which
    features yield the most accurate results. Returns the
    model type and the most accurate permutation with its
    accuracy score
    '''
    max_perf = 0
    max_combo = tuple()
    for perm in perms:
        ohe_needed = []
        mms_needed = []
        #Adds the features to respective lists to pass into the
        #preprocessing functions
        for feature in perm:
            if feature == 'main_category': ohe_needed.append(feature)
            elif feature == 'currency': ohe_needed.append(feature)
            else: mms_needed.append(feature)   
        #Creates the appropriate transformer for the data  
        ct_s = make_column_transformer (
                (ohe, ohe_needed),
                (mms, mms_needed),
                remainder= 'passthrough')
        #Makes the pipeline and gets the accuracy score with the current
        #permutation
        pipe = make_pipeline(ct_s, model_s)
        train_redux_s = train[perm]
        X_s = train_redux_s
        cv_score = cross_val_score(pipe, X_s, y, cv=10, 
                                   scoring='accuracy', n_jobs=-1).mean()
        #Keep track of the features with the most accuracy
        if cv_score > max_perf: 
            max_perf = cv_score
            max_combo = (perm, cv_score)
    return ('model: ' + str(type(model_s)) + ' max: ' + str(max_combo))

##### Testing Gaussian NB and Logistic Regression

In [38]:
#Trying out the categorical features
cols_cat = ['main_category', 'currency']
perms_cat = get_perms(cols_cat)

In [39]:
score_perms(perms_cat, GaussianNB())

"model: <class 'sklearn.naive_bayes.GaussianNB'> max: (['main_category'], 0.5901405540324229)"

In [40]:
score_perms(perms_cat, LogisticRegression())

"model: <class 'sklearn.linear_model._logistic.LogisticRegression'> max: (['main_category', 'currency'], 0.6233377412872371)"

In [41]:
#Trying out the quantitative features
cols_quant = ['backers', 'usd_goal_real', 'days_open']
perms_quant = get_perms(cols_quant)

In [42]:
score_perms(perms_quant, GaussianNB())

"model: <class 'sklearn.naive_bayes.GaussianNB'> max: (['backers', 'days_open'], 0.676743485557713)"

In [43]:
score_perms(perms_quant, LogisticRegression())

"model: <class 'sklearn.linear_model._logistic.LogisticRegression'> max: (['backers', 'usd_goal_real', 'days_open'], 0.6273376541818273)"

In [44]:
#Getting all the permutations of all features
cols_all = ['main_category', 'currency', 'backers', 
            'usd_goal_real', 'days_open']
perms_all = get_perms(cols_all)

In [45]:
score_perms(perms_all, GaussianNB())

"model: <class 'sklearn.naive_bayes.GaussianNB'> max: (['currency', 'backers'], 0.7099366905277311)"

In [46]:
score_perms(perms_all, LogisticRegression())

"model: <class 'sklearn.linear_model._logistic.LogisticRegression'> max: (['main_category', 'currency', 'backers', 'usd_goal_real', 'days_open'], 0.6613911126625247)"

##### Testing K-Nearest Neighbors Classifier

In [49]:
#score_perms(perms_quant, KNeighborsClassifier())

### The Final Model - KNN Classifier

In [50]:
ct = make_column_transformer (
    (mms, ['backers', 'usd_goal_real', 'days_open']),
    remainder= 'passthrough')

cols_redux = ['backers', 'usd_goal_real', 'days_open']
train_redux = train[cols_redux]

model = KNeighborsClassifier()

p1 = make_pipeline(ct, model)
X = train_redux
y = train.state
cross_val_score(p1, X, y, cv=5, scoring='accuracy', n_jobs=-1).mean()

0.917300486378751

### Testing

In [46]:
from sklearn import metrics

test_redux = test[cols_redux]

X_test = test_redux
y_test = test.state

p1.fit(X, y)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('minmaxscaler',
                                                  MinMaxScaler(copy=True,
                                                               feature_range=(0,
                                                                              1)),
                                                  ['backers', 'usd_goal_real',
                                                   'days_open'])],
                                   verbose=False)),
                ('kneighborsclassifier',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neigh

In [38]:
def test_model(X, y):
    predicted = p1.predict(X)
    expected = y
    print(metrics.accuracy_score(expected, predicted))
    print(metrics.classification_report(expected, predicted))
    print(metrics.confusion_matrix(expected, predicted))
    return

test_model(X_test, y_test)

0.9162676829194756
              precision    recall  f1-score   support

      failed       0.94      0.92      0.93     49502
  successful       0.89      0.91      0.90     33417

    accuracy                           0.92     82919
   macro avg       0.91      0.92      0.91     82919
weighted avg       0.92      0.92      0.92     82919

[[45613  3889]
 [ 3054 30363]]


### Findings

#### Data trends
1. Number of backers, fundraising goal amount, number of days active are most important in determining success of Kickstarter projects
2. Category and currency of donations have little effect on success

#### Significance for Kickstarter users
1. Attract as many backers as possible
2. Set a realistic goal amount (not too high) and project duration (not too short/long)
3. All types of projects can succeed 

#### Significance for the Kickstarter company
1. More successes means more profit for Kickstarter
2. Sponsor projects that have realistic goals and timelines in order to increase the number of backers and the likelihood of success
