James Kerwin - 1/28/21

# Analysing Kickstarter Data

Data gathered from: https://www.kaggle.com/kemical/kickstarter-projects?select=ks-projects-201612.csv

Kickstarter is "an American public benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity". The data I used was collected from the Kickstarter platform. It includes information on projects from the Kickstarter website; a project is a finite work with a clear goal that you’d like to bring to life. Think albums, books, or films. The columns included are:

ID: internal kickstarter id <br>
name: name of project <br>
category: specific category of project (Narrative Film, Documentary, Restaurant, Drink, etc.) <br>
main_category: general category of project (Film & Video, Food, etc.) <br>
currency: currency used to support project <br>
deadline: deadline for crowdfunding <br>
goal: the funding goal is the amount of money that a creator needs to complete their project <br>
launched: date project launched <br>
pledged: amount pledged by “crowd” <br>
state: current state the project is in (successful, canceled, failed) <br>
backers: number of backers <br>
country: country pledged from <br>
usd_pledged: pledged amount in USD (conversion made by KS) <br>
usd_pledged_real: pledged amount in USD (conversion made by fixer.io api) <br>
usd_goal_real: goal amount in USD <br>

My research questions include: 

1. Can the name of a project determine its success?

2. Can projects be clustered into groups based off of their parameters (such as, for example, projects that nearly succeeded and had a short deadline, or food projects that were wildly successful)?

3. Can other factors determine a project’s success (such as the amount of time a project had to be funded, the initial funding goal it sets, or the main category it’s placed in)?

In [None]:
#importing all required libraries
import pandas as pd
import numpy as np
%matplotlib inline
import string
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Preprocessing

The data includes two csv files. The following code is intended to determine which csv I should use.

In [None]:
#getting dataframes
df1 = pd.read_csv('ks-projects-201612.csv')
df2 = pd.read_csv('ks-projects-201801.csv')

Initially loading the dataframes was somewhat annoying because they were not UTF-8 encoded, so pandas refused to load them. I ameliorated this issue by encoding them in UTF-8 through Notepad++, but I imagine the remaining Dtypewarning was because of unusual types following encoding (perhaps I'm wrong). 

In [None]:
#inspecting
df1.head()

In [None]:
#inspecting
df1.info()

In [None]:
#inspecting
df2.head()

In [None]:
#inspecting
df2.info()

Upon immediate inspection, there exists some overlap between the two dataframes I created, although they are not exactly the same. In dataframe 1, the rows at indices 0, 1, 2, and 3 seem to match the rows at indices 0, 2, 3, and 4 in dataframe 2 (not perfectly because of the additional columns at ends of df1 and df2). Since df1 was smaller than df2, I figured that df2 might contain all of the information within df1; the following code tests this hypothesis (df2 also contains two additional columns missing in df1, which could prove beneficial).

In [None]:
#determining which IDs from df1 are in df2
df1['in_df2'] = df1['ID '].isin(df2['ID'])

In [None]:
#the number of IDs from df1 in df2. If df2 contains all of df1, should equal number of rows in df1
df1['in_df2'].value_counts()

df2 seems to contain every instance from df1 (assuming each project has a unique ID number, which seems intuitively true). From now on I'll only be using df2.

In [None]:
#number of NaN values
df2.isna().sum()

In [None]:
#few enough NaN to drop
df2.dropna(inplace=True)

In [None]:
#create timeline column - time passed from launch date to deadline
df2['launched'] = pd.to_datetime(df2['launched'])
df2['deadline'] = pd.to_datetime(df2['deadline'])
df2['timeline'] = df2['deadline'] - df2['launched']
df2['timeline'] = df2['timeline'].apply(lambda x: x.days)

# Visualization

In [None]:
#inspecting correlations
corrMatrix = df2.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

In [None]:
#histograms
fig = plt.figure(figsize = (15,20))
ax = fig.gca()
df2.hist(ax = ax)

In [None]:
#some of these make more sense on a log scale
fig = plt.figure(figsize = (15,20))
plt.subplot(3, 3, 1)
plt.hist(df2['goal'], log=True)
plt.title('goal')
plt.subplot(3, 3, 2)
plt.hist(df2['pledged'], log=True)
plt.title('pledged')
plt.subplot(3, 3, 3)
plt.hist(df2['backers'], log=True) 
plt.title('backers')
plt.subplot(3, 3, 4)
plt.hist(df2['usd pledged'], log=True)
plt.title('usd pledged')
plt.subplot(3, 3, 5)
plt.hist(df2['usd_pledged_real'], log=True) 
plt.title('usd_pledged_real')
plt.subplot(3, 3, 6)
plt.hist(df2['usd_goal_real'], log=True) 
plt.title('usd_goal_real')
plt.subplot(3, 3, 7)
plt.hist(df2['timeline'], log=True) 
plt.title('timeline')

In [None]:
#prepares data for NLP and also used for following histogram
def apply_state(x):
    if x == 'successful':
        return 1
    else:
        return -1
df2['state_#'] = df2['state'].apply(lambda x: apply_state(x))

Although the "state" column includes more than just 'successful' and 'failed' (also includes 'canceled', among others), I treat everything as 'failed' if it isn't successful simply for the purposes of this project.

In [None]:
plt.hist(df2['state_#']) 
plt.title('failed or succeeded')

# Natural Language Processing

Here I will attempt to answer my first research question.

In [None]:
#Preprocessing, turning names all into strings and removing whitespace
df2['name'] = df2['name'].apply(lambda x: str(x))
df2['name'] = df2['name'].str.strip()

In [None]:
#More preprocessing
def full_remove(x, removal_list):
    for w in removal_list:
        x = x.replace(w, ' ')
    return x

## Remove digits
digits = [str(x) for x in range(10)]
remove_digits = [full_remove(x, digits) for x in df2['name']]

## Remove punctuation
remove_punc = [full_remove(x, list(string.punctuation)) for x in remove_digits]

## Make everything lower-case and remove any white space
sents_lower = [x.lower() for x in remove_punc]
sents_processed = [x.strip() for x in sents_lower]

I kept preprocessing relatively light; I felt as if the removal of stopwords could only create problems as I'd have projects with names like "Where Hank?" and other short, uninformative phrases. It's possible stopwords do have an observable effect on the success of a project, so they were kept.

In [None]:
#Making sure it worked
sents_processed[1]

In [None]:
#Stemmer
def stem_with_lancaster(words):
    porter = nltk.LancasterStemmer()
    new_words = [porter.stem(w) for w in words]
    return new_words    

In [None]:
#Stemming sentences
lancaster = [stem_with_lancaster(x.split()) for x in sents_processed]
lancaster = [" ".join(i) for i in lancaster]
lancaster[0:10]

In [None]:
#Creating NLP vectorizer, transforming features, etc.
vectorizer = CountVectorizer(analyzer = "word", 
                             preprocessor = None, 
                             max_features = 6000, ngram_range=(1,5))
data_features = vectorizer.fit_transform(lancaster)
tfidf_transformer = TfidfTransformer()
data_features_tfidf = tfidf_transformer.fit_transform(data_features)
data_mat = data_features_tfidf.toarray()

In [None]:
#converting to numpy arrays
y = df2['state_#'].to_numpy()
labels = df2['name'].to_numpy()

In [None]:
#Used for determining test size in later cell
y.shape

In [None]:
#This step was necessary because this took too much space on my computer
#Didn't have enough RAM to work with millions of floats later, plus they were just 1s and 0s so I converted to shorts through numpy
data_mat = data_mat.astype('b')

In [None]:
#Creating test and training data
np.random.seed(0)
test_index = np.append(np.random.choice((np.where(y==-1))[0], 31555, replace=False), np.random.choice((np.where(y==1))[0], 31555, replace=False))
train_index = list(set(range(len(labels))) - set(test_index))
train_data = data_mat[train_index,]
train_labels = y[train_index]
test_data = data_mat[test_index,]
test_labels = y[test_index]

I chose 31555 (so test size was 63110 total) because that made test size around 17% of total size and it made train size around 83%. These are pretty good proportions

In [None]:
#Create polarity function and subjectivity function
pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity
pol_list = [pol(x) for x in sents_processed]
sub_list = [sub(x) for x in sents_processed]

In [None]:
#inspect polarity and subjectivity of first 10 sentences
for i in range(10):
    print(sents_processed[i], '\t', pol_list[i], sub_list[i])

It seems as if not a lot can be inferred from these sentences. They don't seem very skewed in any direction, which is an early indication that the model might not be working perfectly.

## Naive Bayes

In [None]:
nb_clf = MultinomialNB().fit(train_data, train_labels)
nb_preds_test = nb_clf.predict(test_data)
nb_errs_test = np.sum((nb_preds_test > 0.0) != (test_labels > 0.0))
print("Test error: ", float(nb_errs_test)/len(test_labels))

## Logistic Regression

In [None]:
## Fit logistic classifier on training data
clf = SGDClassifier(loss="log", penalty="none")
clf.fit(train_data, train_labels)
## Pull out the parameters (w,b) of the logistic regression model
w = clf.coef_[0,:]
b = clf.intercept_
## Get predictions on training and test data
preds_train = clf.predict(train_data)
preds_test = clf.predict(test_data)
## Compute errors
errs_train = np.sum((preds_train > 0.0) != (train_labels > 0.0))
errs_test = np.sum((preds_test > 0.0) != (test_labels > 0.0))
print("Training error: ", float(errs_train)/len(train_labels))
print("Test error: ", float(errs_test)/len(test_labels))

In [None]:
## Convert vocabulary into a list:
vocab = np.array([z[0] for z in sorted(vectorizer.vocabulary_.items(), key=lambda x:x[1])])
## Get indices of sorting w
inds = np.argsort(w)
## Words with large negative values
neg_inds = inds[0:50]
print("Highly negative words: ")
# MB: fixed bug here
print([x for x in list(vocab[neg_inds])])
## Words with large positive values
pos_inds = inds[-49:-1]
print("Highly positive words: ")
print([x for x in list(vocab[pos_inds])])

Both the Naive Bayes and the Logistic Regression had a very high degree of error

It seems as if natural language processing fails at determining whether a kickstarter project will be successful, seeing as how the test error is 49.9%. Perhaps owing to the short names of projects and the unusual words that they use (many of which are unique product names, which rarely repeat), no easily determinable pattern can be found. Perhaps with a larger dataset a trend could be determined (or perhaps the project name can never be correlated to its success, and its success is largely dependent on other, irrelevant factors), but here it is not so. Nonetheless it is interesting to inspect the highly negative and positive words: negative words (words conducive to failure) include "black", "cancel", "eclips[e]", "photograph", "halloween", "apparel", and, (interestingly), "kickstart". Positive words include "beer", "farm", "laundry", "pickl[e]", "medicin[e]", and "cult". Although the natural language processing fails to determine whether a project will succeed or fail, it seems as if projects focusing on such topics as beer, laundry, medicine, and pickles will be successful, but photography projects or halloween projects probably will not. Quite interesting.

It should also be noted that the failure of the NLP is not surprising. Whereas determining whether a movie review is positive or negative can be done even by children in elementary school, probably very few humans could accurately determine whether a kickstarter would be successful based purely off of its name. Perhaps with a more powerful computer and more data available it could be done, but here it seems impossible; it is simply too difficult a task.

So, ultimately, the answer to my first research question is no - the name of a project cannot determine its success.

# K-Means Clustering
Here I will attempt to answer my second research question.

Initially, I wanted to incorporate the category and each state of a project into the clustering, but as you will see, that will not pan out by the end. Nonetheless I incorporated and demonstrate why I could not use them later on.

In [None]:
#one-hot encoding categories
df_category = pd.get_dummies(df2['main_category'])

In [None]:
#one-hot encoding states
df_state = pd.get_dummies(df2['state'])

I chose to one-hot encode the state and category columns so that the full range of their info could be stored and clustered later on (without bias, as one-hot encoding is an unbiased form of encoding; rather than assigning each category a number, which would be interpreted by algorithms to mean that some categories are 'closer' than others, it generates vectors for the categories, with a 1 in the category that the project is located in and a 0 for the rest).

In [None]:
#removing extraneous columns
df_full = df2[['ID','backers','usd_pledged_real','usd_goal_real','timeline']]

I decided to remove many columns, because a lot of them were just variations on the same info. For example, goal and usd_goal_real are very similar, as are pledged and usd_pledged_real. Also if I included too many columns (like currency), etc. this would never load on my computer

In [None]:
#putting it together
df_full = pd.concat([df_full, df_category, df_state], axis=1)

In [None]:
#scaling to remove bias from large numbers
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_full)

In [None]:
#generating scores for WCSS graph
score = []
range_values = range(1, 35)
for i in range_values:
    kmeans = KMeans(n_clusters = i)
    kmeans.fit(df_scaled)
    score.append(kmeans.inertia_)

In [None]:
#WCSS Graph
plt.plot(score, 'bx-')
plt.title('WCSS vs. Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS Score')
plt.show()

The elbow point seems to occur around 19 clusters. This is simply too many clusters to ever run on my computer. I'll have to remove the dummy variables for category and convert state into simple 0 and 1s. I did not want to simplify this project to that extent, but with the amount of data I'm working with, it seems necessary.

In [None]:
#making new dataframe with fewer columns
df_full = df2[['backers','usd_pledged_real','usd_goal_real','timeline']]
def apply_state(x):
    if x == 'successful':
        return 1
    else:
        return 0
df_full['state_#'] = df2['state'].apply(lambda x: apply_state(x))

In [None]:
#checking out dataframe
df_full.head()

In [None]:
#scaling again
df_scaled = scaler.fit_transform(df_full)

In [None]:
#generating scores again for WCSS Graph
score = []
range_values = range(1, 20)
for i in range_values:
    kmeans = KMeans(n_clusters = i)
    kmeans.fit(df_scaled)
    score.append(kmeans.inertia_)

In [None]:
#Making WCSS Graph
plt.plot(score, 'bx-')
plt.title('WCSS vs. Number of Clusters')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS Score')
plt.show()

That's much better. The elbow point seems to be around 7 clusters, so I'll choose that for my k.

In [None]:
#Generating kmeans
kmeans = KMeans(n_clusters = 7, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
labels = kmeans.fit_predict(df_scaled)

In [None]:
#creating dataframe with clusters
df_cluster = pd.concat([df_full, pd.DataFrame({'cluster': labels})], axis = 1)

In [None]:
#graphing
for i in df_cluster.columns:
    plt.figure(figsize = (25, 5))
    for j in range(7):
        plt.subplot(1, 7, j+1)
        cluster = df_cluster[ df_cluster['cluster'] == j ]
        cluster[i].hist(bins  = 20)
        plt.title( '{}\nCluster {}'.format(i, j))
plt.show()

In [None]:
pca = PCA(n_components = 2)
principal_comp = pca.fit_transform(df_scaled)
pca_df = pd.DataFrame(data = principal_comp, columns = ['pca1', 'pca2'])
pca_df = pd.concat([pca_df, pd.DataFrame({'cluster': labels})], axis = 1)
plt.figure(figsize = (10, 10))
ax = sns.scatterplot(x = 'pca1', y = 'pca2', hue = 'cluster', data = pca_df, palette = ['red', 'green', 'blue', 'black', 'yellow', 'orange', 'purple'])

I am happy to report that the clustering, unlike the Natural Language Processing, was a success! The following clusters were generated, with corresponding characteristics:

Cluster 0: Structurally similar to cluster 1. Fewer backers than cluster 1, but other than that the most backers; most pledged USD overall; the second highest goals; long timelines<br>
Cluster 1: Structurally similar to cluster 0. Most backers of any cluster; second most pledged USD overall; the third highest goals; long timelines<br>
Cluster 2: Mostly a small number of backers, with relatively little pledged, very low goals, and medium-length timelines; highest proportion of success, but also the smallest cluster by a large margin<br>
Cluster 3: The median number of backers of any cluster; median amount pledged, as well; high goals, with timelines similar to clusters 5 and 6<br>
Cluster 4: Second least number of backers; similar amount pledged to cluster 2, although more variation; median goals with a short timeline<br>
Cluster 5: Few backers and regular amount pledged with high goals. Has the most 'failed' projects (proportionally)<br>
Cluster 6: Similar to clusters 1 and 2 in number of backers, amount pledged, and goal, but a shorter timeline<br>

It seems the answer to my second research question is yes - this data can absolutely be clustered.

# KNN Modeling
Here I will attempt to answer my third research question.

In [None]:
#Used just these 2 columns because I feared the others might be too correlatory to success of a project
#(obviously, I thought, a project with a lot of money pledged will be successful)
df_knn = df2[['usd_goal_real','timeline']]

In [None]:
#Decided to use the dummies I made earlier here
df_knn = pd.concat([df_knn, df_category], axis=1)

In [None]:
df_knn.head()

In [None]:
#x consists of goal, timeline, and the category
x = df_knn

In [None]:
#whether or not a project failed
y = df2['state'].apply(lambda x: apply_state(x))

In [None]:
#creating training/testing data
training_data, validation_data, training_labels, validation_labels = train_test_split (
    x,
    y,
    test_size = 0.2,
    random_state = 100
    )

In [None]:
#scaling data
sc = StandardScaler()
training_data = sc.fit_transform(training_data)
validation_data = sc.transform(validation_data)

In [None]:
#classifying
classifier = KNeighborsClassifier(n_neighbors= 3)
classifier.fit(training_data, training_labels.values.ravel())
print(classifier.score(validation_data, validation_labels))

Seeing as how this (relatively simple) KNN code took upwards of 20 minutes for my computer to run, I've decided to opt out of testing cv scores on this dataset. I'll stick with 3 neighbors for now.

I'm pleasantly surprised by the received classifier score. It seems the model I created (with just 3 neighbors) using just the goal, amount of time passed, and the category can determine the success of a project 63% of the time. Since the success of a project is a very complicated thing, I'm surprised it can be modeled (with more than random chance) by using such simple characteristics as its timeline, its goal, and its category. It seems the answer to my third research question is yes - other factors (like category, timeline, and goal$) CAN determine a project’s success, to a certain extent.