# Analyzing Kickstarter Technology Projects

This notebook analyzes the most-funded successful [Kickstarter Technology Projects](https://www.kickstarter.com/discover/advanced?state=successful&category_id=16&sort=most_funded).

This data has been converted into a [KimonoLabs API](https://www.kimonolabs.com/apis/74e0z48i); you can download the data to play along at home.

In [62]:
import json

#data_path = '/Users/lindsayrgwatt/Dropbox/kickstarter_technology_032015.json'
data_path = 'C:\Users\lindwatt\Dropbox\kickstarter_technology_032015.json'

with open(data_path) as data_file:    
    data = json.load(data_file)
    
#print data.keys()
print "Our scraping yielded %i records" % data['count']

Our scraping yielded 1700 records


Let's take a look at a sample successful project:

In [63]:
print data['results']['collection1'][0]

{u'date_funded': u'Apr 15 2014', u'description': u"Pono's mission is to provide the best possible listening experience of your favorite digital music.", u'creator': u'the PonoMusic Team', u'title': {u'text': u'Pono Music - Where Your Soul Rediscovers Music', u'href': u'https://www.kickstarter.com/projects/1003614822/ponomusic-where-your-soul-rediscovers-music?ref=discovery'}, u'percent_funded': u'778%', u'location': {u'text': u'San Francisco, CA', u'href': u'https://www.kickstarter.com/discover/places/san-francisco-ca?ref=city'}, u'amount_funded': u'$6,225,354'}


In [64]:
projects = data['results']['collection1']

I'm interested in what people are building. Let's look at the most popular words used to describe funded technology projects:

In [65]:
word_counts = {}
stop_words = [ 'and', 'the', 'a', 'to', 'your', 'with', 'for', 'is', 'of', 'that', 'you', 'in', 'an', 'or', 'on', 'no']
stop_words += ['from', 'it', 'can', 'into', 'by', 'any', 'this', 'all', 'we', 'at', 'use', 'our', 'most', 'are', 'be']
stop_words += ['&', '-', 's', ' ', '', 'us', '1', '2', '3', '4', '5', '6' ,'7', '8', '9', '0', 'as', 'like', 'do']
# Clustering suggests these are low-signal words in this dataset
stop_words += ['world', 'first', 'new', 'open', 'source', 'make', 'easy', 'help', 'will', 'go', 'more', 'create', 'up', 'get']
import re

non_alpha = re.compile('[\W]+')

for project in projects:
    description_sentence = project['description'].lower()
    description_sentence = non_alpha.sub(' ' , description_sentence)
    description_words = description_sentence.split(" ")
    project['keywords'] = [word for word in description_words if word not in stop_words]
    
    for word in description_words:
        if word not in stop_words:
            if word_counts.has_key(word):
                word_counts[word] += 1
            else:
                word_counts[word] = 1

descending_word_count = sorted(word_counts.items(), key=lambda x: (x[1],x[0]), reverse=True)

sorted_words = sorted(word_counts.keys())

for project in projects:
    project['keyword_mapping'] = {}
    for word in project['keywords']:
        if word in project['keyword_mapping']:
            project['keyword_mapping'][word] += 1
        else:
            project['keyword_mapping'][word] = 1
    
    project['word_vector'] = []
    for word in sorted_words:
        if word in project['keyword_mapping']:
            project['word_vector'].append(project['keyword_mapping'][word])
        else:
            project['word_vector'].append(0)

counter = 0
for word in descending_word_count:
    print word[0], word[1]
    counter += 1
    if counter > 15:
        break

# Check that the word vector code is working
#for i in range(len(projects[0]['word_vector'])):
#    if projects[0]['word_vector'][i] > 0:
#        print sorted_words[i], projects[0]['word_vector'][i]
#print projects[0]['keywords']

3d 146
iphone 100
arduino 98
phone 96
smart 93
control 89
power 86
app 82
light 81
android 80
affordable 80
system 79
device 79
usb 77
technology 72
high 72


Now let's try using k-means clustering to organize these projects into meaningful groups.

The code below is gratuitously lifted from O'Reilly's [Programming Collective Intelligence](http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325)

In [66]:
from math import sqrt
import random

def pearson(v1, v2):
    # Simple sums
    sum1=sum(v1)
    sum2=sum(v2)
    
    # Sum of the squares
    sum1Sq=sum([pow(v,2) for v in v1])
    sum2Sq=sum([pow(v,2) for v in v2])
    
    # Sum of the products
    pSum=sum([v1[i]*v2[i] for i in range(len(v1))])
    
    # Calculate r (Pearson score)
    num=pSum-(sum1*sum2/len(v1))
    den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
    if den==0: return 0
    
    return 1.0-num/den

def kcluster(rows,distance=pearson,k=4):
    # Determine the minimum and maximum values for each point
    ranges=[(min([row[i] for row in rows]), max([row[i] for row in rows])) for i in range(len(rows[0]))]
    
    # Create k randomly placed centroids
    clusters=[[random.random()*(ranges[i][1]-ranges[i][0])+ranges[i][0] for i in range(len(rows[0]))] for j in range(k)]
    
    lastMatches=None
    for t in range(100):
        print 'Iteration %d' % t
        bestMatches=[[] for i in range(k)]
        
        # Find which centroid is the closest for each row
        for j in range(len(rows)):
            row=rows[j]
            bestMatch=0
            for i in range(k):
                d=distance(clusters[i],row)
                if d < distance(clusters[bestMatch], row): bestMatch=i
            bestMatches[bestMatch].append(j)
    
        # If the results are the same as last time, this is complete
        if bestMatches==lastMatches: break
        lastMatches=bestMatches
        
        # Move the centroids to the average of their members
        for i in range(k):
            avgs=[0.0]*len(rows[0])
            if len(bestMatches[i])>0:
                for rowID in bestMatches[i]:
                    for m in range(len(rows[rowID])):
                        avgs[m] += rows[rowID][m]
                for j in range(len(avgs)):
                    avgs[j]/=len(bestMatches[i])
                clusters[i]=avgs
    
    return bestMatches

data = [project['word_vector'] for project in projects]
kclust = kcluster(data,k=6)

#for k in range(len(kclust)):
#    print "Cluster number: %d" % k
#    for project in kclust[k]:
#        #print "%s :: %s" % (projects[project]['title']['text'], projects[project]['description'])
#        print projects[project]['title']['text']
    

Iteration 0
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
Iteration 14
Iteration 15
Iteration 16
Iteration 17


In [67]:
# Loop over each cluster and find the top 15 words
# Print out title and description for each project that matches the top 3 words
MAX_WORDS = 15
TOP_WORDS = 3

for k in range(len(kclust)):
    print "\nCluster %d\n" % k
    words = {}
    for project in kclust[k]:
        temp_words = projects[project]['keyword_mapping']
        for word in temp_words:
            if word in words:
                words[word] += temp_words[word]
            else:
                words[word] = temp_words[word]
    
    descending_cluster_word_count = sorted(words.items(), key=lambda x: (x[1],x[0]), reverse=True)
    
    counter = 0
    for word in descending_cluster_word_count:
        print word[0], word[1]
        counter += 1
        if counter > MAX_WORDS:
            break 
    
    print ('\nSample projects:\n')
    popular_words = [descending_cluster_word_count[i][0] for i in range(TOP_WORDS)]
    
    for project in kclust[k]:
        keywords = projects[project]['keywords']
        if len(set(keywords).intersection(popular_words)) == TOP_WORDS - 1:
            print ('%s :: %s') % (projects[project]['title']['text'], projects[project]['description'])
    


Cluster 0

arduino 87
system 52
compatible 48
wireless 39
usb 37
sound 35
platform 33
music 29
device 27
high 26
time 25
board 22
bluetooth 22
video 21
built 19
small 17

Sample projects:

MicroView: Chip-sized Arduino with built-in OLED Display! :: The MicroView is the first chip-sized Arduino compatible that lets you see what your Arduino is thinking using a built-in OLED display.
Shrunk down an Arduino to the size of a finger-tip! :: RFduino: A finger-tip sized, Arduino compatible, wireless enabled microcontroller, low cost enough to leave in all of your projects!
Touch Board: Interactivity Everywhere :: Now anyone can transform touch into sound (and so much more!) with the Touch Board, an easy-to-use Arduino-compatible device.
smARtDUINO: Open System by former ARDUINO's manufacturer :: For years we manufactured the ARDUINO in Italy. Now we created a new Open System: modular, scalable, the world's cheapest and smallest!
Microduino: Arduino in your pocket, small, stackable, smart ::

In [68]:
currencies = {}

for project in projects:
    currency = project['amount_funded'][0]
    if currencies.has_key(currency):
        currencies[currency] += 1
    else:
        currencies[currency] = 1

for currency in currencies:
    print "%s : %i" % (currency, currencies[currency])

k : 6
£ : 158
$ : 1521
€ : 15


In [69]:
import numpy as np

for project in projects:
    if project['amount_funded'][0] == u'k':
        project['currency'] = "krona"
    elif project['amount_funded'][0] == u'£':
        project['currency'] = "pound"
    elif project['amount_funded'][0] == u'$':
        project['currency'] = "dollar"
    elif project['amount_funded'][0] == u'€':
        project['currency'] = "euro"

exchange_rate = {
    'pound':1.49,
    'euro':1.30,
    'krona':0.15,
    'dollar':1
}

for project in projects:
    # Swedish Krona shows up as "kr" prefix; all other prefixes are only one letter
    project['usd_funding'] = int(project['amount_funded'][1:len(project['amount_funded'])].replace(",","").replace("r",""))*exchange_rate[project['currency']]

print "There are %d projects overall" % len(projects)
total_fundraising = sum([project['usd_funding'] for project in projects])
print "These projects raised a total of: $%.2f" % total_fundraising
print "This is an average of: $%.2f" % (total_fundraising/len(projects))
print "And a median of: $%.2f" % np.median(np.array([project['usd_funding'] for project in projects]))

for k in range(len(kclust)):
    print "\nThere are %d projects in cluster %d" % (len(kclust[k]), k)
    total_fundraising = sum([projects[i]['usd_funding'] for i in kclust[k]])
    print "Total funds raised: $%.2f" % total_fundraising
    print "Average funds raised: $%.2f" % (total_fundraising/len(kclust[k]))
    print "Median funds raised: $%.2f" % (np.median(np.array([projects[i]['usd_funding'] for i in kclust[k]])))

There are 1700 projects overall
These projects raised a total of: $212472913.87
This is an average of: $124984.07
And a median of: $45992.00

There are 329 projects in cluster 0
Total funds raised: $46449092.74
Average funds raised: $141182.65
Median funds raised: $43336.00

There are 181 projects in cluster 1
Total funds raised: $41444599.55
Average funds raised: $228975.69
Median funds raised: $79939.00

There are 433 projects in cluster 2
Total funds raised: $40454448.59
Average funds raised: $93428.29
Median funds raised: $36915.00

There are 246 projects in cluster 3
Total funds raised: $30804666.66
Average funds raised: $125222.22
Median funds raised: $49236.00

There are 271 projects in cluster 4
Total funds raised: $31860275.03
Average funds raised: $117565.59
Median funds raised: $55287.00

There are 240 projects in cluster 5
Total funds raised: $21459831.30
Average funds raised: $89415.96
Median funds raised: $42301.50
